# **Pokémon Data Scraper**

📌 **How to Use This Notebook:**

This notebook is designed to scrape and organize Pokémon data from the pokémon database website into a structured format, including stats, generation, and their artwork links.
It’s structured into the following steps:

**1. Import Libraries:** Loads the necessary tools for scraping and working with data.

**2. Ignore SSL Errors:** Bypasses certificate warnings that could interrupt scraping.

**3. Access the URL and Extract Table Rows:** Connects to the Pokémon database and pulls HTML rows.

**4. Extract Values (excluding special Pokémon):** Filters out alternate forms, mega evolutions, etc., to keep only standard entries.

**5. Save DataFrame with Basic Info ✅:** ⚠️ No need to re-run this step if you've already saved the CSV — it will overwrite your existing file!

**6. Add Generation & Image URL:** Visits individual Pokémon pages to extract generation and official image links.

💡 **Tips:**

Don’t want to scrape everything at once?
Run steps 1 to 4 to get general stats only (much faster).

Want to finish collecting image links in bulk?
You can pick up from Step 6: the code checks for missing entries and continues from where it left off.

**1. Importing all of the libraries**

In [2]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import ssl
import pandas as pd
import time
import os

**2. Ignoring SSL certificate errors**

In [3]:
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

**3. Accesing the URL  and extracting the rows**

In [4]:
url = 'https://pokemondb.net/pokedex/all'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('tr')
print(rows[1:2])

[<tr>
<td class="cell-num cell-fixed" data-sort-value="1"><picture class="infocard-cell-img">
<source height="56" srcset="https://img.pokemondb.net/sprites/scarlet-violet/icon/avif/bulbasaur.avif" type="image/avif" width="60"/>
<img alt="Bulbasaur" class="img-fixed icon-pkmn" height="56" loading="lazy" src="https://img.pokemondb.net/sprites/scarlet-violet/icon/bulbasaur.png" width="60"/>
</picture><span class="infocard-cell-data">0001</span></td> <td class="cell-name"><a class="ent-name" href="/pokedex/bulbasaur" title="View Pokedex for #0001 Bulbasaur">Bulbasaur</a></td><td class="cell-icon"><a class="type-icon type-grass" href="/type/grass">Grass</a><br/> <a class="type-icon type-poison" href="/type/poison">Poison</a></td>
<td class="cell-num cell-total">318</td>
<td class="cell-num">45</td>
<td class="cell-num">49</td>
<td class="cell-num">49</td>
<td class="cell-num">65</td>
<td class="cell-num">65</td>
<td class="cell-num">45</td>
</tr>]


**4. Extracting the values from the table (excluding especial pokemons)**

In [5]:
pokemon_data = []
dex_links = []
numbers = set()

for row in rows[1:]: #This is to skip the header
    line = row.find_all('td')

    #First: Getting the dex number from the pokedex table to omitting extra forms
    dex_number = line[0].find('span', class_='infocard-cell-data').text.strip()

    if dex_number not in numbers:
        numbers.add(dex_number) #This ensures to only have the first pokemon for each dex number (omitting extra frorms)
        
        #Link to each pokemon stats
        position_href = line[1].find('a', class_="ent-name")
        relative_link = position_href['href']
        full_link = "https://pokemondb.net" + relative_link    
        
        #Name of the pokemon (set to add special cases in the future)
        img_tag = line[0].find('img', class_='img-fixed icon-pkmn') #This gets the image tab where the real name for special pokemons is stored
        name = img_tag['alt' ]#Brings the real neame for special cases 

        #Types of the pokemon separating for 1 or 2 cases
        types = [t.text for t in line[2].find_all('a')] 
        #This part separates the 2 types for a clean transition to the database
        if len(types) == 2:
            type_1 = types[0]
            type_2 = types[1]
        else:
            type_1 = types[0]
            type_2 = ''  

        #Data from the stats
        total = line[3].text.strip()
        HP = line[4].text.strip()
        Attack = line[5].text.strip()
        Defense = line[6].text.strip()
        Sp_Atk = line[7].text.strip()
        Sp_Def = line[8].text.strip()
        Speed = line[9].text.strip()

        #Appending the data in a corresponding list
        dex_links.append((dex_number, name, full_link)) #To not have the links in the main data set
        pokemon_data.append((dex_number, name, type_1, type_2, total, HP, Attack, Defense, Sp_Atk, Sp_Def, Speed)) #Appending the data
        
    else:
        continue

print(pokemon_data[:1])
print(dex_links[:1])

[('0001', 'Bulbasaur', 'Grass', 'Poison', '318', '45', '49', '49', '65', '65', '45')]
[('0001', 'Bulbasaur', 'https://pokemondb.net/pokedex/bulbasaur')]


**5. Saving the scraped data in a DataFrame**

In [9]:
#Creating the Data Frame to add data easier
df_poke = pd.DataFrame(pokemon_data, columns=['Dex', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'])
df_links = pd.DataFrame(dex_links,columns=['Dex', 'Name', 'poke_links'])

df_links['Generation'] = 'N/A' # To store image URLs in the dataframe
df_links['Image URL'] = 'N/A' # To store generation info in the dataframe

#Saving the data frames as CSV for reusability
file_path_links = os.path.join('..', 'Data', 'poke_links.csv') #The '..' (2 dots) goes one folder up
df_links.to_csv(file_path_links, index=False)

file_path_poke = os.path.join('..', 'Data', 'poke_data.csv') #The '..' (2 dots) goes one folder up
df_poke.to_csv(file_path_poke, index=False)

**6. Adding the image URL**

In [10]:
#Loading the CSV as the new DataFrame
file_path = os.path.join('..', 'Data', 'poke_links.csv') #The '..' (2 dots) goes one folder up
df_links = pd.read_csv(file_path)

# Replace any string 'N/A' or 'nan' with pandas NA
df_links['Generation'] = df_links['Generation'].replace(['N/A', 'nan'], pd.NA).astype('string')
df_links['Image URL'] = df_links['Image URL'].replace(['N/A', 'nan'], pd.NA).astype('string')

#Filtering for the rows in the df_links that contain N/A (to be able to re-run this code withouth having to scrape all of them again)
missing_data_df = df_links[df_links['Generation'].isna() | df_links['Image URL'].isna()]
total_missing = len(missing_data_df)
print(f"🔎 Found {total_missing} Pokémon entries with missing data.")

#How many to scrape
try:
    to_scrape = int(input(f"How many would you like to scrape (max number {total_missing})?"))
    if to_scrape > total_missing or to_scrape < 1:
        print ("❌ Invalid number, exiting.")
        exit()
        
except ValueError:
    print("❌ Invalid input. Must be a number.")
    exit()

#Slicing the data to scrape to match the requested number
missing_data_df = missing_data_df.iloc[:to_scrape]

#Loop through the missing rows
for idx, row in missing_data_df.iterrows():
    dex = row['Dex']
    name = row['Name']
    poke_url = row['poke_links']
    
    try:
        req_ep = Request(poke_url, headers={'User-Agent': 'Mozilla/5.0'})
        html = urlopen(req_ep).read()
        soup = BeautifulSoup(html, 'html.parser')
        
        #Finding the div container for the image
        image_container = soup.find('div', class_='grid-col span-md-6 span-lg-4 text-center')

        #Finding the <a> or <img> tag for the image url
        img_tag = image_container.find('a', rel='lightbox')
        if img_tag is None:
            img_tag = image_container.find('img')
            img_url = img_tag['src']
        else:
            img_url = img_tag['href'] if img_tag else 'N/A'

        #Finding the Generation
        gen_abbr = soup.find('abbr', title=True)
        gen_text = gen_abbr.text.split() if gen_abbr else 'Unknown' #spliting to only have the number of the generation

        #Update the DataFrame cells at this row index
        df_links.at[idx, 'Generation'] = gen_text[1]
        df_links.at[idx, 'Image URL'] = img_url
        print(f"✅ Retrived= {name} | Generation: {gen_text[1]} | Img: {img_url}")

    except Exception as e: #for an unexpected error
        print(f"❌ Error with {pokemon} at {poke_url}: {e}")
    
    time.sleep(2) #To pause for 2 seconds before the next request to not overload the server, and following the robots.txt of the site
    
#Updating the CSV file with the new data
df_links.to_csv(file_path, index=False)

🔎 Found 0 Pokémon entries with missing data.


How many would you like to scrape (max number 0)? 0


❌ Invalid number, exiting.
