# Scraping for pokemon base stat data

## Using Beautiful Soup

In [1]:
'''
imports relevant libraries
'''
import json
import numpy as np
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

'''
Reads in the html from the url 
'''
url = "https://pokemondb.net/pokedex/game/lets-go-pikachu-eevee"
headers = {'user-agent':'Mozilla/5.0'}
request = urllib.request.Request(url,headers=headers)
html = urllib.request.urlopen(request).read()

'''
Pass the HTML into Beautifulsoup and find the table of all pokemon
'''
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'class':"infocard-list infocard-list-pkmn-lg"})
pokemon = main_table.find_all("a",class_="ent-name")

# to collect just the first 20 pokemon stats
pokemon = pokemon[:20]

'''
Define a function to collect the six stats for each pokemon
'''
def stat_collect(url):
    headers = {'user-agent':'Mozilla/5.0'}
    request = urllib.request.Request(url,headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html,'html.parser')
    stats_table = soup.find(attrs={'class':'grid-col span-md-12 span-lg-8'})
    stat_name = np.array([i.contents for i in stats_table.find_all('th')][:6]).flatten()
    # Because of a lack of ids and repeated class names, this section has to
    # jump through and collect every third value to collect only the base stats
    temp_array = np.array([j.contents for j in stats_table.find_all('td', attrs={'class':'cell-num'})]).flatten()
    stat_value = []
    for ind, num in enumerate(temp_array):
        if ind%3 == 0:
            stat_value.append(num)
    return(list(zip(stat_name, stat_value)))

'''
Cycle through each URL and use the function to stitch pokemon name and stat together
'''
pokemon_names = np.array([j.contents for j in pokemon]).flatten()
pokemon_data = []
for link in pokemon:
    url = link['href']
    if not url.startswith('http'):
        url = "https://pokemondb.net"+ url
    pokemon_data.append(dict(stat_collect(url)))

'''
Compiles into a dictionary and then into a dataframe
'''
name_and_stats = dict(zip(pokemon_names,pokemon_data))

pokemon_df = pd.DataFrame(name_and_stats).T
pokemon_df = pokemon_df.sort_values(by=['Speed'], ascending=False)
print(pokemon_df)

           Attack Defense  HP Sp. Atk Sp. Def Speed
Raticate       81      60  55      50      70    97
Venusaur       82      83  80     100     100    80
Charmeleon     64      58  58      80      65    80
Blastoise      83     100  79      85     105    78
Beedrill       90      40  65      45      80    75
Rattata        56      35  30      25      35    72
Pidgeotto      60      55  63      50      50    71
Butterfree     45      50  60      90      80    70
Charmander     52      43  39      60      50    65
Ivysaur        62      63  60      80      80    60
Wartortle      63      80  59      65      80    58
Pidgey         45      40  40      35      35    56
Weedle         35      30  40      20      20    50
Caterpie       30      35  45      20      20    45
Bulbasaur      49      49  45      65      65    45
Squirtle       48      65  44      50      64    43
Kakuna         25      50  45      25      25    35
Metapod        20      55  50      25      25    30
Pidgeot     

## Using an API

In [2]:
'''
imports relevant libraries
'''
import requests
import json
import pandas as pd
import numpy as np

'''
Requests for a list of pokemon names
'''
response = requests.get("https://pokeapi.co/api/v2/pokemon/")
data = response.json()
pokemon = pd.DataFrame(data["results"])
pokemon_names = pokemon['name'].tolist()

'''
Goes through each name and collects the base stats for it
'''
name_and_stats = {}
for name in pokemon_names:
    pokemon_data = requests.get("https://pokeapi.co/api/v2/pokemon/%s"%name)
    pokemon_data = pokemon_data.json()
    stats = dict([(i['stat']['name'],i['base_stat']) for i in pokemon_data['stats']])
    name_and_stats[name] = stats

'''
Compiles into a dataframe
'''
name_and_stats = pd.DataFrame(name_and_stats).T
pokemon_df = name_and_stats.sort_values(by=['speed'], ascending=False)
print(pokemon_df)

            attack  defense  hp  special-attack  special-defense  speed
pidgeot         80       75  83              70               70    101
charizard       84       78  78             109               85    100
raticate        81       60  55              50               70     97
venusaur        82       83  80             100              100     80
charmeleon      64       58  58              80               65     80
blastoise       83      100  79              85              105     78
beedrill        90       40  65              45               80     75
rattata         56       35  30              25               35     72
pidgeotto       60       55  63              50               50     71
butterfree      45       50  60              90               80     70
charmander      52       43  39              60               50     65
ivysaur         62       63  60              80               80     60
wartortle       63       80  59              65               80

## Thoughts

This was a quick little project to explore web scraping using Beautiful Soup. Note that all the code above took multiple iterations. Through it I refreshed my memory on reading HTML, converting json files and parsing incoming data.

Collecting the data through the API was much easier. The API was well documented and specific data could be requested without much cleaning or parsing. Above, it took half the number of lines of code to collect the same data using the API compared to the web scraper.

But that isn't to say that the web scraper is inferior. An API is not always avaialble and a web scraper is useful because it is enables a user to collect information from what they can see on the web. What you can see on a webpage, you can likely collect.

In this instance, it might be useful to be able to use the API and web scraper to cross reference pokemon and pokemon stats from both sources in order to identify any pieces of missing data or discrepancies. 

\* this was a small project that did not overload the servers, but to prevent being blacklisted and maintaining appropriate web etiqquette, it is important to read the terms and conditions, check for a robot.txt file, not overload the servers and input some randomness into the scraper to avoid detection.