# Project Milestone 3

## DSC 540

## Weeks 7 and 8

## Data Preparation Assignment Weeks 7 and 8

## David Berberena

## 5/5/2024

In [1]:
# For the conversion of HTML web page data to usable data in Python as a Pandas DataFrame, I need to import the following 
# libraries: bs4's BeautifulSoup (for the reading and extraction of HTML data), requests (for BeautifulSoup to receive the 
# URL's web page data), and Pandas (for the conversion of the parsed HTML data to a DataFrame). 

import requests
from bs4 import BeautifulSoup
import pandas as pd

# The URL of the web page containing the data needed for this project is stored in a variable to be used for a GET request.

web_page = "https://pokemon.fandom.com/wiki/List_of_Pok%C3%A9mon_according_to_base_statistics"
response = requests.get(web_page)

# BeautifulSoup can now be used on the response textual data with the html.parser argument added so BeautifulSoup knows how 
# to handle the data being provided.

soup = BeautifulSoup(response.text, 'html.parser')

# Using the "Inspect" function on the web page itself, I was able to see the class of the table I need. Now I will search 
# for that table by using the find() function on the soup data.

table = soup.find('table', class_='sortable')

# The prettify() function is used on the table so I can see its contents and extract the necessary data to create a clean 
# DataFrame.

print(table.prettify())

<table class="sortable list wikitable" style="margin:auto; text-align:center;">
 <tbody>
  <tr>
   <th>
    #
   </th>
   <th width="20%">
    Pokémon
   </th>
   <th width="80px">
    <small>
     PS
    </small>
   </th>
   <th width="80px">
    <small>
     Attack
    </small>
   </th>
   <th width="80px">
    <small>
     Defense
    </small>
   </th>
   <th width="80px">
    <small>
     Special attack
    </small>
   </th>
   <th width="80px">
    <small>
     Special defense
    </small>
   </th>
   <th width="80px">
    <small>
     Speed
    </small>
   </th>
   <th width="70px">
    <small>
     Average
    </small>
   </th>
   <th width="50px">
    <small>
     σ
    </small>
   </th>
   <th width="70px">
    <small>
     Total
    </small>
   </th>
  </tr>
  <tr>
   <th>
    001
   </th>
   <td>
    <a href="/wiki/Bulbasaur" title="Bulbasaur">
     Bulbasaur
    </a>
   </td>
   <td>
    45
   </td>
   <td>
    49
   </td>
   <td>
    49
   </td>
   <td>
    65
   </td>
   

In [2]:
# To build my DataFrame, I need to extract the headers from the table data so I can append them to the data when the cell 
# data is obtained. The list comprehension for loop I made allows me to locate all 'th' observations (which contain header 
# names) within the table body ('tbody'), and extract them using the getText().strip() function.

pokemon_headers = [th.getText().strip() for th in table.find('tbody').find_all('th')]

# In this particular table, the values in the first table column are categorized as being headers as well as the actual 
# headers, so I filtered the extra data out by indexing the real headers.

pokemon_headers = pokemon_headers[:11]
pokemon_headers

['#',
 'Pokémon',
 'PS',
 'Attack',
 'Defense',
 'Special attack',
 'Special defense',
 'Speed',
 'Average',
 'σ',
 'Total']

In [3]:
# Capturing the observations within the table is next. I'll start by creating an empty list for the data to be appended to. 

pokemon_data = []

# To identify where the data should be extracted from, I need to define the rows of the data, which are designated as 'tr' 
# in the soup data. I am excluding the first row as that contains the headers we just pulled out.

rows = table.find_all('tr')[1:]

# To extract the data from each cell within the rows, I have created a for loop that locates each observation cell, 
# extracts the information within that cell, and appends that information to the empty list. As finding the header names 
# excluded the values within the first column, I need to recapture them as cells by naming them again as 'th' here. To 
# encapsulate all of the other observations, they are under the 'td' designation. Once they are found, getText().strip() 
# extracts them. I created a nested for loop with list comprehension to accomplish storing the extracted values into a list 
# to finally have that data appended to the empty list.

for row in rows:
    cells = row.find_all(['th', 'td'])
    row_data = [cell.getText().strip() for cell in cells]
    pokemon_data.append(row_data)

# With the web page data extracted into a list, I can now craft the DataFrame using the list and the headers specified as 
# arguments for the pd.DataFrame() function.

gotta_catch_em_all = pd.DataFrame(pokemon_data, columns=pokemon_headers)

# I will print the DataFrame now to show the successful extraction of the web page data into Python.

gotta_catch_em_all

Unnamed: 0,#,Pokémon,PS,Attack,Defense,Special attack,Special defense,Speed,Average,σ,Total
0,001,Bulbasaur,45,49,49,65,65,45,53.00,8.64,318
1,002,Ivysaur,60,62,63,80,80,60,67.50,8.90,405
2,003,Venusaur,80,82,83,100,100,80,87.50,8.90,525
3,003,Mega Venusaur,80,100,123,122,120,80,104.17,18.75,625
4,004,Charmander,39,52,43,60,50,65,51.50,9.00,309
...,...,...,...,...,...,...,...,...,...,...,...
789,719,Diancie,50,100,150,100,150,50,100.00,40.82,600
790,719,Mega Diancie,50,160,110,160,110,110,116.67,37.27,700
791,720,Hoopa (Confined),80,110,60,150,130,70,100.00,32.66,600
792,720,Hoopa (Unbound),80,160,60,170,130,80,113.33,42.30,680


## Data Transformation 1: Replace Headers

In [4]:
# Looking at the dataset that has been constructed from the web page, some of the header names don't properly explain the 
# information held within. So using the header names from the previous milestone's dataset along with a couple of new ones, 
# I will change the headers to better reflect the meaning of the data. This change of header names to mirror the previous 
# milestone dataset is important for the smooth merging of the datasets during the final project milestone as many of the 
# columns will now share the same header names.

better_headers = ['Pokedex_entry_number', 'Pokemon_name', 'HP_stat', 'Attack_stat', 'Defense_stat', 'Special_attack_stat', 
                  'Special_defense_stat', 'Speed_stat', 'Avg_of_stats', 'Deviation_of_stats', 'Total_stats']
gotta_catch_em_all.columns = better_headers

# The head() function is used simply to verify that the transformation of the data has been performed correctly.

gotta_catch_em_all.head()

Unnamed: 0,Pokedex_entry_number,Pokemon_name,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat,Avg_of_stats,Deviation_of_stats,Total_stats
0,1,Bulbasaur,45,49,49,65,65,45,53.0,8.64,318
1,2,Ivysaur,60,62,63,80,80,60,67.5,8.9,405
2,3,Venusaur,80,82,83,100,100,80,87.5,8.9,525
3,3,Mega Venusaur,80,100,123,122,120,80,104.17,18.75,625
4,4,Charmander,39,52,43,60,50,65,51.5,9.0,309


## Data Transformation 2: Remove Outlier Data

In [5]:
# Within the Pokémon universe, there is an upgrade that some Pokémon go through called Mega Evolution. This evolution is 
# temporary and causes stats to swell, yet as this phenomenon is not a normal part of a Pokémon's life cycle as it is 
# induced by outside influences, I am categorizing all Mega-evolved Pokémon as outliers. In the dataset, these Pokémon are 
# seen by the inclusion of the word "Mega" in their name. I will hone in on this word and remove all Pokémon with "Mega" in 
# their name.

mega_pokemon = gotta_catch_em_all[gotta_catch_em_all['Pokemon_name'].str.contains('Mega')].index
gotta_catch_em_all = gotta_catch_em_all.drop(mega_pokemon)

# The head() function is used simply to verify that the transformation of the data has been performed correctly. I will set 
# the head() function to 20 observations to view the transformation's effect on the dataset better.

gotta_catch_em_all.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat,Avg_of_stats,Deviation_of_stats,Total_stats
0,1,Bulbasaur,45,49,49,65,65,45,53.0,8.64,318
1,2,Ivysaur,60,62,63,80,80,60,67.5,8.9,405
2,3,Venusaur,80,82,83,100,100,80,87.5,8.9,525
4,4,Charmander,39,52,43,60,50,65,51.5,9.0,309
5,5,Charmeleon,58,64,58,80,65,80,67.5,9.23,405
6,6,Charizard,78,84,78,109,85,100,89.0,11.58,534
9,7,Squirtle,44,48,65,50,64,43,52.33,8.92,314
10,8,Wartortle,59,63,80,65,80,58,67.5,9.14,405
11,9,Blastoise,79,83,100,85,105,78,88.33,10.39,530
13,10,Caterpie,45,30,35,20,20,45,32.5,10.31,195


## Data Transformation 3: Remove Footnotes from Data

In [6]:
# Seeing the dataset's first 20 observations has now revealed an issue with the values in many columns containing bracketed 
# numbers. The web page reveals why this is, as these values on the web page contain footnote notation that was also 
# extracted along with the value. To remove these footnotes, I will need to import the re library to craft and use an 
# expression that can sift through the dataset and eliminate only the footnote while leaving the value intact.

import re

# The expression I need to make needs to find only the brackets and everything in it and get rid of it. So I managed to do 
# it in a roundabout way of keeping every digit outside of the brackets.

expression = r'\d+(?=\])|\d+(?!\[)'

# Now not every column can be exposed to this expression. The only column not containing digits is the Pokemon_name column, 
# so it will be excluded. If it is not, the values inside the whole column will be removed. The Avg_of_stats and 
# Deviation_of_stats columns will also be excluded as they do not contain any footnotes and they would be adversely 
# affected by the regular expression, as the numerical value would be stripped of the decimal point. The columns to sift 
# through are defined with a list comprehension for loop.

footnote_columns = [col for col in gotta_catch_em_all.columns if col not in ['Pokemon_name', 'Avg_of_stats', 
                                                                             'Deviation_of_stats']]

# The removal of the footnotes takes place in the for loop below. I have applied a lambda function which uses the regular 
# expression I crafted above on the dataset columns I have chosen to be sifted through and joins the digits that were kept 
# back together after the footnote was removed.

for col in footnote_columns:
    gotta_catch_em_all[col] = gotta_catch_em_all[col].apply(lambda x: ''.join(re.findall(expression, str(x))))
    
# The head() function is used simply to verify that the transformation of the data has been performed correctly.

gotta_catch_em_all.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat,Avg_of_stats,Deviation_of_stats,Total_stats
0,1,Bulbasaur,45,49,49,65,65,45,53.0,8.64,318
1,2,Ivysaur,60,62,63,80,80,60,67.5,8.9,405
2,3,Venusaur,80,82,83,100,100,80,87.5,8.9,525
4,4,Charmander,39,52,43,60,50,65,51.5,9.0,309
5,5,Charmeleon,58,64,58,80,65,80,67.5,9.23,405
6,6,Charizard,78,84,78,109,85,100,89.0,11.58,534
9,7,Squirtle,44,48,65,50,64,43,52.33,8.92,314
10,8,Wartortle,59,63,80,65,80,58,67.5,9.14,405
11,9,Blastoise,79,83,100,85,105,78,88.33,10.39,530
13,10,Caterpie,45,30,35,20,20,45,32.5,10.31,195


## Data Transformation 4: Assert the Correct Column Data Types

In [7]:
# Even though I corrected the dataset values in the last transformation, I ended up creating another issue that I need to 
# rectify. To remove the footnotes, I had to turn all of the column values into string values to apply the lambda function. 
# To verify that the values are indeed strings, I will print the dataset column's types using the dtypes() function.

gotta_catch_em_all.dtypes

Pokedex_entry_number    object
Pokemon_name            object
HP_stat                 object
Attack_stat             object
Defense_stat            object
Special_attack_stat     object
Special_defense_stat    object
Speed_stat              object
Avg_of_stats            object
Deviation_of_stats      object
Total_stats             object
dtype: object

In [8]:
# As I was correct, I will change the columns back to their intended data types so the values within them can be properly 
# manipulated. 

gotta_catch_em_all['Pokedex_entry_number'] = gotta_catch_em_all['Pokedex_entry_number'].astype(int)
gotta_catch_em_all['Pokemon_name'] = gotta_catch_em_all['Pokemon_name'].astype(str)
gotta_catch_em_all['HP_stat'] = gotta_catch_em_all['HP_stat'].astype(int)
gotta_catch_em_all['Attack_stat'] = gotta_catch_em_all['Attack_stat'].astype(int)
gotta_catch_em_all['Defense_stat'] = gotta_catch_em_all['Defense_stat'].astype(int)
gotta_catch_em_all['Special_attack_stat'] = gotta_catch_em_all['Special_attack_stat'].astype(int)
gotta_catch_em_all['Special_defense_stat'] = gotta_catch_em_all['Special_defense_stat'].astype(int)
gotta_catch_em_all['Speed_stat'] = gotta_catch_em_all['Speed_stat'].astype(int)
gotta_catch_em_all['Avg_of_stats'] = gotta_catch_em_all['Avg_of_stats'].astype(float)
gotta_catch_em_all['Deviation_of_stats'] = gotta_catch_em_all['Deviation_of_stats'].astype(float)
gotta_catch_em_all['Total_stats'] = gotta_catch_em_all['Total_stats'].astype(int)

# I will print the dtypes to make sure they have been transformed correctly.

gotta_catch_em_all.dtypes

Pokedex_entry_number      int32
Pokemon_name             object
HP_stat                   int32
Attack_stat               int32
Defense_stat              int32
Special_attack_stat       int32
Special_defense_stat      int32
Speed_stat                int32
Avg_of_stats            float64
Deviation_of_stats      float64
Total_stats               int32
dtype: object

In [9]:
# I will also print the dataset to make sure the data has not been changed with the change of dtypes. 

gotta_catch_em_all.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat,Avg_of_stats,Deviation_of_stats,Total_stats
0,1,Bulbasaur,45,49,49,65,65,45,53.0,8.64,318
1,2,Ivysaur,60,62,63,80,80,60,67.5,8.9,405
2,3,Venusaur,80,82,83,100,100,80,87.5,8.9,525
4,4,Charmander,39,52,43,60,50,65,51.5,9.0,309
5,5,Charmeleon,58,64,58,80,65,80,67.5,9.23,405
6,6,Charizard,78,84,78,109,85,100,89.0,11.58,534
9,7,Squirtle,44,48,65,50,64,43,52.33,8.92,314
10,8,Wartortle,59,63,80,65,80,58,67.5,9.14,405
11,9,Blastoise,79,83,100,85,105,78,88.33,10.39,530
13,10,Caterpie,45,30,35,20,20,45,32.5,10.31,195


## Data Transformation 5: Remove Variant Duplicates

In [10]:
# As the Pokémon videogames have evolved over the years, regional variants and different forms of the same Pokémon have 
# been introduced. For my analysis, I wish to work only with the original forms of each Pokémon. So I will drop the 
# duplicate Pokémon using drop_duplicates() on only the Pokedex_entry_number variable using the subset argument, as each 
# variant of the original Pokémon shares the same Pokedex number as the original.

gotta_catch_em_all = gotta_catch_em_all.drop_duplicates(subset=['Pokedex_entry_number'])

# The tail() function is used simply to verify that the transformation of the data has been performed correctly. I have 
# chosen to use tail() instead of head() here since there were more variants at the bottom of the dataset than at the top.

gotta_catch_em_all.tail(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat,Avg_of_stats,Deviation_of_stats,Total_stats
766,702,Dedenne,67,58,57,81,67,101,71.83,15.24,431
767,703,Carbink,50,50,150,50,150,50,83.33,47.14,500
768,704,Goomy,45,50,35,55,75,40,50.0,12.91,300
769,705,Sliggoo,68,75,53,83,113,60,75.33,19.43,452
770,706,Goodra,90,100,70,110,150,80,100.0,25.82,600
771,707,Klefki,57,80,91,80,87,75,78.33,10.86,470
772,708,Phantump,43,70,48,50,60,38,51.5,10.67,309
773,709,Trevenant,85,110,76,65,82,56,79.0,17.03,474
774,710,Pumpkaboo (small size),44,66,70,44,55,56,55.83,9.87,335
778,711,Gourgeist (small size),55,85,122,58,75,99,82.33,23.28,494


## Ethical Implications

With the above changes made to the dataset (replacing headers, removing outliers, removing footnotes, fixing incorrect data types, and removing duplicates), I have arrived at a human-readable and clean dataset. As I have completed the transformation of the initial dataset parsed from my chosen web page to achieve the end result, I can see now that there are some ethical risks to consider when thinking about potential questions I might ask of the data. In the world of Pokémon, each creature is different, as is apparent with the display of various statistics attached to each observation. With this dataset (just as the previous milestone's dataset), if I wanted to see which Pokémon had the highest total stats, the dataset that I have as a "finished product" would not be representative of a fair evaluation of the observations within the data. With respect to the  hypothetical question I have been asking of which Pokémon has the highest combined total stats, to mitigate the unintentional bias I have created towards Pokémon who are abnormal in the sense that they have different forms, I would have to reinclude both Pokémon with regional or other variants as well as Mega evolved Pokémon. Some of the variant Pokémon have different stats than their original counterparts, so to be impartial, I would need to consider their stat totals also. Mega evolved Pokémon have been ousted due to them being declared outliers, yet since the aforementioned question's answer is an outlier itself, I would need to reintroduce this group into the final dataset. Without bringing these Pokémon back into the picture, I would be acting partial to those Pokémon who are easier to work with in the dataset and not considering all Pokémon in a holistic way. Regarding the data itself, I am glad that I can manipulate the data freely with little to no worry of legal issues other than the fact that all of the observations within the dataset belong to the Pokémon Company, meaning that the data has been verified as accurate by the company itself and that it has been widely published for the public to have access to. In this case with the dataset being harvested from a public web page who has indeed accessed data from the Pokémon Company, they have accurately reported the data correctly, yet they have added useful columns to analyze Pokémon in a different way, which is very helpful for analysis. With the dataset providing new features, I can now ethically expand on the question of which Pokémon has the highest combined stats by also inquiring which Pokémon has the highest average stats. This is considered an ethical approach as the new feature added (the average column) simply applies the arithmetic mean formula to the already verified Pokémon stats. 

While transforming the data from the web page, I had already assumed that each column of data would have the correct data type attached to it. However, upon working with the data, that was not the case. Even worse, I actually ended up creating a risk during my transformations that prompted me to address that risk. In my efforts to clean the data properly, I risked improper further data analysis by changing every column's data type into the string type. I had to perform another transformation to fix that side affect of my previous transformation, which goes to show that some transformations look good on the surface, yet could cause problems down the road if every aspect of the data is not verified as being correct. Using web scraping via bs4 in Python to acquire the data from my chosen web page may bring up some ethical concerns, especially since there are many websites that prohibit web scraping, limit web scraping to certain types of information, or even require express permission from the website officials. For my data however, as it is widely and publicly accessible and poses no threat to those attaining nor the web page the data is being scraped from, the question of ethics is a moot point here. Looking forward, I am starting to see how my datasets can be merged together to create one cohesive, human-readable dataset that encompasses the best features from each of the datasets I will have by the end of the next milestone. The human readable dataset I have cleaned and transformed is below.

In [11]:
gotta_catch_em_all.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat,Avg_of_stats,Deviation_of_stats,Total_stats
0,1,Bulbasaur,45,49,49,65,65,45,53.0,8.64,318
1,2,Ivysaur,60,62,63,80,80,60,67.5,8.9,405
2,3,Venusaur,80,82,83,100,100,80,87.5,8.9,525
4,4,Charmander,39,52,43,60,50,65,51.5,9.0,309
5,5,Charmeleon,58,64,58,80,65,80,67.5,9.23,405
6,6,Charizard,78,84,78,109,85,100,89.0,11.58,534
9,7,Squirtle,44,48,65,50,64,43,52.33,8.92,314
10,8,Wartortle,59,63,80,65,80,58,67.5,9.14,405
11,9,Blastoise,79,83,100,85,105,78,88.33,10.39,530
13,10,Caterpie,45,30,35,20,20,45,32.5,10.31,195


In [12]:
gotta_catch_em_all.to_csv('all_pokemon_web_page.csv', index=False)