# Project Milestone 2
## DSC 540
## Weeks 5 and 6
## Data Preparation Assignment Weeks 5 and 6
## David Berberena
## 4/21/2024

In [1]:
# I have imported the necessary libraries for the data transformations I plan to perform on the dataset.

import pandas as pd
import numpy as np

# The reading and initial display of the flat file dataset is performed below.

pokemon = pd.read_csv('pokemon.csv')
pokemon.head()

Unnamed: 0.1,Unnamed: 0,image_url,Id,Names,Type1,Type2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,0,https://img.pokemondb.net/sprites/sword-shield...,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,1,https://img.pokemondb.net/sprites/sword-shield...,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,2,https://img.pokemondb.net/sprites/sword-shield...,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,3,https://img.pokemondb.net/sprites/sword-shield...,3,Venusaur Mega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,4,https://img.pokemondb.net/sprites/sword-shield...,4,Charmander,Fire,,309,39,52,43,60,50,65


## Data Transformation 1: Remove Unnecessary Columns

In [2]:
# In the initial dataset, the 'Unnamed: 0' column as well as the 'image_url' column are not relevant to any analysis I wish 
# to do, so I will remove them using the drop() function. 

pokemon = pokemon.drop(['Unnamed: 0'], axis = 1)
pokemon = pokemon.drop(['image_url'], axis = 1)

# The head() function is used simply to verify that the transformation of the data has been performed correctly.

pokemon.head()

Unnamed: 0,Id,Names,Type1,Type2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,3,Venusaur Mega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,,309,39,52,43,60,50,65


## Data Transformation 2: Replace Headers

In [3]:
# For those who understand the Pokémon universe, while we can make out the meanings of the headers, it would be much easier 
# for individuals not acquainted to have clearer headers for increased knowledge of the data. 

changed_headers = ['Pokedex_entry_number', 'Pokemon_name', 'Primary_type', 'Secondary_type', 'Total_stats', 'HP_stat', 
                  'Attack_stat', 'Defense_stat', 'Special_attack_stat', 'Special_defense_stat', 'Speed_stat']
pokemon.columns = changed_headers

# The head() function is used simply to verify that the transformation of the data has been performed correctly.

pokemon.head()

Unnamed: 0,Pokedex_entry_number,Pokemon_name,Primary_type,Secondary_type,Total_stats,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
3,3,Venusaur Mega Venusaur,Grass,Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,,309,39,52,43,60,50,65


## Data Transformation 3: Remove Outlier Data

In [4]:
# Within the Pokémon universe, there is an upgrade that some Pokémon go through called Mega Evolution. This evolution is 
# temporary and causes stats to swell, yet as this phenomenon is not a normal part of a Pokémon's life cycle as it is 
# induced by outside influences, I am categorizing all Mega-evolved Pokémon as outliers. In the dataset, these Pokémon are 
# seen by the inclusion of the word "Mega" in their name. I will hone in on this word and remove all Pokémon with "Mega" in 
# their name.

mega_outliers = pokemon[pokemon['Pokemon_name'].str.contains('Mega')].index
pokemon = pokemon.drop(mega_outliers)

# The head() function is used simply to verify that the transformation of the data has been performed correctly. I will set 
# the head() function to 20 observations to view the transformation's effect on the dataset better.

pokemon.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,Primary_type,Secondary_type,Total_stats,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
4,4,Charmander,Fire,,309,39,52,43,60,50,65
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100
9,7,Squirtle,Water,,314,44,48,65,50,64,43
10,8,Wartortle,Water,,405,59,63,80,65,80,58
11,9,Blastoise,Water,,530,79,83,100,85,105,78
13,10,Caterpie,Bug,,195,45,30,35,20,20,45


## Data Transformation 4: Remove Variant Duplicates

In [5]:
# As the Pokémon videogames have evolved over the years, regional variants and different forms of the same Pokémon have 
# been introduced. For my analysis, I wish to work only with the original forms of each Pokémon. So I will drop the 
# duplicate Pokémon using drop_duplicates() on only the Pokedex_entry_number variable using the subset argument, as each 
# variant of the original Pokémon shares the same Pokedex number as the original.

pokemon = pokemon.drop_duplicates(subset=['Pokedex_entry_number'])

# The head() function is used simply to verify that the transformation of the data has been performed correctly. I will set 
# the head() function to 20 observations to view the transformation's effect on the dataset better.

pokemon.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,Primary_type,Secondary_type,Total_stats,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
4,4,Charmander,Fire,,309,39,52,43,60,50,65
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100
9,7,Squirtle,Water,,314,44,48,65,50,64,43
10,8,Wartortle,Water,,405,59,63,80,65,80,58
11,9,Blastoise,Water,,530,79,83,100,85,105,78
13,10,Caterpie,Bug,,195,45,30,35,20,20,45


## Data Transformation 5: Remove NaN Values

In [6]:
# In the videogames, Pokémon can have a maximum of two types: a primary type and a secondary type. Not all Pokémon have two 
# types though, which is prompting the NaN values for those observations that do not have a secondary type. As NaN values 
# make data analysis more difficult, I have decided to remove them to focus on the Pokémon who are more well-balanced by 
# having two types. Dropna() along with the subset argument will be responsible for removing the Pokémon missing the 
# secondary type.

pokemon = pokemon.dropna(subset=['Secondary_type'])

# The head() function is used simply to verify that the transformation of the data has been performed correctly.

pokemon.head()

Unnamed: 0,Pokedex_entry_number,Pokemon_name,Primary_type,Secondary_type,Total_stats,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100
15,12,Butterfree,Bug,Flying,395,60,45,50,90,80,70


## Ethical Implications

With the above changes made to the dataset (removing unnecessary columns, replacing headers, removing outliers, removing duplicates, and removing NaN values), I have arrived at a human-readable and clean dataset. As I have completed the transformation of the initial dataset to achieve the end result, I can see now that there are some ethical risks to consider when thinking about potential questions I might ask of the data. In the world of Pokémon, each 
creature is different, as is apparent with the display of various statistics and typing. If I wanted to see which Pokémon had the highest total stats, the dataset that I have as a "finished product" would not be representative of a fair evaluation of the observations within the data. In an effort to clean the dataset of any NaN values to make further data manipulation easier, I have now created a bias towards those Pokémon who have a primary and a secondary while completely excluding Pokémon with only one type. This bias does not ensure impartial analysis and actually perpetuates an incomplete and potentially erroneous answer to the question of which Pokémon has the highest combined stats. In order to mitigate this ethical implication, I would need to change the way I have transformed the data to reinclude the needed primary type-only Pokémon I have previously filtered out. Continuing along with the hypothetical question of which Pokémon has the highest combined total stats, I also would have to reinclude both Pokémon with regional or other variants as well as Mega evolved Pokémon. Some of the variant Pokémon have different stats than their original counterparts, so to be impartial, I would need to consider their stat totals also. Mega evolved Pokémon have been ousted due to them being declared outliers, yet since the aforementioned question's answer is an outlier itself, I would need to reintroduce this group into the final dataset. Regarding the data itself, I am glad that I can manipulate the data freely with little to no worry of legal issues other than the fact that all of the observations within the dataset belong to the Pokémon Company, meaning that the data has been verified as accurate by the company itself and that it has been widely published for the public to have access to. 

If I were to pursue the hypothetical question I have identified, I would feel confident in using the original dataset as I can be sure that the Pokémon Company has made no errors regarding spelling, inaccurate stat information, or Pokémon typing. I would however need to restart the data transformation process to address a few things. While the first two transformations would remain the same, the next thing that would need to be performed is to fill in the NaN values with a string value that designates a Pokémon as not having a secondary type or with the same string value as the primary Pokémon type value. Next, the removal of the duplicate Pokémon name within the variant observations would need to take place so we would not see wonky Pokémon names like Venasaur Mega Venasaur or Meowth Galarian Meowth. Finally, the Pokedex_entry_number column would need to be changed so the duplicated entry numbers (signifying variant forms) can be transformed into their own unique Pokedex entry number, such as 19A and 19B for Rattata and Alolan Rattata. As it stands, this project is solely meant to demonstrate data wrangling, so the hypotheticals mentioned have no effect on the outcome of my work done already. The human readable dataset I have cleaned and transformed is below.

In [7]:
pokemon.head(20)

Unnamed: 0,Pokedex_entry_number,Pokemon_name,Primary_type,Secondary_type,Total_stats,HP_stat,Attack_stat,Defense_stat,Special_attack_stat,Special_defense_stat,Speed_stat
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100
15,12,Butterfree,Bug,Flying,395,60,45,50,90,80,70
16,13,Weedle,Bug,Poison,195,40,35,30,20,20,50
17,14,Kakuna,Bug,Poison,205,45,25,50,25,25,35
18,15,Beedrill,Bug,Poison,395,65,90,40,45,80,75
20,16,Pidgey,Normal,Flying,251,40,45,40,35,35,56
21,17,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71
