# Data Cleaning

In this file, we will handle the data cleaning. Which involves dealing with NaNs and incorrect data.

In [2]:
# Imports
import pandas as pd

In [4]:
# Data Imports
pokemon_data = pd.read_csv("../data/pokemon.csv")

Let's start by dealing with NaNs. Let's see which columns have NaNs.

In [52]:
print(pokemon_data.isnull().any())

abilities            False
against_bug          False
against_dark         False
against_dragon       False
against_electric     False
against_fairy        False
against_fight        False
against_fire         False
against_flying       False
against_ghost        False
against_grass        False
against_ground       False
against_ice          False
against_normal       False
against_poison       False
against_psychic      False
against_rock         False
against_steel        False
against_water        False
attack               False
base_egg_steps       False
base_happiness       False
base_total           False
capture_rate         False
classfication        False
defense              False
experience_growth    False
height_m              True
hp                   False
japanese_name        False
name                 False
percentage_male       True
pokedex_number       False
sp_attack            False
sp_defense           False
speed                False
type1                False
t

Here we can see that the following variates have NaN values: height_m, percentage_male, type2, and weight_kg. 

The easiest here to fix is type2. Some Pokemon only have one type. So let's set type2 to be the same whenever they only have one type.

In [53]:
pokemon_data['type2'] = pokemon_data['type2'].fillna(pokemon_data['type1'])

Now, lets see what is going on with height_m and weight_kg. These should not be null as every Pokemon should have a height and weight (as with other living creatures). So let's take a look at which Pokemon are running into these issues.

In [54]:
print(pokemon_data[pokemon_data['height_m'].isnull()]["name"])

18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object


We have 20 Pokemon who do not have a height and I suspect that these same Pokemon also don't have a weight. Upon looking deeper into this data (and its source), it seems that this is because these Pokemon have different forms. Such as Alolan Forms or alternate forms like Lycanroc Midday, Midnight and Dusk. So we need to manually go and change the height and weight of these Pokemon.

We'll replace them to be the values of their regular/most common forms.

In [55]:
# Rattata
pokemon_data.at[18, "height_m"] = 0.3
pokemon_data.at[18, "weight_kg"] = 3.5
# Raticate
pokemon_data.at[19, "height_m"] = 0.7
pokemon_data.at[19, "weight_kg"] = 18.5
# Raichu
pokemon_data.at[25, "height_m"] = 0.8
pokemon_data.at[25, "weight_kg"] = 30
# Sandshrew
pokemon_data.at[26, "height_m"] = 0.6
pokemon_data.at[26, "weight_kg"] = 12
# Sandslash
pokemon_data.at[27, "height_m"] = 1
pokemon_data.at[27, "weight_kg"] = 29.5
# Vulpix
pokemon_data.at[36, "height_m"] = 0.6
pokemon_data.at[36, "weight_kg"] = 9.9
# Ninetales
pokemon_data.at[37, "height_m"] = 1.1
pokemon_data.at[37, "weight_kg"] = 19.9
# Diglett
pokemon_data.at[49, "height_m"] = 0.2
pokemon_data.at[49, "weight_kg"] = 0.8
# Dugtrio
pokemon_data.at[50, "height_m"] = 0.7
pokemon_data.at[50, "weight_kg"] = 33.3
# Meowth
pokemon_data.at[51, "height_m"] = 0.4
pokemon_data.at[51, "weight_kg"] = 4.2
# Persian
pokemon_data.at[52, "height_m"] = 1
pokemon_data.at[52, "weight_kg"] = 32
# Geodude
pokemon_data.at[73, "height_m"] = 0.4
pokemon_data.at[73, "weight_kg"] = 20
# Graveler
pokemon_data.at[74, "height_m"] = 1
pokemon_data.at[74, "weight_kg"] = 105
# Golem
pokemon_data.at[75, "height_m"] = 1.4
pokemon_data.at[75, "weight_kg"] = 300
# Grimer
pokemon_data.at[87, "height_m"] = 0.9
pokemon_data.at[87, "weight_kg"] = 30
# Muk
pokemon_data.at[88, "height_m"] = 1.2
pokemon_data.at[88, "weight_kg"] = 30
# Exeggutor
pokemon_data.at[102, "height_m"] = 2
pokemon_data.at[102, "weight_kg"] = 120
# Marowak
pokemon_data.at[104, "height_m"] = 1
pokemon_data.at[104, "weight_kg"] = 45
# Hoopa
pokemon_data.at[719, "height_m"] = 0.5
pokemon_data.at[719, "weight_kg"] = 9
# Lycanroc
pokemon_data.at[744, "height_m"] = 0.8
pokemon_data.at[744, "weight_kg"] = 25

Let's check and see if there are no missing height/weight values now.

In [56]:
print(pokemon_data[pokemon_data['height_m'].isnull()]["name"])
print(pokemon_data[pokemon_data['weight_kg'].isnull()]["name"])

Series([], Name: name, dtype: object)
Series([], Name: name, dtype: object)


Now, that that has worked, we just need to look at percentage_male. I suspect that the entries missing here are those who cannot be male/female, i.e. they are genderless. So let's see which ones are NaN.

In [57]:
print(pokemon_data[pokemon_data['percentage_male'].isnull()]["name"])

80      Magnemite
81       Magneton
99        Voltorb
100     Electrode
119        Staryu
          ...    
796    Celesteela
797       Kartana
798      Guzzlord
799      Necrozma
800      Magearna
Name: name, Length: 98, dtype: object


As I suspected, I am correct. Let's change each of those NaNs to be 0.

In [58]:
pokemon_data['percentage_male'] = pokemon_data['percentage_male'].fillna(0)
print(pokemon_data[pokemon_data['percentage_male'].isnull()]["name"])

Series([], Name: name, dtype: object)


Now that we've dealt with all the NaNs, we need to see if there are any wonky datatypes in our dataframe.

In [59]:
pokemon_data.dtypes

abilities             object
against_bug          float64
against_dark         float64
against_dragon       float64
against_electric     float64
against_fairy        float64
against_fight        float64
against_fire         float64
against_flying       float64
against_ghost        float64
against_grass        float64
against_ground       float64
against_ice          float64
against_normal       float64
against_poison       float64
against_psychic      float64
against_rock         float64
against_steel        float64
against_water        float64
attack                 int64
base_egg_steps         int64
base_happiness         int64
base_total             int64
capture_rate          object
classfication         object
defense                int64
experience_growth      int64
height_m             float64
hp                     int64
japanese_name         object
name                  object
percentage_male      float64
pokedex_number         int64
sp_attack              int64
sp_defense    

That's odd. capture_rate which is the rate at which a Pokemon can be captured, is an object as opposed to an int/float. I suspect that something is wrong here. Upon further digging, we can see that the Pokemon Minior is causing this issue. We will change this value manually and then change the datatype of the column.

In [62]:
print(pokemon_data.iloc[773]["capture_rate"])

pokemon_data.at[773, "capture_rate"] = 30
pokemon_data["capture_rate"] = pokemon_data["capture_rate"].apply(pd.to_numeric)

30


Now, all our data is cleaned! So let's save it as a csv.

In [63]:
# Run if you need a new cleaned file
# pokemon_data.to_csv("../data/pokemon_clean.csv")