# ***Kaggle case***: The complete pokemon dataset

| Name                       | NIU     |
| -------------------------- | ------- |
| Levon Kesoyan Galstyan     | 1668018 |

My Kaggle project will use the dataset from [The Complete Pokemon Dataset](https://www.kaggle.com/datasets/rounakbanik/pokemon/data), which includes data on all Pokémon from the first seven generations, totaling 800 Pokémon. This dataset contains various details, from names and Pokédex numbers to types, stats, and more. This provides a rich amount of data, with 41 columns (features) in total.

The goal is to predict whether a given Pokémon is legendary based on its characteristics. To achieve this, we’ll work with the dataset, split it into training and testing sets, handle any missing values, and more.

### Let's take an initial look at our dataset and examine which features we have:

| Variable          | Definition                                                | Key                                                 | Type       |
|-------------------|-----------------------------------------------------------|-----------------------------------------------------|------------|
| `name`            | The Pokémon's English name                                |                                                     | Categorical  |
| `japanese_name`   | The Pokémon's original Japanese name                      |                                                     | Categorical  |
| `pokedex_number`  | National Pokédex entry number                             |                                                     | Numeric    |
| `percentage_male` | Percentage of the species that are male                   | Blank if genderless                                 | Numeric    |
| `type1`           | Primary type                                              | e.g., Grass, Fire                                   | Categorical  |
| `type2`           | Secondary type                                            | e.g., Flying, Poison                                | Categorical  |
| `classification`  | Classification in the Sun and Moon Pokédex                |                                                     | Categorical  |
| `height_m`        | Height in meters                                          |                                                     | Numeric    |
| `weight_kg`       | Weight in kilograms                                       |                                                     | Numeric    |
| `capture_rate`    | Capture rate                                              |                                                     | Numeric    |
| `base_egg_steps`  | Steps to hatch                                            |                                                     | Numeric    |
| `abilities`       | List of abilities                                         |                                                     | Categorical  |
| `experience_growth` | Experience growth rate                                  |                                                     | Numeric    |
| `base_happiness`  | Base happiness                                            |                                                     | Numeric    |
| `against_?`       | Damage taken from each type (18 columns)                  | e.g., against_fire, against_water                   | Numeric    |
| `hp`              | Base HP                                                   |                                                     | Numeric    |
| `attack`          | Base attack                                               |                                                     | Numeric    |
| `defense`         | Base defense                                              |                                                     | Numeric    |
| `sp_attack`       | Base special attack                                       |                                                     | Numeric    |
| `sp_defense`      | Base special defense                                      |                                                     | Numeric    |
| `speed`           | Base speed                                                |                                                     | Numeric    |
| `generation`      | Generation introduced                                     | Values from 1 to 7                                  | Numeric    |
| `is_legendary`    | Whether the Pokémon is legendary                          | 1 = Legendary, 0 = Not Legendary                    | Binary     |


Firstly, we will import the necessary libraries:

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 50)

plt.rcParams['figure.dpi'] = 80

# import warnings
# warnings.simplefilter(action='ignore')

Next, we will load our dataset and split it into training ($80\%$) and test ($20\%$) sets.

In [8]:
df = pd.read_csv('pokemon.csv')
y = df["is_legendary"]
X = df.drop("is_legendary", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
df

Unnamed: 0,abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,against_ground,against_ice,against_normal,against_poison,against_psychic,against_rock,against_steel,against_water,attack,base_egg_steps,base_happiness,base_total,capture_rate,classfication,defense,experience_growth,height_m,hp,japanese_name,name,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
0,"['Overgrow', 'Chlorophyll']",1.0000,1.0000,1.0000,0.5000,0.5000,0.5000,2.0000,2.0000,1.0000,0.2500,1.0000,2.0000,1.0000,1.0000,2.0000,1.0000,1.0000,0.5000,49,5120,70,318,45,Seed Pokémon,49,1059860,0.7000,45,Fushigidaneフシギダネ,Bulbasaur,88.1000,1,65,65,45,grass,poison,6.9000,1,0
1,"['Overgrow', 'Chlorophyll']",1.0000,1.0000,1.0000,0.5000,0.5000,0.5000,2.0000,2.0000,1.0000,0.2500,1.0000,2.0000,1.0000,1.0000,2.0000,1.0000,1.0000,0.5000,62,5120,70,405,45,Seed Pokémon,63,1059860,1.0000,60,Fushigisouフシギソウ,Ivysaur,88.1000,2,80,80,60,grass,poison,13.0000,1,0
2,"['Overgrow', 'Chlorophyll']",1.0000,1.0000,1.0000,0.5000,0.5000,0.5000,2.0000,2.0000,1.0000,0.2500,1.0000,2.0000,1.0000,1.0000,2.0000,1.0000,1.0000,0.5000,100,5120,70,625,45,Seed Pokémon,123,1059860,2.0000,80,Fushigibanaフシギバナ,Venusaur,88.1000,3,122,120,80,grass,poison,100.0000,1,0
3,"['Blaze', 'Solar Power']",0.5000,1.0000,1.0000,1.0000,0.5000,1.0000,0.5000,1.0000,1.0000,0.5000,2.0000,0.5000,1.0000,1.0000,1.0000,2.0000,0.5000,2.0000,52,5120,70,309,45,Lizard Pokémon,43,1059860,0.6000,39,Hitokageヒトカゲ,Charmander,88.1000,4,60,50,65,fire,,8.5000,1,0
4,"['Blaze', 'Solar Power']",0.5000,1.0000,1.0000,1.0000,0.5000,1.0000,0.5000,1.0000,1.0000,0.5000,2.0000,0.5000,1.0000,1.0000,1.0000,2.0000,0.5000,2.0000,64,5120,70,405,45,Flame Pokémon,58,1059860,1.1000,58,Lizardoリザード,Charmeleon,88.1000,5,80,65,80,fire,,19.0000,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,['Beast Boost'],0.2500,1.0000,0.5000,2.0000,0.5000,1.0000,2.0000,0.5000,1.0000,0.2500,0.0000,1.0000,0.5000,0.0000,0.5000,1.0000,0.5000,1.0000,101,30720,0,570,25,Launch Pokémon,103,1250000,9.2000,97,Tekkaguyaテッカグヤ,Celesteela,,797,107,101,61,steel,flying,999.9000,7,1
797,['Beast Boost'],1.0000,1.0000,0.5000,0.5000,0.5000,2.0000,4.0000,1.0000,1.0000,0.2500,1.0000,1.0000,0.5000,0.0000,0.5000,0.5000,0.5000,0.5000,181,30720,0,570,255,Drawn Sword Pokémon,131,1250000,0.3000,59,Kamiturugiカミツルギ,Kartana,,798,59,31,109,grass,steel,0.1000,7,1
798,['Beast Boost'],2.0000,0.5000,2.0000,0.5000,4.0000,2.0000,0.5000,1.0000,0.5000,0.5000,1.0000,2.0000,1.0000,1.0000,0.0000,1.0000,1.0000,0.5000,101,30720,0,570,15,Junkivore Pokémon,53,1250000,5.5000,223,Akuzikingアクジキング,Guzzlord,,799,97,53,43,dark,dragon,888.0000,7,1
799,['Prism Armor'],2.0000,2.0000,1.0000,1.0000,1.0000,0.5000,1.0000,1.0000,2.0000,1.0000,1.0000,1.0000,1.0000,1.0000,0.5000,1.0000,1.0000,1.0000,107,30720,0,600,3,Prism Pokémon,101,1250000,2.4000,97,Necrozmaネクロズマ,Necrozma,,800,127,89,79,psychic,,230.0000,7,1


Let's see which columns have missing values (NaN).

In [9]:
percent_nan_col = df.isna().mean(axis = 0).sort_values(ascending = False)
percent_nan_col

type2             0.4794
percentage_male   0.1223
weight_kg         0.0250
height_m          0.0250
name              0.0000
                   ...  
against_psychic   0.0000
against_rock      0.0000
against_steel     0.0000
against_water     0.0000
is_legendary      0.0000
Length: 41, dtype: float64

In [11]:
pokemon_with_nan = df[df[['weight_kg', 'height_m']].isna().any(axis=1)]['name']
print(pokemon_with_nan)


18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object


In [12]:
pokemon_without_gender = df[df[['percentage_male']].isna().any(axis=1)]['name']
print(pokemon_without_gender)


80      Magnemite
81       Magneton
99        Voltorb
100     Electrode
119        Staryu
          ...    
796    Celesteela
797       Kartana
798      Guzzlord
799      Necrozma
800      Magearna
Name: name, Length: 98, dtype: object


We have NaNs in the columns 'type2', 'percentage_male', 'weight_kg' and 'height_m'. To solve this problem, we will do the following:

- The NaNs that are in the 'type2' column mean that the Pokémon only has 1 type. To avoid these NaNs, we will find all the NaNs in this column and replace it with 'None'.
- The NaNs that are in the 'percentage_male' column mean that the Pokémon is genderless. To avoid these NaNs, we will add a binary column called 'has_gender' and replace all the NaNs in this column with 0. If this new column is False (0), then we will not look into the 'percentage_male' column
- The NaNs that are in the 'weight_kg' and 'height_m' seem to be just missing values. To solve these problem, we will drop those rows that have NaNs in these columns, as the percentage is very low (0.0125) and it won't affect our model.

In [None]:
df.dropna(inplace=True, subset=['weight_kg', 'height_m'])
