# Predicting Pokemon Type Using Damage Stats

Data originally from Kaggle datasets: [Pokemon with stats](https://www.kaggle.com/abcsds/pokemon).

From Kaggle's description:
> This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats.

> This are the raw attributes that are used for calculating how much damage an attack will do in the games.

![Pokemons dataset original image](assets/dataset-original.jpg)

## 1 Imports

Apart from the usual imports (i.e. numpy, pandas and matplotlib), we will be importing:

* `accuracy_score`, `precision_score` and `recall_score` from `sklearn`'s `metrics` module, for model evaluation
* `train_test_split`, a method from `sklearn.model_selection` that conveniently partitions the raw data into training and test sets 
* `DecisionTreeClassifier` from `sklearn.tree`, a classifier that we will use to exemplify overfitting with and without training and test split.

In [166]:
import numpy  as np
import pandas as pd

% matplotlib inline
from matplotlib import pyplot as plt

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

## 2 About the data

We will import the Pokemon data from the `data` folder, placed in the root directory for the unit, using the `read_csv` method from `pandas`.

In [167]:
data = pd.read_csv('../data/pokemon.csv')

For convenience, we will rename all columns to upper case, so we don't have to remember what is upper or lower case in the future.

In [168]:
data.columns = data.columns.str.upper()

We will also change the index of the dataframe to be the Pokemon's name, and we will use some magic to make names look nicer.

In [169]:
data = data.set_index('NAME')
data.index = data.index.str.replace(".*(?=Mega|Primal|Origin|Therian|Land|Incarnate)", "")
data = data.drop(['#'], axis=1)

We are ready to take a look at the data for the first time! 

Instead of simply calling `head` on the dataframe, let's precede it with `sort_values`, to get the Top 3 most powerful Pokemon in the dataset.

In [170]:
most_powerful = data.sort_values('TOTAL', ascending=False)
most_powerful.head(n=3)

Unnamed: 0_level_0,TYPE 1,TYPE 2,TOTAL,HP,ATTACK,DEFENSE,SP. ATK,SP. DEF,SPEED,GENERATION,LEGENDARY
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Mega Rayquaza,Dragon,Flying,780,105,180,100,180,100,115,3,True
Mega Mewtwo Y,Psychic,,780,106,150,70,194,120,140,1,True
Mega Mewtwo X,Psychic,Fighting,780,106,190,100,154,100,130,1,True


In doubt, [this](https://bulbapedia.bulbagarden.net/wiki/Rayquaza_(Pok%C3%A9mon) is a Mega Rayquaza. But what about the most powerful Pokemon by type (`Type 1`)?

In [171]:
most_powerful_by_type = most_powerful.drop_duplicates(subset=['TYPE 1'], keep='first')
most_powerful_by_type

Unnamed: 0_level_0,TYPE 1,TYPE 2,TOTAL,HP,ATTACK,DEFENSE,SP. ATK,SP. DEF,SPEED,GENERATION,LEGENDARY
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Mega Rayquaza,Dragon,Flying,780,105,180,100,180,100,115,3,True
Mega Mewtwo Y,Psychic,,780,106,150,70,194,120,140,1,True
Primal Kyogre,Water,,770,100,150,90,180,160,90,3,True
Primal Groudon,Ground,Fire,770,100,180,160,150,90,90,3,True
Arceus,Normal,,720,120,120,120,120,120,120,4,True
Mega Metagross,Steel,Psychic,700,80,145,150,105,110,110,3,False
Mega Tyranitar,Rock,Dark,700,100,164,150,95,120,71,2,False
Origin Forme,Ghost,Dragon,680,150,120,100,120,100,90,4,True
Ho-oh,Fire,Flying,680,106,130,90,110,154,90,2,True
Xerneas,Fairy,,680,126,131,95,131,98,99,6,True


## 3 Pre-processing data

We will start by selecting the features we want to use and the label.

We will try to predict the Pokemon type using three main features to represent the Pokemon entity: 

* Attack 
* Defence
* Speed.

In [172]:
columns = ['ATTACK', 'DEFENSE', 'SPEED', 'TYPE 1']
data_clf = data[columns]

Now, briefly describing the raw dataset.

In [173]:
print("The dataset contains %s rows and %s columns." % data.shape)
print("The dataset columns are: %s."% data.columns.values)
data_clf.describe()

The dataset contains 800 rows and 11 columns.
The dataset columns are: ['TYPE 1' 'TYPE 2' 'TOTAL' 'HP' 'ATTACK' 'DEFENSE' 'SP. ATK' 'SP. DEF'
 'SPEED' 'GENERATION' 'LEGENDARY'].


Unnamed: 0,ATTACK,DEFENSE,SPEED
count,800.0,800.0,800.0
mean,79.00125,73.8425,68.2775
std,32.457366,31.183501,29.060474
min,5.0,5.0,5.0
25%,55.0,50.0,45.0
50%,75.0,70.0,65.0
75%,100.0,90.0,90.0
max,190.0,230.0,180.0


Time to separate our features from the labels, and we're ready to train a simple model.

In [174]:
X = data_clf.drop(['TYPE 1'], axis=1)
y = data_clf['TYPE 1']

## 3 Training and testing a model in a single dataset

We want to use a Decision Tree classifier.

In [175]:
clf = tree.DecisionTreeClassifier(random_state=0)

Using the same dataset for training and testing our model, we get a remarkable accuracy score!

In [176]:
model_using_all_data = clf.fit(X, y)

y_pred = model_using_all_data.predict(X)

accuracy_using_all_data = accuracy_score(y, y_pred)
print("The accuracy score of the model is: %s." % accuracy_using_all_data)

results_using_all_data = data
results_using_all_data['PREDICTED'] = y_pred

failures = results_using_all_data['TYPE 1'] != results_using_all_data['PREDICTED']
results_using_all_data[failures].head(n=10)

The accuracy score of the model is: 0.965.


Unnamed: 0_level_0,TYPE 1,TYPE 2,TOTAL,HP,ATTACK,DEFENSE,SP. ATK,SP. DEF,SPEED,GENERATION,LEGENDARY,PREDICTED
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Golbat,Poison,Flying,455,75,80,70,65,75,90,1,False,Normal
Poliwrath,Water,Fighting,510,90,95,95,70,90,70,1,False,Fighting
Espeon,Psychic,,525,65,65,60,130,95,110,2,False,Ghost
Grovyle,Grass,,405,50,65,45,85,65,95,3,False,Bug
Lotad,Water,Grass,220,40,30,30,40,50,30,3,False,Grass
Lombre,Water,Grass,340,60,50,50,60,70,50,3,False,Ice
Ludicolo,Water,Grass,480,80,70,70,90,100,70,3,False,Normal
Nuzleaf,Grass,Dark,340,70,70,40,60,40,60,3,False,Fighting
Lairon,Steel,Rock,430,60,90,140,50,50,40,3,False,Bug
Kecleon,Normal,,440,60,90,70,60,120,40,3,False,Bug


## 4 Using training and test sets

We're going to leave 20% of the total data aside for testing, that we will not use to train our model.

In [177]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)

The training dataset contains 640 rows and 3 columns.
The test dataset contains 160 rows and 3 columns.


Now, we will use one partition for training and the other for testing or evaluating model performance on previously unseen data.

In [178]:
model_using_test_data = clf.fit(X_train, y_train)

y_pred = model_using_test_data.predict(X_test)

accuracy_using_test_data = accuracy_score(y_test, y_pred)
print("The accuracy score of the model for previously unseen data is: %s." % accuracy_using_test_data)

The accuracy score of the model for previously unseen data is: 0.1125.


## 5 Using train, validation and test split

Using a test dataset, there is still the risk of overfitting on the test set. 

To avoid knowledge about the test set to "leak" into the model, we may want to hold out a validation set.

To do this in a quick-and-dirty way, we can just use `train_test_split` twice:

* We use 60% of the data for the training dataset, and leave 40% aside
* We split those 40% into validation (20%) and test (20%) sets. 

Let's use this new set up to test the inclusion of a new feature HP, or hit points.

In [179]:
columns = ['HP', 'ATTACK', 'DEFENSE', 'SPEED', 'TYPE 1']
data_clf = data[columns]

X = data_clf.drop(['TYPE 1'], axis=1)
y = data_clf['TYPE 1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=0)

print("The training dataset contains %s rows and %s columns." % X_train.shape)
print("The validation dataset contains %s rows and %s columns." % X_val.shape)
print("The test dataset contains %s rows and %s columns." % X_test.shape)

The training dataset contains 480 rows and 4 columns.
The validation dataset contains 160 rows and 4 columns.
The test dataset contains 160 rows and 4 columns.


By partitioning our data into three sets, we drastically reduce the number of samples which can be used for training. Often times, this is not affordable.

In this cases, the common approach is to use *cross-validation*, something we will talk about in one of the upcoming units.

To close this example notebook, let's try to add the new feature and test it in the validation set.

In [185]:
model_using_validation_data = clf.fit(X_train, y_train)

y_pred = model_using_validation_data.predict(X_train)
accuracy_on_training_data = accuracy_score(y_train, y_pred)

print("The accuracy score of the model for the training set is: %s." % accuracy_on_training_data)

y_pred = model_using_validation_data.predict(X_val)
accuracy_on_validation_data = accuracy_score(y_val, y_pred)

print("The accuracy score of the model for the validation set is: %s." % accuracy_on_validation_data)

The accuracy score of the model for the training set is: 0.989583333333.
The accuracy score of the model for the validation set is: 0.1125.


Adding an extra features appears to only increase our overfitting problem. In this case we would revert changes, and use the test set only for our best model.

Since this is an academic example, let's test the model against our test set.

In [187]:
y_pred = model_using_validation_data.predict(X_test)
accuracy_on_validation_data = accuracy_score(y_test, y_pred)

print("The accuracy score of the model for the test set is: %s." % accuracy_on_validation_data)

The accuracy score of the model for the test set is: 0.1375.
