# How to build a Machine Learning model with Python (and no PhD)

# Agenda
1. What is ML?
2. What is a tidy dataset?
2. An overview of the data pipeline (with Python)
3. Let's train a model

## What is ML?
Machine Learning is a way to get computers to do what you want without **explicitely** programming them, by instead feeding them **examples** of what you want.

1. Don't use ML if you can program the behaviour explicitely!
2. Don't use ML if you don't have data!

## What is ML?

Machine Learning is mostly composed of two steps:
1. Encoding your data in a high-dimensional vector space
2. Learning, which is minimising your loss function in this vector space

![Gradient descent](img/3d-gradient_descent.png)

## Tidy Datasets

Before starting any Machine Learning project, you'll need data. And in addition, you will want your data to be in the form of a **tidy dataset**: 
* Each variable forms a column and contains values
* Each observation forms a row
* Each type of observational unit forms a table

This form will enable you to perform easily data analysis, and then Machine Learning. This [blog article](http://www.jeannicholashould.com/tidy-data-in-python.html) is a must-read to understand tidy datasets. 

## In practice

1. Data wrangling: prepare your data, using Pandas (or the ETL of your choice) -> getting a tidy dataset out of your data
2. Data preprocessing: Pandas or scikit-learn
3. Model training and evaluation: scikit-learn


### The building block: NumPy

NumPy is the fundamental library for scientific computing in Python. Here I will focus on the NumPy arrays, which are a way to efficiently store numerical matrices in memory, using a C-like representation. 

<img src="img/array_vs_list.png" width="800" />
*Image from [this](http://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) blog article*

In addition to efficient storage of data, NumPy proposes a lot of so-called *universal functions* that are applied in batch (either element-wise to a whole array (map) or as a reduce operation axis-wise). It also defines vectorial and matrix operations. 

*If you really need a universal function that is not already implemented, you can always use [Numba](http://numba.pydata.org/numba-doc/dev/user/vectorize.html).*


In [1]:
import numpy as np

arr = np.array([[12, 117, 47], 
                [17, 178, 72],
                [28, 179, 79]])
np.power(arr, 2)

array([[  144, 13689,  2209],
       [  289, 31684,  5184],
       [  784, 32041,  6241]])

In [2]:
np.mean(arr, axis=0)

array([ 19., 158.,  66.])

In [3]:
np.mean(arr)

81.0

### Step one: extract, transform, load (ETL) with Pandas

NumPy arrays are great for performance, but can be tedious to handle in real world. First, your columns may not be all of the same types (numerical, categorical, datetime); then, it's error-prone to handle lines and columns by index rather than by name. 

Pandas DataFrame are built on top of NumPy arrays to alleviate these issues (and many more). It's my go-to library for ETL; when I'm done, I can convert my DataFrame to a numpy array for further computing.

In [4]:
import pandas as pd
df = pd.DataFrame(data=arr, columns=['Age', 'Size', 'Weight'], index=np.arange(1, 4))
df

Unnamed: 0,Age,Size,Weight
1,12,117,47
2,17,178,72
3,28,179,79


In [5]:
df['Age']

1    12
2    17
3    28
Name: Age, dtype: int64

In [6]:
df.mean(axis=0)

Age        19.0
Size      158.0
Weight     66.0
dtype: float64

### Step two: Machine Learning with scikit-learn

Once I have a numpy array containing all my data, I can train machine learning algorithms on it. 

scikit-learn offers good implementations of typical machine learning estimators, and a unified API to link them all. The documentation is also very well-written and will help you choose the right algorithm for your need. 

```
my_classifier = Classifier(hyperparameters)
my_classifier.fit(X_train, y_train)
my_classifier.score(X_val, y_val)
my_classifier.predict(X_test)
```

### Python ML pipeline
1. **Load** and tidy your data with pandas
3. **Split** your data between a training dataset and a test dataset -> `sklearn.model_selection.train_test_split`
4. **Clean** your data (missing values, categorical variables, etc) -> `df.fillna`, `pd.get_dummies`
5. **Extract** your numerical data as numpy array -> `df.as_matrix`
6. **Preprocess** data with scikit-learn, using a `sklearn.pipeline.Pipeline`
7. **Learn** a scikit-learn model with a `sklearn.model_selection.cross_val_score` wrapper to evaluate how well your model would generalize
8. Maybe **adjust** your model hyperparemeters -> `sklearn.model_selection.GridSearchCV`
9. Learn your model on all your training data, then **evaluate** on your test data

## Let's learn a ML model!

<img src="img/pokemon.jpg" width="200" />

We want to learn to predict the outcome of pokemon battles. 

For that, we have the results of past battles in the file `pokemon-challenge/battles_train.csv`, and some characteristics of each Pokemon can be found in the file `pokemon-challenge/pokemon.csv`.

Let's load the files with pandas and see what we have.



In [7]:
import pandas as pd
import numpy as np
battles = pd.read_csv('pokemon-challenge/battles_train.csv')
battles.head(5)

Unnamed: 0,First_pokemon,Second_pokemon,Winner
0,266,298,298
1,702,701,701
2,191,668,668
3,237,683,683
4,151,231,151


In [8]:
pokemons = pd.read_csv('pokemon-challenge/pokemon.csv', index_col=0)
pokemons.head(5)

Unnamed: 0_level_0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
5,Charmander,Fire,,39,52,43,60,50,65,1,False


In [9]:
pokemons.describe()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,255.0,190.0,230.0,194.0,230.0,180.0,6.0


In [10]:
pokemons.describe(exclude=[np.number])

Unnamed: 0,Name,Type 1,Type 2,Legendary
count,799,800,414,800
unique,799,18,18,2
top,Goomy,Water,Flying,False
freq,1,112,97,735


Our data is split across two tables, and it is not at all *tidy*. 

We want to modify our data to have the following characteristics:
* each Pokemon battle is represented by one line
* each line contains all necessary information for prediction, ie the characteristics of both Pokemons and the outcome
* the outcome will be stored in a column 'label',  with the value 1 if the first Pokemon won, and 0 if the second Pokemon won.

Hints: 

You can compare two columns in pandas with: `df['A'] == df['B']`

Pandas allows you to do all kind of joins, see the [documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html).

In [11]:
battles['label'] = (battles['First_pokemon'] == battles['Winner']).astype(int)
df = battles.join(pokemons, on='First_pokemon').join(pokemons, on='Second_pokemon', rsuffix='_opponent')

Next we want to drop some irrelevant columns, such as indexex. What might they be?

We can first check the name of the columns of the dataframe with `df.columns`.

In [12]:
df.columns

Index(['First_pokemon', 'Second_pokemon', 'Winner', 'label', 'Name', 'Type 1',
       'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed',
       'Generation', 'Legendary', 'Name_opponent', 'Type 1_opponent',
       'Type 2_opponent', 'HP_opponent', 'Attack_opponent', 'Defense_opponent',
       'Sp. Atk_opponent', 'Sp. Def_opponent', 'Speed_opponent',
       'Generation_opponent', 'Legendary_opponent'],
      dtype='object')

In [13]:
df_ml = df.drop(['First_pokemon', 'Second_pokemon', 'Winner',
                 'Name', 'Name_opponent', 
                 'Generation', 'Generation_opponent'], axis=1)
# df_ml = df.drop(columns=['First_pokemon', 'Second_pokemon', 'Winner',
#                 'Name', 'Name_opponent', 
#                 'Generation', 'Generation_opponent'])

One more thing, we said that we wanted to encode all of our data as a numerical vector, and we still have non-numerical data (types and legendary status). 

We can check the types of all columns with `df.dtypes`.

In [14]:
df_ml.dtypes

label                  int64
Type 1                object
Type 2                object
HP                     int64
Attack                 int64
Defense                int64
Sp. Atk                int64
Sp. Def                int64
Speed                  int64
Legendary               bool
Type 1_opponent       object
Type 2_opponent       object
HP_opponent            int64
Attack_opponent        int64
Defense_opponent       int64
Sp. Atk_opponent       int64
Sp. Def_opponent       int64
Speed_opponent         int64
Legendary_opponent      bool
dtype: object

The legendary status is a binary value, we can simply cast it as an int. 

In [15]:
df_ml['Legendary'] = df_ml['Legendary'].astype(int)
df_ml['Legendary_opponent'] = df_ml['Legendary_opponent'].astype(int)

For the types, we will use the one-hot encoding, which is provided by the function [`pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html). 

A small subtetly here, each Pokemon can have up to 2 types, which we choose to encode without consideration for primary or secundary type. 

In [16]:
types = pd.get_dummies(df_ml['Type 1'], prefix='Type') + pd.get_dummies(df_ml['Type 2'], prefix='Type')

In [17]:
types_opponents = pd.get_dummies(df_ml['Type 1_opponent'], prefix='Opponent_Type') + pd.get_dummies(df_ml['Type 2_opponent'], prefix='Opponent_Type')

In [18]:
df_ml = pd.concat((df_ml, types, types_opponents), axis=1).drop(['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent'], axis=1)

That's it, we're done with the data preparation! 

If you want to use the cleaned data for the rest of the workshop, you can load it from `pokemon-challenge/cleaned_train_data.csv`.

In [19]:
df_ml = pd.read_csv('pokemon-challenge/cleaned_train_data.csv', index_col=0)

### Let's train a decision tree model

The first model we will learn is a decision tree. 

Decision trees create *if-then-else* rules in a top-down process, choosing at each level to split on the variable that optimize a criteria (minimize entropy for example). 
 

#### Decision trees : a toy example
Let's try to predict if I should take ice-cream as dessert or not. 

I have written down a few variables that could influence my choice: whereas I'm on a diet or not, whereas it's summer or not, whereas the weather is sunny, whereas strawberries are available as an alternative, and whereas I ate pasta as main dish. The last column register whereas I had ice-cream as dessert.

| Diet | Summer | Sunny | Strawberries | Pasta | Ice-cream |
|------|--------|-------|--------------|-------|-----------|
| 1    | 0      | 1     | 0            | 0     | 0         |
| 0    | 1      | 1     | 0            | 0     | 1         |
| 0    | 0      | 0     | 0            | 1     | 0         |
| 0    | 1      | 0     | 1            | 0     | 0         |
| 1    | 1      | 1     | 0            | 0     | 1         |
| 0    | 0      | 1     | 0            | 0     | 0         |
| 0    | 1      | 1     | 1            | 1     | 0         |
| 0    | 1      | 1     | 0            | 0     | 1         |


From this toy dataset, I could learn the following decision tree:

![A toy decision tree](img/dt.png)

Decision trees have the advantages to be **interpretable**, and to handle nicely multi-class classification and categorical variables. On the downside, they are quite instable and prone to **overfitting**.

In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score

In order to apply sklearn algorithms, we need to translate our pandas DataFrame to numpy arrays.

In [21]:
X = df_ml.drop('label', axis=1).as_matrix()
y = df_ml['label'].as_matrix()

Since Decision Trees are so prone to overfitting, if we train our algorithm on the whole of the data, and then see how well it performs (that we call the training error), we will get a very optimistic estimate of how well the algorithm will perform in real life. 

One solution would be to split our training data into a train dataset and a validation dataset (see for example [`sklearn.model_selection.train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)). 

In [22]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y)

Now, you can instantiate a `DecisionTreeClassifier` with default hyperparameter values, and train it with the method `fit(X, y)`. You can evaluate how well it did with the method `score(X, y)`. 

In [23]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
print("Training error",
      dt.score(X_train, y_train),
      "Validation error",
      dt.score(X_val, y_val))

Training error 1.0 Validation error 0.9355


We can see that with the default hyperparameters settings, the algorithm learns a tree that perfectly classify the training set, but the result is not impressive on new data (overfitting).

Another option, especially if you got little data, is to use cross-validation.

At each iteration, we use 90% of the data to train a model, and the remaining 10% to evaluate how good the model is. And we repeat that 10 times, using a different 10% to evaluate each time. 

![Cross-validation](img/crossValidation.png)

You can try it out with `sklearn.model_selection.cross_val_score`.

In [24]:
cross_val_score(DecisionTreeClassifier(), X, y, cv=10).mean()

0.9370249554765596

We said that Decision Trees have the huge advantage of being interpretable. One very nice feature is that they compute feature importances, a numerical value about how important a feature (= a variable) is to predict correctly the label. 

Let's learn again a tree, and see what the feature importances are (`dt.feature_importances_`). 

In [25]:
dt = DecisionTreeClassifier()
dt.fit(X, y)
pd.DataFrame(data=dt.feature_importances_, index=df_ml.columns[1:],
             columns=['Feature importance']).sort_values(by="Feature importance", ascending=False)

Unnamed: 0,Feature importance
Speed,0.407668
Speed_opponent,0.393014
Attack_opponent,0.027161
Attack,0.023932
HP,0.011712
Sp. Atk,0.010365
Defense,0.010343
HP_opponent,0.010013
Defense_opponent,0.009273
Sp. Atk_opponent,0.007721


From the feature importances, we can see that the most important variables are speed of both Pokemons; we can train a model with only those two variables and see how well it does.

In [26]:
X_speed = df_ml[['Speed', 'Speed_opponent']].as_matrix()
cross_val_score(DecisionTreeClassifier(), X_speed, y, cv=10).mean()

0.930625174128136

Our accuracy is slightly worse, but we only use 2 variables instead of 50 ! 

### Ensemble methods: Random Forests

Decision Trees are quick to train and quite easy to understand, but they have the downside of being very brittle and unstable. Changing only one sample in your training set can lead to the learning of a completely different tree. 

Here, I would like to introduce one of my favorite trick in Machine Learning: ensemble methods. 

The intuitive idea is that of the wisdom of crowd, or of expert committees. There are mathematical proofs that for some ensemble methods (like boosting), if you have a classifier that is slightly better than random, you can build an arbitrarily accurate ensemble from them. 

For Random Forests, the base classifier is a decision tree. In order to improve the robustness of prediction, we won't use only one tree, but rather average the prediction over a whole forest. 

How do we train this forest? For each tree, we draw with replacement a training set from the original training set. In addition, to ensure diversity, at each split, we choose the best from a random subset of features. 

But we don't have to worry about the implementation details for now, we can find an implementation in scikit-learn. 

In [27]:
from sklearn.ensemble import RandomForestClassifier

In [28]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print("Training error",
      rf.score(X_train, y_train),
      "Validation error",
      rf.score(X_val, y_val))

Training error 0.996 Validation error 0.926625


In [29]:
rf_cv_scores = cross_val_score(RandomForestClassifier(), X, y, cv=10) 
dt_cv_scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=10)

print(f'Average accuracy for Decision Trees {dt_cv_scores.mean()} with a standard deviation of {dt_cv_scores.std()}')
print(f'Average accuracy for Random Forests {rf_cv_scores.mean()} with a standard deviation of {rf_cv_scores.std()}')

Average accuracy for Decision Trees 0.9375999804828113 with a standard deviation of 0.002793708761436804
Average accuracy for Random Forests 0.924250111568757 with a standard deviation of 0.004572590728928591


For this dataset, we see that Random Forest does not perform better than Decision Tree, if we run them all with defaults parameters. 

We can also see that since scikit-learn provides a united API to interact with all ML estimators, it is tremendously easy to try out a new algorithm, and see what works best for you (in terms of accuracy, speed of execution or memory consumption). The documentation is also very useful to guide you on what usually works best. 

### Bonus step: choose hyperparameters

Until now, we have used all algorithms "out of the box", without choosing any hyperparameters. 

First, what are hyperparameters? We call hyperparameters the algorithm parameters which are not learned, but choosen by the user.

For example, in the Random Forest algorithm, we can choose how many trees we want in our forest. The default is 10, but I can increase it to 15.

In [30]:
cross_val_score(RandomForestClassifier(n_estimators=15), X, y, cv=10).mean()

0.9332002991906437

We see that the accuracy has improved. But how to know if 15 is the optimal value?

scikit-learn has a function `sklearn.model_selection.GridSearchCV` to automatise the search over a grid of parameters. Here, we will only test the impact of the number of trees in Random Forest, varying the value from 5 to 55. 

In [31]:
from sklearn.model_selection import GridSearchCV
tested_values = {'n_estimators': list(range(5, 56, 5))}
grid_search = GridSearchCV(RandomForestClassifier(), tested_values, cv=10)
grid_search.fit(X, y)
print("Best hyperparameter choice", grid_search.best_params_)
print("Averaged accuracy per run", grid_search.cv_results_['mean_test_score'])
print("Standard deviation per run", grid_search.cv_results_['std_test_score'])

Best hyperparameter choice {'n_estimators': 55}
Averaged accuracy per run [0.909375 0.9277   0.9323   0.935575 0.93935  0.939875 0.94025  0.9412
 0.94135  0.940625 0.941725]
Standard deviation per run [0.00851169 0.00745612 0.00226475 0.0031868  0.00391961 0.00402718
 0.00402483 0.0037949  0.00272123 0.00518303 0.00410262]


### Bonus step: encoding type in a different way

Spoiler alert: categorical variables are often difficult to exploit. One-of-k encoding is clumsy when a variable can take many different values, and it can break links existing between categories. 

In this dataset, the type(s) of each Pokemon are categorical. If you know the game, you know that there exists a system of type (dis)advantage. For example, Fire is weak to Water, but strong against Ice.

You can find the numerical values associated with each pair of types in the file type.csv

In [32]:
df_types = pd.read_csv('pokemon-challenge/type.csv', index_col=0)
df_types.columns.name = 'Attack type'
df_types.index.name = 'Defensive pokemon type'
df_types = df_types.replace(0, 0.01)
df_types

Attack type,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
Defensive pokemon type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Bug,1.0,1.0,1.0,1.0,1.0,0.5,2.0,2.0,1.0,0.5,0.5,1.0,1.0,1.0,1.0,2.0,1.0,1.0
Dark,2.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,0.5,1.0,1.0,1.0,1.0,1.0,0.01,1.0,1.0,1.0
Dragon,1.0,1.0,2.0,0.5,2.0,1.0,0.5,1.0,1.0,0.5,1.0,2.0,1.0,1.0,1.0,1.0,1.0,0.5
Electric,1.0,1.0,1.0,0.5,1.0,1.0,1.0,0.5,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0
Fairy,0.5,0.5,0.01,1.0,1.0,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0
Fighting,0.5,0.5,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.5,1.0,1.0
Fire,0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.5,2.0,0.5,1.0,1.0,1.0,2.0,0.5,2.0
Flying,0.5,1.0,1.0,2.0,1.0,0.5,1.0,1.0,1.0,0.5,0.01,2.0,1.0,1.0,1.0,2.0,1.0,1.0
Ghost,0.5,2.0,1.0,1.0,1.0,0.01,1.0,1.0,2.0,1.0,1.0,1.0,0.01,0.5,1.0,1.0,1.0,1.0
Grass,2.0,1.0,1.0,0.5,1.0,1.0,2.0,2.0,1.0,0.5,0.5,2.0,1.0,2.0,1.0,1.0,1.0,0.5


We replaced zeros (type immunity) with a small value, to avoid to mangage infinity (here, 100 will be infinity, which is much greater than all other values). 

Type advantages and disadvantages are combined when the Pokemon has more than one type. Also, it applies to the attack which is performed on the Pokemon, and each attack has only one type. We make the hypothesis when the attacker has more than one type that the type advantage will be the average. This is not necessarily true because, even though attack types are correlated to the Pokemon type, the Pokemon can also perform attacks of a different type (for example an Electric attack even if it's a Fire Pokemon). 

We compute the type advantage as follow.

In [33]:
def type_advantage_offensive(t1, t2, to1, to2):
    if not t2:
        if not to2:
            return df_types[t1][to1]
        return (df_types[t1][to1] * df_types[t1][to2]) 
    if not to2:
        return (df_types[t1][to1] + df_types[t2][to1]) / 2
    return (df_types[t1][to1] * df_types[t2][to1] + df_types[t1][to2] * df_types[t2][to2]) / 2.  


def type_advantage_defensive(t1, t2, to1, to2):
    tao = type_advantage_offensive(to1, to2, t1, t2)
    return  1. / type_advantage_offensive(to1, to2, t1, t2)

In [34]:
df_ml_type = df.drop(['First_pokemon', 'Second_pokemon', 
                      'Winner', 'Name', 'Name_opponent', 
                      'Generation', 'Generation_opponent'], axis=1)
df_ml_type['Legendary'] = df_ml_type['Legendary'].astype(int)
df_ml_type['Legendary_opponent'] = df_ml_type['Legendary_opponent'].astype(int)

In [35]:
types_ =  df_ml_type[['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent']].fillna(value='')
df_ml_type['Type_advantage_offensive'] = types_.fillna('').apply(lambda x: type_advantage_offensive(*x), axis=1)
df_ml_type['Type_advantage_defensive'] = types_.apply(lambda x: type_advantage_defensive(*x), axis=1)
df_ml_type = df_ml_type.drop(['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent'], axis=1)

In [36]:
df_ml_type.head()

Unnamed: 0,label,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Legendary,HP_opponent,Attack_opponent,Defense_opponent,Sp. Atk_opponent,Sp. Def_opponent,Speed_opponent,Legendary_opponent,Type_advantage_offensive,Type_advantage_defensive
0,0,50,64,50,45,50,41,0,70,70,40,60,40,60,0,0.75,0.5
1,0,91,90,72,90,129,108,1,91,129,90,72,90,108,1,2.5,1.333333
2,0,55,40,85,80,105,40,0,75,75,75,125,95,40,0,1.0,1.0
3,0,40,40,40,70,40,20,0,77,120,90,60,90,48,0,0.5,1.0
4,1,70,60,125,115,70,55,0,20,10,230,10,230,5,0,2.0,1.0


Now we can train all of our ML models as previously.

In [37]:
X_type = df_ml_type.drop('label', axis=1).as_matrix()
y_type = df_ml_type['label'].as_matrix()
cross_val_score(DecisionTreeClassifier(), X, y, cv=10).mean()

0.9377249867656243

In [38]:
dt = DecisionTreeClassifier()
dt.fit(X_type, y_type)
pd.DataFrame(data=dt.feature_importances_, index=df_ml_type.columns[1:],
             columns=['Feature importance']).sort_values(by="Feature importance", ascending=False)

Unnamed: 0,Feature importance
Speed,0.410625
Speed_opponent,0.393651
Type_advantage_offensive,0.039176
Attack_opponent,0.028135
Attack,0.023503
Type_advantage_defensive,0.023084
HP_opponent,0.014308
HP,0.013558
Defense_opponent,0.010831
Defense,0.009824


In this case, the accuracy doesn't change a lot, since the speed is so much more important than any other features. 

We can nonetheless observe that the type advantage is the third most important feature, whereas in the original encoding, the information was basically useless to the algorithm.

## Now we can predict new battle outcomes

We've been interating on our test data with different algorithms and models and hyperparameters, but what if we learned how to be good on this data, and not on real life data (overfitting) ?
There are data in another dataset that we've left untouched, to simulate real life data that we really can't predict. Let's see how we perform on those.

We will need to perform the same preprocessing steps that we ran on the training data to prepare the dataset. 

In [39]:
battles_test = pd.read_csv('pokemon-challenge/battles_test.csv')
pokemons = pd.read_csv('pokemon-challenge/pokemon.csv', index_col=0)
battles_test['label'] = (battles_test['First_pokemon'] == battles_test['Winner']).astype(int)
df_test = battles_test.join(pokemons, on='First_pokemon').join(pokemons, on='Second_pokemon', rsuffix='_opponent')
df_ml_test = df_test.drop(['First_pokemon', 'Second_pokemon', 'Winner',
                           'Name', 'Name_opponent', 
                           'Generation', 'Generation_opponent'], axis=1)
df_ml_test['Legendary'] = df_ml_test['Legendary'].astype(int)
df_ml_test['Legendary_opponent'] =df_ml_test['Legendary_opponent'].astype(int)
types = pd.get_dummies(df_ml_test['Type 1'], prefix='Type') + pd.get_dummies(df_ml_test['Type 2'], prefix='Type')
types_opponents = pd.get_dummies(df_ml_test['Type 1_opponent'], prefix='Opponent_Type') + pd.get_dummies(df_ml_test['Type 2_opponent'], prefix='Opponent_Type')
df_ml_test = pd.concat((df_ml_test, types, types_opponents), axis=1).drop(['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent'], axis=1)
df_ml_test.head()

Unnamed: 0,label,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Legendary,HP_opponent,Attack_opponent,...,Opponent_Type_Ghost,Opponent_Type_Grass,Opponent_Type_Ground,Opponent_Type_Ice,Opponent_Type_Normal,Opponent_Type_Poison,Opponent_Type_Psychic,Opponent_Type_Rock,Opponent_Type_Steel,Opponent_Type_Water
0,1,80,100,123,122,120,80,0,70,90,...,0,0,0,0,0,0,0,0,0,0
1,1,90,93,55,70,55,55,0,120,100,...,0,0,0,0,0,0,0,0,0,0
2,1,70,145,88,140,70,112,0,75,80,...,0,1,0,0,0,0,0,0,0,0
3,0,35,70,55,45,55,25,0,80,70,...,0,0,0,0,0,1,0,0,0,1
4,0,60,69,95,69,95,36,0,65,65,...,0,0,0,0,0,0,0,0,0,0


In [40]:
X = df_ml.drop('label', axis=1).as_matrix()
y = df_ml['label'].as_matrix()
X_test = df_ml_test.drop('label', axis=1).as_matrix()
y_test = df_ml_test['label'].as_matrix()

In [41]:
dt = DecisionTreeClassifier()
dt.fit(X, y)
dt.score(X_test, y_test)

0.9426

In [42]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X, y)
rf.score(X_test, y_test)

0.945