# How to build a Machine Learning model with Python (and no PhD)

# Agenda
1. What is ML?
2. What is a tidy dataset?
2. An overview of the data pipeline (with Python)
3. Let's train a model

Through all this tutorial, if you don't have the solution to one exercice, you can uncomment the line `# %load solutions/....` to get the answer.

## What is ML?
Machine Learning is a way to get computers to do what you want without **explicitely** programming them, by instead feeding them **examples** of what you want.

1. Don't use ML if you can program the behaviour explicitely!
2. Don't use ML if you don't have data!

## What is ML?

Machine Learning is mostly composed of two steps:
1. Encoding your data in a high-dimensional vector space
2. Learning, which is minimising your loss function in this vector space

![Gradient descent](img/3d-gradient_descent.png)

## Tidy Datasets

Before starting any Machine Learning project, you'll need data. And in addition, you will want your data to be in the form of a **tidy dataset**: 
* Each variable forms a column and contains values
* Each observation forms a row
* Each type of observational unit forms a table

This form will enable you to perform easily data analysis, and then Machine Learning. This [blog article](http://www.jeannicholashould.com/tidy-data-in-python.html) is a must-read to understand tidy datasets. 

## In practice

1. Data wrangling: prepare your data, using Pandas (or the ETL of your choice) -> getting a tidy dataset out of your data
2. Data preprocessing: Pandas or scikit-learn
3. Model training and evaluation: scikit-learn


### The building block: NumPy

NumPy is the fundamental library for scientific computing in Python. Here I will focus on the NumPy arrays, which are a way to efficiently store numerical matrices in memory, using a C-like representation. 

<img src="img/array_vs_list.png" width="800" />

*Image from [this](http://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) blog article*

In addition to efficient storage of data, NumPy proposes a lot of so-called *universal functions* that are applied in batch (either element-wise to a whole array (map) or as a reduce operation axis-wise). It also defines vectorial and matrix operations. 

*If you really need a universal function that is not already implemented, you can always use [Numba](http://numba.pydata.org/numba-doc/dev/user/vectorize.html).*


In [None]:
import numpy as np

arr = np.array([[12, 117, 47], 
                [17, 178, 72],
                [28, 179, 79]])
np.power(arr, 2)

In [None]:
np.mean(arr, axis=0)

In [None]:
np.mean(arr)

### Step one: extract, transform, load (ETL) with Pandas

NumPy arrays are great for performance, but can be tedious to handle in real world. First, your columns may not be all of the same types (numerical, categorical, datetime); then, it's error-prone to handle lines and columns by index rather than by name. 

Pandas DataFrame are built on top of NumPy arrays to alleviate these issues (and many more). It's my go-to library for ETL; when I'm done, I can convert my DataFrame to a numpy array for further computing.

In [None]:
import pandas as pd
df = pd.DataFrame(data=arr, columns=['Age', 'Size', 'Weight'], index=np.arange(1, 4))
df

In [None]:
df['Age']

In [None]:
df.mean(axis=0)

### Step two: Machine Learning with scikit-learn

Once I have a numpy array containing all my data, I can train machine learning algorithms on it. 

scikit-learn offers good implementations of typical machine learning estimators, and a unified API to bind them all. The documentation is also very well-written and will help you choose the right algorithm for your need. 

```
my_classifier = Classifier(hyperparameters)
my_classifier.fit(X_train, y_train)
my_classifier.score(X_val, y_val)
my_classifier.predict(X_test)
```

### Python ML pipeline
1. **Load** and tidy your data with pandas
3. **Split** your data between a training dataset and a test dataset -> `sklearn.model_selection.train_test_split`
4. **Clean** your data (missing values, categorical variables, etc) -> `df.fillna`, `pd.get_dummies`
5. **Extract** your numerical data as numpy array -> `df.as_matrix`
6. **Preprocess** data with scikit-learn, using a `sklearn.pipeline.Pipeline`
7. **Learn** a scikit-learn model with a `sklearn.model_selection.cross_val_score` wrapper to evaluate how well your model would generalize
8. Maybe **adjust** your model hyperparemeters -> `sklearn.model_selection.GridSearchCV`
9. Learn your model on all your training data, then **evaluate** on your test data

## Let's learn a ML model!

<img src="img/pokemon.jpg" width="200" />

We want to learn to predict the outcome of pokemon battles. 

For that, we have the results of past battles in the file `pokemon-challenge/battles_train.csv`, and some characteristics of each Pokemon can be found in the file `pokemon-challenge/pokemon.csv`.

Let's load the files with pandas and see what we have.



In [None]:
import pandas as pd
import numpy as np
battles = pd.read_csv('pokemon-challenge/battles_train.csv')
battles.head(5)

In [None]:
pokemons = pd.read_csv('pokemon-challenge/pokemon.csv', index_col=0)
pokemons.head(5)

In [None]:
pokemons.describe()

In [None]:
pokemons.describe(exclude=[np.number])

Our data is split across two tables, and it is not at all *tidy*. 

We want to modify our data to have the following characteristics:
* each Pokemon battle is represented by one line
* each line contains all necessary information for prediction, ie the characteristics of both Pokemons and the outcome
* the outcome will be stored in a column 'label',  with the value 1 if the first Pokemon won, and 0 if the second Pokemon won.

Hints: 

You can compare two columns in pandas with: `df['A'] == df['B']`

Pandas allows you to do all kind of joins, see the [documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html).

In [None]:
# %load solutions/create_dataset.py
# Creating the label column (don't forget we want it to be of type int)
battles['label'] = 
# Joining the 3 tables into one
df = 

Next we want to drop some irrelevant columns, such as indices. What might they be?

We can first check the name of the columns of the dataframe with `df.columns`.

In [None]:
# %load solutions/columns.py


In [None]:
# %load solutions/drop_columns.py
# Drop unneccessary columns 
df_ml = 

One more thing, we said that we wanted to encode all of our data as a numerical vector, and we still have non-numerical data (types and legendary status). 

We can check the types of all columns with `df.dtypes`.

In [None]:
# %load solutions/types.py


The legendary status is a binary value, we can simply cast it as an int. 

In [None]:
# %load solutions/cast_types.py


For the types, we will use the one-hot encoding, which is provided by the function [`pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html). 

A small subtetly here, each Pokemon can have up to 2 types, which we choose to encode without consideration for primary or secundary type. 

In [None]:
# %load solutions/one_hot_encoding.py


That's it, we're done with the data preparation! 

If you want to use the cleaned data for the rest of the workshop, you can load it from `pokemon-challenge/cleaned_train_data.csv`.

In [None]:
df_ml = pd.read_csv('pokemon-challenge/cleaned_train_data.csv', index_col=0)

### Let's train a decision tree model

The first model we will learn is a decision tree. 

Decision trees create *if-then-else* rules in a top-down process, choosing at each level to split on the variable that optimize a criteria (minimize entropy for example). 
 

#### Decision trees : a toy example
Let's try to predict if I should take ice-cream as dessert or not. 

I have written down a few variables that could influence my choice: whereas I'm on a diet or not, whereas it's summer or not, whereas the weather is sunny, whereas strawberries are available as an alternative, and whereas I ate pasta as main dish. The last column register whereas I had ice-cream as dessert.

| Diet | Summer | Sunny | Strawberries | Pasta | Ice-cream |
|------|--------|-------|--------------|-------|-----------|
| 1    | 0      | 1     | 0            | 0     | 0         |
| 0    | 1      | 1     | 0            | 0     | 1         |
| 0    | 0      | 0     | 0            | 1     | 0         |
| 0    | 1      | 0     | 1            | 0     | 0         |
| 1    | 1      | 1     | 0            | 0     | 1         |
| 0    | 0      | 1     | 0            | 0     | 0         |
| 0    | 1      | 1     | 1            | 1     | 0         |
| 0    | 1      | 1     | 0            | 0     | 1         |


From this toy dataset, I could learn the following decision tree:

![A toy decision tree](img/dt.png)

Decision trees have the advantages to be **interpretable**, and to handle nicely multi-class classification and categorical variables. On the downside, they are quite instable and prone to **overfitting**.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score

In order to apply sklearn algorithms, we need to translate our pandas DataFrame to numpy arrays.

In [None]:
X = df_ml.drop('label', axis=1).as_matrix()
y = df_ml['label'].as_matrix()

Since Decision Trees are so prone to overfitting, if we train our algorithm on the whole of the data, and then see how well it performs (that we call the training error), we will get a very optimistic estimate of how well the algorithm will perform in real life. 

One solution would be to split our training data into a train dataset and a validation dataset (see for example [`sklearn.model_selection.train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)). 

In [None]:
# %load solutions/train_test_split.py

X_train, X_val, y_train, y_val =

Now, you can instantiate a `DecisionTreeClassifier` with default hyperparameter values, and train it with the method `fit(X, y)`. You can evaluate how well it did with the method `score(X, y)`. 

In [None]:
# %load solutions/decision_tree.py


We can see that with the default hyperparameters settings, the algorithm learns a tree that perfectly classify the training set, but the result is not impressive on new data (overfitting).

Another option, especially if you got little data, is to use cross-validation.

At each iteration, we use 90% of the data to train a model, and the remaining 10% to evaluate how good the model is. And we repeat that 10 times, using a different 10% to evaluate each time. 

![Cross-validation](img/crossValidation.png)

You can try it out with `sklearn.model_selection.cross_val_score`.

In [None]:
# %load solutions/cross_val.py


We said that Decision Trees have the huge advantage of being interpretable. One very nice feature is that they compute feature importances, a numerical value about how important a feature (= a variable) is to predict correctly the label. 

Let's learn again a tree, and see what the feature importances are (`dt.feature_importances_`). 

In [None]:
# %load solutions/feature_importance.py


From the feature importances, we can see that the most important variables are speed of both Pokemons; we can train a model with only those two variables and see how well it does.

In [None]:
X_speed = df_ml[['Speed', 'Speed_opponent']].as_matrix()
cross_val_score(DecisionTreeClassifier(), X_speed, y, cv=10).mean()

Our accuracy is slightly worse, but we only use 2 variables instead of 50 ! 

### Ensemble methods: Random Forests

Decision Trees are quick to train and quite easy to understand, but they have the downside of being very brittle and unstable. Changing only one sample in your training set can lead to the learning of a completely different tree. 

Here, I would like to introduce one of my favorite trick in Machine Learning: ensemble methods. 

The intuitive idea is that of the wisdom of crowd, or of expert committees. There are mathematical proofs that for some ensemble methods (like boosting), if you have a classifier that is slightly better than random, you can build an arbitrarily accurate ensemble from them. 

For Random Forests, the base classifier is a decision tree. In order to improve the robustness of prediction, we won't use only one tree, but rather average the prediction over a whole forest. 

How do we train this forest? For each tree, we draw with replacement a training set from the original training set. In addition, to ensure diversity, at each split, we choose the best from a random subset of features. 

But we don't have to worry about the implementation details for now, we can find an implementation in scikit-learn. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# %load solutions/random_forest.py


In [None]:
# %load solutions/compare.py

rf_cv_scores = 
dt_cv_scores = 

print(f'Average accuracy for Decision Trees {dt_cv_scores.mean()} with a standard deviation of {dt_cv_scores.std()}')
print(f'Average accuracy for Random Forests {rf_cv_scores.mean()} with a standard deviation of {rf_cv_scores.std()}')

For this dataset, we see that Random Forest does not perform better than Decision Tree, if we run them all with defaults parameters. 

We can also see that since scikit-learn provides a united API to interact with all ML estimators, it is tremendously easy to try out a new algorithm, and see what works best for you (in terms of accuracy, speed of execution or memory consumption). The documentation is also very useful to guide you on what usually works best. 

### Bonus step: choose hyperparameters

Until now, we have used all algorithms "out of the box", without choosing any hyperparameters. 

First, what are hyperparameters? We call hyperparameters the algorithm parameters which are not learned, but choosen by the user.

For example, in the Random Forest algorithm, we can choose how many trees we want in our forest. The default is 10, but I can increase it to 15.

In [None]:
cross_val_score(RandomForestClassifier(n_estimators=15), X, y, cv=10).mean()

We see that the accuracy has improved. But how to know if 15 is the optimal value?

scikit-learn has a function `sklearn.model_selection.GridSearchCV` to automatise the search over a grid of parameters. Here, we will only test the impact of the number of trees in Random Forest, varying the value from 5 to 55. 

In [None]:
from sklearn.model_selection import GridSearchCV
tested_values = {'n_estimators': list(range(5, 56, 5))}
grid_search = GridSearchCV(RandomForestClassifier(), tested_values, cv=10)
grid_search.fit(X, y)
print("Best hyperparameter choice", grid_search.best_params_)
print("Averaged accuracy per run", grid_search.cv_results_['mean_test_score'])
print("Standard deviation per run", grid_search.cv_results_['std_test_score'])

### Bonus step: encoding type in a different way

Spoiler alert: categorical variables are often difficult to exploit. One-of-k encoding is clumsy when a variable can take many different values, and it can break links existing between categories. 

In this dataset, the type(s) of each Pokemon are categorical. If you know the game, you know that there exists a system of type (dis)advantage. For example, Fire is weak to Water, but strong against Ice.

You can find the numerical values associated with each pair of types in the file type.csv

In [None]:
df_types = pd.read_csv('pokemon-challenge/type.csv', index_col=0)
df_types.columns.name = 'Attack type'
df_types.index.name = 'Defensive pokemon type'
df_types = df_types.replace(0, 0.01)
df_types

We replaced zeros (type immunity) with a small value, to avoid to mangage infinity (here, 100 will be infinity, which is much greater than all other values). 

Type advantages and disadvantages are combined when the Pokemon has more than one type. Also, it applies to the attack which is performed on the Pokemon, and each attack has only one type. We make the hypothesis when the attacker has more than one type that the type advantage will be the average. This is not necessarily true because, even though attack types are correlated to the Pokemon type, the Pokemon can also perform attacks of a different type (for example an Electric attack even if it's a Fire Pokemon). 

We compute the type advantage as follow.

In [None]:
def type_advantage_offensive(t1, t2, to1, to2):
    if not t2:
        if not to2:
            return df_types[t1][to1]
        return (df_types[t1][to1] * df_types[t1][to2]) 
    if not to2:
        return (df_types[t1][to1] + df_types[t2][to1]) / 2
    return (df_types[t1][to1] * df_types[t2][to1] + df_types[t1][to2] * df_types[t2][to2]) / 2.  


def type_advantage_defensive(t1, t2, to1, to2):
    tao = type_advantage_offensive(to1, to2, t1, t2)
    return  1. / type_advantage_offensive(to1, to2, t1, t2)

In [None]:
df_ml_type = df.drop(['First_pokemon', 'Second_pokemon', 
                      'Winner', 'Name', 'Name_opponent', 
                      'Generation', 'Generation_opponent'], axis=1)
df_ml_type['Legendary'] = df_ml_type['Legendary'].astype(int)
df_ml_type['Legendary_opponent'] = df_ml_type['Legendary_opponent'].astype(int)

In [None]:
types_ =  df_ml_type[['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent']].fillna(value='')
df_ml_type['Type_advantage_offensive'] = types_.fillna('').apply(lambda x: type_advantage_offensive(*x), axis=1)
df_ml_type['Type_advantage_defensive'] = types_.apply(lambda x: type_advantage_defensive(*x), axis=1)
df_ml_type = df_ml_type.drop(['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent'], axis=1)

In [None]:
df_ml_type.head()

Now we can train all of our ML models as previously.

In [None]:
X_type = df_ml_type.drop('label', axis=1).as_matrix()
y_type = df_ml_type['label'].as_matrix()
cross_val_score(DecisionTreeClassifier(), X, y, cv=10).mean()

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_type, y_type)
pd.DataFrame(data=dt.feature_importances_, index=df_ml_type.columns[1:],
             columns=['Feature importance']).sort_values(by="Feature importance", ascending=False)

In this case, the accuracy doesn't change a lot, since the speed is so much more important than any other features. 

We can nonetheless observe that the type advantage is the third most important feature, whereas in the original encoding, the information was basically useless to the algorithm.

## Now we can predict new battle outcomes

We've been interating on our test data with different algorithms and models and hyperparameters, but what if we learned how to be good on this data, and not on real life data (overfitting) ?
There are data in another dataset that we've left untouched, to simulate real life data that we really can't predict. Let's see how we perform on those.

We will need to perform the same preprocessing steps that we ran on the training data to prepare the dataset. 

In [None]:
def dummify(type_series, prefix):
    # we need to give our classifier data with the same columns as the data it has learned on
    # in particular, we need to have all type columns present, even if they are not in the test set
    # that's why we extract all the types from the trainset, and fill the missing columns with zeros
    type_columns = sorted(prefix + '_' + tc for tc in pokemons['Type 1'].unique())
    dummies = pd.get_dummies(type_series.iloc[:,0], prefix=prefix).add(
        pd.get_dummies(type_series.iloc[:,1], prefix=prefix),
        fill_value=0)
    missing_cols = set(type_columns) - set( dummies.columns )
    for c in missing_cols:
        dummies[c] = 0
    return dummies[type_columns]

In [None]:
def preprocess_test_data(battles_test):
    # we need to apply all the preprocessing steps we applied on our training data
    # 1°) joining data about battle outcomes and pokemons
    df_test = battles_test.join(pokemons, on='First_pokemon').join(pokemons, on='Second_pokemon', rsuffix='_opponent')
    # 2°) encoding the winner as int (1 if First_pokemon wins, 0 else)
    if 'Winner' in df_test.columns:
        df_test['label'] = (df_test['First_pokemon'] == df_test['Winner']).astype(int)
        df_test = df_test.drop(columns=['Winner'])
    # 3°) dropping useless columns
    df_test = df_test.drop(['First_pokemon', 'Second_pokemon', 
                               'Name', 'Name_opponent', 
                               'Generation', 'Generation_opponent'], axis=1)
    # 4°) encoding boleans as int
    df_test['Legendary'] = df_test['Legendary'].astype(int)
    df_test['Legendary_opponent'] = df_test['Legendary_opponent'].astype(int)
    # 5°) one-hot encoding of categorical columns 
    types = dummify(df_test[['Type 1', 'Type 2']], prefix='Type')
    types_opponents = dummify(df_test[['Type 1_opponent', 'Type 2_opponent']], prefix='Opponent_Type')
    return pd.concat((df_test, types, types_opponents), axis=1).drop(['Type 1', 'Type 2', 'Type 1_opponent', 'Type 2_opponent'], axis=1)

In [None]:
battles_test = pd.read_csv('pokemon-challenge/battles_test.csv')
df_ml_test = preprocess_test_data(battles_test)
df_ml_test.head()

In [None]:
X = df_ml.drop('label', axis=1).as_matrix()
y = df_ml['label'].as_matrix()
X_test = df_ml_test.drop('label', axis=1).as_matrix()
y_test = df_ml_test['label'].as_matrix()

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X, y)
dt.score(X_test, y_test)

In [None]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X, y)
rf.score(X_test, y_test)

Now that our classifiers are trained, and that they seem to perform well enough, we can write a small function to predict which Pokemon will win a battle. 

In [None]:
def predict_battle_outcome(pokemon1, pokemon2, trained_estimator):
    if isinstance(pokemon1, str):
        try:
            pokemon1 = pokemons.index[pokemons['Name'] == pokemon1][0]
        except IndexError:
            raise ValueError(f"The name {pokemon1} is not a valid Pokemon name")
    
    if isinstance(pokemon2, str):
        try:
            pokemon2 = pokemons.index[pokemons['Name'] == pokemon2][0]
        except IndexError:
            raise ValueError(f"The name {pokemon2} is not a valid Pokemon name")
    df = pd.DataFrame(data=[[pokemon1, pokemon2]], columns=['First_pokemon', 'Second_pokemon'])
    X = preprocess_test_data(df).as_matrix()
                
    if trained_estimator.predict(X)[0]:
        print(f"{pokemons.loc[pokemon1, 'Name']} wins")
    else:
        print(f"{pokemons.loc[pokemon2, 'Name']} wins")

In [None]:
predict_battle_outcome('Charizard', 'Mega Venusaur', rf)