# 7. Optimizing models

Evaluation is not only meant to be a reality-check of how well a model measures up to our expectations.
This notebook demonstrates how to evaluate models better, and more importantly, how to optimize a model's performance with just a few lines of code.

In this notebook, we'll use the Titanic dataset to train decision trees.
All of the examples will show how different evaluation measures and optimization techniques lead to very different trees.
The next code cell loads the dataset into a Pandas DataFrame.

In [1]:
import pandas as pd # import pandas
df = pd.read_csv('data/titanic-train.csv') # read the Titanic dataset into a DataFrame
df.head() # show the first 5 rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Throughout this notebook, we'll only be using a few features from this dataset.
Normally, you would experiment with most features, but we'll ignore rows with many missing values and categorical variables to get to optimization techniques more quickly.
The next code cell shows some statistics from the Titanic dataset, including the number of rows, the number of missing values and the data types.

In [2]:
_df = df.copy() # create a copy of the dataset before imputation
_df.info() # print information about the columns, including missing data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Before starting to train the decision tree, the next code cell encodes the `Sex` feature, which we know to be important to predict survival, and impute the missing `Age` values.

In [3]:
from sklearn import preprocessing # import the preprocessing module for feature engineering

label_encoder = preprocessing.LabelEncoder() # create the label encoder
encoding = label_encoder.fit_transform(_df.Sex) # encode the `Sex` feature
_df.Sex = encoding # replace the encoded `Sex` values
print(f"{ [ 0, 1 ] } are { list(label_encoder.inverse_transform([ 0, 1 ])) }")

_df.Age = _df.Age.fillna(_df.Age.mean()) # fill the Age's missing values with the mean
_df.head() # show the first 5 rows

[0, 1] are ['female', 'male']


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S


## Cross validation

K-fold cross validation gives us more reliable results by, well, cross validating using several train and test sets.
Instead of training a single model, k-fold cross validation trains several models, always using different training and test sets.

In scikit-learn, there are two ways of applying k-fold cross validation, but in either way, the first step is to identify the features and labels.
The next code cell specifies the features we'll use to train decision trees with, and then it extracts all features `X` and labels `y`.

In [4]:
features = [ 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare' ] # the subset of the features we're using
X = _df[ features ] # keep all rows, but only some of the features
y = _df.Survived # keep all the labels in the `Survived` column

The first k-fold cross validation experiment we'll perform gives us complete control and visibility of the evaluation process.
We'll use the [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class to perform cross validation, which involves two steps:

1. Import the `KFold` class
2. Instantiate the `KFold` class

The next code cell performs both steps, instantiating the class with 5 splits ($k=5$).
We also shuffle all rows, an optional parameter that gives us different results whenever we run the experiment.

The `splits` variable is a list, which holds a number of splits, as the name implies.
Each split is made up of two lists, each containing indices:

1. The first list is the examples that we should use to **train** the model.
2. The second list is the examples that we should use to **test** the model.

In [5]:
from sklearn.model_selection import KFold # import the KFold class

folds = KFold(n_splits=5, shuffle=True) # create the k-fold cross validation class with k=5 and shuffling
splits = list(folds.split(X)) # extract the folds as a list
splits[0]

(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  25,  27,  28,
         29,  30,  31,  32,  33,  35,  38,  39,  42,  45,  47,  49,  50,
         51,  52,  53,  54,  55,  56,  57,  58,  59,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  71,  72,  73,  74,  75,  76,  77,  79,
         80,  81,  82,  83,  84,  86,  88,  89,  90,  93,  96,  97,  98,
         99, 101, 102, 103, 104, 105, 109, 110, 111, 112, 115, 116, 118,
        121, 122, 123, 124, 125, 126, 127, 128, 129, 131, 133, 134, 135,
        139, 140, 141, 142, 143, 145, 147, 148, 149, 151, 152, 153, 154,
        155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167,
        168, 169, 170, 171, 173, 175, 176, 177, 178, 179, 180, 181, 182,
        183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
        196, 197, 199, 200, 202, 203, 204, 208, 209, 210, 211, 212, 213,
        214, 216, 217, 218, 219, 220, 222, 223, 224

The way scikit-learn applies k-fold cross validation, we need to give it a model with all hyperparameters set.
In other words, we create the model, but without training it.
We'll train the decision tree later for each different split.

In [6]:
from sklearn import tree # import the tree module
dt = tree.DecisionTreeClassifier(max_leaf_nodes=5) # create the DecisionTreeClassifier—but do not train it

Thus, we have the two tools that we need to cross-validate: a list of splits and a decision tree.
The next code cell performs a few, very simple steps:

1. It iterates over every split: a training set and a test set.
2. It trains the model on the training set.
3. It evaluates the model on the test set.

In [7]:
from sklearn import metrics # import the metrics module
for train_index, test_index in splits: # go through each split: a training set and a test set
    X_train, X_test = X.iloc[train_index], X.iloc[test_index] # extract the train and test rows (but only the features)
    y_train, y_test = y[train_index], y[test_index] # extract the train and test labels
    
    dt = dt.fit(X_train, y_train) # train, or fit, the decision tree
    
    y_pred = dt.predict(X_test) # predict the labels of the unseen observations
    print(f"F-measure: { round(metrics.f1_score(y_test, y_pred) * 100, 2) }%")

F-measure: 75.71%
F-measure: 60.78%
F-measure: 70.68%
F-measure: 69.7%
F-measure: 72.06%


Note that no matter how many times you run this notebook, the average results are always similar.
That is what makes k-fold cross validation more reliable than evaluating with one split.

There are more ways to improve the reliability *and* performance of models.
For example, the [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) class is a stratified version of k-fold cross validation.
Stratified sampling ensures that we do not have a train set with very few survivors or many surviors: the training and test sets are more consistent.

The second way of applying k-fold cross validation is also much more straightforward.
In fact, it's straightforward enough to fit in one code cell.

The solution is to use the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function, which is really a shortcut to the `StratifiedKFold` class.
To use this function, we naturally need to import it, but we also need to pass on a few parameters:

- The model we want to use—instantiated, as before, but not trained
- The features that the function will use to train the model, `X`, and the corresponding labels, `y`
- The number of splits, `cv`
- The scoring function—see the [scoring options here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

The function will simply return the scores from each split.
And we're done.

In [8]:
from sklearn.model_selection import cross_val_score # import the cross_val_score function

dt = tree.DecisionTreeClassifier(max_leaf_nodes=5) # create the DecisionTreeClassifier
cross_val_score(dt, X, y, cv=5, scoring='f1_macro') # calculate the F1-score with 5 splits

array([0.69089482, 0.77091377, 0.79400953, 0.72775306, 0.79657143])

## Hyperparameter tuning

Tuning hyperparameters can be tedious, but scikit-learn can do it for you.
We'll first experiment with [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

As usual, we have to import the class, but more importantly, we need to pass on at least two parameters: the model and all possible hyperparameters we want to try.
For example, we'll vary the maximum depth and leaf nodes from 2 to 19.

We also set `cv=5` for 5-fold, stratified cross validation, and choose the scoring metric.
After fitting, we can view the best hyperparameters (`best_params_`) and score (`best_score_`).

In [9]:
%%time
# ^ time how long it takes grid search to find the best hyperparameters
from sklearn.model_selection import GridSearchCV # import the GridSearchCV class

dt = tree.DecisionTreeClassifier() # create a decision tree with any default hyperparameters
parameters = { 'max_depth': range(2, 20), 'max_leaf_nodes': range(2, 20) } # choose the hyparameters that you want to experiment with

# instantiate the grid search with the model and all possible parameters, as well as some other optional parameters
grid = GridSearchCV(dt, parameters, cv=5, scoring='f1_macro')

grid.fit(X, y) # tune the parameters
print(f"Best results: { grid.best_params_ }")
print(f"F-measure: { round(grid.best_score_ * 100, 2) }%")

Best results: {'max_depth': 10, 'max_leaf_nodes': 19}
F-measure: 80.93%
CPU times: user 6.81 s, sys: 0 ns, total: 6.81 s
Wall time: 6.81 s


Note how long it takes: that's grid search.
[RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) takes much less time, but the score trade-off isn't large.
Best of all, the code is practically the same.

In [10]:
%%time 
# ^ time how long it takes random search to find the best hyperparameters
from sklearn.model_selection import RandomizedSearchCV # import the RandomizedSearchCV class

dt = tree.DecisionTreeClassifier() # create a decision tree with any default hyperparameters
parameters = { 'max_depth': range(2, 20), 'max_leaf_nodes': range(2, 20) } # choose the hyparameters that you want to experiment with

# instantiate the random search with the model and all possible parameters, as well as some other optional parameters
rand = RandomizedSearchCV(dt, parameters, n_iter=5, cv=5, scoring='f1_macro')

rand.fit(X, y) # tune the parameters
print(f"Best results: { rand.best_params_ }")
print(f"F-measure: { round(rand.best_score_ * 100, 2) }%")

Best results: {'max_leaf_nodes': 14, 'max_depth': 9}
F-measure: 80.56%
CPU times: user 114 ms, sys: 0 ns, total: 114 ms
Wall time: 113 ms


Hyperparameter tuning does not have to a chore, and sometimes making some extra effort can improve results considerably.