# Training Machine Learning Models
*Curtis Miller*

In this video I discuss techniques used in machine learning.

## Underfitting and Overfitting

**Underfitting** is when an algorithm trained to predict a value does so poorly both in the training data and on future, unseen data.

Reconsider the *Titanic* dataset example below:

In [None]:
import pandas as pd
from pandas import DataFrame

In [None]:
titanic = pd.read_csv("titanic.csv")
titanic.head()

In [None]:
titanic.Survived.value_counts()

In [None]:
# Predict most common value
if titanic.Survived.value_counts()[0] > titanic.Survived.value_counts()[1]:
    guess = 0
else:
    guess = 1

predicted = pd.Series([guess] * len(titanic))
(titanic.Survived - predicted).abs().sum()    # Error count (trivial here)

In [None]:
(titanic.Survived - predicted).abs().mean()     # Error rate

In [None]:
1 - (titanic.Survived - predicted).abs().mean()     # Correct prediction rate

This algorithm is underfitting as much as it possibly can. It may in fact be a worst-case-scenario for underfitting.

**Overfitting** occurs when an algorithm predicts training data well but does not generalize to new data; on new data, the algorithm's error rate increases unacceptably.

Underfitting is obvious when training a system, but overfitting requires more care to detect, since unseen data is not seen (obviously). There are techniques, though, for simulating unseen data.

## Training / Testing Split

The first technique is to split data into a training dataset and a testing dataset. We use the training data for developing our algorithm. We then see how well the algorithm generalizes by applying the trained algorithm to the test data and quantifying the error rate.

`train_test_split()`, from scikit-learn (**sklearn**), makes splitting data easy.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
titanic_train, titanic_test = train_test_split(titanic,          # Dataset to split (array-like)
                                               test_size=0.1)    # How large test set should be; in this case, 10% of
                                                                 # the whole (could also be an integer for fixed size)

In [None]:
titanic_train

In [None]:
titanic_train.shape

In [None]:
titanic_test

In [None]:
titanic_test.shape

Let's now train a new algorithm. The table-lookup algorithm does the following:

1. *Look up all individuals in the training set with the same passenger class (`Pclass`), sex (`Sex`), siblings/spouses aboard (`Siblings/Spouses Aboard`) and parents/children aboard (`Parents/Children Aboard`).*
2. *Predict the most common value amongst those individuals.*

Below is the code for the algorithm.

In [None]:
def table_lookup_predictor(x, table):
    """Implements the table-lookup algorithm"""
    
    # Get most common label
    default = table.Survived.value_counts().argmax()
    # Get similar individuals
    similar_tab = table.loc[(table["Pclass"] == x["Pclass"]) &\
                            (table["Sex"] == x["Sex"]) &\
                            (table["Siblings/Spouses Aboard"] == x["Siblings/Spouses Aboard"]) &\
                            (table["Parents/Children Aboard"] == x["Parents/Children Aboard"]), "Survived"]
    if len(similar_tab) == 0:
        # If table is empty (no "similar" individuals), guess the most common label
        return default
    else:
        return similar_tab.value_counts().argmax()

In [None]:
titanic_train.iloc[0,:]

In [None]:
# Demonstration 1
table_lookup_predictor(titanic_train.iloc[0,:], titanic_train)    # Perfect!

In [None]:
tlu_train_predicted = titanic_train.apply(table_lookup_predictor, 1,
                                          table=titanic_train)    # Make predictions on training set
tlu_train_predicted

We can easily compute the error our algorithm made on the training set using the scikit-learn function `accuracy_score()`.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_true=titanic_train.Survived,    # True values
               y_pred=tlu_train_predicted)    # Predicted values

The algorithm is very accurate on the training set. This is to be expected; it's just looking up values from the table! What about when it's applied to the test set?

In [None]:
tlu_test_predicted = titanic_test.apply(table_lookup_predictor, 1,
                                        table=titanic_train)    # Make predictions on test set

In [None]:
accuracy_score(y_true=titanic_test.Survived,    # True values
               y_pred=tlu_test_predicted)    # Predicted values

The algorithm overfit slightly on the training set, though the overfitting isn't terrible.

**NOTE:** Evaluating a model on the test set should be the *very last thing you do!* If you repeatedly refer to the test set, it no longer is "unseen" data.

## Cross-Validation

Many algorithms include **hyperparameters**, which are parameters that are characteristic of the algorithm itself rather than the underlying phenomenon. We need to choose the value of these parameters and we are indifferent to their values beyond their ability to improve predictions.

Our algorithm does not account for passengers' ages when making predictions. These unfortunately are not binary variables, but we can use them to create a binary variable by fixing an age and marking all those individuals less than this age with 1, and the rest 0. The cutoff age behaves like a hyperparameter here.

We don't want to pick our cutoff to maximize predictive accuracy in the training set, though, and we don't want to choose it so that it improves accuracy in the test set either. Instead we will employ **cross-validation**. The procedure works as follows:

1. *Divide data into $k$ **folds** (approximately equal size subsets of the original dataset that together form the whole dataset).*
2. *For each fold, do the following:*
    1. *Treat the fold as the "test" data and the rest of the data as "training" data.*
    2. *For each possible value of the hyperparameter, use the "training" data to fit the model and evaluate its performance on the "test" data; track performance*
3. *Aggregate the performance of the algorithm across the different folds for each possible value of the hyperparameter*
4. *Use the hyperparameter value that overall yielded the best performance.*

Cross-validation can be used for purposes other than choosing hyperparameters. For example, it can be a good way to evaluate an algorithm's performance and thus allow you to choose between different algorithms.

Here I will consider six candidate cutoff ages: 10, 20, 30, 40, 50, 60. I will use 10 folds.

scikit-learn provides multiple functions for supporting cross-validation. The `KFold` class can split a dataset up into folds as described. `cross_val_score()` can perform the entire cross-validation procedure, and is a good choice. Here I will do the cross-validation manually using only `KFold` but in future videos we may use `cross_val_score()`.

In [None]:
from sklearn.model_selection import KFold
import numpy as np

In [None]:
kf = KFold(n_splits=10)    # Prepare for cross-validation, creating an object for managing splits

In [None]:
# Preview; note that these are NumPy arrays
for train, test in kf.split(titanic_train):
    print("Training Indices:")
    print(train)
    print("\nTest Indices")
    print(test)
    print("\n----\n")

In [None]:
def table_lookup_predictor_2(x, table, age):
    """Implements the table-lookup algorithm with ages after cufoff"""
    
    # Get most common label
    default = table.Survived.value_counts().argmax()
    # Get similar individuals
    similar_tab = table.loc[(table["Pclass"] == x["Pclass"]) &\
                            (table["Sex"] == x["Sex"]) &\
                            (table["Siblings/Spouses Aboard"] == x["Siblings/Spouses Aboard"]) &\
                            (table["Parents/Children Aboard"] == x["Parents/Children Aboard"]) &\
                            ((table["Age"] < age) == (x["Age"] < age)) , "Survived"]
    if len(similar_tab) == 0:
        # If table is empty (no "similar" individuals), guess the most common label
        return default
    else:
        return similar_tab.value_counts().argmax()

In [None]:
ages = [10, 20, 30, 40, 50, 60]
performance = dict()

for age in ages:
    cv_perf = list()
    for train, test in kf.split(titanic_train):
        # Get predicted values in "test" data using "train" data
        predicted = titanic_train.iloc[test,:].apply(table_lookup_predictor_2, 1, table=titanic_train.iloc[train,:],
                                                    age=age)
        actual = titanic_train.loc[:,"Survived"].iloc[test]
        # Add performance to a list
        cv_perf.append(accuracy_score(y_true=actual, y_pred=predicted))
    performance[age] = cv_perf

In [None]:
DataFrame(performance)

In [None]:
DataFrame(performance).mean()

It appears we attain optimal performance by choosing our cutoff age to be 10 years.