# Lecture 5 - Classification & Regression I - Cross Validation

In this short notebook, we perform a "proper" classification task using cross validation. Most classifiers or regressors feature a set of hyperparameters (e.g., the k in KNN) that can significantly affect the results. To find the best parameter settings, we have to train and evaluate for different parameter values. 

However, this evaluation for of find the best parameter values cannot be done using the test set. The test set has to be unseen using he very end for the  final evaluation (once the hyperparameters have been fixed). Using the test set to tune the hyperparameters means that the test set has affected the training process. 

Let's get started...


## Setting up the notebook

Specify how plots get rendered

In [55]:
%matplotlib notebook

Make all required imports. Many of the stuff is for fancy visualization.

In [42]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression


from sklearn import preprocessing
from sklearn.model_selection import KFold


from sklearn.metrics import classification_report, f1_score, mean_squared_error
from sklearn.model_selection import cross_val_score

## KNN Classification of IRIS Dataset

The [IRIS dataset](https://archive.ics.uci.edu/ml/datasets/iris) is a very simple classification data for prediction the type of iris plant given 4 numerical features (all lengths in cm).

## Load Data

In [101]:
df_diabetes = pd.read_csv('data/diabetes.csv')

# The rows are sorted, so let's shuffle them
df_diabetes = df_diabetes.sample(frac=1).reset_index(drop=True)

# Show the first 5 columns
df_diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,3,111,62,0,0,22.6,0.142,21,0
1,5,99,74,27,0,29.0,0.203,32,0
2,0,95,80,45,92,36.5,0.33,26,0
3,2,114,68,22,0,28.7,0.092,25,0
4,3,90,78,0,0,42.7,0.559,21,0


## Create Training and Test Data

To allow to visualize things more easily, we consider only to input features (sepal length and sepal width)

In [102]:
# Convert data to numpy arrays
X = df_diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].to_numpy()
y = df_diabetes[['Outcome']].to_numpy().squeeze()

# Let's go for a 80%/20% split -- you can change the value anf see its effects
train_test_ratio = 0.80¶

# Calculate the size of the training data (the size of the dest data is also implicitly given)
train_set_size = int(train_test_ratio * len(X))

# Split data and labels into training and test data with respect to the size of the test data
X_train, X_test = X[:train_set_size], X[train_set_size:]
y_train, y_test = y[:train_set_size].squeeze(), y[train_set_size:].squeeze()

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

Size of training set: 614
Size of test: 154


## Standardize Data

Although we have only numerical values as input attributes there magnitudes and ranges differ noticeable. It's therefore a good idea to normalize/standardize the data. As usual, scitkit-learn makes it very convenient by providing a [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) (among other methods for normalization and standardization).

**Important:** We have fit the scaler on the training data `X_train` only (fitting here means to calculate the mean and standard deviation)! If we would use the `X` for that, then this would include the test data `X_test`. In this case, the test data would affect the transformation and training steps. However, `X_test` has to remain truly "unseen" until the very end.

In [103]:
scaler = preprocessing.StandardScaler().fit(X_train)
#scaler = preprocessing.StandardScaler().fit(X)  # WRONG!!!

X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

Just from looking at the plot above we can see, that the "red" class is well separated while the "green" and "blue" classes show quite some overlap. Based on this we can expect that predicting the "red" class correctly will be easier than for the "green" and "blue" class.

**Note:** This overlap between the "green" and "blue" class is only so pronounced because we have ignored 2 features. With respect to all 4 features, all 3 classes are quite separated and most classification models have no problem with that simple dataset.

## Train and Test KNN Classifier Using Cross-Validation

### Semi-Manually K-Fold Validation

We first utilize scikit-learn's  [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) to split the training data into $k$ folds (here, $k=10$). The [`KFold.split()`] methods generates the folds and allows to loop over all combination of training and validation folds. Each combination contains $k-1$ training folds and 1 validation fold. For each combination we can retrain and validate the classifier.

In [114]:
# Initialize the best f1-score and respective k value
k_best, f1_best = None, 0.0

# Loop over a range of values for setting k
for k in range(1, 20):

    kf = KFold(n_splits=10)
    f1_scores = []
    for train_index, val_index in kf.split(X_train_transformed):
        
        # Create the next combination of training and validation folds
        X_trn, X_val = X_train_transformed[train_index], X_train_transformed[val_index]
        y_trn, y_val = y_train[train_index], y_train[val_index]
    
        # Train the classifier for the current training folds
        classifier = KNeighborsClassifier(n_neighbors=k).fit(X_trn, y_trn)
        
        # Predict the labels for the validation fold
        y_pred = classifier.predict(X_val)

        # Calculate the f1-score for the validation fold
        f1_scores.append(f1_score(y_val, y_pred, average='macro'))
        
    # Calculate f1-score for all fold combination as the mean over all scores
    f1_fold_mean = np.mean(f1_scores)
    
    # Keep track of the best f1-score and the respective k value
    if f1_fold_mean > f1_best:
        k_best, f1_best = k, f1_fold_mean
        
        
print('The best f1-score was {:.3f} for a k={}'.format(f1_best, k_best))

The best f1-score was 0.714 for a k=5


### Automatic Cross-Validation

scikit-learn provides the even more convenient method [`cross_val_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) that does the generation of folds and spliting them into training folds and validation folds, as well as the training of a classifier for all folds.

In [113]:
# Initialize the best f1-score and respective k value
k_best, f1_best = None, 0.0

# Loop over a range of values for setting k
for k in range(1, 20):
    
    # Specfify type of classifier
    classifier = KNeighborsClassifier(n_neighbors=k)
    
    # perform cross validation (here with 10 folds)
    # f1_scores is an array containg the 10 f1-scores
    f1_scores = cross_val_score(classifier, X_train_transformed, y_train, cv=10, scoring='f1_macro')
    
    # Calculate the f1-score for the current k value as the mean over all 10 f1-scores
    f1_fold_mean = np.mean(f1_scores)
    
    # Keep track of the best f1-score and the respective k value
    if f1_fold_mean > f1_best:
        k_best, f1_best = k, f1_fold_mean
  

print('The best f1-score was {:.3f} for a k={}'.format(f1_best, k_best))

The best f1-score was 0.724 for a k=5


## Final evaluation on Test Data

Now that we have identified the best value for $k$, we can perform the final evaluation using the test data. We can now also use the fill training data, and don't need to split it into any folds.

In [116]:
classifier = KNeighborsClassifier(n_neighbors=k_best).fit(X_train_transformed, y_train)

y_pred = classifier.predict(X_test_transformed)

f1_final = f1_score(y_test, y_pred, average='macro')

print('The final f1-score of the KNN classifier (k={}) is: {:.3f}'.format(k_best, f1_final))
        

The final f1-score of the KNN classifier (k=5) is: 0.680


This final score is the one to report when quantifying the quality of the classifier.

## Summary

It is tempting to use the test data over and over again to find the best values for the hyperparameters for a classifier or regressor. However, this defeats the purpose of the test data which is supposed to be unseen.

For finding the best parameter values, it is therefore required to split the training data further into a training and validation set. The validation set is used to evaluate a classifier for different hyperparameter values. In practice, typically several splits into training and validation data are used for each parameter setting. While different ways to generate this different splits exist, in this notebook, we used to common k-fold validation approach.