# Model Evaluation, and Kaggle Inclass Competition

### By Keiron O'Shea, and Chuan Lu

In machine learning, we must assure that our model is able to disseminate correct features that can work with "real data". One means of ensuring that our models do this is through the use of cross-validation.

Cross-validation is a technique in which a model is trained rusing a subset of the dataset, and then evaluated through the use of a (usually smaller) subset of the data.

This involves three steps:

1. Split the data into portions (training, validation, and testing) data.
2. Use the training and validation data to train the model.
3. Use the testing data to evaluate the model performance.

For example given a dataset of 10 examples:

In [1]:
import numpy as np
    
X = np.array([14, 12, 25, 7, 5, 17, 47, 52, 26, 69])
y = np.concatenate([np.zeros(5), np.ones(5)])

print(X)
print(y)


[14 12 25  7  5 17 47 52 26 69]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]


We can split the data into two datasets, where 20% of the data is taken forward for testing purposes, leaving the remaining 8 samples to be used for training the model:

In [2]:
training_data = X[0:8]
testing_data = X[8:]

This is called validation, and is an extremely simplistic way of evaluating a model. However one issue with this approach is that it fails to provide any real metric of how the model performs against the entirety of the dataset.

One way of ensuring that the dataset is well represented when evaluating our models would be to iterate through, whilst spliting the data.

**Leave One Out Cross Validation** is a simple way of doing exactly this. In this method, we perform training on the entire dataset - only leaving one example of the available dataset out for testing purposes.

So using the simple example above, we do this by:

In [3]:
y_pred = []
y_true = []

for i in range(len(X)):
    #take all except i into X_train
    X_train = X[[x for x in range(len(X)) if x != i]]
    # y_train ....
    X_test = X[i]
    print("Training the model with %i samples, and %s tests" % (len(X_train), 1))        
    # y_test ....
    # Train model
    # Add y_train to y_true
    # Add prediction from X_test to y_pred
    
# Run metrics here (classification_report, etc)

Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests
Training the model with 9 samples, and 1 tests


A major issue of this method is that it may lead to a higher variation in the testing model. Another issue is that it may take a lot of execution time if the dataset is of large size - or if the model takes awhile to learn.

**K-Fold Cross Validation** provides a superior representation of the dataset. In this method we split the dataset into k-number of subsets (folds), performing the training on the remaining dataset and evaluating on the fold.

For example, given 10 samples:

![title](images/dataset.png)

We could set the K-parameter to equal 2 fold. This will effectively split the data into two equal sizes (where green is for training, and red is for testing):

![title](images/fold1.png)

Once evaluated, we then split the data again and evaluate the performance once more:

![title](images/fold2.png)


See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

So in code format:

In [10]:
import numpy as np
from sklearn.model_selection import KFold

kf = KFold(n_splits=2, random_state=None, shuffle= False)
kf.get_n_splits(X)

print(kf)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # build classifier e.g. model = RandomForestClassifier(n_estimators=100)    
    # fit model e.g. model.fit(X_train, y_train)
    
    # prediction e.g. y_prediction = model.predict(X_test)
    # print "Prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)    
    

KFold(n_splits=2, random_state=None, shuffle=False)
TRAIN: [5 6 7 8 9] TEST: [0 1 2 3 4]
TRAIN: [0 1 2 3 4] TEST: [5 6 7 8 9]


### Stratified cross-validation and random state control
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold

For classification problem, the stratified cross-validation is preferred over the non-stratified one. Try stratified cross-validation, where the distribution of the class lables is kept the same for each fold, code in form of: 

- from sklearn.model_selection import StratifiedKFold
- skf = StratifiedKFold(n_splits=5, random_state=2020, shuffle= True)
- skf.get_n_splits(X, y)

In order to control the random state for sample splitting (or kfold splitting), we can set the random_state parameter to a predefined number and set "shuffle" to True. This is important to ensure reproducible experiment/output in the notebook. 

Also when comparing different models using k-fold CV, you should also control the random state to ensure every model is using the same k-fold splitting for fair comparison. 

In [13]:
#Â Change the random state in the above code, and use stratified CV, observe any changes in the training/testing index  
# 
# Your code here

# random_state is a seed for the shuffle, so we get repeatable splits & therefore results
kf = KFold(n_splits=5, random_state=68, shuffle= True)
kf.get_n_splits(X)

print(kf)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

KFold(n_splits=5, random_state=68, shuffle=True)
TRAIN: [0 2 3 4 5 7 8 9] TEST: [1 6]
TRAIN: [0 1 2 4 6 7 8 9] TEST: [3 5]
TRAIN: [1 2 3 4 5 6 7 9] TEST: [0 8]
TRAIN: [0 1 2 3 5 6 7 8] TEST: [4 9]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]


###  ROC Curve and AUC (Area under the Curve)

We've already played around with performance metrics in previous workshops, but we've not had any chance to play around with how we can visualise said results effectively.

But before we do this, we must first venture into the wonderfully simplistic, yet incredibly powerful **Confusion Matrix**. The Confusion Matrix is a performance measurement for machine learning classification. In the simplest terms, it is comprised of a single table with 4 combinations of predicted and actual values. A nice example I found online was the pregnancy anology:

![title](images/preggo.png)

- **True Positive/TP:** You have predicted positive, and that is indeed the case.
- **True Negative/TN:** You have predicted negative, amd that is indeed the case.
- **False Positive/FP/Type 1 Error:** You have predicted positive, where the case is actually negative.
- **False Negative/FN/Type 2 Error:** You have predicted negative, where the case is actually positive.


Using this simple metric, we are able to calculate an array of performance metrics. For example:

$$Recall = {TP \over{TP + FN}}$$

Which calculates how many classes were correctly predicted for the positive task.

$$Precision = {TP \over{TP + FP}}$$

Which calculates how many classes were correctly predicted for the negative task.

$$F-measure = {2*(Recall*Precision) \over{Recall + Precision}}$$

Uses the harmonic mean to punish the extreme values more, where two models with low precision and high recall (or vice versa).

The major calculations are the **true positive rate**:

$$TPR = {TP \over{TP+FN}}$$

The **false positive rate**:

$$FPR = {FP \over{FP+FP}}$$

And the **specificity**:

$$Specificity = {TN \over{TN+FP}}$$


The **Reciever Operating Characteristic** (ROC) curve and the **Area Under the Cuve** (AUC) are very popular for evaluating the performance of binary classification models. See: 

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html and https://en.wikipedia.org/wiki/Receiver_operating_characteristic

ROC curve is constructed by plotting false positive rate vs true positive rate (specificity) at various threshold settings (for a continuous output, e.g. the probabilty output given by a decision tree model). AUC is the area under the curve. The best possible model has an AUC of 1, which means that it has a good mesaure of seperability. 0 means that the model has a worst measure of seperability, and where AUC is 0.5 it illustrates that the model has zero class seperation capacity whatsoever.

You can see an example on how to plot ROC curves in scikit-learn v0.22: 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html#sklearn.metrics.plot_roc_curve

or ROC curves with cross-validation
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py


Alternatively for lower version of scikit-learn, the example plot ROC in python can be found below:

**Note:** This only works for binary classification tasks.

In [14]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

def plot_auc_and_roc(probabilities, y_true):
    fpr, tpr, _ = roc_curve(y_true, probabilities)
    calc_auc = auc(fpr, tpr)
        
    plt.title("Recieving Operating Characteristic")
    plt.plot(fpr, tpr, "red", label="AUC %0.2f" % calc_auc)
    plt.plot([0, 1], [0, 1], "b--")
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.legend(loc="lower right")
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

### Now, over to you

Register for Kaggle using your Aberystwyth University, using your AU email address.

https://www.kaggle.com

Log in and join the following competition:

https://www.kaggle.com/t/e366e0f54f694a1fbdd0e921e7617e55


A major part of these workshops are to help you go out and do this for yourself. That being said, we do expect you to achieve the following:

1. Load in the data, explore the data. Transform data data into a format that a machine learning technique can use. 

2. Might need to further split the training set into train/validation sets (depending on the sample size of the data with labels and problem at hand. Due to time limit in the class, you may skip this step for today's exercise). 

3. On the training set, use model selection and some automated hyperparameter tuning techniques (such as grid search or randomized search, see https://scikit-learn.org/stable/modules/grid_search.html or practical2) to select a 'best' model (i.e. the model type and the relevant hyperparameters with best performance). Initially, you might start with select one simple model and use the default settings or manual selected hyperparameters, in order to have some rough idea on problem at hand. 

4. Evaluate the model using an appropriate form of cross-validation, in this case providing the AUC metrics and ROC curves with cross-validation or on the validation set (if made available in the Step2). Is the performance good enough? If not go back to step 3, or repeat step 4 with a different model type and/or hyperparemters. 

5. Refit the 'best' model (model type and the relevant hyperparameters) to all the training data, then make prediction with probability output on the test set, save the results to a file in an appropriate format. 

6. Submit the prediction file to Kaggle inclass competition. 

7. Repeat the procedure from step 3, and see if you can improve the external public validation performance. 

**REMEMBER:** IT IS OK NOT TO KNOW, IF YOU ARE STRUGGLING TO MAKE A START ON THIS PROJECT, CALL OVER A DEMONSTRATOR FOR ASSISTANCE. 