# Model Evaluation, and Kaggle Datasets

### By Keiron O'Shea, and Chuan Lu

In machine learning, we must assure that our model is able to disseminate correct features that can work with "real data". One means of ensuring that our models do this is through the use of cross-validation.

Cross-validation is a technique in which a model is trained rusing a subset of the dataset, and then evaluated through the use of a (usually smaller) subset of the data.

This involves three steps:

1. Split the data into portions (training, validation, and testing) data.
2. Use the training and validation data to train the model.
3. Use the testing data to evaluate the model performance.

For example given a dataset of 10 examples:

In [None]:
import numpy as np
X = np.array([14, 12, 25, 7, 5, 17, 47, 52, 26, 69])

We can split the data into two datasets, where 20% of the data is taken forward for testing purposes, leaving the remaining 8 samples to be used for training the model:

In [None]:
training_data = X[0:8]
testing_data = X[8:]

This is called validation, and is an extremely simplistic way of evaluating a model. However one issue with this approach is that it fails to provide any real metric of how the model performs against the entirety of the dataset.

One way of ensuring that the dataset is well represented when evaluating our models would be to iterate through, whilst spliting the data.

**Leave One Out Cross Validation** is a simple way of doing exactly this. In this method, we perform training on the entire dataset - only leaving one example of the available dataset out for testing purposes.

So using the simple example above, we do this by:

In [None]:
y_pred = []
y_true = []

for i in range(len(X)):
    X_train = X[[x for x in range(len(X)) if x != i]]
    # y_train ....
    X_test = X[i]
    print("Training the model with %i samples, and %s tests" % (len(X_train), 1))
    # y_test ....
    # Train model
    # Add y_train to y_true
    # Add prediction from X_test to y_pred
    
# Run metrics here (classification_report, etc)

A major issue of this method is that it may lead to a higher variation in the testing model. Another issue is that it may take a lot of execution time if the dataset is of large size - or if the model takes awhile to learn.

**K-Fold Cross Validation** provides a superior representation of the dataset. In this method we split the dataset into k-number of subsets (folds), performing the training on the remaining dataset and evaluating on the fold.

For example, given 10 samples:

![title](images/dataset.png)

We could set the K-parameter to equal 2 fold. This will effectively split the data into two equal sizes (where green is for training, and red is for testing):

![title](images/fold1.png)

Once evaluated, we then split the data again and evaluate the performance once more:

![title](images/fold2.png)


So in code format:

In [None]:
from sklearn.model_selection import KFold

kf = KFold(2)

for train, test in kf.split(X):
    print("\nTraining the model with a dataset of %i length" % len(train))
    print("Test the model with a dataset of %i length" % len(test))

If need want to adjust the number of folds to help figure out how this works.

### AUC - ROC Curve

We've already played around with performance metrics in previous workshops, but we've not had any chance to play around with how we can visualise said results effectively.

But before we do this, we must first venture into the wonderfully simplistic, yet incredibly powerful **Confusion Matrix**. The Confusion Matrix is a performance measurement for machine learning classification. In the simplest terms, it is comprised of a single table with 4 combinations of predicted and actual values. A nice example I found online was the pregnancy anology:

![title](images/preggo.png)

- **True Positive/TP:** You have predicted positive, and that is indeed the case.
- **True Negative/TN:** You have predicted negative, amd that is indeed the case.
- **False Positive/FP/Type 1 Error:** You have predicted positive, where the case is actually negative.
- **False Negative/FN/Type 2 Error:** You have predicted negative, where the case is actually positive.

Using this simple metric, we are able to calculate an array of performance metrics. For example:

$$Recall = {TP \over{TP + FN}}$$

Which calculates how many classes were correctly predicted for the positive task.

$$Precision = {TP \over{TP + FP}}$$

Which calculates how many classes were correctly predicted for the negative task.

$$F-measure = {2*(Recall*Precision) \over{Recall + Precision}}$$

Uses the harmonic mean to punish the extreme values more, where two models with low precision and high recall (or vice versa).


The **Area Under the Cuve** (AUC) and **Reciever Operating Characteristic** (ROC) curve are two of the most important evaluation metrics for checking any classification model's true predicitive performance.

The major calculations are the **true positive rate**:

$$TPR = {TP \over{TP+FN}}$$

The **false positive rate**:

$$FPR = {FP \over{FP+FP}}$$

And the **specificity**:

$$Specificity = {TN \over{TN+FP}}$$

The best possible model has an AUC of 1, which means that it has a good mesaure of seperability. 0 means that the model has a worst measure of seperability, and where AUC is 0.5 it illustrates that the model has zero class seperation capacity whatsoever.

The code to calculate and plot the AUC and ROC in python can be found below:

**Note:** This only works for binary classification tasks.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

def plot_auc_and_roc(probabilities, y_true):
    fpr, tpr, _ = roc_curve(y_true, probabilities)
    calc_auc = auc(fpr, tpr)
    
    
    plt.title("Recieving Operating Characteristic")
    plt.plot(fpr, tpr, "red", label="AUC %0.2f" % calc_auc)
    plt.plot([0, 1], [0, 1], "b--")
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.legend(loc="lower right")
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

### Now, over to you

A major part of these workshops are to help you go out and do this for yourself. That being said, we do expect you to achieve the following:

- Load in the data into a format that a machine learning technique can use.
- Use models and techniques from previous workshops to build a model.
- Evaluate the model using an appropriate form of cross validation (simple validation **will not suffice**).
- Make use of the ROC function above to illustrate model performance.


Register for Kaggle using your Aberystwyth University, using your AU email address.

https://www.kaggle.com

Log in and join the following competition:

https://www.kaggle.com/c/csm6420-bbbp/

**REMEMBER:** IT IS OK NOT TO KNOW, IF YOU ARE STRUGGLING TO MAKE A START ON THIS PROJECT, CALL OVER A DEMONSTRATOR FOR ASSISTANCE. MAKE ME EARN MY £8 A HOUR