# Objectives

1. Data splitting: Train, Test, and Validation
2. What is Data Leakage ?
3. Underfitting and Overfitting
4. Evaluation metrics for classifiers

# Warm-up: Cancer Detection

**Consider this simplified example inspired by this paper:**

- Two new skin cancer detection classifiers have been developed and tested on 1000 images of patients’ skins of which 50 show signs of cancer

- The first model predicts a high risk of cancer for 800 out of 1000 patients’ images in the test set. Of these 800 images 50 actually show signs of skin cancer. Hence, all problematic images are correctly identified.

- The second classifier categorizes 100 out of 1000 images into the high risk group. 40 of the 100 images show real signs of cancer. 10 images are not identified and falsely classified as low-risk.

**Compare the outcome of these two classifiers. Which one do you prefer?**

## 1. Data splitting: Train, Test, and Validation

![data_splitting.png](attachment:data_splitting.png)

**Training set**: is what the machine learning model use to learn. This sample needs to be representative of the population and it should not leak any information from the test sample


**Validation set**: is what you use to optimise your model parameters, and make a choice between different models. Since it will be used by the model(s), it should not leak any information from the test sample


**Test set**: is what you use to test how your model performs with unseen data. We need to be very strict about keeping the unseen property intact


### Data Splitting convention

**Important Note:** care has to be taken with imbalanced data that the under represented samples are well covered in the validation and test samples when splitting the data (more on that later)

![data_splitting_convention.png](attachment:data_splitting_convention.png)

## 2. What is Data Leakage ?

![data_leakage.png](attachment:data_leakage.png)

- **Data Leakage** happens when we fail to preserve the unseen attribute of the test sample and the model already started learning something about it


- Data Leakage can be very subtle and it can start to happen in all phases of the project life cycle and it will lead to overly optimistic results


- Well established scientists and researchers sometimes miss subtle leakage sources sometimes too

### Example 1: Feature Leakage

![data_leakage_e1.png](attachment:data_leakage_e1.png)

### Example 2: Premature Featurization Leakage

![data_leakage_e2.png](attachment:data_leakage_e2.png)

### Example 3: Non-iid Data Leakage

**link to the paper**: https://arxiv.org/abs/1711.05225

![data_leakage_e3.png](attachment:data_leakage_e3.png)

## 3. Underfitting and Overfitting

![overfitting_underfitting.png](attachment:overfitting_underfitting.png)


<div>
<img src="attachment:bias_variance.png" width="500"/>
</div>

- **Overfitting** means that your model is too complex that it memorized the training data and cannot generalize well for all data


- **Underfiitting** means that your model is too simples that it fails to capture the relationships needed for prediction


- An ideal model is a somewhere in between ; Not too complex and not too simple but can generalize well for all data

**How can we identify these problems ?**


1. Comparing training and test score will tell you if you have a model that is overfitting. That means your training score will be much higher than your test score e.g. 90% accuracy for training but 60% accuracy for test


2. If you have a low score for both training and test, that means your model is underfit and you work more on the model (e.g. add more training data or do more feature engineering) to achieve better results

## 4. Evaluation metrics for classifiers

### 4.1 Confusion Matrix


<div>
<img src="attachment:confusion_matrix.jpeg" width="600"/>
</div>

### 4.2 Accuracy, Precision and Sensitivity (a.k.a. Recall)

![precision_recall.png](attachment:precision_recall.png)

[source](https://medium.com/swlh/how-to-remember-all-these-classification-concepts-forever-761c065be33)

### Example:

For the cancer detection example in the course notes:

1. Draw the confusion matrix for both models


2. Calculate the accuracy, precision and recall for each

### 4.3 F1 score

<div>
<img src="attachment:f1score.png" width="400"/>
</div>

- the harmonic mean of the precision and recall


- a metric that balances the trade-off between precision and recall

### 4.4 Receiver Operating Characteristics (RoC) curve



<div>
<img src="attachment:roc_curve.png" width="500"/>
</div>


<div>
<img src="attachment:auc_equations.png" width="200"/>
</div>