# Module 7 - Advanced Topics in Performance Evaluation

## Module Overview

- Final module in the programmes first phase that covers machine learning foundations.

## Learning outcomes

- LO 1: Identify examples where oversampling might be useful.
- LO 2: Apply oversampling to classification problems where the binary data of interest is rare.
- LO 3: Estimate the performance of a given predictor using the k-fold cross-validation algorithm.
- LO 4: Apply predictive techniques to a multi-variable, real-world data set.

## Misc and Keywords

## Introduction to Module Seven

- Explores **two techniques for fine-tuning** the machine learning algorithms, These are
    - **Oversampling** is used for fine tuning type one and two errors (also known as false positives, and false negatives)
        - False positives: Incorrectly predicts a no sample is a yes sample
        - False negatives: Incorrectly predicts a yes sample is a no sample 
    - **Cross-validation** is a more efficient approach to the train/validation and test set approach
        - Intelligently reuses training and validation data to make data
        - Requires more computational power

## Oversampling

- A technique used to address the issue of class imbalance in a dataset. Class imbalance occurs when one or more classes have significantly fewer samples than others, which can lead to biased models that favour the majority class.
- Involves increasing the size of the minority class by adding more samples, so its representation becomes comparable to that of the majority class.
- **Approaches**
    - **Basic duplication**: create identical duplicates of the samples
    - **Synthetic data** generates artifical data based on the few samples that exist, common techniques include:
        - **SMOTE (Synthetic Minority Oversampling Technique)** generates new samples by interpolating between existing minority class samples. For example, it takes two samples from the minority class and creates a synthetic point along the line connecting them in feature space.
        - **ADASYN (Adaptive Synthetic Sampling)** improves on SMOTE by generating synthetic samples in regions where the minority class is underrepresented or more difficult to learn.
        - Variational Autoencoders (VAEs) or **GANs (Generative Adversarial Networks)** create realistic synthetic data points for the minority class, often used in domains like image and text data.
    - **Weighted oversampling**: Sometimes, oversampling is combined with assigning different weights to samples to emphasise minority class data without creating excessive duplication or overfitting
- **Challenges**
    - Overfitting: If oversampling is performed by duplicating minority class samples, it can lead to overfitting, as the model sees the same data repeatedly.
    - Computational Cost: Oversampling increases the size of the dataset, which may lead to higher computational requirements during training.
    - Synthetic Sample Quality: In cases where synthetic data is generated (e.g., SMOTE), the quality of these samples can significantly impact the model's performance.
- **Stratified sampling**  involves dividing a dataset into distinct, non-overlapping groups, known as strata, and then randomly sampling from each of these strata. The goal of stratified sampling is to ensure that each subgroup of the population is well represented in the sample, which can be particularly useful when dealing with imbalanced datasets.
    - When you split your data into a training set and a test set, using stratified sampling ensures that the proportion of each class in both the training and test sets is the same as in the original dataset. This is essential to avoid issues where the model is trained on an unbalanced training set and evaluated on an unbalanced test set, which can result in biased performance metrics.
    - Approach:
        - Divide the available data into two sets (strata):
            - all samples of the class of interest (The rare class) (set A).
            - all other samples (set B).
        - Construct the training set:
            - randomly select 50 per cent of the samples in set A.
            - add equally many samples from set B.
        - Construct the validation set:
            - select the remaining 50 per cent of samples from set A.
            - add enough samples from set B so as to restore the original ratio from the overall data set

## K-Fold Cross-validation
- Splitting data:
    - Since you need to split the data into different sets, you almost invariably end up with either too little training data or too little validation data.
    - If the data is randomly split into training and validation data, the approach gives different results whenever the data is reshuffled.
- Cross-validation can alleviate both of the shortcomings above.
- The K-folds process
    - Split the data into equal parts, k (which is often 5, 7 or 10)
    - For each iteration i = {1, 2..k} select i as the validation set, and the others as training
    - After k iterations average the validation performance over all k runs
    - Select the model with the best average validation performance
- For example:
    - $\text{The dataset} = [x_1 = 3.2, x_2 = 1.7, x_3 = 7.2, x_4 = 4.0, x_5 = 8.1]$ 
    - For the fold above, the average predictor is (1.7 + 7.2 + 4.0 + 8.1) / 4 = 5.25, and its mean absolute error on the validation set is | 3.2 – 5.25 | = 2.05.
    - Similarly, for the other folds, we obtain:
        - Fold 2: average predictor 5.625, mean absolute error on validation set 3.925
        - Fold 3: average predictor 4.25, mean absolute error on validation set 2.95
        - Fold 4: average predictor 5.05, mean absolute error on validation set 1.05
        - Fold 5: average predictor 4.025, mean absolute error on validation set 4.075
    - So, the overall estimate of the mean absolute error is (2.05 + 3.925 + 2.95 + 1.05 + 4.075) / 5 = 2.81. 