<a href="https://colab.research.google.com/github/FabriceBeaumont/4216_Biomedical_DS_and_AI/blob/main/Sheet7/Assignment7_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import math
import pandas as pd
import random as rand

In [None]:
def get_dataset_from_github(filename, index_col_str=None, header_str='infer'):    
    data_file_path = "https://raw.githubusercontent.com/D34dP0oL/4216_Biomedical_DS_and_AI/main/Datasets/"
    if index_col_str is None and header_str == 'infer':
      data = pd.read_csv(data_file_path + filename)
    elif index_col_str is None:
        data = pd.read_csv(data_file_path + filename, header=header_str)
    elif header_str == 'infer':
      data = pd.read_csv(data_file_path + filename, index_col=index_col_str)
    else:
      data = pd.read_csv(data_file_path + filename, index_col=index_col_str, header=header_str)

    return data

## Biomedical Data Science & AI

## Assignment 7

#### Group members:  Fabrice Beaumont, Fatemeh Salehi, Genivika Mann, Helia Salimi, Jonah

---
### Exercise 1 - Elastic Net & Nested Cross-Validation


#### 1.1. Using the `titanic_survival_data.csv` dataset, train a logistic regression model with elastic net penalization to demonstrate the pros and cons of the different data splitting methods and give a short description on what you observe.

##### 1.1.a) Report the accuracy of data splitting with a test size of $0.2$ and random state as $1$.

##### 1.1.b) Plot the boxplot for the accuracy of the **$K$-fold cross validation** with $5$ splits.

##### 1.1.c) Plot the boxplot for the accuracy of the **Stratified-$K$-fold cross validation** with $5$ splits.

##### 1.1.d) Inform yourself about **leave-one-out cross-validation** (**LOOCV**). Implement LOOCV and mention the pros and cons of the method.

#### 1.2. Use the nested cross validation to train a logistic regression with elastic net penalization (`leukemia_small.csv`).

##### 1.2.a) Split the data into training and test samples using an appropriate cross validation method, and in the inner loop carry out **hyperparameter optimization**.

##### 1.2.b) Compute the area under the ROC curve (**AUC-ROC**) and the area under the precision-recall curve (**AUC-PR**).

##### 1.2.c) Plot separate boxplots for the two performance metrics.

#### 1.3. In your own words, explain how each of the following metrics can be used to assess the performance of a model and then calculate each metric using the following confusion matrix.

 _           | Predicted No | Predicted Yes |
---|---|---
Actual No    | $250$        | $20$          |
Actual Yes   | $30$         | $100$         |

##### 1.3.a) Recall

With recall we can measure what percentage of the total positives are predicted to be positive, so in other words, it gives us a measure of the true positive rate.

Calculation:

$Recall = \frac{TP}{TP+FN} = \frac{100}{100+30} \approx 77\%$

##### 1.3.b) $F_1$

The F1-Score measures the balance between precision and recall. While the recall measures how many false negatives we have, the precision give us an indication of the number of false positives. If the model has high recall and precision this leads to a high F1-Score. The F1-Score is especially useful as a performance measure if we have an uneven class distribution.

Calculation:

$Precision = \frac{TP}{TP+FP} = \frac{100}{100+20} \approx 83\%$

$F1 = 2\cdot \frac{Precision \cdot Recall}{Precision + Recall} = 2\cdot \frac{0.833\cdot 0.769}{0.833 + 0.769} \approx 0.8$

##### 1.3.c) Balanced Accuracy (BAC)

Balanced Accuracy is the arithmetic mean between recall (also called sensitivity/true positive rate in this scope) and specificity. The specificity is a measure for the true negative rate. Like the F1-Score the balanced accuracy is especially useful to measure the performance of a model when the classes are imbalanced as it attempts to account for the imbalance in classes.

Calculation:

$Specificity = \frac{TN}{TN+FP} = \frac{250}{250+20} \approx 93\%$

$BAC = \frac{TPR + TNR}{2} = \frac{0.769 + 0.926}{2} \approx 0.85$

##### 1.3.d) Matthews Correlation Coefficient (MCC)

Matthew Correlation Coefficient gives us a measure of the differences between the real values and the predicted values. The difference takes true positives, false positives, true negatives and false negatives into account and returns a high score only if for all four measures the model has good results.

Calculation:

$MCC = \frac{TP\cdot TN - FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} = \frac{100\cdot 250 - 20\cdot 30}{\sqrt{(100+20)(100+30)(250+20)(250+30)}} \approx 0.71$

---
### Exercise 2 - SVM

#### 2.1. Inform yourself about **SVM** and briefly explain the working strategy of linear SVM and why maximizing the margin is a good strategy.

SVM is a *supervised machine learning algorithm* which can be used for 
- classification or 
- regression problems. 

It uses a technique called the *kernel trick* to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to seperate your data based on the labels or outputs you have defined.

A simple linear SVM classifier works by making a straight line between two classes. That means all of the data points on one side of the line will represent a category and the data points on the other side of the line will be put into a different category. This means there can be an infinite number of lines to choose from.

What makes the linear SVM algorithm better than some of the other algorithms, like $k$-nearest neighbors, is that it chooses the best line to classify your data points. It chooses the line that separates the data and is the furthest away from the closet data points as possible.

A large margin effectively corresponds to a regularization of SVM weights which prevents overfitting. Hence, we prefer a large margin (or the right margin chosen by cross-validation) because it helps us generalize our predictions and perform better on the test data by not overfitting the model to the training data.

The intuition as that decision boundary that maximises the margin would be the most useful, as they create the most separation between boundary cases so that small variations will be less likely to affect the classification

#### 2.2. Inform yourself about the non-linearity problem for classifiers. Briefly explain how SVM uses **kernel trick** to overcome this issue.

If the data are not linearly separable, a linear classification cannot perfectly distinguish the two classes. Nonlinear functions can be used to separate instances that are not linearly separable.

In machine learning, a trick known as **kernel trick** is used to learn a linear classifier to classify a non-linear dataset. It transforms the linearly inseparable data into a linearly separable one by projecting it into a higher dimension. A kernel function is applied on each data instance to map the original non-linear data points into some higher dimensional space in which they become linearly separable.

To get a better understanding, let’s consider circles dataset:

In [None]:
# TODO: 1.png

The dataset is clearly a non-linear dataset and consists of two features (say, $X$ and $Y$).

In order to use SVM for classifying this data, introduce another feature $Z = X^2 + Y^2$ into the dataset. Thus, projecting the 2-dimensional data into 3-dimensional space. The first dimension representing the feature $X$, second representing $Y$ and third representing $Z$ (which, mathematically, is equal to the radius of the circle of which the point $(x, y)$ is a part of). Now, clearly, for the data shown above, the *yellow* data points belong to a circle of smaller radius and the *purple* data points belong to a circle of larger radius. Thus, the data becomes linearly separable along the $Z$-axis.

In [None]:
# TODO: 2.png

---
### Exercise 3 - Random Forest

For the following questions, use `random_seed = 1` for better reproducibility of your
answers.

#### 3.1. Load the breast cancer dataset from sklearn to your Jupyter notebook. Use label encoding to convert your target variable “class” into numerical form. Split the dataset using a $5$-fold cross validation.

#### 3.2. Set up a parameter grid and use grid search with $5$-fold cross validation to identify the best hyperparameter values used to fit a random forest classifier.

#### 3.3. Use the best hyperparameters from *2)* to fit the final model. Predict the classes of the test set and count the number of samples assigned to each class.

#### 3.4. Print the importance of each feature in descending order. Identify the top five features.

#### 3.5. Mention a case when permutation feature importance is favored over impurity-based feature importance. Use permutation importance to print the importances of your features in a descending order. Compare your answer with that of *4)*. Do you notice any differences?

#### 3.6. In your own words, explain the **bootstrapping technique** and mention how random forest benefits from its application.