<a href="https://colab.research.google.com/github/FabriceBeaumont/4216_Biomedical_DS_and_AI/blob/main/Sheet7/Assignment7_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [48]:
import numpy as np
import math
import pandas as pd
import random as rand
from sklearn import preprocessing
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [None]:
def get_dataset_from_github(filename, index_col_str=None, header_str='infer'):    
    data_file_path = "https://raw.githubusercontent.com/D34dP0oL/4216_Biomedical_DS_and_AI/main/Datasets/"
    if index_col_str is None and header_str == 'infer':
      data = pd.read_csv(data_file_path + filename)
    elif index_col_str is None:
        data = pd.read_csv(data_file_path + filename, header=header_str)
    elif header_str == 'infer':
      data = pd.read_csv(data_file_path + filename, index_col=index_col_str)
    else:
      data = pd.read_csv(data_file_path + filename, index_col=index_col_str, header=header_str)

    return data

## Biomedical Data Science & AI

## Assignment 7

#### Group members:  Fabrice Beaumont, Fatemeh Salehi, Genivika Mann, Helia Salimi, Jonah

---
### Exercise 1 - Elastic Net & Nested Cross-Validation


#### 1.1. Using the `titanic_survival_data.csv` dataset, train a logistic regression model with elastic net penalization to demonstrate the pros and cons of the different data splitting methods and give a short description on what you observe.

##### 1.1.a) Report the accuracy of data splitting with a test size of $0.2$ and random state as $1$.

##### 1.1.b) Plot the boxplot for the accuracy of the **$K$-fold cross validation** with $5$ splits.

##### 1.1.c) Plot the boxplot for the accuracy of the **Stratified-$K$-fold cross validation** with $5$ splits.

##### 1.1.d) Inform yourself about **leave-one-out cross-validation** (**LOOCV**). Implement LOOCV and mention the pros and cons of the method.

#### 1.2. Use the nested cross validation to train a logistic regression with elastic net penalization (`leukemia_small.csv`).

##### 1.2.a) Split the data into training and test samples using an appropriate cross validation method, and in the inner loop carry out **hyperparameter optimization**.

##### 1.2.b) Compute the area under the ROC curve (**AUC-ROC**) and the area under the precision-recall curve (**AUC-PR**).

##### 1.2.c) Plot separate boxplots for the two performance metrics.

#### 1.3. In your own words, explain how each of the following metrics can be used to assess the performance of a model and then calculate each metric using the following confusion matrix.

 _           | Predicted No | Predicted Yes |
---|---|---
Actual No    | $250$        | $20$          |
Actual Yes   | $30$         | $100$         |

##### 1.3.a) Recall

With recall we can measure what percentage of the total positives are predicted to be positive, so in other words, it gives us a measure of the true positive rate.

Calculation:

$Recall = \frac{TP}{TP+FN} = \frac{100}{100+30} \approx 77\%$

##### 1.3.b) $F_1$

The F1-Score measures the balance between precision and recall. While the recall measures how many false negatives we have, the precision give us an indication of the number of false positives. If the model has high recall and precision this leads to a high F1-Score. The F1-Score is especially useful as a performance measure if we have an uneven class distribution.

Calculation:

$Precision = \frac{TP}{TP+FP} = \frac{100}{100+20} \approx 83\%$

$F1 = 2\cdot \frac{Precision \cdot Recall}{Precision + Recall} = 2\cdot \frac{0.833\cdot 0.769}{0.833 + 0.769} \approx 0.8$

##### 1.3.c) Balanced Accuracy (BAC)

Balanced Accuracy is the arithmetic mean between recall (also called sensitivity/true positive rate in this scope) and specificity. The specificity is a measure for the true negative rate. Like the F1-Score the balanced accuracy is especially useful to measure the performance of a model when the classes are imbalanced as it attempts to account for the imbalance in classes.

Calculation:

$Specificity = \frac{TN}{TN+FP} = \frac{250}{250+20} \approx 93\%$

$BAC = \frac{TPR + TNR}{2} = \frac{0.769 + 0.926}{2} \approx 0.85$

##### 1.3.d) Matthews Correlation Coefficient (MCC)

Matthew Correlation Coefficient gives us a measure of the differences between the real values and the predicted values. The difference takes true positives, false positives, true negatives and false negatives into account and returns a high score only if for all four measures the model has good results.

Calculation:

$MCC = \frac{TP\cdot TN - FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} = \frac{100\cdot 250 - 20\cdot 30}{\sqrt{(100+20)(100+30)(250+20)(250+30)}} \approx 0.71$

---
### Exercise 2 - SVM

#### 2.1. Inform yourself about **SVM** and briefly explain the working strategy of linear SVM and why maximizing the margin is a good strategy.

SVM is a *supervised machine learning algorithm* which can be used for 
- classification or 
- regression problems. 

It uses a technique called the *kernel trick* to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to seperate your data based on the labels or outputs you have defined.

A simple linear SVM classifier works by making a straight line between two classes. That means all of the data points on one side of the line will represent a category and the data points on the other side of the line will be put into a different category. This means there can be an infinite number of lines to choose from.

What makes the linear SVM algorithm better than some of the other algorithms, like $k$-nearest neighbors, is that it chooses the best line to classify your data points. It chooses the line that separates the data and is the furthest away from the closet data points as possible.

A large margin effectively corresponds to a regularization of SVM weights which prevents overfitting. Hence, we prefer a large margin (or the right margin chosen by cross-validation) because it helps us generalize our predictions and perform better on the test data by not overfitting the model to the training data.

The intuition as that decision boundary that maximises the margin would be the most useful, as they create the most separation between boundary cases so that small variations will be less likely to affect the classification

#### 2.2. Inform yourself about the non-linearity problem for classifiers. Briefly explain how SVM uses **kernel trick** to overcome this issue.

If the data are not linearly separable, a linear classification cannot perfectly distinguish the two classes. Nonlinear functions can be used to separate instances that are not linearly separable.

In machine learning, a trick known as **kernel trick** is used to learn a linear classifier to classify a non-linear dataset. It transforms the linearly inseparable data into a linearly separable one by projecting it into a higher dimension. A kernel function is applied on each data instance to map the original non-linear data points into some higher dimensional space in which they become linearly separable.

To get a better understanding, let’s consider circles dataset:

In [None]:
# TODO: 1.png

The dataset is clearly a non-linear dataset and consists of two features (say, $X$ and $Y$).

In order to use SVM for classifying this data, introduce another feature $Z = X^2 + Y^2$ into the dataset. Thus, projecting the 2-dimensional data into 3-dimensional space. The first dimension representing the feature $X$, second representing $Y$ and third representing $Z$ (which, mathematically, is equal to the radius of the circle of which the point $(x, y)$ is a part of). Now, clearly, for the data shown above, the *yellow* data points belong to a circle of smaller radius and the *purple* data points belong to a circle of larger radius. Thus, the data becomes linearly separable along the $Z$-axis.

In [None]:
# TODO: 2.png

---
### Exercise 3 - Random Forest

For the following questions, use `random_seed = 1` for better reproducibility of your
answers.

#### 3.1. Load the breast cancer dataset from sklearn to your Jupyter notebook. Use label encoding to convert your target variable “class” into numerical form. Split the dataset using a $5$-fold cross validation.

In [40]:
cancer_data = pd.read_csv('https://raw.githubusercontent.com/D34dP0oL/4216_Biomedical_DS_and_AI/main/Datasets/cancer_all.csv', index_col = 'Unnamed: 0')
cancer_data.head(4)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173


In [54]:
feature_scaler = preprocessing.StandardScaler()
cancer_data = feature_scaler.fit_transform(cancer_data)

In [42]:
cancer_class_data = pd.read_csv('https://raw.githubusercontent.com/D34dP0oL/4216_Biomedical_DS_and_AI/main/Datasets/cancer_class.csv', index_col = 'Unnamed: 0')
cancer_class_data.head(4)

Unnamed: 0,class
0,malignant
1,malignant
2,malignant
3,malignant


In [43]:
cancer_class_data['class'].unique()

array(['malignant', 'benign'], dtype=object)

In [44]:
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(['malignant', 'benign'])
encoded_cancer_class_data = label_encoder.transform(cancer_class_data.values.ravel())
encoded_cancer_class_data

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

In [45]:
# cross-validation
cv = KFold(n_splits=5, random_state=1, shuffle=True)
counter = 1
for train_index, test_index in cv.split(cancer_data):
  print("Splitted data number: %d\n" % counter)
  print("Train data indexes:", train_index,2*"\n", "Test data indexes:", test_index, 5*"\n")
  counter += 1

classifier = RandomForestClassifier(n_estimators=300, random_state=1)

Splitted data number: 1

Train data indexes: [  1   2   4   6   7   8  10  11  12  13  14  15  16  18  19  20  21  22
  23  24  25  26  27  28  30  32  33  35  36  37  39  42  43  44  45  46
  48  50  51  52  53  54  55  56  57  58  60  61  63  64  70  71  72  73
  74  75  76  77  78  79  80  81  82  83  84  86  87  88  89  91  93  94
  95  96  97  98  99 100 101 102 103 104 105 106 108 109 110 112 113 114
 115 116 117 118 121 122 123 125 126 127 128 129 130 131 133 134 136 137
 138 139 140 141 142 143 144 145 146 148 149 150 151 152 153 154 155 156
 157 158 162 163 164 166 167 168 169 170 171 173 174 175 176 177 178 181
 182 183 184 185 188 190 191 192 193 194 196 198 199 200 201 202 203 204
 205 206 208 209 210 211 212 213 215 216 217 218 219 220 222 223 224 225
 226 227 228 229 230 231 232 234 235 236 238 239 240 241 243 244 247 248
 249 250 251 252 253 254 255 256 259 260 261 262 263 264 265 266 267 268
 269 270 271 272 275 276 278 279 280 281 282 284 287 288 290 291 293 294
 296 2

In [46]:
# fitting the model and computing the score 5 consecutive times for evaluation with different splits each time
scores = cross_val_score(classifier, cancer_data, cancer_class_data, scoring='accuracy', cv=cv, n_jobs=-1)
print("scores:\n",scores,"\n")

print('Accuracy: %.3f and Standard Deviation: %.3f)' % (np.mean(scores), np.std(scores)))

scores:
 [0.95614035 0.94736842 0.95614035 0.96491228 0.97345133] 

Accuracy: 0.960 and Standard Deviation: 0.009)


#### 3.2. Set up a parameter grid and use grid search with $5$-fold cross validation to identify the best hyperparameter values used to fit a random forest classifier.

In [49]:
#from sklearn.model_selection import ParameterGrid

grid_param = {
    'n_estimators': [100, 300, 500, 800, 1000],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}

gd_sr = GridSearchCV(estimator=classifier,
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=cv,
                     n_jobs=-1)

gd_sr.fit(cancer_data, cancer_class_data.values.ravel())
best_parameters = gd_sr.best_params_
print(best_parameters)
best_result = gd_sr.best_score_
print(best_result)

{'bootstrap': False, 'criterion': 'gini', 'n_estimators': 800}
0.9613724576929048


#### 3.3. Use the best hyperparameters from *2)* to fit the final model. Predict the classes of the test set and count the number of samples assigned to each class.

In [52]:
clf = RandomForestClassifier(bootstrap = False, criterion = 'gini', n_estimators = 800, random_state=1)
clf.fit(cancer_data, cancer_class_data.values.ravel())

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=800,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

#### 3.4. Print the importance of each feature in descending order. Identify the top five features.

In [51]:
print(sorted(gd_sr.cv_results_.keys(), reverse=True))

['std_test_score', 'std_score_time', 'std_fit_time', 'split4_test_score', 'split3_test_score', 'split2_test_score', 'split1_test_score', 'split0_test_score', 'rank_test_score', 'params', 'param_n_estimators', 'param_criterion', 'param_bootstrap', 'mean_test_score', 'mean_score_time', 'mean_fit_time']


#### 3.5. Mention a case when permutation feature importance is favored over impurity-based feature importance. Use permutation importance to print the importances of your features in a descending order. Compare your answer with that of *4)*. Do you notice any differences?

#### 3.6. In your own words, explain the **bootstrapping technique** and mention how random forest benefits from its application.