<a href="https://colab.research.google.com/github/FabriceBeaumont/4216_Biomedical_DS_and_AI/blob/main/Sheet9/Assignment9_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt

import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.wrappers.scikit_learn import KerasClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict, train_test_split
from sklearn.calibration import calibration_curve

In [None]:
def get_dataset_from_github(filename, index_col_str=None, header_str='infer'):    
    data_file_path = "https://raw.githubusercontent.com/FabriceBeaumont/4216_Biomedical_DS_and_AI/tree/main/Datasets"
    if index_col_str is None and header_str == 'infer':
      data = pd.read_csv(data_file_path + filename)
    elif index_col_str is None:
        data = pd.read_csv(data_file_path + filename, header=header_str)
    elif header_str == 'infer':
      data = pd.read_csv(data_file_path + filename, index_col=index_col_str)
    else:
      data = pd.read_csv(data_file_path + filename, index_col=index_col_str, header=header_str)

    return data

In [None]:
# titanic_survival_ds = get_dataset_from_github("/titanic_survival_data.csv")
# If this does not work, load the file (temporarily) into the Colab-File system (left side) 
# from your local files. Then execute as usual:
titanic_survival_ds = pd.read_csv("titanic_survival_data.csv")

titanic_survival_ds.head(4)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,no_cabin,Label
0,1,3,0,22.0,1,0,7.25,0,2,0
1,2,1,1,38.0,1,0,71.2833,1,1,1
2,3,3,1,26.0,0,0,7.925,0,2,1
3,4,1,1,35.0,1,0,53.1,0,1,1


## Biomedical Data Science & AI

## Assignment 8

#### Group members:  Fabrice Beaumont, Fatemeh Salehi, Genivika Mann, Helia Salimi, Jonah

---
### Exercise 1 - Ensemble Learning


#### 1.1. Inform yourself about **gradient boosting**, then answer the following questions in your own words:

In-depth resource for Gradient Boosting: https://explained.ai/gradient-boosting/index.html

Gradient Boosting is a machine learning technique which uses Gradient Descent and Boosting. It aims at fitting an additive model by introducing **weak learners** (i.e Decision trees) such that the recently added weak learner compensates the shortcomings of existing weak learners. The shortcoming of existing weak learners are identified by gradients in the loss function. Any user specified loss function can be optimised by a gradient boosting algorithm. The objective is to minimise the loss function by adding weak learners using Gradient Descent.

a. What do the individual **weak learners** model? How does this relate to the
gradient of the loss function?


- The weak learners are trained with the objective of minimising the loss function, hence they are trained on the residuals of the model. Each new weak learner will be fitted on the **residual error** usually known as **pseudo-residual** produced by the existing sequence of learners.

- The gradient boosting algorithm performs **gradient descent minimisation on some loss function** between the true and the predicted values. We perform gradient descent to bring the predicted values closer to the true value by minimising the residual. The residual is a vector which not only provides the magnitude of difference between the true and the predicted value but also the direction of better approximation (w.r.t. minimization of loss function). Hence we are chasing the (negative) gradient of the loss function via gradient descent by chasing the direction of residual. Thus we perform gradient descent on the loss function.

- The gradient boosted model that trains weak learners on residual vectors optimises the mean squared error (MSE; $L_2$ loss), ...

- ...while the model that trains the weak learners on the sign vector (only direction of residual without the magnitude) optimises the mean absolute error(MAE; $L_1$ loss).

b. What is the difference between **gradient boosting** and **random forest**?

- Gradient Boosting (GB) is a forward stage-wise additive model, that builds and adds one tree at a time with the objective of minimising the loss function (computed by considering the existing sequence of trees). Random forest (RF) on the other hand builds all trees independently - using random samples of the data (to prevent overfitting).


- GB focusses step by step on difficult examples - making it suitable for datasets with class imbalance. No such quality is present in RF. Additionally, any user specified loss function can be optimised by a gradient boosting algorithm.


- RF combines the results of all the trees at the end after the construction of all trees. GB on the other hand, takes the predictions of the sequence of trees into consideration at each stage of the algorithm.


- If the parameters are tuned carefully, GB can perform better than RF. However it is difficult to tune GB since there are much more parameters that need to be tuned.


- GB is more sensitive to overfitting if the data is noisy. RF is more robust and should be considered in this case.


- Training GB generally takes longer then RF, since the trees are constructed sequentially.

#### 1.2. Which modifications make gradient boosting **robust against overfitting**?

Gredient Boosting is not robust against overfitting the training data as it is a greedy algorithm. This problem can be resolved by using regualarization methods which penalize different aspects of the algorithm. The following methods can be used:
- **Tree constraints:** The idea is that is the trees are more constrained, more trees need to be constructed. The constraints can be imposed on 
    - the number of trees (~"keep on adding trees until no improvement is observed"), 
    - tree depth (~"shorter trees are preferred as deeper trees are considered more complex"),
    - number of nodes/leaves of tree and
    - number of observations per split and minimum improvement to loss.

- **Weighted Updates:** The prediction of each tree is weighed by a learning rate or shrinkage to slow down the learning by the algorithm. Shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.


- **Stochastic Gradient Boosting:** The method aims at reducing the correlation of the trees in the sequence of trees. This is achieved by using only a subsample of the training data to fit the  base learner.


- **Penalized Gradient Boosting:** Regression trees (a variant of decision trees which contain only numeric values at leaf nodes) can be used in GB. The leaf values act as weights and can be regularised using $L_1$ or $L_2$ regularization to prevent overfitting.

#### 1.3. Using the `titanic_survival_dataset.csv`, train the following models using nested cross validation while optimizing a selected number of hyperparameters in the inner loop using grid search, then compute the probabilities of your targets:

In [None]:
# Load the dataset
titanic_data = pd.read_csv('titanic_survival_data.csv', index_col="PassengerId")
titanic_data.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,no_cabin,Label
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,3,0,22.0,1,0,7.25,0,2,0
2,1,1,38.0,1,0,71.2833,1,1,1
3,3,1,26.0,0,0,7.925,0,2,1
4,1,1,35.0,1,0,53.1,0,1,1
5,3,0,35.0,0,0,8.05,0,2,0


In [None]:
# Sepearate features and target (which is stored in column 'Label')
y = titanic_data['Label'].ravel()
X = titanic_data.drop(columns = ['Label'])

#### 1.3.a) Random forest, optimizing the number of estimators

In [None]:
# Initialize the RF classifier and a parameter grid for the grid search
Random_Forest = RandomForestClassifier()
p_grid_random_forest = {'n_estimators': [100, 150, 200, 300, 400]}

In [None]:
# Inner Fold - to obtain the best hyperparameters
Random_Forest_Fit = GridSearchCV(
    estimator = Random_Forest,
    param_grid = p_grid_random_forest,
    cv = KFold(shuffle = True),
    verbose = 1
)

Random_Forest_Fit.fit(X, y)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:   10.1s finished


GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_sc

In [None]:
# Outer Fold - to perform cross validation based on metrics and compute the probabilities of the target
random_forest_prediction_prob = cross_val_predict(
    estimator = Random_Forest_Fit,
    X = X,
    y = y,
    cv = KFold(shuffle = True),
    method = 'predict_proba', # To obtain prediction probabilities in result
    verbose = 1
)
random_forest_prediction_prob

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    9.6s finished


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    9.5s finished


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    9.4s finished


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    9.5s finished


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    9.5s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   50.1s finished


array([[0.706  , 0.294  ],
       [0.01   , 0.99   ],
       [0.615  , 0.385  ],
       ...,
       [0.605  , 0.395  ],
       [0.155  , 0.845  ],
       [0.98375, 0.01625]])

#### 1.3.b) Gradient boosting, optimizing boosting steps

In [None]:
# Initialize the GB classifier and a parameter grid for the grid search
GB = GradientBoostingClassifier()
p_grid_gb = {'n_estimators': [10, 50, 100, 200, 300]}

# Inner Fold - to obtain the best hyperparameters
GB_Best_Clf = GridSearchCV(
    estimator = GB,
    param_grid = p_grid_gb,
    cv = KFold(shuffle = True),
    verbose = 1
)

GB_Best_Clf.fit(X, y)

# Outer Fold - to perform cross validation based on metrics and compute the probabilities of the target
gb_prediction_prob = cross_val_predict(
    estimator = GB_Best_Clf,
    X = X,
    y = y,
    cv = KFold(shuffle = True),
    method = 'predict_proba', # To obtain prediction probabilities in result
    verbose = 1
)
gb_prediction_prob

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    3.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    3.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    3.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    3.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    3.1s finished


Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    3.1s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   16.0s finished


array([[0.78829091, 0.21170909],
       [0.05449812, 0.94550188],
       [0.49750538, 0.50249462],
       ...,
       [0.54399186, 0.45600814],
       [0.10972886, 0.89027114],
       [0.9760037 , 0.0239963 ]])

#### 1.3.c) Lasso penalized logistic regression, optimizing $L_1$ regularization strength

In [None]:
# Converting computed prediction probabilities to dataframes
lr_prob_result = pd.DataFrame(lr_prediction_prob, columns = ['0', '1'])
gb_prob_result = pd.DataFrame(gb_prediction_prob, columns = ['0', '1'])
rf_prob_result = pd.DataFrame(random_forest_prediction_prob, columns = ['0', '1'])

(Using a large parameter grid results in an extended computation time. We advise using a maximum of *five* values per hyperparameter)

#### 3.3. How does the neural network perform in comparison to the models in the calibration curve from the previous task and plot the results alongside the other models in the calibration plot?