# Project 2: Covid ---> III/ Models and predictions

The purpose of this file is to test and compare several models on the matrix extracted from the "II_features-selection" file (training on data from Europe only). Then we will improve the best model and analyse it. Finally, we will apply this model to other continents.

In [1]:
# Import
%matplotlib inline

import os
import os.path as op
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, average_precision_score, auc

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import xgboost as xgb

from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

%load_ext autoreload
%autoreload 2

## 1. Literature

In 2019, the first COVID-19 cases are observed in China. Rapidly, the SARS-Cov2 virus spread worldwide, pushing governments to take strict decisions about the lives of their co-citizens, like containment, to protect the population. Indeed, in some cases, COVID-19 patients ended up in intensive care services and sometimes died.

**The aim of our model is, based on easily computable parameters at the study's beginning, to predict whether the patient will be likely to die or if the chance of survival is important.** The point of this study is to help the hospital organise in the case of a high number of cases.


The studied dataset stem from the IDDO Data Repository of COVID-19 data. This data was pulled from the underlying data collection projects on 2022-09-01. The data comes from 1,200 institutions from over 45 countries and gather various information from 700,000 hospitalised individuals.

To keep only the relevant features, we first dive into the literature, using Meta-analysis papers. First, we have been looking for aggravating factors that will likely lead the patient to ICU.

Obesity: according to a meta-analysis by Sales-Peres, there is a correlation between obesity and ICU admission. This paper also concluded that co-morbidities for obese patients, such as hypertension, type 2 diabetes, smoking habit, lung disease, and/or cardiovascular disease lead to a higher chance of ICU admission.
Age: patients aged 70 years and above have a higher risk of infection and a higher need for intensive care than patients younger than 70.
Sex: men, when infected, have a higher risk of severe COVID-19 disease and a higher need for intensive care than women\cite{pijls_demographic_2021}.
Ethnicity: the risk of contamination was higher in most ethnic minority groups than their White counterparts in North America and Europe. Among people with confirmed infection, African-Americans and Hispanic Americans were also more likely than White Americans to be hospitalised with SARS-CoV-2 infection. However, the probability of ICU admission was equivalent for all groups. Thus, ethnicity is not relevant to our question. 
Blood tests: Patients with increased pancreatic enzymes, including elevated serum lipase or amylase of either type, had worse clinical outcomes. Lower levels of lymphocytes and hemoglobin; elevated levels of leukocytes, aspartate aminotransferase, alanine aminotransferase, blood creatinine, blood urea nitrogen, high-sensitivity troponin, creatine kinase, high-sensitivity C-reactive protein, interleukin 6, D-dimer, ferritin, lactate dehydrogenase, and procalcitonin; and a high erythrocyte sedimentation rate were also associated with severe COVID-19.  

Out of a total of 3009 citations, 17 articles (22 studies, 21 from China and one study from Singapore) with 3396 ranging from 12 to1099 patients were included. Our meta-analyses showed a significant decrease in lymphocyte, monocyte, and eosinophil, hemoglobin, platelet, albumin, serum sodium, lymphocyte to C-reactive protein ratio (LCR), leukocyte to C-reactive protein ratio (LeCR), leukocyte to IL-6 ratio (LeIR), and an increase in the neutrophil, alanine aminotransferase (ALT), aspartate aminotransferase (AST), total bilirubin, blood urea nitrogen (BUN), creatinine (Cr), erythrocyte Sedimentation Rate (ESR), C-reactive protein (CRP), Procalcitonin (PCT), lactate dehydrogenase (LDH), fibrinogen, prothrombin time (PT), D-dimer, glucose level, and neutrophil to lymphocyte ratio (NLR) in the severe group compared with the non-severe group. 

No significant changes in white blood cells (WBC), Creatine Kinase (CK), troponin I, myoglobin, IL-6 and K between the two groups were observed. 

## 2. Load data after data_selection and feature_selection

The file we open already has: 
- lines with NA for DSDECOD that have been removed
- the NAs that have been filled in 
- the standardisation that has been performed 
- the features we want to keep that have been selected
- the data has been stratified over the continents, and we only kept the data from Europe

**WARNING: These are only the data for Europe because we want to do the training only on Europe.**

In [2]:
# Open file
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_train_LogisticRegression_alldata.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_train = pd.concat(mylist, axis=0)
df_train.name = 'df_train'
del mylist

In [3]:
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_test_LogisticRegression_alldata.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_test = pd.concat(mylist, axis=0)
df_test.name = 'df_test'
del mylist

In [4]:
df_train.head(3)

Unnamed: 0,SEX,IETEST_Fever,"INCLAS_ANTIINFLAMMATORY_AND_ANTIRHEUMATIC_PRODUCTS,_NON-STEROIDS",INCLAS_ARTIFICIAL_RESPIRATION,INCLAS_DRUGS_FOR_ACID_RELATED_DISORDERS,INCLAS_PSYCHOLEPTICS,INCLAS_RENAL_REPLACEMENT,INCLAS_VACCINES,AGE,LBTEST_APTT,...,LBTEST_CRP,LBTEST_LDH,LBTEST_SODIUM,LBTEST_UREAN,VSTEST_DIABP,VSTEST_HR,VSTEST_OXYSAT,VSTEST_RESP,VSTEST_SYSBP,DSDECOD
0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.709161,-0.083546,...,-0.225248,-0.086105,-0.031763,-0.235018,-0.268879,-0.541073,0.000606,-0.930559,-0.286771,1.0
1,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,-2.127947,-0.083546,...,-0.225248,-0.472977,-0.031763,-0.585141,-0.332212,0.568828,0.000606,-0.403798,-0.098491,0.0
2,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,-0.34162,-0.083546,...,1.319953,2.033032,-0.68731,-0.723068,-0.268879,0.297519,0.002965,-0.667179,-0.550364,0.0


In [5]:
df_train.shape

(76391, 21)

In [6]:
df_train.columns

Index(['SEX', 'IETEST_Fever',
       'INCLAS_ANTIINFLAMMATORY_AND_ANTIRHEUMATIC_PRODUCTS,_NON-STEROIDS',
       'INCLAS_ARTIFICIAL_RESPIRATION',
       'INCLAS_DRUGS_FOR_ACID_RELATED_DISORDERS', 'INCLAS_PSYCHOLEPTICS',
       'INCLAS_RENAL_REPLACEMENT', 'INCLAS_VACCINES', 'AGE', 'LBTEST_APTT',
       'LBTEST_BILI', 'LBTEST_CRP', 'LBTEST_LDH', 'LBTEST_SODIUM',
       'LBTEST_UREAN', 'VSTEST_DIABP', 'VSTEST_HR', 'VSTEST_OXYSAT',
       'VSTEST_RESP', 'VSTEST_SYSBP', 'DSDECOD'],
      dtype='object')

In [7]:
# Import data for other continents

df_Asia_self = pd.read_csv(op.join(data_folder, 'df_Asia_SelfStd_withINCLAS.csv'), sep=',', index_col=0)
df_Asia_NonSelf = pd.read_csv(op.join(data_folder, 'df_Asia_NonSelfStd_withINCLAS.csv'), sep=',', index_col=0)

df_SouthAmerica_self = pd.read_csv(op.join(data_folder, 'df_SouthAmerica_SelfStd_withINCLAS.csv'), sep=',', index_col=0)
df_SouthAmerica__NonSelf = pd.read_csv(op.join(data_folder, 'df_SouthAmerica_NonSelfStd_withINCLAS.csv'), sep=',', index_col=0)

df_NorthAmerica_self = pd.read_csv(op.join(data_folder, 'df_NorthAmerica_SelfStd_withINCLAS.csv'), sep=',', index_col=0)
df_NorthAmerica_NonSelf = pd.read_csv(op.join(data_folder, 'df_NorthAmerica_NonSelfStd_withINCLAS.csv'), sep=',', index_col=0)

dfs_continents = [df_Asia_self, df_Asia_NonSelf, df_SouthAmerica_self, df_SouthAmerica__NonSelf, df_NorthAmerica_self, df_NorthAmerica_NonSelf]
list_continents = ['Asia_self', 'Asia_NonSelf', 'SouthAmerica_self', 'SouthAmerica__NonSelf', 'NorthAmerica_self', 'NorthAmerica_NonSelf']

## 3.Models

We try to find the best algorithm/model with the best parameters based on Europe data. We will also test if the model perform well or not on other continents.

In [8]:
# Separate data into features and label
X_train = df_train.loc[:, df_train.columns!='DSDECOD']
y_train = df_train['DSDECOD']

X_test = df_test.loc[:, df_test.columns!='DSDECOD']
y_test = df_test['DSDECOD']

In [9]:
# Function to calculate performances

def performance(df_results, model_name, data_name, y_test, y_pred, y_score, print_):
    
    # Calculate performance scores
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_score)
    average_precision = average_precision_score(y_test, y_score)

    # Print performance
    if print_ == True:
        print('Performance for ' + model_name + ':')
        print('  - Accuracy score = {:.2f}'.format(accuracy))
        print('  - F1 score = {:.2f}'.format(f1))
        print('  - Precision score = {:.2f}'.format(precision))
        print('  - Recall score = {:.2f}'.format(recall))
        print('  - ROC AUC score = {:.2f}'.format(roc_auc))
        print('  - Average precision score = {:.2f}'.format(average_precision))

    # Add performance to df_results_performance
    df_results = df_results.append(pd.Series({"Model" : model_name,
                                              "Data" : data_name,
                                              "Parameters": parameters,
                                              "Accuracy" : accuracy,
                                              "F1" : f1,
                                              "Precision": precision,
                                              "Recall" : recall,
                                              "ROC AUC" : roc_auc,
                                              "Average precision" : average_precision}), ignore_index=True)
    return df_results

In [10]:
# Function to apply model to other continents and calculate performances

def other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, model_name):
    
    for i, dfi in enumerate(dfs_continents):
    
        name_continent = list_continents[i]
    
        # Separate in X and y
        X_continent = dfi.loc[:, dfi.columns!='DSDECOD']
        y_continent = dfi['DSDECOD']
    
        # Predicting values
        y_pred_continent = clf.predict(X_continent)

        # Proba for the greater label
        y_score_continent = clf.predict_proba(X_continent)[:, 1]
    
        # Calculate performance scores
        df_results_OtherContinents = performance(df_results_OtherContinents, model_name, name_continent, y_continent, y_pred_continent, y_score_continent, False)

    return df_results_OtherContinents

In [11]:
# For storage of results for each model
df_results_performance = pd.DataFrame()
df_results_coeff = pd.DataFrame(X_train.columns, columns=['Features'])
df_results_OtherContinents = pd.DataFrame()

In [12]:
df_results_OtherContinents = pd.read_csv(op.join(data_folder, "ResultsPerContinent.csv"))
df_results_coeff = pd.read_csv(op.join(data_folder, "EuropeCoeffs.csv"))
df_results_performance = pd.read_csv(op.join(data_folder, "EuropePerformances.csv"))

### 3.1. Model 1: Logistic regression

In [27]:
%%time
# Choose parameter values to test for cross-validation
param_grid = {"C":np.logspace(-3,3,4), 
              "penalty":["l2"]}

# Choose the estimator
estimator = LogisticRegression(max_iter=200)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'C':best_param['C'], 'penalty':best_param['penalty']}
print('Best parameters for Logistic regression:')
print('C:', parameters['C'])
print('Penalty:', parameters['penalty'])
print('------------')

# Calculate performance on Europe test
df_results_performance = performance(df_results_performance, "Logistic Regression", "Europe test", y_test, y_pred, y_score, True)

# Look at coefs
coefs = clf.best_estimator_.coef_
df_results_coeff["Logistic Regression"] = coefs[0]

# Calculate performance on other continents
df_results_OtherContinents = other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, "Logistic Regression")

Best parameters for Logistic regression:
C: 1000.0
Penalty: l2
------------
Performance for Logistic Regression:
  - Accuracy score = 0.71
  - F1 score = 0.60
  - Precision score = 0.50
  - Recall score = 0.76
  - ROC AUC score = 0.80
  - Average precision score = 0.60
CPU times: total: 2.11 s
Wall time: 2.11 s


  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,


### 3.2. Model 2 : K-Nearest Neighbor Classifier

In [28]:
%%time
# Choose parameter values to test for cross-validation
k_range = list(range(5, 21, 5))
param_grid = dict(n_neighbors=k_range)

# Choose the estimator
estimator = KNeighborsClassifier()

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'k':best_param['n_neighbors']}
print('Best parameters for K-Nearest Neighbor:')
print('k:', parameters['k'])
print('------------')

# Calculate performance on Europe test
df_results_performance = performance(df_results_performance, "K-Nearest Neighbor", "Europe test", y_test, y_pred, y_score, True)

# Look at coefs
df_results_coeff["K-Nearest Neighbor"] = "-"

# Calculate performance on other continents
df_results_OtherContinents = other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, "K-Nearest Neighbor")

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Best parameters for K-Nearest Neighbor:
k: 5
------------
Performance for K-Nearest Neighbor:
  - Accuracy score = 0.67
  - F1 score = 0.54
  - Precision score = 0.45
  - Recall score = 0.66
  - ROC AUC score = 0.72
  - Average precision score = 0.45


  df_results = df_results.append(pd.Series({"Model" : model_name,
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  df_results = df_results.append(pd.Series({"Model" : model_name,
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  df_results = df_results.append(pd.Series({"Model" : model_name,
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  df_results = df_results.append(pd.Series({"Model" : model_name,
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  df_results = df_results.append(pd.Series({"Model" : model_name,
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  df_results = df_results.append(pd.Series({"Model" : model_name,
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


CPU times: total: 17min 58s
Wall time: 6min 21s


  df_results = df_results.append(pd.Series({"Model" : model_name,


### 3.3. Model 3 : Support Vector Machines

In [15]:
# Because svm is too long, we will take a sample
X_train_svm = X_train.sample(n=10000)
y_train_svm = y_train.sample(n=10000)

In [16]:
%%time

# Choose parameter values to test for cross-validation
param_grid = {'C': [0.1, 1, 10, 100], 
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear']} 

# Choose the estimator
estimator = svm.SVC(probability=True)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True), verbose=True)

# Fit data into the model
clf.fit(X_train_svm, y_train_svm)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'C':best_param['C'], 'gamma':best_param['gamma'], 'kernel':best_param['kernel']}
print('Best parameters for SVM:')
print('C:', parameters['C'])
print('gamma:', parameters['gamma'])
print('kernel:', parameters['kernel'])
print('------------')

# Calculate performance on Europe test
df_results_performance = performance(df_results_performance, "SVM", "Europe test", y_test, y_pred, y_score, True)

# Look at coefs
coefs = clf.best_estimator_.coef_
df_results_coeff["SVM"] = coefs[0]

# Calculate performance on other continents
df_results_OtherContinents = other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, "SVM")

Fitting 3 folds for each of 16 candidates, totalling 48 fits
Best parameters for SVM:
C: 1
gamma: 1
kernel: linear
------------
Performance for SVM:
  - Accuracy score = 0.71
  - F1 score = 0.00
  - Precision score = 0.62
  - Recall score = 0.00
  - ROC AUC score = 0.51
  - Average precision score = 0.30


  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  _warn_prf(average, modifier, msg_start, len(result))
  df_results = df_results.append(pd.Series({"Model" : model_name,
  _warn_prf(average, modifier, msg_start, len(result))
  df_results = df_results.append(pd.Series({"Model" : model_name,
  _warn_prf(average, modifier, msg_start, len(result))
  df_results = df_results.append(pd.Series({"Model" : model_name,


CPU times: total: 2h 56min 42s
Wall time: 2h 56min 45s


  _warn_prf(average, modifier, msg_start, len(result))
  df_results = df_results.append(pd.Series({"Model" : model_name,


In [20]:
df_results_performance

Unnamed: 0.1,Unnamed: 0,Model,Data,Parameters,Accuracy,F1,Precision,Recall,ROC AUC,Average precision
0,0.0,Logistic Regression,Europe test,"{'C': 1000.0, 'penalty': 'l2'}",0.705863,0.603858,0.500139,0.761851,0.799059,0.599559
1,1.0,K-Nearest Neighbor,Europe test,{'k': 5},0.665848,0.538607,0.453612,0.662799,0.717902,0.454071
2,2.0,SVM,Europe test,"{'C': 10, 'gamma': 1, 'kernel': 'linear'}",0.705738,0.0,0.0,0.0,0.294392,0.203689
3,3.0,MLP,Europe test,"{'hidden_layer_sizes': 46, 'activation': 'logi...",0.712483,0.59212,0.508213,0.709212,0.791526,0.59279
4,4.0,Quadratic discriminant analysis,Europe test,{'reg_param': 0.1},0.70936,0.548045,0.505193,0.59884,0.763283,0.517587
5,5.0,XGBoost,Europe test,"{'max_depth': 9, 'n_estimators': 140, 'learnin...",0.766447,0.564214,0.625603,0.513797,0.811671,0.631167
6,6.0,SVM,Europe test,"{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}",0.705904,0.006191,0.55,0.003113,0.576587,0.372166
7,7.0,SVM,Europe test,"{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}",0.705904,0.006191,0.55,0.003113,0.576591,0.372184
8,,SVM,Europe test,"{'C': 1, 'gamma': 1, 'kernel': 'linear'}",0.705821,0.001413,0.625,0.000708,0.510855,0.300076


In [17]:
coefs

array([[ 0.01669233, -0.00402645,  0.01256534,  0.00832636, -0.00353805,
        -0.00086363,  0.00329812,  0.00209732,  0.00227995, -0.00085147,
        -0.00048678, -0.0021243 ,  0.02701284, -0.00023855, -0.00053369,
         0.00402095,  0.00286729, -0.06614103, -0.00057383, -0.00108952]])

### 3.4. Model 4 : Multi-layer perceptrons

In [31]:
%%time
# Choose parameter values to test for cross-validation
param_grid = {'hidden_layer_sizes': np.arange(10, 50, 3),
              'activation': ['relu', 'logistic'],
              'solver': ['sgd', 'adam'],
              'learning_rate': ['adaptive'],
              'learning_rate_init': [0.01, 0.001]}

# Choose the estimator
estimator = MLPClassifier(max_iter=300)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'hidden_layer_sizes':best_param['hidden_layer_sizes'], 'activation':best_param['activation'], 
              'solver':best_param['solver'], 'learning_rate':best_param['learning_rate'], 'learning_rate_init':best_param['learning_rate_init'], }
print('Best parameters for MLP:')
print('hidden_layer_sizes:', parameters['hidden_layer_sizes'])
print('activation:', parameters['activation'])
print('solver:', parameters['solver'])
print('learning_rate:', parameters['learning_rate'])
print('learning_rate_init:', parameters['learning_rate_init'])
print('------------')

# Calculate performance on Europe test
df_results_performance = performance(df_results_performance, "MLP", "Europe test", y_test, y_pred, y_score, True)

# Look at coefs
df_results_coeff["MLP"] = "-"

# Calculate performance on other continents
df_results_OtherContinents = other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, "MLP")



Best parameters for MLP:
hidden_layer_sizes: 46
activation: logistic
solver: adam
learning_rate: adaptive
learning_rate_init: 0.01
------------
Performance for MLP:
  - Accuracy score = 0.71
  - F1 score = 0.59
  - Precision score = 0.51
  - Recall score = 0.71
  - ROC AUC score = 0.79
  - Average precision score = 0.59
CPU times: total: 14h 57min 59s
Wall time: 1h 52min 28s


  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,


### 3.5. Model 5 : Quadratic discriminant analysis

In [32]:
%%time
# Choose parameter values to test for cross-validation
param_grid = [{'reg_param': [0.1, 0.2, 0.3, 0.4, 0.5]}]

# Choose the estimator
estimator = QuadraticDiscriminantAnalysis()

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'reg_param':best_param['reg_param']}
print('Best parameters for Quadratic discriminant analysis:')
print('reg_param:', parameters['reg_param'])
print('------------')

# Calculate performance on Europe test
df_results_performance = performance(df_results_performance, "Quadratic discriminant analysis", "Europe test", y_test, y_pred, y_score, True)

# Look at coefs
df_results_coeff["Quadratic discriminant analysis"] = "-"

# Calculate performance on other continents
df_results_OtherContinents = other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, "Quadratic discriminant analysis")

Best parameters for Quadratic discriminant analysis:
reg_param: 0.1
------------
Performance for Quadratic discriminant analysis:
  - Accuracy score = 0.71
  - F1 score = 0.55
  - Precision score = 0.51
  - Recall score = 0.60
  - ROC AUC score = 0.76
  - Average precision score = 0.52
CPU times: total: 8.81 s
Wall time: 1.15 s


  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,


### 3.6. Model 6 : XGBoost

In [33]:
%%time

# Choose parameter values to test for cross-validation
param_grid = {'max_depth': range (2, 10, 1),
              'n_estimators': range(60, 220, 40),
              'learning_rate': [0.1, 0.01, 0.05],
              'booster': ['gbtree', 'gblinear']}

# Choose the estimator
estimator = xgb.XGBClassifier(objective="binary:logistic", random_state=42)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'max_depth':best_param['max_depth'], 'n_estimators':best_param['n_estimators'], 'learning_rate':best_param['learning_rate'], 'booster':best_param['booster']}
print('Best parameters for XGBoost:')
print('max_depth:', parameters['max_depth'])
print('n_estimators:', parameters['n_estimators'])
print('learning_rate:', parameters['learning_rate'])
print('booster:', parameters['booster'])
print('------------')

# Calculate performance on Europe test
df_results_performance = performance(df_results_performance, "XGBoost", "Europe test", y_test, y_pred, y_score, True)

# Look at coefs
if (parameters['booster']=='gblinear'):
    coefs = clf.best_estimator_.coef_
    df_results_coeff["XGBoost"] = coefs
else:
    feature_importances = clf.best_estimator_.feature_importances_
    df_results_coeff["XGBoost"] = feature_importances

# Calculate performance on other continents
df_results_OtherContinents = other_continents(dfs_continents, list_continents, clf, df_results_OtherContinents, "XGBoost")

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters: { "max_depth" } are not used.

Parameters:

  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,
  df_results = df_results.append(pd.Series({"Model" : model_name,


### 3.7. Models comparison

In [25]:
# Performance of each model
df_results_performance.sort_values(by="F1", axis=0, ascending=False).round(2).set_index('Model')

Unnamed: 0_level_0,Unnamed: 0,Data,Parameters,Accuracy,F1,Precision,Recall,ROC AUC,Average precision
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Logistic Regression,0.0,Europe test,"{'C': 1000.0, 'penalty': 'l2'}",0.71,0.6,0.5,0.76,0.8,0.6
MLP,3.0,Europe test,"{'hidden_layer_sizes': 46, 'activation': 'logi...",0.71,0.59,0.51,0.71,0.79,0.59
XGBoost,5.0,Europe test,"{'max_depth': 9, 'n_estimators': 140, 'learnin...",0.77,0.56,0.63,0.51,0.81,0.63
Quadratic discriminant analysis,4.0,Europe test,{'reg_param': 0.1},0.71,0.55,0.51,0.6,0.76,0.52
K-Nearest Neighbor,1.0,Europe test,{'k': 5},0.67,0.54,0.45,0.66,0.72,0.45
SVM,6.0,Europe test,"{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}",0.71,0.01,0.55,0.0,0.58,0.37
SVM,7.0,Europe test,"{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}",0.71,0.01,0.55,0.0,0.58,0.37
SVM,,Europe test,"{'C': 1, 'gamma': 1, 'kernel': 'linear'}",0.71,0.0,0.62,0.0,0.51,0.3
SVM,2.0,Europe test,"{'C': 10, 'gamma': 1, 'kernel': 'linear'}",0.71,0.0,0.0,0.0,0.29,0.2


In [23]:
# Feature importance for each model
df_results_coeff.round(4).set_index('Features')

Unnamed: 0_level_0,Unnamed: 0,Logistic Regression,K-Nearest Neighbor,MLP,Quadratic discriminant analysis,XGBoost,SVM
Features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SEX,0,-0.3836,-,-,-,0.1113,0.0167
IETEST_Fever,1,0.0642,-,-,-,0.0486,-0.004
"INCLAS_ANTIINFLAMMATORY_AND_ANTIRHEUMATIC_PRODUCTS,_NON-STEROIDS",2,-0.2758,-,-,-,0.0276,0.0126
INCLAS_ARTIFICIAL_RESPIRATION,3,1.6377,-,-,-,0.0382,0.0083
INCLAS_DRUGS_FOR_ACID_RELATED_DISORDERS,4,0.1614,-,-,-,0.0841,-0.0035
INCLAS_PSYCHOLEPTICS,5,0.5338,-,-,-,0.075,-0.0009
INCLAS_RENAL_REPLACEMENT,6,-1.4487,-,-,-,0.0477,0.0033
INCLAS_VACCINES,7,-0.7067,-,-,-,0.0672,0.0021
AGE,8,1.1289,-,-,-,0.1326,0.0023
LBTEST_APTT,9,0.072,-,-,-,0.0207,-0.0009


In [24]:
# Performance on other continents
df_results_OtherContinents.sort_values(by="Data", axis=0).round(2).set_index("Data")

Unnamed: 0_level_0,Model,Parameters,Accuracy,F1,Precision,Recall,ROC AUC,Average precision
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Asia_NonSelf,Logistic Regression,"{'C': 1000.0, 'penalty': 'l2'}",0.91,0.28,0.41,0.21,0.81,0.28
Asia_NonSelf,SVM,"{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}",0.9,0.02,0.09,0.01,0.62,0.24
Asia_NonSelf,XGBoost,"{'max_depth': 9, 'n_estimators': 140, 'learnin...",0.92,0.27,0.75,0.17,0.87,0.48
Asia_NonSelf,K-Nearest Neighbor,{'k': 5},0.87,0.31,0.29,0.34,0.74,0.22
Asia_NonSelf,Quadratic discriminant analysis,{'reg_param': 0.1},0.9,0.39,0.42,0.36,0.82,0.29
Asia_NonSelf,MLP,"{'hidden_layer_sizes': 46, 'activation': 'logi...",0.92,0.27,0.66,0.17,0.82,0.41
Asia_NonSelf,SVM,"{'C': 1, 'gamma': 1, 'kernel': 'linear'}",0.91,0.0,0.0,0.0,0.39,0.07
Asia_self,Logistic Regression,"{'C': 1000.0, 'penalty': 'l2'}",0.75,0.37,0.23,0.83,0.87,0.49
Asia_self,SVM,"{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}",0.91,0.01,0.43,0.01,0.64,0.31
Asia_self,K-Nearest Neighbor,{'k': 5},0.62,0.22,0.14,0.63,0.67,0.13


In [37]:
# Reminder: percentage of death
dfs = [df_test, df_Asia_self, df_SouthAmerica_self, df_NorthAmerica_self]
continents = ["Europe_test", "Asia", "SouthAmerica", "NorthAmerica"]
for i, dfi in enumerate(dfs):
    distribution = dfi['DSDECOD'].value_counts()
    print("In", continents[i], ":", distribution[0], "survivals,", distribution[1], "deaths, so a", round(distribution[0]/(distribution[0]+distribution[1])*100, 2), "% survival rate.")

In Europe_test : 16949 survivals, 7067 deaths, so a 70.57 % survival rate.
In Asia : 5994 survivals, 561 deaths, so a 91.44 % survival rate.
In SouthAmerica : 5133 survivals, 748 deaths, so a 87.28 % survival rate.
In NorthAmerica : 4077 survivals, 1178 deaths, so a 77.58 % survival rate.


In [None]:
# Investigate features
df_results_features = pd.DataFrame(X_train.columns, columns=['Features'])
df_results_features['Survival_percent'] = [round(X_test[c].value_counts(normalize=True).values[0], 2) if (X_test[c].nunique()==2) else '-' for c in X_test.columns]
df_results_features

In [None]:
fig, axs = plt.subplots(5,5, figsize=(14,8))
axs = axs.flatten()

for ax, col in zip(axs, X_test.columns):
    sns.histplot(X_test[col], ax=ax, stat="density")
    
fig.suptitle("Distribution in each feature")
plt.tight_layout()
plt.show()

In [22]:
df_results_OtherContinents.sort_values(by="Data", axis=0).round(2).set_index("Data").to_csv(op.join(data_folder, "ResultsPerContinent_withINCLAS.csv"))
df_results_coeff.to_csv(op.join(data_folder, "EuropeCoeffs_withINCLAS.csv"))
df_results_performance.to_csv(op.join(data_folder, "EuropePerformances_withINCLAS.csv"))