# Project 2: Covid ---> III/ Models and predictions

The purpose of this file is to test and compare several models on the matrix extracted from the "II_features-selection" file. We will use a sample of the data for questions of run time. Then the best model will be applied to all the data.

In [14]:
# Import
%matplotlib inline

import os
import os.path as op
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import xgboost as xgb

from pandas_profiling import ProfileReport

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, average_precision_score, auc

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.model_selection import RandomizedSearchCV

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Literature

In 2019, the first COVID-19 cases are observed in China. Rapidly, the SARS-Cov2 virus spread worldwide, pushing governments to take strict decisions about the lives of their co-citizens, like containment, to protect the population. Indeed, in some cases, COVID-19 patients ended up in intensive care services and sometimes died.

**The aim of our model is, based on easily computable parameters at the study's beginning, to predict whether the patient will be likely to die or if the chance of survival is important.** The point of this study is to help the hospital organise in the case of a high number of cases.


The studied dataset stem from the IDDO Data Repository of COVID-19 data. This data was pulled from the underlying data collection projects on 2022-09-01. The data comes from 1,200 institutions from over 45 countries and gather various information from 700,000 hospitalised individuals.

To keep only the relevant features, we first dive into the literature, using Meta-analysis papers. First, we have been looking for aggravating factors that will likely lead the patient to ICU.

Obesity: according to a meta-analysis by Sales-Peres, there is a correlation between obesity and ICU admission. This paper also concluded that co-morbidities for obese patients, such as hypertension, type 2 diabetes, smoking habit, lung disease, and/or cardiovascular disease lead to a higher chance of ICU admission.
Age: patients aged 70 years and above have a higher risk of infection and a higher need for intensive care than patients younger than 70.
Sex: men, when infected, have a higher risk of severe COVID-19 disease and a higher need for intensive care than women\cite{pijls_demographic_2021}.
Ethnicity: the risk of contamination was higher in most ethnic minority groups than their White counterparts in North America and Europe. Among people with confirmed infection, African-Americans and Hispanic Americans were also more likely than White Americans to be hospitalised with SARS-CoV-2 infection. However, the probability of ICU admission was equivalent for all groups. Thus, ethnicity is not relevant to our question. 
Blood tests: Patients with increased pancreatic enzymes, including elevated serum lipase or amylase of either type, had worse clinical outcomes. Lower levels of lymphocytes and hemoglobin; elevated levels of leukocytes, aspartate aminotransferase, alanine aminotransferase, blood creatinine, blood urea nitrogen, high-sensitivity troponin, creatine kinase, high-sensitivity C-reactive protein, interleukin 6, D-dimer, ferritin, lactate dehydrogenase, and procalcitonin; and a high erythrocyte sedimentation rate were also associated with severe COVID-19.  

Out of a total of 3009 citations, 17 articles (22 studies, 21 from China and one study from Singapore) with 3396 ranging from 12 to1099 patients were included. Our meta-analyses showed a significant decrease in lymphocyte, monocyte, and eosinophil, hemoglobin, platelet, albumin, serum sodium, lymphocyte to C-reactive protein ratio (LCR), leukocyte to C-reactive protein ratio (LeCR), leukocyte to IL-6 ratio (LeIR), and an increase in the neutrophil, alanine aminotransferase (ALT), aspartate aminotransferase (AST), total bilirubin, blood urea nitrogen (BUN), creatinine (Cr), erythrocyte Sedimentation Rate (ESR), C-reactive protein (CRP), Procalcitonin (PCT), lactate dehydrogenase (LDH), fibrinogen, prothrombin time (PT), D-dimer, glucose level, and neutrophil to lymphocyte ratio (NLR) in the severe group compared with the non-severe group. 

No significant changes in white blood cells (WBC), Creatine Kinase (CK), troponin I, myoglobin, IL-6 and K between the two groups were observed. 

## 2. Load data after data_selection and feature_selection

The file we open already has: 
- lines with NA for DSDECOD that have been removed
- the NAs that have been filled in 
- the standardisation that has been performed 
- the features we want to keep that have been selected
- the data has been stratified over the continents, and we only kept the data from Europe

In [3]:
# Open file
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_train_svm.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_train = pd.concat(mylist, axis=0)
df_train.name = 'df_train'
del mylist

In [4]:
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_test_svm.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_test = pd.concat(mylist, axis=0)
df_test.name = 'df_test'
del mylist

In [5]:
df_train.head(3)

Unnamed: 0,SEX,CONTINENT_AF,CONTINENT_EU,INCLAS_AGENTS_ACTING_ON_THE_RENIN-ANGIOTENSIN_SYSTEM,INCLAS_ANALGESICS,INCLAS_ANTIMALARIALS,INCLAS_ANTITHROMBOTIC_AGENTS,INCLAS_BETA_BLOCKING_AGENTS,INCLAS_CARDIAC_PACING,INCLAS_CARDIOPULMONARY_RESUSCITATION,...,LBTEST_BICARB,LBTEST_GLUC,LBTEST_HGB,RSCAT_GCS_NINDS_VERSION,VSTEST_DIABP,VSTEST_HR,VSTEST_RESP,VSTEST_SYSBP,VSTEST_WEIGHT,DSDECOD
0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,...,-0.00187,-0.017709,0.00142,0.088356,-0.114774,-0.02362,-0.066478,-0.082594,-0.012458,0.0
1,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,...,-0.00187,-0.017709,0.00142,0.088356,-0.114774,-0.02362,-0.066478,-0.082594,-0.012458,1.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-10.074182,0.092409,0.244564,0.088356,4.912852,4.341164,2.687317,1.085295,-0.012458,0.0


In [6]:
df_train.columns

Index(['SEX', 'CONTINENT_AF', 'CONTINENT_EU',
       'INCLAS_AGENTS_ACTING_ON_THE_RENIN-ANGIOTENSIN_SYSTEM',
       'INCLAS_ANALGESICS', 'INCLAS_ANTIMALARIALS',
       'INCLAS_ANTITHROMBOTIC_AGENTS', 'INCLAS_BETA_BLOCKING_AGENTS',
       'INCLAS_CARDIAC_PACING', 'INCLAS_CARDIOPULMONARY_RESUSCITATION',
       'INCLAS_CHEMOTHERAPY', 'INCLAS_DIURETICS',
       'INCLAS_DRUGS_FOR_ACID_RELATED_DISORDERS',
       'INCLAS_DRUGS_FOR_OBSTRUCTIVE_AIRWAY_DISEASES',
       'INCLAS_IMMUNOGLOBULINS', 'INCLAS_IMMUNOSTIMULANTS',
       'INCLAS_INSERTION_OF_TRACHEOSTOMY_TUBE',
       'INCLAS_LIPID_MODIFYING_AGENTS', 'INCLAS_MUSCLE_RELAXANTS',
       'INCLAS_OTHER_RESPIRATORY_SYSTEM_PRODUCTS', 'INCLAS_VACCINES',
       'RPSTRESC', 'AGE', 'IETEST_Acute_Respiratory_Infection', 'IETEST_Fever',
       'LBTEST_ALT', 'LBTEST_BICARB', 'LBTEST_GLUC', 'LBTEST_HGB',
       'RSCAT_GCS_NINDS_VERSION', 'VSTEST_DIABP', 'VSTEST_HR', 'VSTEST_RESP',
       'VSTEST_SYSBP', 'VSTEST_WEIGHT', 'DSDECOD'],
      dtype='object'

## 3. Search the best model/algorithm and the best parameters

We will work only on a sample of the data to try to find the best algorithm/model with the best parameters. Then we will apply the best model to the whole data set.

In [7]:
# Separate data into features and label
X_train = df_train.loc[:, df_train.columns!='DSDECOD']
y_train = df_train['DSDECOD']

X_test = df_test.loc[:, df_test.columns!='DSDECOD']
y_test = df_test['DSDECOD']

In [8]:
# For storage of results for each model
df_results = pd.DataFrame()

### 3.1. Model 1: Logistic regression

In [13]:
# Fit data into the model
clf = LogisticRegression().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for Logistic regression:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Logistic regression", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index = True)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Performance for Logistic regression:
  - Accuracy score = 0.78
  - F1 score = 0.21
  - Precision score = 0.55
  - Recall score = 0.13
  - ROC AUC score = 0.74
  - Average precision score = 0.44


  df_results = df_results.append(pd.Series({"Model" : "Logistic regression",


### 3.2. Model 2 : K-Nearest Neighbor Classifier

In [12]:
# Fit data into the model
clf = KNeighborsClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for K-Nearest Neighbor:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "K-Nearest Neighbor", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index = True)

Performance for K-Nearest Neighbor:
  - Accuracy score = 0.32
  - F1 score = 0.39
  - Precision score = 0.89
  - Recall score = 0.25
  - ROC AUC score = 0.53


### 3.3. Model 3 : Support Vector Machines

In [16]:
# Fit data into the model
clf = svm.SVC(probability=True).fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for SVM:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "SVM", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index = True)

Performance for Linear SVM:
  - Accuracy score = 0.77
  - F1 score = 0.11
  - Precision score = 0.06
  - Recall score = 0.74
  - ROC AUC score = 0.75




### 3.4. Model 4 : Multi-layer perceptrons

In [17]:
# Fit data into the model
clf = MLPClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for Multi-layer perceptrons:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Multi-layer perceptrons", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index = True)

Performance for MLP:
  - Accuracy score = 0.77
  - F1 score = 0.11
  - Precision score = 0.06
  - Recall score = 0.80
  - ROC AUC score = 0.79


### 3.5. Model 5 : Quadratic discriminant analysis

In [22]:
# Fit data into the model
clf = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for Quadratic discriminant analysis:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Quadratic discriminant analysis", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index = True)

Performance for Quadratic discriminant analysis:
  - Accuracy score = 0.26
  - F1 score = 0.39
  - Precision score = 0.98
  - Recall score = 0.24
  - ROC AUC score = 0.54




### 3.6. Model 6 : XGBoost

In [15]:
# Fit data into the model
clf = xgb.XGBClassifier(objective="binary:logistic", random_state=42).fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for XGBoost:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : " XGBoost", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index = True)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Performance for Logistic regression:
  - Accuracy score = 0.79
  - F1 score = 0.19
  - Precision score = 0.72
  - Recall score = 0.11
  - ROC AUC score = 0.76
  - Average precision score = 0.48


  df_results = df_results.append(pd.Series({"Model" : " XGBoost",


### 3.7. Models comparison

In [28]:
df_results.sort_values(by = "F1", axis=0, ascending=False).round(2).set_index('Model')

Unnamed: 0_level_0,Accuracy,F1,Precision,Recall,ROC AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Quadratic discriminant analysis,0.26,0.39,0.98,0.24,0.54
K-Nearest Neighbor,0.32,0.39,0.89,0.25,0.53
Passive aggressive,0.28,0.39,0.94,0.25,0.52
Naive Bayes,0.75,0.18,0.11,0.45,0.61
Perceptron,0.77,0.13,0.07,0.7,0.74
Random Forest,0.76,0.13,0.07,0.5,0.63
AdaBoost,0.77,0.12,0.07,0.74,0.76
Decision tree,0.76,0.12,0.07,0.47,0.62
GradientBoostingClassifier,0.77,0.12,0.07,0.69,0.73
Linear discriminant analysis,0.77,0.12,0.06,0.73,0.75


## 4. Apply the best model/algorithm

**!!TODO!!**

In [None]:
# SAUVEGARDE EN ATTENDANT => obj faire une cross validation quand on a choisi le meilleur modèle !

# Perform Logistic regression with cross-validation

# Split dataset in features and target variable
X_reg = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_reg = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_reg_woNaN = imputer.fit_transform(X_reg)

# Get rid of rows with missing y values
X_reg_woNaN = (X_reg_woNaN[~np.isnan(y_reg)[:,0], :])
y_reg_woNaN = y_reg[~np.isnan(y_reg)]

# Split X and y into training and testing sets
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg_woNaN, y_reg_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_reg = Pipeline([('scl', StandardScaler()), ('clf', LogisticRegression())])

# Set parameters to test
param_reg = {'clf__penalty': [None, 'l1', 'l2', 'elasticnet']}

# Cross-validation
cv_reg = RandomizedSearchCV(estimator = pipe_reg, 
                                         param_distributions=param_reg, 
                                         cv=3, n_iter=30, n_jobs=-1)

# Fit data into the model
cv_reg.fit(X_reg_train, y_reg_train)

# Predicting values
y_reg_pred = cv_reg.predict(X_reg_test)

# Calculate accuracy score
acc_score = accuracy_score(y_reg_pred, y_reg_test)
print('Accuracy score for logistic regression : ',acc_score)

clf.get_params().keys()

In [None]:
# See best parameters of the model
g_search.best_params_

In [None]:
# See coeffs of the model
cv_reg.named_steps['clf'].coef_