# Project 2: Covid ---> III/ Models and predictions

The purpose of this file is to test and compare several models on the matrix extracted from the "II_features-selection" file. We will use a sample of the data for questions of run time. Then the best model will be applied to all the data.

In [1]:
# Import
%matplotlib inline

import os
import os.path as op
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import xgboost as xgb

from pandas_profiling import ProfileReport

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron


from sklearn.model_selection import RandomizedSearchCV

%load_ext autoreload
%autoreload 2

## 1. Literature

In 2019, the first COVID-19 cases are observed in China. Rapidly, the SARS-Cov2 virus spread worldwide, pushing governments to take strict decisions about the lives of their co-citizens, like containment, to protect the population. Indeed, in some cases, COVID-19 patients ended up in intensive care services and sometimes died.

**The aim of our model is, based on easily computable parameters at the study's beginning, to predict whether the patient will be likely to die or if the chance of survival is important.** The point of this study is to help the hospital organise in the case of a high number of cases.


The studied dataset stem from the IDDO Data Repository of COVID-19 data. This data was pulled from the underlying data collection projects on 2022-09-01. The data comes from 1,200 institutions from over 45 countries and gather various information from 700,000 hospitalised individuals.

To keep only the relevant features, we first dive into the literature, using Meta-analysis papers. First, we have been looking for aggravating factors that will likely lead the patient to ICU.

Obesity: according to a meta-analysis by Sales-Peres, there is a correlation between obesity and ICU admission. This paper also concluded that co-morbidities for obese patients, such as hypertension, type 2 diabetes, smoking habit, lung disease, and/or cardiovascular disease lead to a higher chance of ICU admission.
Age: patients aged 70 years and above have a higher risk of infection and a higher need for intensive care than patients younger than 70.
Sex: men, when infected, have a higher risk of severe COVID-19 disease and a higher need for intensive care than women\cite{pijls_demographic_2021}.
Ethnicity: the risk of contamination was higher in most ethnic minority groups than their White counterparts in North America and Europe. Among people with confirmed infection, African-Americans and Hispanic Americans were also more likely than White Americans to be hospitalised with SARS-CoV-2 infection. However, the probability of ICU admission was equivalent for all groups. Thus, ethnicity is not relevant to our question. 
Blood tests: Patients with increased pancreatic enzymes, including elevated serum lipase or amylase of either type, had worse clinical outcomes. Lower levels of lymphocytes and hemoglobin; elevated levels of leukocytes, aspartate aminotransferase, alanine aminotransferase, blood creatinine, blood urea nitrogen, high-sensitivity troponin, creatine kinase, high-sensitivity C-reactive protein, interleukin 6, D-dimer, ferritin, lactate dehydrogenase, and procalcitonin; and a high erythrocyte sedimentation rate were also associated with severe COVID-19.  

Out of a total of 3009 citations, 17 articles (22 studies, 21 from China and one study from Singapore) with 3396 ranging from 12 to1099 patients were included. Our meta-analyses showed a significant decrease in lymphocyte, monocyte, and eosinophil, hemoglobin, platelet, albumin, serum sodium, lymphocyte to C-reactive protein ratio (LCR), leukocyte to C-reactive protein ratio (LeCR), leukocyte to IL-6 ratio (LeIR), and an increase in the neutrophil, alanine aminotransferase (ALT), aspartate aminotransferase (AST), total bilirubin, blood urea nitrogen (BUN), creatinine (Cr), erythrocyte Sedimentation Rate (ESR), C-reactive protein (CRP), Procalcitonin (PCT), lactate dehydrogenase (LDH), fibrinogen, prothrombin time (PT), D-dimer, glucose level, and neutrophil to lymphocyte ratio (NLR) in the severe group compared with the non-severe group. 

No significant changes in white blood cells (WBC), Creatine Kinase (CK), troponin I, myoglobin, IL-6 and K between the two groups were observed. 

## 2. Load data after data_selection and feature_selection

The file we open already has: 
- lines with NA for DSDECOD that have been removed
- the NAs that have been filled in 
- the standardisation that has been performed 
- the features we want to keep that have been selected

In [3]:
# Open file
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_train_svm.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_train = pd.concat(mylist, axis=0)
df_train.name = 'df_train'
del mylist

In [4]:
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_test_svm.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_test = pd.concat(mylist, axis=0)
df_test.name = 'df_test'
del mylist

In [5]:
df_train.head(3)

Unnamed: 0,CONTINENT_AF,CONTINENT_EU,CONTINENT_SA,HODECOD,INCLAS_ANESTHETICS,INCLAS_ANTIHELMINTICS,"INCLAS_ANTIINFLAMMATORY_AND_ANTIRHEUMATIC_PRODUCTS,_NON-STEROIDS",INCLAS_ANTIMYCOTICS_FOR_SYSTEMIC_USE,INCLAS_ANTIVIRALS_FOR_SYSTEMIC_USE,INCLAS_ARTIFICIAL_RESPIRATION,...,INCLAS_PSYCHOLEPTICS,INCLAS_REMOVAL_OF_ENDOTRACHEAL_TUBE,INCLAS_RENAL_REPLACEMENT,INCLAS_TOTAL_PARENTERAL_NUTRITION,INCLAS_VACCINES,MBTEST_OTHER RESPIRATORY PATHOGENS,RSCAT_AVPU,LBTEST_AST,LBTEST_LYM,DSDECOD
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-0.024282,-0.018348,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-0.024282,-0.018348,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-0.024282,-0.018348,0.0


In [6]:
df_train.columns

Index(['CONTINENT_AF', 'CONTINENT_EU', 'CONTINENT_SA', 'HODECOD',
       'INCLAS_ANESTHETICS', 'INCLAS_ANTIHELMINTICS',
       'INCLAS_ANTIINFLAMMATORY_AND_ANTIRHEUMATIC_PRODUCTS,_NON-STEROIDS',
       'INCLAS_ANTIMYCOTICS_FOR_SYSTEMIC_USE',
       'INCLAS_ANTIVIRALS_FOR_SYSTEMIC_USE', 'INCLAS_ARTIFICIAL_RESPIRATION',
       'INCLAS_BRONCHOSCOPY', 'INCLAS_CARDIAC_THERAPY', 'INCLAS_CHEMOTHERAPY',
       'INCLAS_EXTRACORPOREAL_MEMBRANE_OXYGENATION',
       'INCLAS_HIGH_FLOW_OXYGEN_NASAL_CANNULA', 'INCLAS_IMMUNOSTIMULANTS',
       'INCLAS_INSERTION_OF_TRACHEOSTOMY_TUBE',
       'INCLAS_LIPID_MODIFYING_AGENTS', 'INCLAS_MUSCLE_RELAXANTS',
       'INCLAS_NONINVASIVE_POSITIVE_PRESSURE_VENTILATION',
       'INCLAS_NONINVASIVE_VENTILATION',
       'INCLAS_OTHER_RESPIRATORY_SYSTEM_PRODUCTS', 'INCLAS_OXYGEN',
       'INCLAS_PERCUTANEOUS_ENDOSCOPIC_GASTROSTOMY',
       'INCLAS_PRONE_BODY_POSITION', 'INCLAS_PSYCHOLEPTICS',
       'INCLAS_REMOVAL_OF_ENDOTRACHEAL_TUBE', 'INCLAS_RENAL_REPLACEMENT',
  

## 3. Search the best model/algorithm and the best parameters

We will work only on a sample of the data to try to find the best algorithm/model with the best parameters. Then we will apply the best model to the whole data set.

In [7]:
# Separate data into features and label
X_train = df_train.loc[:, df_train.columns!='DSDECOD']
y_train = df_train['DSDECOD']

X_test = df_test.loc[:, df_test.columns!='DSDECOD']
y_test = df_test['DSDECOD']

In [9]:
# For storage of results for each model
df_results = pd.DataFrame()

### 3.1. Model 1: Logistic regression

In [10]:
# Fit data into the model
clf = LogisticRegression().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Logistic regression:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Logistic regression", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Logistic regression:
  - Accuracy score = 0.77
  - F1 score = 0.12
  - Precision score = 0.06
  - Recall score = 0.82
  - ROC AUC score = 0.79


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 3.2. Model 2: GradientBoostingClassifier

In [11]:
# Fit data into the model
clf = GradientBoostingClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for GradientBoostingClassifier:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "GradientBoostingClassifier", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for GradientBoostingClassifier:
  - Accuracy score = 0.77
  - F1 score = 0.12
  - Precision score = 0.07
  - Recall score = 0.69
  - ROC AUC score = 0.73


### 3.3. Model 3 : K-Nearest Neighbor Classifier

In [12]:
# Fit data into the model
clf = KNeighborsClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for K-Nearest Neighbor:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "K-Nearest Neighbor", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for K-Nearest Neighbor:
  - Accuracy score = 0.32
  - F1 score = 0.39
  - Precision score = 0.89
  - Recall score = 0.25
  - ROC AUC score = 0.53


### 3.4. Model 4 : Naive Bayes

In [13]:
# Fit data into the model
clf = GaussianNB().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Naive Bayes:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Naive Bayes", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Naive Bayes:
  - Accuracy score = 0.75
  - F1 score = 0.18
  - Precision score = 0.11
  - Recall score = 0.45
  - ROC AUC score = 0.61


### 3.5. Model 5 : Random Forest Classifier

In [14]:
# Fit data into the model
clf = RandomForestClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Random Forest:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Random Forest", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Random Forest:
  - Accuracy score = 0.76
  - F1 score = 0.13
  - Precision score = 0.07
  - Recall score = 0.50
  - ROC AUC score = 0.63


### 3.6. Model 6 : Support Vector Machines

#### 3.6.1. Normal one

In [15]:
# Fit data into the model
clf = svm.SVC().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Normal SVM:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Normal SVM", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Normal SVM:
  - Accuracy score = 0.77
  - F1 score = 0.11
  - Precision score = 0.06
  - Recall score = 0.80
  - ROC AUC score = 0.78


#### 3.6.2. Linear SVM

In [16]:
# Fit data into the model
clf = svm.LinearSVC().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Linear SVM:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Linear SVM", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Linear SVM:
  - Accuracy score = 0.77
  - F1 score = 0.11
  - Precision score = 0.06
  - Recall score = 0.74
  - ROC AUC score = 0.75




### 3.7. Model 7 : Multi-layer perceptrons

In [17]:
# Fit data into the model
clf = MLPClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for MLP:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "MLP", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for MLP:
  - Accuracy score = 0.77
  - F1 score = 0.11
  - Precision score = 0.06
  - Recall score = 0.80
  - ROC AUC score = 0.79


### 3.8. Model 8 : Decision tree classifier 

In [18]:
# Fit data into the model
clf = DecisionTreeClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Decision tree:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Decision tree", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Decision tree:
  - Accuracy score = 0.76
  - F1 score = 0.12
  - Precision score = 0.07
  - Recall score = 0.47
  - ROC AUC score = 0.62


### 3.9. Model 9 : ADABoost Classifier 

In [19]:
# Fit data into the model
clf = AdaBoostClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for AdaBoost:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "AdaBoost", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for AdaBoost:
  - Accuracy score = 0.77
  - F1 score = 0.12
  - Precision score = 0.07
  - Recall score = 0.74
  - ROC AUC score = 0.76


### 3.10. Model 10 : Extra trees classifier

In [20]:
# Fit data into the model
clf = ExtraTreesClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Extra trees:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Extra trees", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Extra trees:
  - Accuracy score = 0.76
  - F1 score = 0.11
  - Precision score = 0.06
  - Recall score = 0.51
  - ROC AUC score = 0.64


### 3.11. Model 11 : Discriminant analysis

#### 3.11.1. Linear discriminant analysis

In [21]:
# Fit data into the model
clf = LinearDiscriminantAnalysis().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Linear discriminant analysis:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Linear discriminant analysis", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Linear discriminant analysis:
  - Accuracy score = 0.77
  - F1 score = 0.12
  - Precision score = 0.06
  - Recall score = 0.73
  - ROC AUC score = 0.75


#### 3.11.2. Quadratic discriminant analysis

In [22]:
# Fit data into the model
clf = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Quadratic discriminant analysis:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Quadratic discriminant analysis", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Quadratic discriminant analysis:
  - Accuracy score = 0.26
  - F1 score = 0.39
  - Precision score = 0.98
  - Recall score = 0.24
  - ROC AUC score = 0.54




### 3.12. Model 12 : Stochastic gradient descent

In [23]:
# Fit data into the model
clf = SGDClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for SGD:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "SGD", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for SGD:
  - Accuracy score = 0.77
  - F1 score = 0.12
  - Precision score = 0.06
  - Recall score = 0.75
  - ROC AUC score = 0.76


### 3.13. Model 13 : XGBoost

In [24]:
# Fit data into the model
clf = xgb.XGBClassifier(objective="binary:logistic", random_state=42).fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for XGBoost:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "XGBoost", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for XGBoost:
  - Accuracy score = 0.76
  - F1 score = 0.12
  - Precision score = 0.06
  - Recall score = 0.59
  - ROC AUC score = 0.68


### 3.14. Model 14 : Gaussian process classifier

In [25]:
# Fit data into the model
clf = GaussianProcessClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Gaussian process')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Gaussian process", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Gaussian process
  - Accuracy score = 0.77
  - F1 score = 0.12
  - Precision score = 0.06
  - Recall score = 0.79
  - ROC AUC score = 0.78


### 3.15. Model 15 : Passive aggressive classifier

In [26]:
# Fit data into the model
clf = PassiveAggressiveClassifier().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Passive aggressive:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Passive aggressive", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Passive aggressive:
  - Accuracy score = 0.28
  - F1 score = 0.39
  - Precision score = 0.94
  - Recall score = 0.25
  - ROC AUC score = 0.52


### 3.16. Model 16 : Linear perceptron classifier

In [27]:
# Fit data into the model
clf = Perceptron().fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Calculate performance scores
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
precision = precision_score(y_pred, y_test)
recall = recall_score(y_pred, y_test)
roc_auc = roc_auc_score(y_pred, y_test)

# Print performance
print('Performance for Perceptron:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Perceptron", 
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc}), ignore_index = True)

Performance for Perceptron:
  - Accuracy score = 0.77
  - F1 score = 0.13
  - Precision score = 0.07
  - Recall score = 0.70
  - ROC AUC score = 0.74


### 3.17. Models comparison

In [28]:
df_results.sort_values(by = "F1", axis=0, ascending=False).round(2).set_index('Model')

Unnamed: 0_level_0,Accuracy,F1,Precision,Recall,ROC AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Quadratic discriminant analysis,0.26,0.39,0.98,0.24,0.54
K-Nearest Neighbor,0.32,0.39,0.89,0.25,0.53
Passive aggressive,0.28,0.39,0.94,0.25,0.52
Naive Bayes,0.75,0.18,0.11,0.45,0.61
Perceptron,0.77,0.13,0.07,0.7,0.74
Random Forest,0.76,0.13,0.07,0.5,0.63
AdaBoost,0.77,0.12,0.07,0.74,0.76
Decision tree,0.76,0.12,0.07,0.47,0.62
GradientBoostingClassifier,0.77,0.12,0.07,0.69,0.73
Linear discriminant analysis,0.77,0.12,0.06,0.73,0.75


## 4. Apply the best model/algorithm

**!!TODO!!**

In [None]:
# SAUVEGARDE EN ATTENDANT => obj faire une cross validation quand on a choisi le meilleur modèle !

# Perform Logistic regression with cross-validation

# Split dataset in features and target variable
X_reg = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_reg = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_reg_woNaN = imputer.fit_transform(X_reg)

# Get rid of rows with missing y values
X_reg_woNaN = (X_reg_woNaN[~np.isnan(y_reg)[:,0], :])
y_reg_woNaN = y_reg[~np.isnan(y_reg)]

# Split X and y into training and testing sets
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg_woNaN, y_reg_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_reg = Pipeline([('scl', StandardScaler()), ('clf', LogisticRegression())])

# Set parameters to test
param_reg = {'clf__penalty': [None, 'l1', 'l2', 'elasticnet']}

# Cross-validation
cv_reg = RandomizedSearchCV(estimator = pipe_reg, 
                                         param_distributions=param_reg, 
                                         cv=3, n_iter=30, n_jobs=-1)

# Fit data into the model
cv_reg.fit(X_reg_train, y_reg_train)

# Predicting values
y_reg_pred = cv_reg.predict(X_reg_test)

# Calculate accuracy score
acc_score = accuracy_score(y_reg_pred, y_reg_test)
print('Accuracy score for logistic regression : ',acc_score)

clf.get_params().keys()

In [None]:
# See best parameters of the model
g_search.best_params_

In [None]:
# See coeffs of the model
cv_reg.named_steps['clf'].coef_