# Project 2: Covid ---> III/ Models and predictions

In [18]:
# Import
%matplotlib inline

import os
import os.path as op
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import xgboost as xgb

from pandas_profiling import ProfileReport

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron


from sklearn.model_selection import RandomizedSearchCV

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Literature

In 2019, the first COVID-19 cases are observed in China. Rapidly, the SARS-Cov2 virus spread worldwide, pushing governments to take strict decisions about the lives of their co-citizens, like containment, to protect the population. Indeed, in some cases, COVID-19 patients ended up in intensive care services and sometimes died.

The aim of our model is, based on easily computable parameters at the study's beginning, to predict whether the patient will be likely to die or if the chance of survival is important. The point of this study is to help the hospital organise in the case of a high number of cases.


The studied dataset stem from the IDDO Data Repository of COVID-19 data. This data was pulled from the underlying data collection projects on 2022-09-01. The data comes from 1,200 institutions from over 45 countries and gather various information from 700,000 hospitalised individuals.

To keep only the relevant features, we first dive into the literature, using Meta-analysis papers. First, we have been looking for aggravating factors that will likely lead the patient to ICU.

Obesity: according to a meta-analysis by Sales-Peres, there is a correlation between obesity and ICU admission. This paper also concluded that co-morbidities for obese patients, such as hypertension, type 2 diabetes, smoking habit, lung disease, and/or cardiovascular disease lead to a higher chance of ICU admission.
Age: patients aged 70 years and above have a higher risk of infection and a higher need for intensive care than patients younger than 70.
Sex: men, when infected, have a higher risk of severe COVID-19 disease and a higher need for intensive care than women\cite{pijls_demographic_2021}.
Ethnicity: the risk of contamination was higher in most ethnic minority groups than their White counterparts in North America and Europe. Among people with confirmed infection, African-Americans and Hispanic Americans were also more likely than White Americans to be hospitalised with SARS-CoV-2 infection. However, the probability of ICU admission was equivalent for all groups. Thus, ethnicity is not relevant to our question. 
Blood tests: Patients with increased pancreatic enzymes, including elevated serum lipase or amylase of either type, had worse clinical outcomes. Lower levels of lymphocytes and hemoglobin; elevated levels of leukocytes, aspartate aminotransferase, alanine aminotransferase, blood creatinine, blood urea nitrogen, high-sensitivity troponin, creatine kinase, high-sensitivity C-reactive protein, interleukin 6, D-dimer, ferritin, lactate dehydrogenase, and procalcitonin; and a high erythrocyte sedimentation rate were also associated with severe COVID-19.  

Out of a total of 3009 citations, 17 articles (22 studies, 21 from China and one study from Singapore) with 3396 ranging from 12 to1099 patients were included. Our meta-analyses showed a significant decrease in lymphocyte, monocyte, and eosinophil, hemoglobin, platelet, albumin, serum sodium, lymphocyte to C-reactive protein ratio (LCR), leukocyte to C-reactive protein ratio (LeCR), leukocyte to IL-6 ratio (LeIR), and an increase in the neutrophil, alanine aminotransferase (ALT), aspartate aminotransferase (AST), total bilirubin, blood urea nitrogen (BUN), creatinine (Cr), erythrocyte Sedimentation Rate (ESR), C-reactive protein (CRP), Procalcitonin (PCT), lactate dehydrogenase (LDH), fibrinogen, prothrombin time (PT), D-dimer, glucose level, and neutrophil to lymphocyte ratio (NLR) in the severe group compared with the non-severe group. 

No significant changes in white blood cells (WBC), Creatine Kinase (CK), troponin I, myoglobin, IL-6 and K between the two groups were observed. 

## 2. Load data after data_selection and feature_selection

In [34]:
# Open file
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_I-DataSelection.csv'), sep=',', low_memory=False, chunksize=5000, index_col = 0):
    mylist.append(chunk)
df = pd.concat(mylist, axis=0)
df.name = 'df'
del mylist

In [35]:
# Delete row where DSDECOD is NA
df = df[df.DSDECOD.notna()]
df.DSDECOD.isna().sum()

0

## 3. Search the best model/algorithm and the best parameters

We will work only on a sample of the data to try to find the best algorithm/model with the best parameters. Then we will apply the best model to the whole data set.

In [46]:
# Take a sample of the data
df_sample = df.sample(n = 10000, axis = 0, random_state = 42, replace = False)

# Separate into features and label
X = df_sample.loc[:, df_sample.columns != 'DSDECOD'].to_numpy()
y = df_sample.loc[:, df_sample.columns == 'DSDECOD'].to_numpy()

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=16)

In [47]:
# For storage of results for each model
df_results = pd.DataFrame()

### 3.1 Model 1: Logistic regression

#### 3.1.1 Basic method

In [14]:
# Create pipeline to standardize and make logistic regression
pipe_reg = Pipeline([('clf', LogisticRegression())])

# Fit data into the model
pipe_reg.fit(X_train, y_reg_train)

# Predicting values
y_reg_pred = pipe_reg.predict(X_test)

# Calculate accuracy score
acc_score = accuracy_score(y_reg_pred, y_reg_test)
print('Accuracy score for logistic regression : ',acc_score)

df_results = df_results.append(pd.Series({"model name" : "logistic regression", "accuracy" : acc_score}), ignore_index = True)

Accuracy score for logistic regression :  0.7735652173913043


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [15]:
# See coeffs of the model
pipe_reg.named_steps['clf'].coef_

array([[-0.03163098, -0.03163098,  0.07162374,  0.98571667,  0.09700267,
        -0.40978083,  0.0519647 ,  0.41043135, -0.03441706,  0.01274007,
         0.11726346,  0.02119725, -0.39100685, -0.18191565, -0.07475868,
         0.08623388,  0.02837382,  0.15915803,  0.06301954,  0.06905836,
         0.0503761 ,  0.06332276,  0.10889381,  0.06332276,  0.22128865,
        -0.11430464,  0.03990982,  0.02850262,  0.00954175, -0.10633301,
        -0.09238852,  0.06332276,  0.22128865,  0.06857301, -0.38641174,
         0.        , -0.08333206,  0.00973305,  0.        , -0.03639733,
         0.1181282 , -0.01864331,  0.        ,  0.26474536,  0.06689652,
         0.0503761 , -0.31220915, -0.06315157,  0.        ,  0.0248374 ,
         0.06531094,  0.03766175,  0.01583329,  0.00712008,  0.02470179,
        -0.03246636, -0.02702797,  0.03713892,  0.06925871, -0.02318937,
         0.02580156, -0.00823818,  0.02079294,  0.02535451,  0.01409555,
        -0.06076724, -0.02655724,  0.0874436 , -0.0

### 3.1.2 Play with logistic regression parameters, cross validation

In [16]:
LogisticRegression().get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [17]:
# Perform Logistic regression with cross-validation

# Split dataset in features and target variable
X_reg = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_reg = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_reg_woNaN = imputer.fit_transform(X_reg)

# Get rid of rows with missing y values
X_reg_woNaN = (X_reg_woNaN[~np.isnan(y_reg)[:,0], :])
y_reg_woNaN = y_reg[~np.isnan(y_reg)]

# Split X and y into training and testing sets
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg_woNaN, y_reg_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_reg = Pipeline([('scl', StandardScaler()), ('clf', LogisticRegression())])

# Set parameters to test
param_reg = {'clf__penalty': [None, 'l1', 'l2', 'elasticnet']}

# Cross-validation
cv_reg = RandomizedSearchCV(estimator = pipe_reg, 
                                         param_distributions=param_reg, 
                                         cv=3, n_iter=30, n_jobs=-1)

# Fit data into the model
cv_reg.fit(X_reg_train, y_reg_train)

# Predicting values
y_reg_pred = cv_reg.predict(X_reg_test)

# Calculate accuracy score
acc_score = accuracy_score(y_reg_pred, y_reg_test)
print('Accuracy score for logistic regression : ',acc_score)

clf.get_params().keys()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy score for logistic regression :  0.7735652173913043


NameError: name 'clf' is not defined

In [None]:
# See best parameters of the model
g_search.best_params_

In [None]:
# See coeffs of the model
cv_reg.named_steps['clf'].coef_

## 3.2 Model 2: GradientBoostingClassifier

Estimators that allow NaN values for type classifier.

In [None]:
# Perform GradientBoostingClassifier

# Split dataset in features and target variable
X_hgbc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_hgbc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_hgbc_woNaN = imputer.fit_transform(X_hgbc)

# Get rid of rows with missing y values
X_hgbc_woNaN = (X_hgbc_woNaN[~np.isnan(y_hgbc)[:,0], :])
y_hgbc_woNaN = y_hgbc[~np.isnan(y_hgbc)]

# Split X and y into training and testing sets
X_hgbc_train, X_hgbc_test, y_hgbc_train, y_hgbc_test = train_test_split(X_hgbc_woNaN, y_hgbc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_hgbc = Pipeline([('scl', StandardScaler()), 
                     ('clf', GradientBoostingClassifier())])

# Fit data into the model
pipe_hgbc.fit(X_hgbc_train, y_hgbc_train)

# Predicting values
y_hgbc_pred = pipe_hgbc.predict(X_hgbc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_hgbc_pred, y_hgbc_test)
print('Accuracy score for gradient boosting classification : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "gradient boosting classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.3 Model 3 : K-Nearest Neighbor Classifier

In [None]:
# Perform KNeighborsClassifier

# Split dataset in features and target variable
X_knn = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_knn = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_knn_woNaN = imputer.fit_transform(X_knn)

# Get rid of rows with missing y values
X_knn_woNaN = (X_knn_woNaN[~np.isnan(y_knn)[:,0], :])
y_knn_woNaN = y_knn[~np.isnan(y_knn)]

# Split X and y into training and testing sets
X_knn_train, X_knn_test, y_knn_train, y_knn_test = train_test_split(X_knn_woNaN, y_knn_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_knn = Pipeline([('scl', StandardScaler()), 
                     ('clf', KNeighborsClassifier())])

# Fit data into the model
pipe_knn.fit(X_knn_train, y_knn_train)

# Predicting values
y_knn_pred = pipe_knn.predict(X_knn_test)

# Calculate accuracy score
acc_score = accuracy_score(y_knn_pred, y_knn_test)
print('Accuracy score for k-nearest neighbor classification : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "k-nearest neighbors", "accuracy" : acc_score}), ignore_index = True)

## 3.4 Model 4 : Naive Bayes

In [None]:
# Perform GaussianNB

# Split dataset in features and target variable
X_gnb = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_gnb = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_gnb_woNaN = imputer.fit_transform(X_gnb)

# Get rid of rows with missing y values
X_gnb_woNaN = (X_gnb_woNaN[~np.isnan(y_gnb)[:,0], :])
y_gnb_woNaN = y_gnb[~np.isnan(y_gnb)]

# Split X and y into training and testing sets
X_gnb_train, X_gnb_test, y_gnb_train, y_gnb_test = train_test_split(X_gnb_woNaN, y_gnb_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_gnb = Pipeline([('scl', StandardScaler()), 
                     ('clf', GaussianNB())])

# Fit data into the model
pipe_gnb.fit(X_gnb_train, y_gnb_train)

# Predicting values
y_gnb_pred = pipe_gnb.predict(X_gnb_test)

# Calculate accuracy score
acc_score = accuracy_score(y_gnb_pred, y_gnb_test)
print('Accuracy score for naive Bayes : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "naive Bayes", "accuracy" : acc_score}), ignore_index = True)

## 3.5 Model 5 : Random Forest Classifier

In [None]:
# Perform RandomForestClassifier

# Split dataset in features and target variable
X_rfc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_rfc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_rfc_woNaN = imputer.fit_transform(X_rfc)

# Get rid of rows with missing y values
X_rfc_woNaN = (X_rfc_woNaN[~np.isnan(y_rfc)[:,0], :])
y_rfc_woNaN = y_rfc[~np.isnan(y_rfc)]

# Split X and y into training and testing sets
X_rfc_train, X_rfc_test, y_rfc_train, y_rfc_test = train_test_split(X_rfc_woNaN, y_rfc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_rfc = Pipeline([('scl', StandardScaler()), 
                     ('clf', RandomForestClassifier())])

# Fit data into the model
pipe_rfc.fit(X_rfc_train, y_rfc_train)

# Predicting values
y_rfc_pred = pipe_rfc.predict(X_rfc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_rfc_pred, y_rfc_test)
print('Accuracy score for random forest classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "random forest", "accuracy" : acc_score}), ignore_index = True)

## 3.6 Model 6 : Support Vector Machines

### 3.6.1 Normal one

In [None]:
# Perform SVC

# Split dataset in features and target variable
X_svmn = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_svmn = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_svmn_woNaN = imputer.fit_transform(X_svmn)

# Get rid of rows with missing y values
X_svmn_woNaN = (X_svmn_woNaN[~np.isnan(y_svmn)[:,0], :])
y_svmn_woNaN = y_svmn[~np.isnan(y_svmn)]

# Split X and y into training and testing sets
X_svmn_train, X_svmn_test, y_svmn_train, y_svmn_test = train_test_split(X_svmn_woNaN, y_svmn_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_svmn = Pipeline([('scl', StandardScaler()), 
                     ('clf', svm.SVC())])

# Fit data into the model
pipe_svmn.fit(X_svmn_train, y_svmn_train)

# Predicting values
y_svmn_pred = pipe_svmn.predict(X_svmn_test)

# Calculate accuracy score
acc_score = accuracy_score(y_svmn_pred, y_svmn_test)
print('Accuracy score for normal SVC : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "normal SVM classification", "accuracy" : acc_score}), ignore_index = True)

### 3.6.2 Linear SVM

In [None]:
# Perform LinearSVM

# Split dataset in features and target variable
X_svml = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_svml = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_svml_woNaN = imputer.fit_transform(X_svml)

# Get rid of rows with missing y values
X_svml_woNaN = (X_svml_woNaN[~np.isnan(y_svml)[:,0], :])
y_svml_woNaN = y_svml[~np.isnan(y_svml)]

# Split X and y into training and testing sets
X_svml_train, X_svml_test, y_svml_train, y_svml_test = train_test_split(X_svml_woNaN, y_svml_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_svml = Pipeline([('scl', StandardScaler()), 
                     ('clf', svm.LinearSVC())])

# Fit data into the model
pipe_svml.fit(X_svml_train, y_svml_train)

# Predicting values
y_svml_pred = pipe_svml.predict(X_svml_test)

# Calculate accuracy score
acc_score = accuracy_score(y_svml_pred, y_svml_test)
print('Accuracy score for linear SVC : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "linear SVM classification", "accuracy" : acc_score}), ignore_index = True)

## 3.7 Model 7 : Multi-layer perceptrons

In [None]:
# Perform MLPClassifier

# Split dataset in features and target variable
X_mlp = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_mlp = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_mlp_woNaN = imputer.fit_transform(X_mlp)

# Get rid of rows with missing y values
X_mlp_woNaN = (X_mlp_woNaN[~np.isnan(y_mlp)[:,0], :])
y_mlp_woNaN = y_mlp[~np.isnan(y_mlp)]

# Split X and y into training and testing sets
X_mlp_train, X_mlp_test, y_mlp_train, y_mlp_test = train_test_split(X_mlp_woNaN, y_mlp_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_mlp = Pipeline([('scl', StandardScaler()), 
                     ('clf', MLPClassifier())])

# Fit data into the model
pipe_mlp.fit(X_mlp_train, y_mlp_train)

# Predicting values
y_mlp_pred = pipe_mlp.predict(X_mlp_test)

# Calculate accuracy score
acc_score = accuracy_score(y_mlp_pred, y_mlp_test)
print('Accuracy score for MLP Classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "MLP classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.8 Model 8 : Decision tree classifier 

In [None]:
# Perform DecisionTreeClassifier

# Split dataset in features and target variable
X_dtc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_dtc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_dtc_woNaN = imputer.fit_transform(X_dtc)

# Get rid of rows with missing y values
X_dtc_woNaN = (X_dtc_woNaN[~np.isnan(y_dtc)[:,0], :])
y_dtc_woNaN = y_dtc[~np.isnan(y_dtc)]

# Split X and y into training and testing sets
X_dtc_train, X_dtc_test, y_dtc_train, y_dtc_test = train_test_split(X_dtc_woNaN, y_dtc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_dtc = Pipeline([('scl', StandardScaler()), 
                     ('clf', DecisionTreeClassifier())])

# Fit data into the model
pipe_dtc.fit(X_dtc_train, y_dtc_train)

# Predicting values
y_dtc_pred = pipe_dtc.predict(X_dtc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_dtc_pred, y_dtc_test)
print('Accuracy score for Decision Tree classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "decision tree classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.9 Model 9 : ADABoost Classifier 

In [None]:
# Perform ADABoostClassifier

# Split dataset in features and target variable
X_abc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_abc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_abc_woNaN = imputer.fit_transform(X_abc)

# Get rid of rows with missing y values
X_abc_woNaN = (X_abc_woNaN[~np.isnan(y_abc)[:,0], :])
y_abc_woNaN = y_abc[~np.isnan(y_abc)]

# Split X and y into training and testing sets
X_abc_train, X_abc_test, y_abc_train, y_abc_test = train_test_split(X_abc_woNaN, y_abc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_abc = Pipeline([('scl', StandardScaler()), 
                     ('clf', DecisionTreeClassifier())])

# Fit data into the model
pipe_abc.fit(X_abc_train, y_abc_train)

# Predicting values
y_abc_pred = pipe_abc.predict(X_abc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_abc_pred, y_abc_test)
print('Accuracy score for ADABoost classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "ADABoost classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.10 Model 10 : Extra trees classifier

In [None]:
# Perform ExtraTreesClassifier

# Split dataset in features and target variable
X_etc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_etc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_etc_woNaN = imputer.fit_transform(X_etc)

# Get rid of rows with missing y values
X_etc_woNaN = (X_etc_woNaN[~np.isnan(y_etc)[:,0], :])
y_etc_woNaN = y_etc[~np.isnan(y_etc)]

# Split X and y into training and testing sets
X_etc_train, X_etc_test, y_etc_train, y_etc_test = train_test_split(X_etc_woNaN, y_etc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_etc = Pipeline([('scl', StandardScaler()), 
                     ('clf', ExtraTreesClassifier())])

# Fit data into the model
pipe_etc.fit(X_etc_train, y_etc_train)

# Predicting values
y_etc_pred = pipe_etc.predict(X_etc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_etc_pred, y_etc_test)
print('Accuracy score for Extra Trees classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "extra trees classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.11 Model 11 : Discriminant analysis

### 3.11.1 Linear discriminant analysis

In [None]:
# Perform LinearDiscriminantAnalysis

# Split dataset in features and target variable
X_lda = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_lda = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_lda_woNaN = imputer.fit_transform(X_lda)

# Get rid of rows with missing y values
X_lda_woNaN = (X_lda_woNaN[~np.isnan(y_lda)[:,0], :])
y_lda_woNaN = y_lda[~np.isnan(y_lda)]

# Split X and y into training and testing sets
X_lda_train, X_lda_test, y_lda_train, y_lda_test = train_test_split(X_lda_woNaN, y_lda_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_lda = Pipeline([('scl', StandardScaler()), 
                     ('clf', LinearDiscriminantAnalysis())])

# Fit data into the model
pipe_lda.fit(X_lda_train, y_lda_train)

# Predicting values
y_lda_pred = pipe_lda.predict(X_lda_test)

# Calculate accuracy score
acc_score = accuracy_score(y_lda_pred, y_lda_test)
print('Accuracy score for Linear Dicriminant Analysis : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "linear discriminant analysis", "accuracy" : acc_score}), ignore_index = True)

### 3.11.2 Quadratic discriminant analysis

In [None]:
# Perform QuadraticDiscriminantAnalysis

# Split dataset in features and target variable
X_qda = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_qda = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_qda_woNaN = imputer.fit_transform(X_qda)

# Get rid of rows with missing y values
X_qda_woNaN = (X_qda_woNaN[~np.isnan(y_qda)[:,0], :])
y_qda_woNaN = y_qda[~np.isnan(y_qda)]

# Split X and y into training and testing sets
X_qda_train, X_qda_test, y_qda_train, y_qda_test = train_test_split(X_qda_woNaN, y_qda_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_qda = Pipeline([('scl', StandardScaler()), 
                     ('clf', QuadraticDiscriminantAnalysis())])

# Fit data into the model
pipe_qda.fit(X_qda_train, y_qda_train)

# Predicting values
y_qda_pred = pipe_qda.predict(X_qda_test)

# Calculate accuracy score
acc_score = accuracy_score(y_qda_pred, y_qda_test)
print('Accuracy score for Quadratic Discriminant Analysis : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "quadratic discriminant classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.12 Model 12 : Stochastic gradient descent

In [None]:
# Perform SGDClassifier

# Split dataset in features and target variable
X_sgd = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_sgd = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_sgd_woNaN = imputer.fit_transform(X_sgd)

# Get rid of rows with missing y values
X_sgd_woNaN = (X_sgd_woNaN[~np.isnan(y_sgd)[:,0], :])
y_sgd_woNaN = y_sgd[~np.isnan(y_sgd)]

# Split X and y into training and testing sets
X_sgd_train, X_sgd_test, y_sgd_train, y_sgd_test = train_test_split(X_sgd_woNaN, y_sgd_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_sgd = Pipeline([('scl', StandardScaler()), 
                     ('clf', SGDClassifier())])

# Fit data into the model
pipe_sgd.fit(X_sgd_train, y_sgd_train)

# Predicting values
y_sgd_pred = pipe_sgd.predict(X_sgd_test)

# Calculate accuracy score
acc_score = accuracy_score(y_sgd_pred, y_sgd_test)
print('Accuracy score for Stochastic gradient descent classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "stochastic gradient descent", "accuracy" : acc_score}), ignore_index = True)

## 3.13 Model 13 : XGBoost

In [None]:
# Perform XGBClassifier

# Split dataset in features and target variable
X_xgb = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_xgb = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_xgb_woNaN = imputer.fit_transform(X_xgb)

# Get rid of rows with missing y values
X_xgb_woNaN = (X_xgb_woNaN[~np.isnan(y_xgb)[:,0], :])
y_xgb_woNaN = y_xgb[~np.isnan(y_xgb)]

# Split X and y into training and testing sets
X_xgb_train, X_xgb_test, y_xgb_train, y_xgb_test = train_test_split(X_xgb_woNaN, y_xgb_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_xgb = Pipeline([('scl', StandardScaler()), 
                     ('clf', xgb.XGBClassifier(objective="binary:logistic", random_state=42))])

# Fit data into the model
pipe_xgb.fit(X_xgb_train, y_xgb_train)

# Predicting values
y_xgb_pred = pipe_xgb.predict(X_xgb_test)

# Calculate accuracy score
acc_score = accuracy_score(y_xgb_pred, y_xgb_test)
print('Accuracy score for XGBoost classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "XGBoost classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.14 Model 14 : Gaussian process classifier

In [None]:
# Perform GaussianProcessClassifier

# Split dataset in features and target variable
X_gpc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_gpc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_gpc_woNaN = imputer.fit_transform(X_gpc)

# Get rid of rows with missing y values
X_gpc_woNaN = (X_gpc_woNaN[~np.isnan(y_gpc)[:,0], :])
y_gpc_woNaN = y_gpc[~np.isnan(y_gpc)]

# Split X and y into training and testing sets
X_gpc_train, X_gpc_test, y_gpc_train, y_gpc_test = train_test_split(X_gpc_woNaN, y_gpc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_gpc = Pipeline([('scl', StandardScaler()), 
                     ('clf', GaussianProcessClassifier())])

# Fit data into the model
pipe_gpc.fit(X_gpc_train, y_gpc_train)

# Predicting values
y_gpc_pred = pipe_gpc.predict(X_gpc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_gpc_pred, y_gpc_test)
print('Accuracy score for Gaussian process classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "gaussian process classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.15 Model 15 : Passive aggressive classifier

In [None]:
# Perform PassiveAggressiveClassifier

# Split dataset in features and target variable
X_pac = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_pac = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_pac_woNaN = imputer.fit_transform(X_pac)

# Get rid of rows with missing y values
X_pac_woNaN = (X_pac_woNaN[~np.isnan(y_pac)[:,0], :])
y_pac_woNaN = y_pac[~np.isnan(y_pac)]

# Split X and y into training and testing sets
X_pac_train, X_pac_test, y_pac_train, y_pac_test = train_test_split(X_pac_woNaN, y_pac_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_pac = Pipeline([('scl', StandardScaler()), 
                     ('clf', PassiveAggressiveClassifier())])

# Fit data into the model
pipe_pac.fit(X_pac_train, y_pac_train)

# Predicting values
y_pac_pred = pipe_pac.predict(X_pac_test)

# Calculate accuracy score
acc_score = accuracy_score(y_pac_pred, y_pac_test)
print('Accuracy score for Gaussian process classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "passive aggressive classifier", "accuracy" : acc_score}), ignore_index = True)

## 3.16 Model 16 : Linear perceptron classifier

In [None]:
# Perform Perceptron

# Split dataset in features and target variable
X_lpc = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_lpc = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_lpc_woNaN = imputer.fit_transform(X_lpc)

# Get rid of rows with missing y values
X_lpc_woNaN = (X_lpc_woNaN[~np.isnan(y_lpc)[:,0], :])
y_lpc_woNaN = y_lpc[~np.isnan(y_lpc)]

# Split X and y into training and testing sets
X_lpc_train, X_lpc_test, y_lpc_train, y_lpc_test = train_test_split(X_lpc_woNaN, y_lpc_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_lpc = Pipeline([('scl', StandardScaler()), 
                     ('clf', Perceptron())])

# Fit data into the model
pipe_lpc.fit(X_lpc_train, y_lpc_train)

# Predicting values
y_lpc_pred = pipe_lpc.predict(X_lpc_test)

# Calculate accuracy score
acc_score = accuracy_score(y_lpc_pred, y_lpc_test)
print('Accuracy score for Gaussian process classifier : ',acc_score)
df_results = df_results.append(pd.Series({"model name" : "perceptron", "accuracy" : acc_score}), ignore_index = True)

In [None]:
df_results.sort_values(by = "accuracy", axis = 0, ascending  = False)