# Project 2: Covid ---> III/ Models and predictions

The purpose of this file is to test and compare several models on the matrix extracted from the "II_features-selection" file. We will use a sample of the data for questions of run time. Then the best model will be applied to all the data.

In [1]:
# Import
%matplotlib inline

import os
import os.path as op
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, average_precision_score, auc

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import xgboost as xgb

from sklearn.model_selection import GridSearchCV

%load_ext autoreload
%autoreload 2

## 1. Literature

In 2019, the first COVID-19 cases are observed in China. Rapidly, the SARS-Cov2 virus spread worldwide, pushing governments to take strict decisions about the lives of their co-citizens, like containment, to protect the population. Indeed, in some cases, COVID-19 patients ended up in intensive care services and sometimes died.

**The aim of our model is, based on easily computable parameters at the study's beginning, to predict whether the patient will be likely to die or if the chance of survival is important.** The point of this study is to help the hospital organise in the case of a high number of cases.


The studied dataset stem from the IDDO Data Repository of COVID-19 data. This data was pulled from the underlying data collection projects on 2022-09-01. The data comes from 1,200 institutions from over 45 countries and gather various information from 700,000 hospitalised individuals.

To keep only the relevant features, we first dive into the literature, using Meta-analysis papers. First, we have been looking for aggravating factors that will likely lead the patient to ICU.

Obesity: according to a meta-analysis by Sales-Peres, there is a correlation between obesity and ICU admission. This paper also concluded that co-morbidities for obese patients, such as hypertension, type 2 diabetes, smoking habit, lung disease, and/or cardiovascular disease lead to a higher chance of ICU admission.
Age: patients aged 70 years and above have a higher risk of infection and a higher need for intensive care than patients younger than 70.
Sex: men, when infected, have a higher risk of severe COVID-19 disease and a higher need for intensive care than women\cite{pijls_demographic_2021}.
Ethnicity: the risk of contamination was higher in most ethnic minority groups than their White counterparts in North America and Europe. Among people with confirmed infection, African-Americans and Hispanic Americans were also more likely than White Americans to be hospitalised with SARS-CoV-2 infection. However, the probability of ICU admission was equivalent for all groups. Thus, ethnicity is not relevant to our question. 
Blood tests: Patients with increased pancreatic enzymes, including elevated serum lipase or amylase of either type, had worse clinical outcomes. Lower levels of lymphocytes and hemoglobin; elevated levels of leukocytes, aspartate aminotransferase, alanine aminotransferase, blood creatinine, blood urea nitrogen, high-sensitivity troponin, creatine kinase, high-sensitivity C-reactive protein, interleukin 6, D-dimer, ferritin, lactate dehydrogenase, and procalcitonin; and a high erythrocyte sedimentation rate were also associated with severe COVID-19.  

Out of a total of 3009 citations, 17 articles (22 studies, 21 from China and one study from Singapore) with 3396 ranging from 12 to1099 patients were included. Our meta-analyses showed a significant decrease in lymphocyte, monocyte, and eosinophil, hemoglobin, platelet, albumin, serum sodium, lymphocyte to C-reactive protein ratio (LCR), leukocyte to C-reactive protein ratio (LeCR), leukocyte to IL-6 ratio (LeIR), and an increase in the neutrophil, alanine aminotransferase (ALT), aspartate aminotransferase (AST), total bilirubin, blood urea nitrogen (BUN), creatinine (Cr), erythrocyte Sedimentation Rate (ESR), C-reactive protein (CRP), Procalcitonin (PCT), lactate dehydrogenase (LDH), fibrinogen, prothrombin time (PT), D-dimer, glucose level, and neutrophil to lymphocyte ratio (NLR) in the severe group compared with the non-severe group. 

No significant changes in white blood cells (WBC), Creatine Kinase (CK), troponin I, myoglobin, IL-6 and K between the two groups were observed. 

## 2. Load data after data_selection and feature_selection

The file we open already has: 
- lines with NA for DSDECOD that have been removed
- the NAs that have been filled in 
- the standardisation that has been performed 
- the features we want to keep that have been selected
- the data has been stratified over the continents, and we only kept the data from Europe

In [2]:
# Open file
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_train_LogisticRegression.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_train = pd.concat(mylist, axis=0)
df_train.name = 'df_train'
del mylist

In [3]:
data_folder = op.join(os.getcwd(), "data", "results")
mylist = []
for chunk in pd.read_csv(op.join(data_folder, 'df_final_II-FeaturesSelection_test_LogisticRegression.csv'), sep=',', low_memory=False, chunksize=5000, index_col=0):
    mylist.append(chunk)
df_test = pd.concat(mylist, axis=0)
df_test.name = 'df_test'
del mylist

In [4]:
df_train.head(3)

Unnamed: 0,SEX,IETEST_Fever,INCLAS_VACCINES,SACAT_COMPLICATIONS,SACAT_PREVIOUS_COVID-19_INFECTION,AGE,LBTEST_APTT,LBTEST_AST,LBTEST_BILI,LBTEST_CRP,...,LBTEST_HCT,LBTEST_INR,LBTEST_LDH,LBTEST_PCT,LBTEST_SODIUM,LBTEST_UREAN,VSTEST_MAP,VSTEST_OXYSAT,VSTEST_RESP,DSDECOD
0,1.0,1.0,0.0,1.0,0.0,0.14057,1.740176,-0.033027,-0.174741,1.344486,...,-0.003288,-0.054942,0.20266,-0.020325,-0.083268,-0.658366,-0.02455,-0.190663,1.948568,0.0
1,1.0,0.0,1.0,0.0,0.0,-2.264617,-0.088959,-0.033027,-0.174741,-0.211841,...,-0.003288,-0.054942,-0.081009,-0.023414,-0.083268,-0.244866,-0.02455,0.025502,-1.207292,0.0
2,0.0,0.0,0.0,1.0,0.0,0.196505,-0.679003,-0.033027,0.005122,0.513587,...,-0.003288,-0.054942,-0.081009,-0.023414,0.459073,0.247397,-0.02455,-0.100594,0.307521,1.0


In [5]:
df_train.shape

(28000, 21)

In [6]:
df_train.columns

Index(['SEX', 'IETEST_Fever', 'INCLAS_VACCINES', 'SACAT_COMPLICATIONS',
       'SACAT_PREVIOUS_COVID-19_INFECTION', 'AGE', 'LBTEST_APTT', 'LBTEST_AST',
       'LBTEST_BILI', 'LBTEST_CRP', 'LBTEST_GLUC', 'LBTEST_HCT', 'LBTEST_INR',
       'LBTEST_LDH', 'LBTEST_PCT', 'LBTEST_SODIUM', 'LBTEST_UREAN',
       'VSTEST_MAP', 'VSTEST_OXYSAT', 'VSTEST_RESP', 'DSDECOD'],
      dtype='object')

## 3. Search the best model/algorithm and the best parameters

We will work only on a sample of the data to try to find the best algorithm/model with the best parameters. Then we will apply the best model to the whole data set.

In [7]:
# Separate data into features and label
X_train = df_train.loc[:, df_train.columns!='DSDECOD']
y_train = df_train['DSDECOD']

X_test = df_test.loc[:, df_test.columns!='DSDECOD']
y_test = df_test['DSDECOD']

In [8]:
# For storage of results for each model
df_results = pd.DataFrame()

### 3.1. Model 1: Logistic regression

In [9]:
# Choose parameter values to test for cross-validation
param_grid = {"C":np.logspace(-3,3,4), 
              "penalty":["l2"]}

# Choose the estimator
estimator = LogisticRegression(max_iter=200)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'C':best_param['C'], 'penalty':best_param['penalty']}
print('Best parameters for Logistic regression:')
print('C:', parameters['C'])
print('Penalty:', parameters['penalty'])
print('------------')

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for Logistic regression:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Logistic regression",
                                          "Parameters": parameters,
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index=True)

Best parameters for Logistic regression:
C: 1000.0
Penalty: l2
------------
Performance for Logistic regression:
  - Accuracy score = 0.71
  - F1 score = 0.72
  - Precision score = 0.70
  - Recall score = 0.74
  - ROC AUC score = 0.79
  - Average precision score = 0.76


### 3.2. Model 2 : K-Nearest Neighbor Classifier

In [10]:
# Choose parameter values to test for cross-validation
k_range = list(range(5, 21, 5))
param_grid = dict(n_neighbors=k_range)

# Choose the estimator
estimator = KNeighborsClassifier()

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'k':best_param['n_neighbors']}
print('Best parameters for K-Nearest Neighbor:')
print('k:', parameters['k'])
print('------------')

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for K-Nearest Neighbor:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "K-Nearest Neighbor",
                                          "Parameters": parameters,
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index=True)

Best parameters for K-Nearest Neighbor:
k: 15
------------
Performance for K-Nearest Neighbor:
  - Accuracy score = 0.70
  - F1 score = 0.71
  - Precision score = 0.69
  - Recall score = 0.73
  - ROC AUC score = 0.77
  - Average precision score = 0.73


### 3.3. Model 3 : Support Vector Machines

In [11]:
# Choose parameter values to test for cross-validation
param_grid = {'C': [0.1, 1, 10, 100], 
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['rbf']} 

# Choose the estimator
estimator = svm.SVC(probability=True)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'C':best_param['C'], 'gamma':best_param['gamma'], 'kernel':best_param['kernel']}
print('Best parameters for SVM:')
print('C:', parameters['C'])
print('gamma:', parameters['gamma'])
print('kernel:', parameters['kernel'])
print('------------')

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for SVM:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "SVM",
                                          "Parameters": parameters,
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index=True)

Best parameters for SVM:
C: 0.1
gamma: 0.1
kernel: rbf
------------
Performance for SVM:
  - Accuracy score = 0.72
  - F1 score = 0.74
  - Precision score = 0.70
  - Recall score = 0.79
  - ROC AUC score = 0.80
  - Average precision score = 0.77


### 3.4. Model 4 : Multi-layer perceptrons

In [12]:
# Choose parameter values to test for cross-validation
param_grid = {'hidden_layer_sizes': np.arange(10, 20, 3),
              'activation': ['relu'],
              'solver': ['sgd', 'adam'],
              'alpha': [0.001, 0.05],
              'learning_rate': ['constant']}

# Choose the estimator
estimator = MLPClassifier(max_iter=300)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'hidden_layer_sizes':best_param['hidden_layer_sizes'], 'activation':best_param['activation'], 
              'solver':best_param['solver'], 'alpha':best_param['alpha'], 'learning_rate':best_param['learning_rate'], }
print('Best parameters for MLP:')
print('hidden_layer_sizes:', parameters['hidden_layer_sizes'])
print('activation:', parameters['activation'])
print('solver:', parameters['solver'])
print('alpha:', parameters['alpha'])
print('learning_rate:', parameters['learning_rate'])
print('------------')

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for MLP:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "MLP",
                                          "Parameters": parameters,
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index=True)

Best parameters for MLP:
hidden_layer_sizes: 16
activation: relu
solver: adam
alpha: 0.05
learning_rate: constant
------------
Performance for MLP:
  - Accuracy score = 0.73
  - F1 score = 0.74
  - Precision score = 0.71
  - Recall score = 0.77
  - ROC AUC score = 0.81
  - Average precision score = 0.79


### 3.5. Model 5 : Quadratic discriminant analysis

In [13]:
# Choose parameter values to test for cross-validation
param_grid = [{'reg_param': [0.1, 0.2, 0.3, 0.4, 0.5]}]

# Choose the estimator
estimator = QuadraticDiscriminantAnalysis()

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'reg_param':best_param['reg_param']}
print('Best parameters for Quadratic discriminant analysis:')
print('reg_param:', parameters['reg_param'])
print('------------')

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for Quadratic discriminant analysis:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "Quadratic discriminant analysis",
                                          "Parameters": parameters,
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index=True)

Best parameters for Quadratic discriminant analysis:
reg_param: 0.5
------------
Performance for Quadratic discriminant analysis:
  - Accuracy score = 0.56
  - F1 score = 0.29
  - Precision score = 0.79
  - Recall score = 0.18
  - ROC AUC score = 0.77
  - Average precision score = 0.73


### 3.6. Model 6 : XGBoost

In [14]:
# Choose parameter values to test for cross-validation
param_grid = {'max_depth': range (2, 10, 1),
              'n_estimators': range(60, 220, 40),
              'learning_rate': [0.1, 0.01, 0.05]
}

# Choose the estimator
estimator = xgb.XGBClassifier(objective="binary:logistic", random_state=42)

# Cross-validation with GridSearchCV
clf = GridSearchCV(estimator, param_grid, scoring='f1', cv=StratifiedKFold(n_splits=3, shuffle=True))

# Fit data into the model
clf.fit(X_train, y_train)

# Predicting values
y_pred = clf.predict(X_test)

# Proba for the greater label
y_score = clf.predict_proba(X_test)[:, 1]

# Best parameters
best_param = clf.best_estimator_.get_params()
parameters = {'max_depth':best_param['max_depth'], 'n_estimators':best_param['n_estimators'], 'learning_rate':best_param['learning_rate']}
print('Best parameters for XGBoost:')
print('max_depth:', parameters['max_depth'])
print('n_estimators:', parameters['n_estimators'])
print('learning_rate:', parameters['learning_rate'])
print('------------')

# Calculate performance scores
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_score)
average_precision = average_precision_score(y_test, y_score)

# Print performance
print('Performance for XGBoost:')
print('  - Accuracy score = {:.2f}'.format(accuracy))
print('  - F1 score = {:.2f}'.format(f1))
print('  - Precision score = {:.2f}'.format(precision))
print('  - Recall score = {:.2f}'.format(recall))
print('  - ROC AUC score = {:.2f}'.format(roc_auc))
print('  - Average precision score = {:.2f}'.format(average_precision))

# Add performance to df_results
df_results = df_results.append(pd.Series({"Model" : "XGBoost",
                                          "Parameters": parameters,
                                          "Accuracy" : accuracy,
                                          "F1" : f1,
                                          "Precision": precision,
                                          "Recall" : recall,
                                          "ROC AUC" : roc_auc,
                                          "Average precision" : average_precision}), ignore_index=True)

Best parameters for XGBoost:
max_depth: 4
n_estimators: 100
learning_rate: 0.1
------------
Performance for XGBoost:
  - Accuracy score = 0.73
  - F1 score = 0.74
  - Precision score = 0.72
  - Recall score = 0.77
  - ROC AUC score = 0.81
  - Average precision score = 0.80


### 3.7. Models comparison

In [15]:
df_results.sort_values(by = "F1", axis=0, ascending=False).round(2).set_index('Model')

Unnamed: 0_level_0,Parameters,Accuracy,F1,Precision,Recall,ROC AUC,Average precision
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
XGBoost,"{'max_depth': 4, 'n_estimators': 100, 'learnin...",0.73,0.74,0.72,0.77,0.81,0.8
MLP,"{'hidden_layer_sizes': 16, 'activation': 'relu...",0.73,0.74,0.71,0.77,0.81,0.79
SVM,"{'C': 0.1, 'gamma': 0.1, 'kernel': 'rbf'}",0.72,0.74,0.7,0.79,0.8,0.77
Logistic regression,"{'C': 1000.0, 'penalty': 'l2'}",0.71,0.72,0.7,0.74,0.79,0.76
K-Nearest Neighbor,{'k': 15},0.7,0.71,0.69,0.73,0.77,0.73
Quadratic discriminant analysis,{'reg_param': 0.5},0.56,0.29,0.79,0.18,0.77,0.73


## 4. Apply the best model/algorithm

**!!TODO!!**

In [None]:
# SAUVEGARDE EN ATTENDANT => obj faire une cross validation quand on a choisi le meilleur modèle !

# Perform Logistic regression with cross-validation

# Split dataset in features and target variable
X_reg = df.loc[:10000, df.columns != 'DSDECOD'].to_numpy()
y_reg = df.loc[:10000, df.columns == 'DSDECOD'].to_numpy()

# Fill NA values
imputer = KNNImputer(n_neighbors=2, weights="uniform")
X_reg_woNaN = imputer.fit_transform(X_reg)

# Get rid of rows with missing y values
X_reg_woNaN = (X_reg_woNaN[~np.isnan(y_reg)[:,0], :])
y_reg_woNaN = y_reg[~np.isnan(y_reg)]

# Split X and y into training and testing sets
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg_woNaN, y_reg_woNaN, test_size=0.3, random_state=16)

# Create pipeline to standardize and make logistic regression
pipe_reg = Pipeline([('scl', StandardScaler()), ('clf', LogisticRegression())])

# Set parameters to test
param_reg = {'clf__penalty': [None, 'l1', 'l2', 'elasticnet']}

# Cross-validation
cv_reg = RandomizedSearchCV(estimator = pipe_reg, 
                                         param_distributions=param_reg, 
                                         cv=3, n_iter=30, n_jobs=-1)

# Fit data into the model
cv_reg.fit(X_reg_train, y_reg_train)

# Predicting values
y_reg_pred = cv_reg.predict(X_reg_test)

# Calculate accuracy score
acc_score = accuracy_score(y_reg_pred, y_reg_test)
print('Accuracy score for logistic regression : ',acc_score)

clf.get_params().keys()

In [None]:
# See best parameters of the model
g_search.best_params_

In [None]:
# See coeffs of the model
cv_reg.named_steps['clf'].coef_