# Base Model: Logistic Regression

## Date: Nov 9, 2023

------------------

## Introduction

In this notebook, a baseline classification model will be established using Logistic Regression. Log reg is a good baseline as it is one of the simplest classification models, offers high explainability, and is computationally light compared to its counterparts. Specifically, it offers greater explainability via odds ratio than its more optimized counterpart SVM. After the data is read in, basic assumptions are established and checked. Any features that show high colinearity and multicollinearity are removed. Then 3 iterations of the log reg will are run.
1. Unbalanced unscaled dataset. This allows us to evaluate how balancing the dataset affects the model's performance
2. Balanced scaled dataset. This is the first baseline model to be used
3. Optimized model. A optimized model will be used by varying the solver, iterations, regulatization to achieve the best baseline log reg model.

The optimizations will be done manually as to demonstrate the process.

----------------

### Table of Contents

1. [Introduction](#Introduction)
   - [Table of Contents](#Table-of-contents)
   - [Import Librarys](#Import-Librarys)
   - [Data Dictionary](#Data-Dictionary)
   - [Define Functions](#Define-Functions)
   - [Load the data](#Load-the-data)
3. [Logistic Regression Model](#Logistic-Regression-Model)
   - [Assumptions](#Assumptions)
   - [PreProcessing](#PreProcessing)
   - [Modelling](#Modelling)
   - [Evaluation](#Evaluation)
8. [Conclusion](#Conclusion)


### Import Librarys

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import resample
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector

from statsmodels.stats.outliers_influence import variance_inflation_factor

### Data Dictionary

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [None]:
#pathlib is used to ensure compatibility across operating systems
try:
    data_destination = Path('../Data/Lending_club/Lending Club Data Dictionary Approved.csv')
    dict_df = pd.read_csv(data_destination, encoding='ISO-8859-1')
    display(dict_df.iloc[:,0:2])
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

### Load the Data

In [None]:
# Define the relative path to the file
parquet_file_path = Path('../Data/Lending_club/model_cleaned')

try:
    # Read the parquet file
    loans_df = pd.read_parquet(parquet_file_path)
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

In [None]:
loans_df.head()

## Logistic Regression Model

-------------

### Assumptions 

Before we can start modeling, some base assumptions must be met in order to use a log reg model.   
These include:  
* **Binary Outcome:** The dependent variable is a binary. This is met as loan status has been encoded as 1 and 0
* **Independence:** It is reasonable to assume loans are independent. Without identifiable information, there is not way of knowing from the dataset whether a borrower has applied for multiple loans as the member_id data has been removed by lendingclub.
* **No collinearity / multicollinearity.** This will be checked
* **Sufficiently Large sample size:** This is met

### Colinearity

Plot a correlation heatmap for the remaining features.

In [None]:
# Select only the numeric columns for the correlation matrix
numeric_df = loans_df.select_dtypes(include=[np.number])

# Calculate the correlation matrix
corr = numeric_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(14, 10))
sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, annot=True)
plt.show()

### Collinearity / Multicollinearity

We will check for multicollinearity and collineartiy before we split the data or encode categorical variables. We will first check for multicollinearity using Variance Inflation Factor (VIF). 

Create a dataframe with the vif values for each feature

In [None]:
#create a dataframe to hold the vif scores for each feature
vif_data = pd.DataFrame()
vif_data['feature'] = numeric_df.columns

Calculate the vif scores for each feature and place in the dataframe. This may take a few moments as it is running a linear regression between each feature. 

In [None]:
#define a vif threshold
vif_cutoff = 10

#calculate the vif. This may take a few minutes
print('Running vif calculations')
vif_data['VIF'] = [variance_inflation_factor(numeric_df.values, i) for i in range(len(numeric_df.columns))]

In [None]:
vif_data.sort_values(by=['VIF'], ascending=False)

Create a list of the columns with a vif greater than the threshold

In [None]:
high_vif_columns = vif_data[vif_data['VIF'] > vif_cutoff]['feature'].tolist()

Before we drop the features with high vif, we will inspect them

In [None]:
display(high_vif_columns)

We will leave `loan_amnt`, `term`, `int_rate`, as these are key features of the dataset.

In [None]:
# Drop features with high VIF
# https://easystats.github.io/performance/reference/check_collinearity.html#:~:text=Interpretation%20of%20the%20Variance%20Inflation%20Factor&text=A%20VIF%20less%20than%205,model%20predictors%20(James%20et%20al.
filtered_high_vif_columns = [feature for feature in high_vif_columns if feature not in ['loan_amnt', 'term', 'int_rate']]

loans_df.drop(columns = filtered_high_vif_columns, inplace=True)

The remaining features:

In [None]:
loans_df.head(0)

***Collinearity***

In [None]:
# Select only the numeric columns for the correlation matrix
numeric_df = loans_df.select_dtypes(include=[np.number])

# Calculate the correlation matrix
corr = numeric_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(8, 8))
sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, annot=True)
plt.show()

There are still some high correlations between variables. There are no major features highly correlated with the target variable. 

All the assumptions have now beeen met

### PreProcessing

For the first iteration, no scaling or resampling will be done. Only encoding for the categorical variables. For the second iteration, a standard scaler is, and the data inbalance is addressed.

***Train Test Split***

In [None]:
# Split the data
X = loans_df.drop(columns=['loan_status'], inplace=False)
y = loans_df['loan_status']

# Split into train and test sets. Stratify to ensure any inbalance is preserved as in the original data. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11, stratify=y)

# Split into train and test sets except unbalanced
unbal_X_train, unbal_X_test, unbal_y_train, unbal_y_test = train_test_split(X, y, test_size=0.3, random_state=11, stratify=y)

***Data Inbalance***

As shown in EDA, there is a large inbalance between the number of successful loans (class 1) and failed loans (class 0), approximately 80/20. Since the dataset is sufficiently large, it is acceptable to downsample the instances of class 1 to equal class 0. Balancing the dataset reduces the risk of any bias introduced to a single class simply due to its frequency in the dataset.  

This also has the added advantage of giving a more manageable dataset size. However, if computation power is not an issue, then more failed loans could be sampled from the original dataset, and / or synthetic data created for the minority class using SMOTE.  
More information can be found here:  
https://towardsdatascience.com/smote-fdce2f605729

In [None]:
print('Number of class 1 examples before:', X_train[y_train == 1].shape[0])

# Downsample majority class
X_downsampled, y_downsampled  = resample(X_train[y_train == 1],
                                   y_train[y_train == 1],
                                   replace=False,
                                   n_samples=X_train[y_train == 0].shape[0],
                                   random_state=1)

print('\nNumber of class 1 examples after:', X_downsampled.shape[0])

Can now combine with the original dataset.

In [None]:
# Combine the downsampled successful loans with the failed loans. Will keep as a df since changing to 
X_train_bal = pd.concat([X_train[y_train == 0], X_downsampled])
y_train_bal = np.hstack((y_train[y_train == 0], y_downsampled))

print("New X_train shape: ", X_train_bal.shape)
print("New y_train shape: ", y_train_bal.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

***Inspect Categorical Features***

Inspect whether the categorical features are ordinal or nominal. 

In [None]:
categorical_columns = X_train_bal.select_dtypes('object').columns.tolist()
display(categorical_columns)
categorical_columns.remove('verification_status')

In [None]:
X_train_bal['verification_status'].value_counts()

The feature `verification_status` will be **ordinal encoded** since loan applications with verified information should be weighted higher than those with unverified info. The other categorical features can be onehot encoded.

***Column Transformation for 1st iteration***

Just encode the categorical variables as described above.

In [None]:
#instantiate onehot encoder
unbal_categorical_transformer = OneHotEncoder(handle_unknown='ignore')

#instantiate ordinal encoder
unbal_ordinal_transformer = OrdinalEncoder(categories=[['Not Verified', 'Source Verified', 'Verified']])

#instantiate the column transformer
unbal_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', unbal_categorical_transformer, ['home_ownership', 'purpose', 'application_type']),
        ('ord', unbal_ordinal_transformer, ['verification_status']),
    ],
    remainder='passthrough',
    n_jobs=2 #use 2 cpu cores for greater speed
)

#fit to the train set
unbal_preprocessor.fit(unbal_X_train)

#transform the train and test sets
unbal_X_train_transformed = unbal_preprocessor.transform(unbal_X_train)
unbal_X_test_transformed = unbal_preprocessor.transform(unbal_X_test)

print("Shape of train transformed: ", unbal_X_train_transformed.shape)
print("Shape of test transformed: ",  unbal_X_test_transformed.shape)

***Column Transformation for 2nd Iteration***

For the second iteration, a standard scaler is fit as well. Although log reg is not a distance based model, it can aid in model performance by reducing the size of the parameter space, allowing the model to converge more easily.  
More information can be found here:  
https://forecastegy.com/posts/does-logistic-regression-require-feature-scaling/

In [None]:
#instantiate onehot encoder
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

#instantiate ordinal encoder
ordinal_transformer = OrdinalEncoder(categories=[['Not Verified', 'Source Verified', 'Verified']])

#combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, ['home_ownership', 'purpose', 'application_type']),
        ('ord', ordinal_transformer, ['verification_status']),
        ('num', StandardScaler(), make_column_selector(dtype_include=['int64','int32','float64','float32']))
    ],
    remainder='passthrough',
    n_jobs=2
)

#fit to the train set
preprocessor.fit(X_train_bal)

#transform the train and test sets
X_train_transformed = preprocessor.transform(X_train_bal)
X_test_transformed = preprocessor.transform(X_test)

print("Shape of train transformed: ", X_train_transformed.shape)
print("Shape of test transformed: ", X_test_transformed.shape)

***Run the model 1st iteration***

In [None]:
# Initializing and training the logistic regression model
unbal_log_reg = LogisticRegression(random_state=1,
                             solver='lbfgs', 
                             max_iter=4000, 
                             verbose=2, #output while the model runs
                             n_jobs=2) #use 2 cpu cores
                             #class_weight='balanced') #weight the class to counter more frequent class

unbal_log_reg.fit(unbal_X_train_transformed, unbal_y_train)

# Making predictions on the test data using the trained model
unbal_y_pred = unbal_log_reg.predict(unbal_X_test_transformed)

***Score the 1st iteration model***

In [None]:
# Scoring the model on both train and test data
unbal_train_score = unbal_log_reg.score(unbal_X_train_transformed, unbal_y_train)
unbal_test_score = unbal_log_reg.score(unbal_X_test_transformed, unbal_y_test)
print(f'Score on train: {unbal_train_score}')
print(f'Score on test: {unbal_test_score}')

# Evaluating the model with confusion matrix and a classification report
conf_matrix = confusion_matrix(unbal_y_test, unbal_y_pred)
class_report = classification_report(unbal_y_test, unbal_y_pred)

ConfusionMatrixDisplay.from_estimator(
    unbal_log_reg, 
    unbal_X_test_transformed, 
    unbal_y_test, 
    cmap='Blues', 
    display_labels=['Class 0', 'Class 1']
)

plt.title('Confusion Matrix for Logistic Regression')
plt.show()
print("-"*20)
print("Confusion matrix:")
print(conf_matrix)
print("-"*20)
print(class_report)
print("-"*20)

num_failed = conf_matrix[0,:].sum()
num_successful = conf_matrix[1,:].sum()

print("Number of failed loans: ", num_failed)
print("Number of successful loans: ", num_successful)

DO THE INTERPRETATION

***Run the model 2nd iteration***

The log reg model is ready to be run. The A log reg model will be run on both the balanced downsampled data as well as the inbalanced data, to showcase model evaluation wrt to class balance. The log reg model will use the `lbfgs` solver as it performs well on small dataset, even though it may not converge. If the model does not converge, we will check for any features with high multicollinearity,  a different solver and higher iteration count can be used. Note that instead of downsampling the set of successful loans in the dataset, a class_weight parameter can be used. Since the dataset 

https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions

In [None]:
# Initializing and training the logistic regression model
log_reg = LogisticRegression(random_state=1,
                             solver='lbfgs', 
                             max_iter=4000, 
                             verbose=2, #output while the model runs
                             n_jobs=2) #use 2 cpu cores
                             #class_weight='balanced') #weight the class to counter more frequent class

log_reg.fit(X_train_transformed, y_train_bal)

# Making predictions on the test data using the trained model
y_pred_bal = log_reg.predict(X_test_transformed)

## Model Evaluation

In [None]:
# Scoring the model on both train and test data
train_score = log_reg.score(X_train_transformed, y_train_bal)
test_score = log_reg.score(X_test_transformed, y_test)
print(f'Score on train: {train_score}')
print(f'Score on test: {test_score}')

# Evaluating the model with confusion matrix and a classification report
conf_matrix = confusion_matrix(y_test, y_pred_bal)
class_report = classification_report(y_test, y_pred_bal)

ConfusionMatrixDisplay.from_estimator(
    log_reg, 
    X_test_transformed, 
    y_test, 
    cmap='Blues', 
    display_labels=['Class 0', 'Class 1']
)

plt.title('Confusion Matrix for Logistic Regression')
plt.show()
print("-"*20)
print("Confusion matrix:")
print(conf_matrix)
print("-"*20)
print(class_report)
print("-"*20)

#
num_failed = conf_matrix[0,:].sum()
num_successful = conf_matrix[1,:].sum()

print("Number of failed loans: ", num_failed)
print("Number of successful loans: ", num_successful)

### Interpretation

Score on train: 0.6504545944033352
Score on test: 0.6566522963815273

Our model has approximatel 65.0%% accuracy on the train se and 65.6% accuracy on the test sett. The score between the train and test set are close meaning the model fits well to unseen data, and that there is no overfitting or underfitting.,However, this will be further explored with the classification report.
 

The confusion matrix above shows the counts for correctly and incorrectly predicted classes, in the format of:  
```
Predicted Label 
    0      1 
+------+------+  
| TP   |  FP  |  0
+------+------+     True Label
| FN   |  TN  |  1
+------+------+  
```


Where,
- **True Negative (TN):** 17,788 loans were correctly predicted as failed (class 0).
- **False Positive (FP):** 10,071 cases were incorrectly predicted as successful (class 1) when they are actually failed (class 0).
- **False Negative (FN):** 35,883 cases were incorrectly predicted as failed (class 0) when they are actually successful (class 1).
- **True Positive (TP):** 70,099 cases were correctly predicted as successful (class 1).  

The model showed a strong ability to discern successful from failed loans, with the majority of successful loans being accurately identified.
For this project, the primary goal is to minimize false positives, ie instance of failed loans incorrectly predicted as successful, minimizing credit default risk. Of the 27,859 failed loans, 17,788 were correctly predicted as failed, and of the 105,982 successful loans, 70,099 were correctly predicted as successful. While the model will be tuned for precision, however, this can be adjusted based on the lenders risk appetite, allowing for a more balanced approach between granting credit and managing default risks. 


Classification Report
- **Precision for Class 0:** 0.33, meaning when the model predicts failed, it is correct ~ 33% of the time.
- **Recall for Class 0:** 0.64, meaning that the model correctly identifies ~ 64% of the actual failed cases.
- **F1-Score for Class 0:** 0.44, a weighted average of precision and recall for failed loans, indicating a moderate balance between precision and recall for this class.
- **Support for Class 0:** There are 27,859 actual occurrences of failed loans in the dataset. 

- **Precision for Class 1:** 0.87, suggesting that when the model predicts successful, it is correct ~ 87% of the time.
- **Recall for Class 1:** 0.66, meaning that the model correctly identifies ~ 66% of the actual successful cases.
- **F1-Score for Class 1:** 0.75, a weighted average of precision and recall for successful loans, indicating a strong balance between precision and recall for this the successful class.
- **Support for Class 1:** There are 105,982 actual occurrences of successful loans in the dataset.

Overall Metrics
- **Accuracy:** 0.66, indicating that the overall, the model correctly predicts 66% of the cases.
- **Macro Average Precision:** 0.60, the average precision across both classes.
- **Macro Average Recall:** 0.65, the average recall across both classes.
- **Macro Average F1-Score:** 0.59, the average F1-score across both classes.

The modeloverall  perforsm better in identifying class 1 cases over class . Although the train set was balanced, the test set was left unbalanced in order to evaluate how the model would perform with real world raw data. This inbalance is likely the cause of the the low precision and high recall for class 0. Since class 0 recall measures the percentage of failed cases correctly identified, a smaller number of cases to start with could inflate this number. The low precision for class 0 could be costly as failing to identify potential failed loans is as important as identifying successful ones.  Furthermore, the relatively low F1-Score means that that both precision and recall could be improved, especially the low precision. The number of occurrences for class 0 is approximately 26% of class 1.


On the hand, the model scored a higher `F1-Score` andr` precisio`  for class , meaning the model struck a good balance between precision and recall.
Of the total successful loans, the model identified ~ 66%. Ideally this would be higher as to minimize any potential lost lending opportunities.


We can vary the threshold to optimize the for precision and recall

In [None]:
from sklearn.metrics import precision_score, recall_score
import numpy as np
import matplotlib.pyplot as plt

#get the probabilities for the positive class
y_proba = log_reg.predict_proba(X_test_transformed)[:, 1]

# Vary thresholds by 0.05 from 0.05 to 1
thresholds = np.arange(0.05, 1, 0.05)

precisions = []
recalls = []

for threshold in thresholds:
    # Apply threshold
    y_threshold = np.where(y_proba > threshold, 1, 0)
    
    # Calculate precision and recall
    precision = precision_score(y_test, y_threshold)
    recall = recall_score(y_test, y_threshold)
    
    # Append to list
    precisions.append(precision)
    recalls.append(recall)

# Visualize the result
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precisions, label='Precision', marker='o')
plt.plot(thresholds, recalls, label='Recall', marker='o')
plt.title('Precision and Recall scores as a function of the decision threshold')
plt.xlim(0, 1)
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Calculate the AUC
roc_auc = auc(fpr, tpr)

# Plotting the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

***Feature importance***

Explore which features were most useful in the prediction be inspecting their weights, keeping in mind to look for any missed leaky features remaining. 

In [None]:
#get the feature weights out. 
feature_weights = pd.DataFrame({
    'Feature': preprocessor.get_feature_names_out(),
    'Coefficient': log_reg.coef_[0]
})

# Sort the features by the absolute value of their coefficient
feature_weights = feature_weights.sort_values(by='Coefficient', ascending=True)

In [None]:
# Plotting the feature weights
plt.figure(figsize=(10, 10))
plt.barh(feature_weights['Feature'], feature_weights['Coefficient'], color='lightblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.title('Feature Importance')
plt.show()

The categorical features home ownership and loan purpose are the most positively predictive, with the home ownership, interest rate, and loan term being the most negatively predictive. These can be interpreted as

In [None]:
log_odds = log_reg.coef_[0]
odds = np.exp(log_odds)

feature_names = preprocessor.get_feature_names_out()
odds_df = pd.DataFrame({'Feature': feature_names, 'LogOdds': log_odds, 'OddsRatio': odds})

#sort df by OddsRatio
odds_df = odds_df.sort_values(by='OddsRatio', ascending=False)

We can look at the log odds ie for a unit increase in a feature, how do the odds multiply.
For example, the if someone does not own a home, their oddsratio is 

Home Ownership - None: The odds of the target event are 113% higher for individuals with no home ownership compared to the baseline group, holding all other variables constant, given an odds ratio of 



 1.37.

***3rd Iteration***

The final iteration for Log Reg. 

In [None]:
#instantiate onehot encoder
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

#instantiate ordinal encoder
ordinal_transformer = OrdinalEncoder(categories=[['Not Verified', 'Source Verified', 'Verified']])

#combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, ['home_ownership', 'purpose', 'application_type']),
        ('ord', ordinal_transformer, ['verification_status']),
        ('num', StandardScaler(), make_column_selector(dtype_include=['int64','int32','float64','float32']))
    ],
    remainder='passthrough',
    n_jobs=2
)

#fit to the train set
preprocessor.fit(X_train_bal)

#transform the train and test sets
X_train_transformed = preprocessor.transform(X_train_bal)
X_test_transformed = preprocessor.transform(X_test)

print("Shape of train transformed: ", X_train_transformed.shape)
print("Shape of test transformed: ", X_test_transformed.shape)

In [None]:
estimators = [('scaler', StandardScaler()),
              ('dim_redu', PCA()),
              ('model', LogisticRegression())]

In [None]:
# Create the column transformations list + columns to which to apply
col_transforms = [('onehot', OneHotEncoder(), ['home_ownership', 'purpose', 'application_type']),
                  ('ordinal', OrdinalEncoder(), ['verification_status'])]

preprocessor = ColumnTransformer

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


#instantiate ordinal encoder
ordinal_transformer = OrdinalEncoder(categories=[['Not Verified', 'Source Verified', 'Verified']])

#combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, ['home_ownership', 'purpose', 'application_type']),
        ('ord', ordinal_transformer, ['verification_status']),
        ('num', StandardScaler(), make_column_selector(dtype_include=['int64','int32','float64','float32']))
    ],
    remainder='passthrough',
    n_jobs=2
)

# Create the column transformations list + columns to which to apply
col_transforms = [('city_transform', OneHotEncoder(), ['home_ownership', 'purpose', 'application_type']),
                ('review_transform', TfidfVectorizer(), 'review')]

# Create the column transformer
col_trans = ColumnTransformer(col_transforms)

# Fit
col_trans.fit(city_df)

In [None]:
pipe=Pipeline(estimators)

In [None]:
param_grid = [{'scaler':[StandardScaler(), None, MinMaxScaler(), PowerTransformer() ], 
             'dim_redu':[PCA(), KernelPCA()],
             'model':[LogisticRegression()],
             'model__C':[10**i for i in range(-3,3)]},

In [None]:
from sklearn.model_selection import ParameterGrid

params = ParameterGrid(param_grid)

In [None]:
grid = GridSearchCV(pipe, param_grid, cv=5, verbose=4)

In [None]:
fitted_search = grid.fit(X_train, y_train)

In [None]:
fitted_search.best_params_

In [None]:
fitted_search.cv_results_['mean_test_score']

## Conclusion

The logistic regression model performed quite well considering its explainability and ease of use. We achieved a 66% and 65% accuracy on our baseline model. 