# Base Model: Logistic Regression

## Date: Nov 9, 2023

------------------

## Introduction

In this notebook, we will establish a baseline classification model using Logistic Regression. Log reg is a good baseline as it is one of the simplist classification models, and offers high explainability compared to its counterparts. Furthermore, it can be far less computationally heavy.   
After the data is read in, some final leaky columns are dropped, and then assumptions are established and checked. These were all completed aside from multicollinearity due to a bug when using VIF. After the assumptions have been checked, a model was run and evaluated.

----------------

### Table of Contents

1. [Introduction](#Introduction)
   - [Table of Contents](#Table-of-contents)
   - [Import Librarys](#Import-Librarys)
   - [Data Dictionary](#Data-Dictionary)
   - [Define Functions](#Define-Functions)
   - [Load the data](#Load-the-data)
3. [Logistic Regression Model](#Logistic-Regression-Model)
   - [Feature Engineering](#Feature-Engineering)
   - [Assumptions](#Assumptions)
   - [Modelling](#Modelling)
   - [Evaluation](#Evaluation)
8. [Conclusion](#Conclusion)


### Import Librarys

In [None]:
%pip install statsmodels

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from statsmodels.stats.outliers_influence import variance_inflation_factor

### Data Dictionary

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [None]:
#pathlib is used to ensure compatibility across operating systems
try:
    data_destination = Path('../Data/Lending_club/Lending Club Data Dictionary Approved.csv')
    dict_df = pd.read_csv(data_destination, encoding='ISO-8859-1')
    display(dict_df.iloc[:,0:2])
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

### Load the Data

In [None]:
# Define the relative path to the file
parquet_file_path = Path('../Data/Lending_club/Cleaned')

try:
    # Read the parquet file
    loans_df = pd.read_parquet(parquet_file_path)
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

In [None]:
loans_df.head()

## Logistic Regression Model

-------------

### Feature Engineering

***There is some data prep still left from EDA.***

Drop any Leaky columns left over from EDA

In [None]:
loans_df.drop(columns=['funded_amnt', 'funded_amnt_inv', 'chargeoff_within_12_mths', 'delinq_amnt'], inplace=True)

# Also any categorical columns with too many categories for one hot encoding
loans_df.drop(columns=['issue_d', 'earliest_cr_line'], inplace=True)

Map Successful loans to 1, and Defaulted or Charged Off loans to 0 in our target column.

In [None]:
loans_df['loan_status'] = loans_df['loan_status'].apply(lambda x: 1 if x == 'Fully Paid' else 0)

### Assumptions 

Before we can start modeling, some base assumptions must be met in order to use a log reg model.   
These include:  
* Binary Outcome. The dependent variable is a binary (met)
* Independence. The observations should be independent. It is reasonable to assume independence here. Without identifiable information, there is not way of knowing from the dataset whether a borrower has applied for multiple loans as each loan id is unique.
* No collinearity / multicollinearity. This will be checked
* Large sample size. (met)

In [None]:
#Error running vif to check for multicollinearity. This will be fixed for the next iteration

# Check for multicollinearity using Variance Inflation Factor (VIF)
#vif_data = pd.DataFrame()
#vif_data['feature'] = X.columns
#vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

# Drop features with high VIF
#high_vif_columns = vif_data[vif_data['VIF'] > 9]['feature'].tolist()
#X = X.drop(high_vif_columns, axis=1)

***Collinearity***

We will plot a correlation heatmap. 

In [None]:
# Select only the numeric columns for the correlation matrix
numeric_df = loans_df.select_dtypes(include=[np.number])

# Calculate the correlation matrix
corr = numeric_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(10, 10))
sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, annot=True)
plt.show()

We can see that `open_acc` has high correlation with many other features, as well as `tot_cur_bal` and `num_sats`. We can drop these

In [None]:
loans_df.drop(columns=['open_acc', 'tot_cur_bal', 'num_sats', 'num_accts_ever_120_pd'], inplace=True)

We can now create dummy variables for our categorical variables and split the data. 

In [None]:
# Convert categorical variables to dummy variables
categorical_cols = loans_df.select_dtypes(include=['object']).columns
loans_df = pd.get_dummies(loans_df, columns=categorical_cols, drop_first=True)

# Split the data
X = loans_df.drop(columns=['loan_status'], inplace=False)
y = loans_df['loan_status']

# Split into train and test sets. Stratify to account for imbalance
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

We can now run the model

In [None]:
# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initializing and training the logistic regression model
log_reg = LogisticRegression(random_state=1,
                             solver='lbfgs', 
                             max_iter=3000, 
                             verbose=2, #output while the model runs
                             n_jobs=2, #use 2 cpu cores
                             class_weight='balanced') #weight the class to counter more frequent class
log_reg.fit(X_train_scaled, y_train)

# Making predictions on the test data using the trained model
y_pred = log_reg.predict(X_test_scaled)

### Evaluation

In [None]:
# Scoring the model on both train and test data
train_score = log_reg.score(X_train_scaled, y_train)
test_score = log_reg.score(X_test_scaled, y_test)
print(f'Score on train: {train_score}')
print(f'Score on test: {test_score}')

# Evaluating the model with confusion matrix and a classification report
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
display(conf_matrix)
print(class_report)

Our model has approximately 66% accuracy on the train set. The score between the train and test set are relatively close, however there might be some overfitting since the accuracy on the test set is lower. We can combat this by passing a C value to regularize the model. The model is better at predicting successful loans, likely due to the data imbalance, with class 1 having a higher precision and recall. Of all the actual fully paid loans, 66% were correctly predicted, with 64% of all failed loans correctly predicted. Due to the cost of false positives in this context (predicting a loan to be repaid when it will be defaulted on), a higher precision would be preferred. However, this comes at the cost of a lower recall, meaning good lending opportunities could be missed (false negatives).  

We can explore which features were most used useful in the prediction be inspecting their weights. 

In [None]:
# Assuming 'model' is your fitted Logistic Regression model
feature_weights = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': log_reg.coef_[0]
})

# Sort the features by the absolute value of their coefficient
feature_weights = feature_weights.sort_values(by='Coefficient', ascending=True)

# Display the feature weights
feature_weights

## Conclusion

The logistic regression model performed quite well considering its explainability and ease of use. We achieved a 66% and 65% accuracy on our baseline model. 