# Credit Risk Classification

Credit risk poses a classification problem that’s inherently imbalanced. This is because healthy loans easily outnumber risky loans. In this application, we’ll use various techniques to train and evaluate models with imbalanced classes. We’ll use a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.


In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler


import warnings
warnings.filterwarnings('ignore')

---

## Split the Data into Training and Testing Sets

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data = pd.read_csv(Path('Resources/lending_data.csv'))

# Review the DataFrame
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data['loan_status']

# Separate the X variable, the features
X = lending_data.drop(columns='loan_status')

In [4]:
# Review the y variable Series
y.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [6]:
# Check the balance of our target values
y.value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

In [7]:
# Split the data using train_test_split assigning a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

In [8]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
lr = LogisticRegression(random_state=1)
lr


LogisticRegression(random_state=1)

In [9]:
# Scale the data

scaler = StandardScaler()
X_scaler = scaler.fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


In [10]:
# Fit the model using training data

lr.fit(X_train_scaled, y_train)

LogisticRegression(random_state=1)

In [11]:
# Make a prediction using the testing data

lending_data_y_pred = lr.predict(X_test_scaled)

In [12]:
# Print the balanced_accuracy score of the model

balanced_accuracy_score(y_test, lending_data_y_pred)

0.9889115309798473

In [13]:
# Generate a confusion matrix for the model
confusion_matrix(y_test, lending_data_y_pred)

array([[18652,   113],
       [   10,   609]])

In [14]:
# Print the classification report for the model
print(classification_report_imbalanced(y_test, lending_data_y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.98      1.00      0.99      0.98     18765
          1       0.84      0.98      0.99      0.91      0.99      0.98       619

avg / total       0.99      0.99      0.98      0.99      0.99      0.98     19384



**How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?**

The logistic regression model predicted very well the healthy loans as we can see in the confusion matrix getting 18652 correct and 113 incorrect, which it is represented in the report with a precision of 1.00, while the predictions of high-risk loans was not as accurate at 0.84 although it is still a good. 

---

## Predict a Logistic Regression Model with Resampled Training Data

### In this section, we will use the `RandomOverSampler` module from the imbalanced-learn library to resample the data to test if we could improve the precision of the high risk loans. 

In [15]:
# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
random_oversampler = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)

In [16]:
# Count the distinct values of the resampled labels data
y_resampled.value_counts()

0    56271
1    56271
Name: loan_status, dtype: int64

In [17]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
lr = LogisticRegression(random_state=1)
lr

# Fit the model using the resampled training data
#DO I ALWAYS HAVE TO SCALE THE DATA???
# Scale the data
scaler = StandardScaler()
X_scaler = scaler.fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
lr.fit(X_resampled, y_resampled)

# Make a prediction using the testing data
resampled_data_pred = lr.predict(X_test)


In [18]:
# Print the balanced_accuracy score of the model 

balanced_accuracy_score(y_test, resampled_data_pred)

0.9936781215845847

In [19]:
# Generate a confusion matrix for the model
confusion_matrix(y_test, resampled_data_pred)

array([[18649,   116],
       [    4,   615]])

In [20]:
# Print the classification report for the model
print(classification_report_imbalanced(y_test, resampled_data_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.99      1.00      0.99      0.99     18765
          1       0.84      0.99      0.99      0.91      0.99      0.99       619

avg / total       0.99      0.99      0.99      0.99      0.99      0.99     19384



**How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?**

Looking at the confusion matrix, we can see that the model improved just a bit, however, looking at the classification report, we can see that resampling the data has not had a worthy impact compared to the classification report of the logistic regression model. Therefore, it would be advisable not to resample the data and use the original logistic regression model. 