In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
import os

In [4]:
os.getcwd()

'c:\\Users\\rttay\\Documents\\Education\\UTA_Data_Bootcamp\\Homeworks_and_Projects\\Module_20\\credit-risk-classification'

In [7]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path('Resources\lending_data.csv')

lending_df = pd.read_csv(file_path)

# Review the DataFrame
lending_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [8]:
# Separate the data into labels and features


# Separate the y variable, the labels
y = lending_df['loan_status']

display(y.head())

# Separate the X variable, the features
X = lending_df.copy()

X.drop('loan_status', axis=1, inplace=True)

X.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [9]:
# Review the y variable Series
y.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [10]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [11]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split


In [12]:

# Split the data using train_test_split
# Assign a random_state of 1 to the function

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [13]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

In [14]:

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model = LogisticRegression(random_state=1)


# Fit the model using training data
model.fit(X=X_train, y=y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [15]:
# Make a prediction using the testing data
predictions = model.predict(X=X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [16]:
# Generate a confusion matrix for the model
cm = confusion_matrix(y_test, predictions)

cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

display(cm_df)

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,18663,102
Actual 1,56,563


In [17]:
# Print the classification report for the model

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 



---------------
For 0 = Healthy Loans:

True Positives: (Actual 0, Predicted 0): 18663

True Negatives: (Actual 1, Predicted 1): 563

False Positives: (Actual 1, Predicted 0): 56

False Negatives: (Actual 0, Predicted 1): 102


Precision is defined as $ \frac{\text{\# True Positive}}{\text{\# True Positive + \# False Positive}} $. Precision is the breakdown of your positive predictions - if you made a positive prediction, what are the odds it was Ture or False? Here, our precision was 1.00 for healthy loans $ \frac{18663}{18663+56} $. Essentially, out of all of the predictions we made on the loans where we said they were healthy, nearly 100% of those were True Positives - correct predictions. So this tells us that if our model tells us that the features we have fed it give us a "healthy loan" prediction, we can be fairly certain that we don't have a false positive - the incidence rate of a false positive is lower than 1%. So if we see a healthy loan predicted, we can trust it almost 100% of the time thanks to the precision.

Precision - if I have 100 things that I predict to be positive, and 60 were actually positive, then I have a precision of 60%. 

Recall is defined similarly to precision but is focused on how many out of the actual positive values do we capture with our predicted true values: $ \frac{\text{\# True Positive}}{\text{\# True Positive + \# False Negative}} $. For healthy loans, our recall was $\frac{18663}{18663+102}$ which is ~0.99. Essentially, we captured 99% of the actually healthy loans with our predictions. Recall is useful if getting a false negative has a high cost (which isn't too true here - it would be a missed profitable investment as opposed to a loss). 

Recall - if I have 100 actually positive values, and I predicted that 70 of those were positive, then we have a recall of 70%. 

The F1 Score gives you a balance between Precision and Recall: $\text{F1} = 2 \times \frac{\text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall} } $. Here that gives us 1.00, which makes sense because both precision and recall were extremely high. 


The support tells you the number of actual occurences of the class in the specified dataset. We can see the support for healthy loans is $18663 + 102 = 18765$ because the actual occurences of healthy loans in the data set is the sum of the True Positives and the False Negatives. 

The accuracy metric then is just the total correct predictions out of the total items in the data set. Given no other information, what is the liklihood that our prediction is correct? This is what accuracy allows us to answer. It's sort of a bird's eye view, and it can miss out on a lot of the nuance. So accuracy here is 

$$ \frac{\text{\# True Positive} + \text{\# True Negative}}{\text{\# True Positive + \# False Positive \# True Negative + \# False Negative}} $$

$$ = \frac{18663+563}{18663+56+563+102} $$

Which is approximately 99.2%. Given no other information, that would inspire some confidence, but further investigation would need to be seen. 

---------------
This is more impactful - the minority value is often the one we are interested in being the "true case".


For High-Risk loans = 0 and healthy loans to 1, we can just calculate the following by swapping the 0 and 1 before calculating the values below:

True Positives: (Actual 0, Predicted 0): 563

True Negatives: (Actual 1, Predicted 1): 18663

False Positives: (Actual 1, Predicted 0): 102

False Negatives: (Actual 0, Predicted 1): 56

Precision: $ \frac{\text{\# True Positive}}{\text{\# True Positive + \# False Positive}} $. Remember, we have swapped the 0 and 1 when calculating the True/False Positives/Negatives - that's why we have different values. Here, our precision was 0.85 for high risk loans : $\frac{563}{563+102}$. Essentially, out of all the loans that we labeled as high risk, ~85% of these loans were actually high-risk. 15% of the time if our model was to label a set of features for a loan as high-risk, it would actually be a healthy loan. From a financial perspective, it definitely makes sense to have precision be more stringent here. Higher precision for healthy loans means a lower liklihood to lose money with a bad investment, while less precision on high-risk loans just means the volume of healthy loans you are giving out is reduced slightly. In the first case high precision means you minimize losses, and the second case, high precision means you are missing out on fewer healthy loan/profitable opportunities. 

Out of 100 loans that we predicted to be high risk, we would find 85 loans to be actually high risk. 

Recall the definition of recall: $ \frac{\text{\# True Positive}}{\text{\# True Positive + \# False Negative}} $. For the high-risk loans, our recall was $\frac{563}{563+56}$ which is ~0.91. Essentially, we captured 91% of the actually high-risk loans with our predictions. Recall is useful if getting a false negative has a high cost. In this case, a false negative means a high risk loan that we predicted was actually healthy. That absolutely has a high cost, and thus the recall measurement is important when we consider high-risk loans. Personally, 91% seems like good odds, but I'm not sure what the standards are for the banking industry. 

Out of 100 loans that are actually high risk, we would predict that 91 of those are high risk. 


The F1 Score gives you a balance between Precision and Recall: $\text{F1} = 2 \times \frac{\text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall} } $. Here that gives us 0.88.


The support tells you the number of actual occurences of the class in the specified dataset. We can see the support for high risk loans is $563 + 56 = 619$ because the actual occurences of healthy loans in the data set is the sum of the True Positives and the False Negatives. 

The accuracy metric does not change if we swap high-risk/healthy loans for 0/1. So the accuracy remains around 99.2%.

----------------------------------

Conclusion:

Without more knowledge of the standards of the banking industry, it is hard to say how impactful these metrics are. But the most signifcant results seem to stem from treating the minority population - the high risk loans - as the "positive" value for the prediction. This leads us to insightful metrics for precision and recall. The precision of our predictions for high risk loans was 85%, which tells us that out of 100 loans that we predict to be high risk, 85% are actually high risk. This means that about 15% of the loans we predict to be high risk are actually healthy loans, which eats in to potential profit.

Recall however feels more impactful. The recall in this case told us that out of 100 loans that are actually high risk, we predicted 91 of those high risk loans. Essentially, this model is able to capture 91% of the high risk loans, which allows us to avoid more bad investments. 

Without knowledge of the volumne, size, and interest rates of these loans though, it's hard to advise further actions.