In [6]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [8]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
# YOUR CODE HERE!
df_data = pd.read_csv(
    "Resources/lending_data.csv")

# Review the DataFrame
# YOUR CODE HERE!
df_data.head(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
5,10100.0,7.438,50600,0.407115,4,1,20600,0
6,10300.0,7.49,51100,0.412916,4,1,21100,0
7,8800.0,6.857,45100,0.334812,3,0,15100,0
8,9300.0,7.096,47400,0.367089,3,0,17400,0
9,9700.0,7.248,48800,0.385246,4,0,18800,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [15]:
# Separate the data into labels and features

# Separate the y variable, the labels
# YOUR CODE HERE!]
y = df_data["loan_status"]
# Separate the X variable, the features
# YOUR CODE HERE! 
x = df_data.drop(columns=["loan_status"])

In [16]:
# Review the y variable Series
# YOUR CODE HERE!
y.head(10)


0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: loan_status, dtype: int64

In [17]:
# Check the distribution of loan_status values
y.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

In [18]:
# Review the X variable DataFrame
# YOUR CODE HERE!
x.head(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000
5,10100.0,7.438,50600,0.407115,4,1,20600
6,10300.0,7.49,51100,0.412916,4,1,21100
7,8800.0,6.857,45100,0.334812,3,0,15100
8,9300.0,7.096,47400,0.367089,3,0,17400
9,9700.0,7.248,48800,0.385246,4,0,18800


In [19]:
# Get summary statistics of the features
X.describe()

# Check for missing values
X.isnull().sum()

# Check the data types of each column
X.dtypes

loan_size           float64
interest_rate       float64
borrower_income       int64
debt_to_income      float64
num_of_accounts       int64
derogatory_marks      int64
total_debt            int64
dtype: object

### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [21]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Verify the shape of the datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((62028, 7), (15508, 7), (62028,), (15508,))

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [22]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_model = LogisticRegression(random_state=1)

# Fit the model using the training data
logistic_model.fit(X_train, y_train)

# Check the model's coefficients
logistic_model.coef_, logistic_model.intercept_

(array([[ 4.65493177e-03, -1.21882486e-03, -1.16564682e-03,
          3.10079032e-01, -1.39173075e-01,  1.34396996e+00,
          2.29851129e-04]]),
 array([-4.9082417e-08]))

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [23]:
# Make a prediction using the testing data
y_pred = logistic_model.predict(X_test)

# Display the first 10 predictions
y_pred[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [24]:
from sklearn.metrics import confusion_matrix, classification_report

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[14924    77]
 [   31   476]]


In [25]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(class_report)


Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.86      0.94      0.90       507

    accuracy                           0.99     15508
   macro avg       0.93      0.97      0.95     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** WRITE YOUR ANSWER HERE! 

The logistic regression model performs exceptionally well in predicting both 0 (healthy loan) and 1 (high-risk loan) labels, with an overall accuracy of 99%.

Analysis of the Model Performance:
For Class 0 (Healthy Loan):

Precision: 1.00 → When the model predicts a loan as healthy, it is almost always correct.
Recall: 0.99 → Almost all healthy loans are correctly identified.
F1-score: 1.00 → A strong balance between precision and recall.
For Class 1 (High-Risk Loan):

Precision: 0.86 → 86% of the loans predicted as high-risk are truly high-risk.
Recall: 0.94 → The model successfully identifies 94% of actual high-risk loans.
F1-score: 0.90 → A good balance of precision and recall, indicating reliable classification.
Confusion Matrix Insights:

True Negatives (Healthy loans correctly classified as 0): 14,924
False Positives (Healthy loans incorrectly classified as high-risk): 77
False Negatives (High-risk loans incorrectly classified as healthy): 31
True Positives (High-risk loans correctly classified as 1): 476
The small number of false positives (77) and false negatives (31) shows that the model is effective.
Conclusion:
The model performs almost perfectly for predicting healthy loans (0).
It does reasonably well for high-risk loans (1), though it occasionally misclassifies some high-risk loans as healthy (31 cases).
Potential Improvement: To enhance the classification of high-risk loans, we could adjust the decision threshold or use techniques such as class balancing (SMOTE) to improve recall without compromising precision.

This logistic regression model is highly effective for loan classification!

---