In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Step 1: Read the CSV file from the Resources folder into a Pandas DataFrame
# Replace 'path_to_csv' with the actual path to the lending_data.csv file
file_path = 'Resources/lending_data.csv'
lending_data = pd.read_csv(file_path)

# Review the DataFrame
print(lending_data.head())
print(lending_data.info())
print(lending_data.describe())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  loan_status  
0                 1       22800            0  
1                 0       13600            0  
2                 0       16100            0  
3                 1       22700            0  
4                 1       23000            0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Step 2: Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data['loan_status']

# Separate the X variable, the features
X = lending_data.drop(columns=['loan_status'])

# Review the X and y DataFrames
print("Features (X):")
print(X.head())
print("\nLabels (y):")
print(y.head())


Features (X):
   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  

Labels (y):
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


In [4]:
# Review the y variable Series
print("Labels (y):")
print(y.value_counts())
print(y.describe())


Labels (y):
loan_status
0    75036
1     2500
Name: count, dtype: int64
count    77536.000000
mean         0.032243
std          0.176646
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: loan_status, dtype: float64


In [5]:
# Review the X variable DataFrame
print("Features (X):")
print(X.head())
print(X.info())
print(X.describe())


Features (X):
   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64

### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_split module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Review the split datasets
print("Training Features (X_train):")
print(X_train.head())
print("Training Labels (y_train):")
print(y_train.head())
print("Testing Features (X_test):")
print(X_test.head())
print("Testing Labels (y_test):")
print(y_test.head())


Training Features (X_train):
       loan_size  interest_rate  borrower_income  debt_to_income  \
2713      9300.0          7.069            47100        0.363057   
27355    10400.0          7.528            51400        0.416342   
46607     9900.0          7.350            49800        0.397590   
23276     9600.0          7.212            48500        0.381443   
75139    22200.0         12.546            98700        0.696049   

       num_of_accounts  derogatory_marks  total_debt  
2713                 3                 0       17100  
27355                4                 1       21400  
46607                4                 0       19800  
23276                4                 0       18500  
75139               15                 3       68700  
Training Labels (y_train):
2713     0
27355    0
46607    0
23276    0
75139    1
Name: loan_status, dtype: int64
Testing Features (X_test):
       loan_size  interest_rate  borrower_income  debt_to_income  \
60914    12600.0       

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model = LogisticRegression(random_state=1)

# Fit the model using training data
model.fit(X_train, y_train)

# Print the model's coefficients to check if it has been trained
print("Model Coefficients:")
print(model.coef_)


Model Coefficients:
[[-1.07343332e-05 -1.11821247e-07 -3.86442644e-04 -2.57250652e-09
   1.61411871e-07  5.41492664e-08  6.42898333e-04]]


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model = LogisticRegression(random_state=1)

# Fit the model using training data
model.fit(X_train, y_train)

# Print the model's coefficients to check if it has been trained
print("Model Coefficients:")
print(model.coef_)

# Make a prediction using the testing data
y_pred = model.predict(X_test)

# Print the predictions to check
print("Predictions on the testing data:")
print(y_pred)


Model Coefficients:
[[-1.07343332e-05 -1.11821247e-07 -3.86442644e-04 -2.57250652e-09
   1.61411871e-07  5.41492664e-08  6.42898333e-04]]
Predictions on the testing data:
[0 0 0 ... 0 0 0]


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [10]:

# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[14926    75]
 [   46   461]]


In [11]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

The logistic regression model performs very well in predicting healthy loans but less effectively in predicting high-risk loans. Here is a detailed analysis:

(a) Healthy Loans (Class 0):

Precision: 1.00 - The model is very precise in predicting healthy loans, meaning that when it predicts a loan as healthy, it is almost always correct.
Recall: 1.00 - The model identifies all healthy loans correctly.
F1-Score: 1.00 - The perfect balance of precision and recall indicates that the model is excellent at predicting healthy loans.
(b) High-Risk Loans (Class 1):

Precision: 0.96 - The model is quite precise in predicting high-risk loans, meaning that most of the loans it predicts as high-risk are indeed high-risk.
Recall: 0.35 - The model's recall for high-risk loans is low, meaning that it only identifies 35% of the actual high-risk loans. This indicates that the model misses a significant number of high-risk loans, predicting them as healthy.
F1-Score: 0.51 - This relatively low score reflects the imbalance between precision and recall.
Overall, while the model is highly accurate due to the overwhelming number of healthy loans, its performance in identifying high-risk loans is limited. This limitation is likely due to the class imbalance, with many more healthy loans than high-risk ones. Addressing this imbalance through techniques such as resampling, adjusting class weights, or using more sophisticated models may improve the model's performance in predicting high-risk loans.

---