In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame

lending_data_df = pd.read_csv("Resources/lending_data.csv")

# Review the DataFrame

lending_data_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels

# Separate the X variable, the features

y = lending_data_df["loan_status"]
X = lending_data_df.drop(columns = "loan_status")

In [4]:
# Review the y variable Series

print(y.head())

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


In [5]:
# Review the X variable DataFrame

print(X.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(58152, 7)
(19384, 7)
(58152,)
(19384,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model

logistic_regression_model = LogisticRegression(random_state = 1)

# Fit the model using training data

lr_model = logistic_regression_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data

testing_predictions = logistic_regression_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [9]:
# Generate a confusion matrix for the model

test_matrix = confusion_matrix(y_test, testing_predictions)

print(test_matrix)

[[18663   102]
 [   56   563]]


In [10]:
# Print the classification report for the model

testing_report = classification_report(y_test, testing_predictions)

print(testing_report)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**  The logistic regression model analyzed how accurate the predictions observed would be in distinguishing between a healthy loan, which would have a value of 0, or a high-risk loan, value of 1.  

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.  The questions that can be answered with the precision are of all the samples classified as being a high_risk loan, how many actually are considered a high-risk, and of all the samples classified as being a healthy loan, how many are actually considered health?  Based on the classification report with the test data, there is a precision value of 0.85 (85%) for predicted high-risk loans and 1.00 (100%) for predicted health loans.  If a high-risk loan was to be tested to see if it is actually high-risk, the model would predict that it is actually a high-risk about 85% of the time.  If a healthy loan was to be tested to see if it is actually healthy, the model would predict that it is actually healthy about 100% of the time.

Recall is the ratio of correctly predicted positive observations to all predicted observations for that class.  The questions that can be answered with the recall are of all the actual healthy loans, how many were correctly classified as being healthy, and of all the actual high-risk loans, how mnay were correcetly classified as being high-risk?  With the recall, there should be very little room for error as it can cause more damage to companies if an actual positive result is predicted as a predicted negative.  In this case, when analyzing just the healthy loans, these loans would be considered the actual positive.  A value of approximately 99% recall for predicting a healthy loan would be beneficial for determining if the given loan in question would be a healthy or high-risk loan.  With this almost perfect detection for a healthy loan, this model can be considered trustworthy to determine the healthy loans and help customers secure them.

When analyzine the high-risk loans, the recall is shown to be about 91%, so in this case, receiving a high-risk would be considered the actual positive result and the remaining 9% would be the predicted negative of the loan possibly being healthy.  This piece of the model may be dangerous for companies as a higher recall value should what companies strive to achieve for their models to properly help their customers.  As mentioned, ther eis a higher chance of a healthy loan being identified as a healthy loan, with just about 1% chance of a healthy loan being identified as a high-risk loan.  If a healthy loan is identified to be a high-risk loan, this may make the lending company lose some money for one loan, but this means that the customer was saved from possibly facing a loss of money.  There might not be as much profit, if any, from the customer if they believe a healthy loan is a high-risk loan, but this can lead to the customer coming back for another possibly health loan, helping the lending company earn a profit while preventing a loss for the customer.  The big issue would arise if a high-risk loan has been predicted to be a healthy loan.  A high-risk loan being classified as a healthy loan could cost the customer more money.  There is virtually no, or very little, risk if a healthy loan is predicted to be a high-risk loan.  With about 91% recall for predicting a high-risk loan accurately, the remaining 9% that a high-risk loan would be classified as healthy is more dangerous for the borrower, and possibly the lender.  Even though there is a high chance that the high-risk loan will be classified correctly, the 9% change that it will be classified incorrectly seems too high for the risk involved.

---