In [34]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [37]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = 'Resources/lending_data.csv'
lending_data = pd.read_csv(file_path)

# Review the DataFrame
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [40]:
# Separate the data into labels and features

# Separate the y variable, the labels
#It made sense to just call for the loan_status column into its own and dorop is
#in the other we would just drop it instead of making a anotehr x = lending_data 
#and calling every column. 

Y = lending_data['loan_status']

# Separate the X variable, the features
X = lending_data.drop(columns=['loan_status'])

In [41]:
# Review the y variable Series
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [42]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [51]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
# using the train_test_split we are training the data in order to predict what we are going to
# test and seeing that Y is set to a 1 or 0 we are seeing if there can be a prediction of a loan by comparing
# the test to the remaining variables to see if we see a trend
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [70]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# by using 1 we are then able to be sure that we are minimizing randomness, thus we are
# able to be sure that the model can be reproducible.
lrm = LogisticRegression(random_state=1)

# Fit the model using training data
# By using the fit method on the model we are then able to create parameters that will 
# work best with the regression model we created above
# Once the model is fitted, it has learned the relationships between the input features and the target
# labels and is ready to make predictions on new, unseen data.
fit_lrm = lrm.fit(X_train, Y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [71]:
# Make a prediction using the testing data
# The predict method applies the learned coefficients from the model to X_test and generates predicted target labels.
# These predictions will help us evaluate the model's performance on new, unseen 
predictions = fit_lrm.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [68]:
# Generate a confusion matrix for the model
matrix = confusion_matrix(Y_test, predictions)
print(matrix)

[[18663   102]
 [   56   563]]


In [72]:
# Print the classification report for the model
report = classification_report(Y_test, predictions)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** When considering the confusion matrix we can see that there are 18,663 that are a true negative, 102 that are a false potitive, 56 that are a false neagtive and finally 563 that are a true positive. Meaning that we have accuracy since we have a low number of false negative and false positives and having a high true negative and true positive. From the report we can see this to be true as our precision is relatively high and nearing 1 rather than zero. Givng us confidence that the logistic regression model can confidently predict a loan that is healthy and or at a high risk

---