In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
customer_df = pd.read_csv(
    Path("../Resources/lending_data.csv")
)

# Review the DataFrame
display(customer_df.head())
display(customer_df.tail())

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
77531,19100.0,11.261,86600,0.65358,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1
77535,15600.0,9.742,72300,0.585062,9,2,42300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features
# Separate the y variable, the labels
y= customer_df['loan_status']

# Separate the X variable, the features
X = customer_df.drop(columns=["loan_status"])

In [4]:
# Review the y variable Series
y[:5]

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [11]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
lr_model = classifier.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [13]:
# Make a prediction using the testing data
predictions = lr_model.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test}).tail(10)

Unnamed: 0,Prediction,Actual
73999,0,0
47267,0,0
35950,0,0
42373,0,0
38631,0,0
45639,0,0
11301,0,0
51614,0,0
4598,0,0
2793,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [15]:
# Generate a confusion matrix for the model
confusion_matrix(y_test, predictions)


array([[18663,   102],
       [   56,   563]], dtype=int64)

In [18]:
# Print the classification report for the model
target_names=['Machine Learning Model 1:', 'Machine Learning Model 2:']
print(classification_report(y_test, predictions,target_names=target_names))

                           precision    recall  f1-score   support

Machine Learning Model 1:       1.00      0.99      1.00     18765
Machine Learning Model 2:       0.85      0.91      0.88       619

                 accuracy                           0.99     19384
                macro avg       0.92      0.95      0.94     19384
             weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** Based on the report it appears the healthy loans were precise but were an 85% for high-risk loans which seems a bit low. The recall was better at predicting positive instances of both healthy loans and high-risk loans. 

---

Module 12 Report 
Overview of the Analysis
The purpose of this analysis was to determine the loan risk of a client based on other lending activity to determine their creditworthiness. 
The data considered the following features: loan size, interest rate, borrower income, debt to income ratio, number of accounts, derogatory marks, and total debt. To begin the data processing the lending dataset was reviewed and split into training and testing data. The logistic regression module was imported from scikit learn and then was applied to the training data, and then predictions were made using the testing data. 
Results
Balanced accuracy scores and the precision and recall scores of all machine learning models.
•	Machine Learning Model 1:
o	The first model had a high precision, recall, and f1 score. It had a recall of 99% of correctly predicted positives, and 100% of actual positives when predicting healthy loans.
•	Machine Learning Model 2:
o	The second model had lower scores when predicting high-risk loans, so out of all the loans classified as being high-risk only 85% were actually risky. The recall was 91% so out of the high-risk loans 91% were correctly classified as high-risk loans. 
Summary
Recommendation based on the results of the machine learning models:
Model 1 has better performance scores than model 2 and is therefore more efficient at predicting healthy loans and actual positives. The biggest problem with trying to choose the correct model is choosing our end goal.
I would not recommend either model because the first one predicts healthy loans, and it would be more beneficial to predict the high-risk loans. When choosing the high-risk loan model 2 is fairly low especially when catching false negatives at 91%. I believe false negatives are more important so if it was a stronger score, I would have suggested model 2.


