# Credit Risk Analysis Report

## Overview
The purpose of this analysis is to evaluate the performance of a logistic regression model in predicting credit risk. The model is trained on historical loan data to classify loans as either healthy or high-risk based on various features.

## Model Performance
- **Accuracy:** 99%
- **Precision (Healthy Loan):** 1.00
- **Precision (High-Risk Loan):** 0.86
- **Recall (Healthy Loan):** 1.00
- **Recall (High-Risk Loan):** 0.91

## Summary
The logistic regression model demonstrates exceptional performance in predicting credit risk. It achieves an accuracy of 99%, indicating that it correctly classifies the majority of loans. The precision for healthy loans is perfect (1.00), meaning that when the model predicts a loan as healthy, it is indeed healthy 100% of the time. The precision for high-risk loans is also high (0.86), suggesting that the model is effective in identifying loans at risk of default. Similarly, the recall scores indicate that the model correctly identifies the vast majority of actual healthy loans (recall of 1.00) and high-risk loans (recall of 0.91).

Given its high accuracy and strong performance in precision and recall, I highly recommend deploying this logistic regression model for credit risk assessment within the company. It can assist in identifying potentially risky loans early, allowing the company to take appropriate actions to mitigate the associated risks. Additionally, the model's interpretability makes it valuable for explaining the rationale behind loan decisions to stakeholders. However, it's essential to monitor the model's performance regularly and recalibrate it as needed to ensure its continued effectiveness.



In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data_df = pd.read_csv(Path("Resources/lending_data.csv"))

# Review the DataFrame
lending_data_df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data_df["loan_status"]

# Separate the X variable, the features
x = lending_data_df.drop(columns="loan_status")

In [8]:
# Review the y variable Series
y.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [9]:
# Review the X variable DataFrame
x.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [11]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [13]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
my_classifier = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
my_lr_model = my_classifier.fit(x_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [14]:
# Make a prediction using the testing data
y_pred = my_classifier.predict(x_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [16]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Confusion Matrix:
 [[14926    75]
 [   46   461]]


In [18]:
# Generate a classification report for the model
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

Based on the provided classification report:

### For the "healthy loan" label (0):
- **Precision:** 1.00
  - Indicates that when the model predicts a loan as healthy, it is correct 100% of the time.
- **Recall:** 1.00
  - Suggests that the model correctly identifies 100% of the actual healthy loans.
- **F1-score:** 1.00
  - The harmonic mean of precision and recall, providing a balanced measure of model performance.

### For the "high-risk loan" label (1):
- **Precision:** 0.86
  - Indicates that when the model predicts a loan as high-risk, it is correct 86% of the time.
- **Recall:** 0.91
  - Suggests that the model correctly identifies 91% of the actual high-risk loans.
- **F1-score:** 0.88
  - Provides a balance between precision and recall for the high-risk loan class.

### Overall:
- The logistic regression model performs very well in predicting both healthy and high-risk loans.
- It achieves high accuracy (99%) and demonstrates strong performance in terms of precision, recall, and F1-score for both classes.
- However, it's worth noting that there is a slight imbalance in the precision and recall between the two classes, with the "healthy loan" class having perfect scores while the "high-risk loan" class has slightly lower scores.


---