In [35]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.decomposition import PCA

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [24]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path("../Credit_Risk/lending_data.csv")
df_lending_data = pd.read_csv(file_path)

# Review the DataFrame
df_lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [25]:
# Separate the data into labels and features
y = df_lending_data["loan_status"]

# Separate the X variable (the features) - drop the target column
X = df_lending_data.drop(columns=["loan_status"])

In [26]:
# Review the y variable
print("\nReview of the y variable (labels):")
y.head()


Review of the y variable (labels):


0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [27]:
# Review the X variable
print("\nReview of the X variable (features):")
X.head()


Review of the X variable (features):


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [28]:
from sklearn.model_selection import train_test_split
# Step 1: Identify the target column

target_column = 'loan_status'  # Replace with the actual name of the target column
# Step 1: Create the feature matrix X by dropping the 'loan_status' column
X = df_lending_data.drop(columns=['loan_status'])

# Step 2: Set the target variable y to the 'loan_status' column
y = df_lending_data['loan_status']

# Step 3: Split the data into training and testing sets using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Step 4: Review the shapes of the training and testing sets to ensure the split worked
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (62028, 7)
X_test shape: (15508, 7)
y_train shape: (62028,)
y_test shape: (15508,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [29]:
# Step 1: Import the LogisticRegression module
from sklearn.linear_model import LogisticRegression

# Step 2: Instantiate the Logistic Regression model with random_state=1
model = LogisticRegression(random_state=1)

# Step 3: Fit the model using the training data
model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [31]:
# Step 2: Use the model to make predictions on the test data
y_pred = model.predict(X_test)

# Step 3: Generate the confusion matrix by comparing y_test and y_pred
cm = confusion_matrix(y_test, y_pred)

# Step 4: Print the confusion matrix
print("Confusion Matrix:")
cm


Confusion Matrix:


array([[14924,    77],
       [   31,   476]], dtype=int64)

In [32]:

# Print the classification report by comparing y_test and y_pred
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.86      0.94      0.90       507

    accuracy                           0.99     15508
   macro avg       0.93      0.97      0.95     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** WRITE YOUR ANSWER HERE!

---

# credit-risk-classification

#### Overview of the Analysis

The purpose of this analysis was to build a machine learning model to predict the likelihood of loan default based on historical lending data. In financial contexts, identifying high-risk loans is critical to minimizing losses and making informed lending decisions.
The dataset used included several features, such as loan amounts, borrower income, and interest rates. The target variable was loan status, which indicates whether a loan was repaid (represented as 0) or defaulted (represented as 1). Our goal was to develop a predictive model to classify loans into these two categories.
The machine learning process involved several key stages:
•	Data Preprocessing: Splitting the dataset into a feature matrix (X) and target variable (y).
•	Data Splitting: Using train_test_split to divide the data into 80% training and 20% testing sets.
•	Model Selection: We chose Logistic Regression, a straightforward and efficient algorithm for binary classification.
•	Model Training: The logistic regression model was trained on the training data to learn the patterns.
•	Model Evaluation: Predictions were made on the test data, and we generated a confusion matrix and classification report to assess the model’s performance.

#### Results

Below is the performance of the Logistic Regression model based on the classification report:
Machine Learning Model 1: Logistic Regression
•	Accuracy: 0.99
This metric measures the overall correctness of the model’s predictions. The model correctly predicted 99% of all loan statuses.
•	Precision:
o	For Class 0 (Loans Repaid): 1.00
All loans predicted to be repaid were indeed repaid.
o	For Class 1 (Loans Defaulted): 0.86
Out of all loans predicted to default, 86% were correct. This shows that some borrowers who were flagged as defaulters were reliable borrowers.
•	Recall:
o	For Class 0 (Loans Repaid): 0.99
The model successfully identified 99% of the loans that were repaid.
o	For Class 1 (Loans Defaulted): 0.94
The model correctly identified 94% of the loans that defaulted, meaning it missed 6% of the actual defaults.
•	F1-Score:
The F1-score, which balances precision and recall, was 1.00 for loans repaid and 0.90 for loans defaulted. This shows strong overall performance, with some room for improvement in identifying defaulters more accurately.

#### Summary

The Logistic Regression model performed exceptionally well, achieving 99% accuracy with strong precision and recall scores for both classes. Below is a summary of its strengths and considerations:

##### Which metric is most important?
For a loan prediction model, recall for Class 1 (loans defaulted) is crucial. Missing a potential defaulter (false negative) could result in financial loss for the lender. In this case, the recall for loans defaulted was 0.94, which is quite high.

##### Is precision for defaults important?
While precision for Class 1 (0.86) indicates some false positives (loans incorrectly predicted to default), this is less concerning than missing actual defaulters. The lender might reject a few reliable borrowers, but this trade-off is often preferable to approving risky loans.

#### Recommendation:
The Logistic Regression model performs very well with a high recall and accuracy. However, if further improvement is needed—especially in precision for defaulters—it might be worth exploring more advanced models like Random Forest or XGBoost. Fine-tuning the classification threshold could also help achieve a better balance between precision and recall.
Given the high accuracy (0.99) and strong recall for defaulters (0.94), this model is recommended for deployment. It provides reliable predictions for managing credit risk and minimizing loan defaults.
