In [2]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
# Read the CSV file from the Resources folder into a Pandas DataFrame

df_cr_data = pd.read_csv(
    "Resources/lending_data.csv",
    index_col="loan_size")

# Display sample data
df_cr_data.head(10)



Unnamed: 0_level_0,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
loan_size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10700.0,7.672,52800,0.431818,5,1,22800,0
8400.0,6.692,43600,0.311927,3,0,13600,0
9000.0,6.963,46100,0.349241,3,0,16100,0
10700.0,7.664,52700,0.43074,5,1,22700,0
10800.0,7.698,53000,0.433962,5,1,23000,0
10100.0,7.438,50600,0.407115,4,1,20600,0
10300.0,7.49,51100,0.412916,4,1,21100,0
8800.0,6.857,45100,0.334812,3,0,15100,0
9300.0,7.096,47400,0.367089,3,0,17400,0
9700.0,7.248,48800,0.385246,4,0,18800,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separate the data into labels and features

y = df_cr_data['loan_status']

# Separate the x variable, the labels
X = df_cr_data.drop(columns='loan_status')

# Separate the X variable, the features
print("Features (X):")
print(X.head())

print("\nLabels (y):")
print(y.head())

Features (X):
           interest_rate  borrower_income  debt_to_income  num_of_accounts  \
loan_size                                                                    
10700.0            7.672            52800        0.431818                5   
8400.0             6.692            43600        0.311927                3   
9000.0             6.963            46100        0.349241                3   
10700.0            7.664            52700        0.430740                5   
10800.0            7.698            53000        0.433962                5   

           derogatory_marks  total_debt  
loan_size                                
10700.0                   1       22800  
8400.0                    0       13600  
9000.0                    0       16100  
10700.0                   1       22700  
10800.0                   1       23000  

Labels (y):
loan_size
10700.0    0
8400.0     0
9000.0     0
10700.0    0
10800.0    0
Name: loan_status, dtype: int64


In [5]:
# Review the y variable Series
y

loan_size
10700.0    0
8400.0     0
9000.0     0
10700.0    0
10800.0    0
          ..
19100.0    1
17700.0    1
17600.0    1
16300.0    1
15600.0    1
Name: loan_status, Length: 77536, dtype: int64

In [6]:
# Review the X variable DataFrame
X

Unnamed: 0_level_0,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
loan_size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10700.0,7.672,52800,0.431818,5,1,22800
8400.0,6.692,43600,0.311927,3,0,13600
9000.0,6.963,46100,0.349241,3,0,16100
10700.0,7.664,52700,0.430740,5,1,22700
10800.0,7.698,53000,0.433962,5,1,23000
...,...,...,...,...,...,...
19100.0,11.261,86600,0.653580,12,2,56600
17700.0,10.662,80900,0.629172,11,2,50900
17600.0,10.595,80300,0.626401,11,2,50300
16300.0,10.068,75300,0.601594,10,2,45300


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Assign a random_state of 1 to the function


# Display the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (62028, 6)
X_test shape: (15508, 6)
y_train shape: (62028,)
y_test shape: (15508,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Step 1: Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_model = LogisticRegression(random_state=1)

# Step 2: Fit the model using training data
logistic_model.fit(X_train, y_train)

# Display the model coefficients and intercept
print("Model coefficients:", logistic_model.coef_)
print("Model intercept:", logistic_model.intercept_)

Model coefficients: [[-1.12072803e-07 -3.88233073e-04 -2.54756267e-09  1.61200224e-07
   5.41639386e-08  6.42005683e-04]]
Model intercept: [-3.43412919e-08]


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
y_pred = logistic_model.predict(X_test)

# Display the predictions
print("Predicted labels for the testing data:")
print(y_pred)


Predicted labels for the testing data:
[0 0 0 ... 0 0 0]


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Generate a confusion matrix for the model
confusion_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat)

Confusion Matrix:
[[14926    75]
 [   46   461]]


In [11]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)

print("\nClassification Report:")
print(class_report)


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The model predicts healthy loans (0) with perfect precision, recall, and F1-score.
For high-risk loans (1), the model's precision and recall are slightly lower (86%), which suggests that it is less effective at identifying high-risk loans.


## Step 5 - Overview of the Analysis



In this section, describe the analysis you completed for the machine learning models used in this Challenge. This might include:

* Explain the purpose of the analysis:

    ANS: to build a machine learning model to predict loan risk. Specifically, the models aim to distinguish between healthy loans (0) and high-risk loans (1) to help financial institutions assess the risk associated with each loan application and make informed decisions
* Explain what financial information the data was on, and what you needed to predict.

    ANS: The dataset contains financial information related to borrowers and their loan details. Key features include:
    - interest_rate: The interest rate of the loan.
    - borrower_income: The annual income of the borrower.
    - debt_to_income: The ratio of the borrower’s total monthly debt payments to their monthly gross income.
    - num_of_accounts: The number of credit accounts the borrower has.
    - derogatory_marks: The number of derogatory marks on the borrower’s credit history.
    - total_debt: The total debt of the borrower.
    - loan_size: The size of the loan being applied for.

The target variable we need to predict is loan_status, which indicates whether a loan is a healthy loan (0) or a high-risk loan (1)

* Provide basic information about the variables you were trying to predict (e.g., `value_counts`).

    ANS: The variable to predict is loan_status:
    - 0: Healthy loan, where the borrower is likely to repay without issues.
    - 1: High-risk loan, where there is a higher chance of default.
Value Counts of loan_status:

    loan_status
    - 0:    75036 -> MAJORITY
    - 1:     2500 -> MINORITY



* Describe the stages of the machine learning process you went through as part of this analysis.

1. Data Preprocessing:
    - Standardized numerical features to ensure they are on the same scale.
    - Separated the data into features (X) and labels (y).
    - Split the data into training and testing datasets using train_test_split.
2. Model Training:
    - A logistic regression model was instantiated and trained using the training dataset (X_train and y_train).
3. Model Evaluation:   
    - The model’s performance was evaluated using the test dataset (X_test and y_test).
    - Key evaluation metrics such as accuracy, precision, recall, and F1-score were calculated.
    - Confusion matrix and classification report were generated to assess the model’s predictive capabilities.
4. Model Interpretation:
    - Analyzed the confusion matrix and classification report to understand how well the model predicts both healthy and high-risk loans.


* Briefly touch on any methods you used (e.g., `LogisticRegression`, or any other algorithms).

Logistic Regression:

This algorithm was chosen for its simplicity and interpretability, making it a good baseline model for binary classification problems like this one.

## Results

Using bulleted lists, describe the accuracy scores and the precision and recall scores of all machine learning models.
* Machine Learning Model 1: Logistic Regression
    * Accuracy Score:
        - The model achieved an overall accuracy of 99%.
    * Precision and Recall Scores:
        - Healthy Loan (0):
            - Precision: 1.00 (100% of loans predicted as healthy were actually healthy)
            - Recall: 1.00 (100% of actual healthy loans were correctly predicted)
            - F1-Score: 1.00
        - High-Risk Loan (1):
            - Precision: 0.86 (86% of loans predicted as high-risk were actually high-risk)
            - Recall: 0.91 (91% of actual high-risk loans were correctly predicted)
            - F1-Score: 0.88

## Summary

Summarize the results of the machine learning models, and include a recommendation on the model to use, if any. For example:

* Which one seems to perform best? How do you know it performs best?

    ANS: The logistic regression model performed exceptionally well overall, particularly for predicting healthy loans with perfect precision and recall. This suggests it is very reliable in identifying low-risk borrowers.

* Does performance depend on the problem we are trying to solve? (For example, is it more important to predict the `1`'s, or predict the `0`'s? )

    ANS: There is a significant class imbalance, with far more healthy loans than high-risk loans. This can lead to inflated accuracy scores and potential underperformance in predicting the minority class (high-risk loans).



In [12]:
print(y.value_counts())


loan_status
0    75036
1     2500
Name: count, dtype: int64


---