In [2]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [3]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path('Resources/lending_data.csv')
lending_data = pd.read_csv(file_path)

# Review the DataFrame
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [4]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = lending_data['loan_status']

# Separate the X variable, the features
X = lending_data.drop('loan_status', axis=1)

In [5]:
# Review the y variable Series
print(y.head())


0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


In [6]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Split the data using train_test_split
# We'll use 80% of the data for training and 20% for testing
# Assign a random_state of 1 to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Initiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_model = LogisticRegression(random_state=1)

# Fit the model using training data
logistic_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
y_pred = logistic_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[14926    75]
 [   46   461]]


In [11]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



In [21]:
# Сalculate the benefit of implementing the model retrospectively. Create a function that iterates over all the 
# rows of the dataset and calculates the result as the sum of the money saved on predicted defaulters and accounts for losses from the model on loans not issued to good clients.

def compute_income(X, y, pred):
    res = 0
    X_ = X.reset_index()
    y_ = list(y)
    for rn, row in X_.iterrows():
        if y_[rn] == 1 and pred[rn] == 1:
            res += row['loan_size']
        if y_[rn] == 0 and pred[rn] == 1:
            res -= row['loan_size'] * row['interest_rate'] / 100
    return res

compute_income(X_test, y_test, y_pred) / X_test.shape[0]
        

548.5335366907401

Our model allows us to save money on loans that were issued but not repaid. However, at the same time, we can lose money by not issuing loans to good clients. Based on the result of this function, we see that our model allows us to save 548.53 per client. For a more accurate analysis, we will need additional information about loan issuance terms and other details.

### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

### Conclusions

The main diagonal shows the correct answers. The confusion matrix showed that the model gave the correct result in cases where people returned the credit. There were `14926` such cases. The model made incorrect predictions `75` times in cases where it predicted that people would not return the credit, but in fact, they did. And `45` times the opposite happened. 

The model demonstrates high accuracy with a significant number of correct predictions (`14926` `True Negatives` and `461` `True Positives`), and a low rate of `False Positives` (`75`), minimizing the risk of denying credit to potentially reliable clients. The small number of False Negatives (`46`) suggests effective risk management in credit issuance. Overall, the model maintains a good balance between precision and recall, ensuring reliability in predicting credit returns, which enhances both economic efficiency and the safety of lending practices. 

Regarding the drawbacks of the model, it does not reflect the actual financial benefit from the prediction results. Additional analysis will be provided in the Credit Risk Analysis Report.

The classification report shows excellent model performance with an `overall accuracy` of `0.99`. For class 0, the model achieved perfect `precision`, `recall`, and `F1-score` of `1.00` across `15001` instances, indicating flawless prediction for this class. Class 1, with 507 instances, showed good precision (0.86) and better `recall` (`0.91`), leading to an `F1-score` of `0.88`. The macro average scores highlight a robust model with `0.93` `precision`, `0.95` `recall`, and `0.94` `F1-score`, while the weighted average underscores consistent performance across different classes, each reflecting a `0.99` score in `precision`, `recall`, and `F1-score`.

Due to the class imbalance (between people returning and not returning credit), it makes sense to analyze additional metrics such as precision and recall. 

`Precision` (accuracy) for class 0 is perfect (`1.00`), meaning that the model did not produce any false positives for this class. For class 1, the precision is `0.86`, indicating that when the model predicts that a credit will not be returned, it is correct about 86% of the time. This is a good indicator, but it can be improved, as the remaining 14% of false positives could negatively affect clients who would have been able to return the credit.

`Recall` (completeness) for class 0 is also perfect (`1.00`), which means that all real cases of credit return have been correctly identified by the model. For class 1, the `recall` is `0.91`, which is higher than the precision, and indicates that the model successfully identifies most cases of credit non-return. However, some cases are still missed, which could lead to financial losses.

---