In [39]:
%env HV_DOC_HTML=true

env: HV_DOC_HTML=true


In [40]:
pip install -q hvplot

In [41]:
pip install -q holoviews

In [42]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report, balanced_accuracy_score, accuracy_score

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [43]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df_lending = pd.read_csv("lending_data.csv")

# Review the DataFrame
df_lending

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [44]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = df_lending["loan_status"]

# Separate the X variable, the features
X = df_lending.drop(columns=["loan_status"])

In [45]:
# Review the y variable Series
y.head(5)

Unnamed: 0,loan_status
0,0
1,0
2,0
3,0
4,0


In [46]:
# Review the X variable DataFrame
X.head(5)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [47]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# where X_train and X_test contains the features for the training and testing datasets
# where y_train and y_test contains the labels for the training and testing datasets

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [48]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_model = LogisticRegression(random_state=1)

# Fit the model using training data
logistic_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [49]:
# Make a prediction using the testing data
predictions = logistic_model.predict(X_test)
predictions

array([0, 0, 0, ..., 0, 0, 0])

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [50]:
# Generate a confusion matrix for the model
confusion_mat = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(confusion_mat)

Confusion Matrix:
[[18655   110]
 [   36   583]]


In [51]:
# Print the classification report for the model
class_report = classification_report(y_test, predictions)
print("\nClassification Report:")
print(class_report)


Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.94      0.89       619

    accuracy                           0.99     19384
   macro avg       0.92      0.97      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The logistic regression model exhibits exceptional predictive with around 99% accuracy, especially for healthy loans, and effectively identifies high-risk loans. While the model slightly compromises precision for high-risk loans, further refinement or exploration of alternative algorithms may be warranted for enhanced performance.

## Predict a logistic regression model with resampled training data


###Step 1: Use the RandomOverSampler module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points.

In [52]:
# Import the RandomOverSampler module from imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
ros = RandomOverSampler(random_state=1)

# Fit the original training data to the random oversampler model
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

# Confirm that the labels have an equal number of data points
print("Original dataset shape:", y_train.value_counts())
print("Resampled dataset shape:", y_resampled.value_counts())

Original dataset shape: loan_status
0    56271
1     1881
Name: count, dtype: int64
Resampled dataset shape: loan_status
0    56271
1    56271
Name: count, dtype: int64


In [53]:
# Count the distinct values of the resampled labels data
unique_values, counts = np.unique(y_resampled, return_counts=True)
# Print the distinct values and their counts
for value, count in zip(unique_values, counts):
    print(f"Value: {value}, Count: {count}")

Value: 0, Count: 56271
Value: 1, Count: 56271


###Step 2: **Use** the LogisticRegression classifier and the resampled data to fit the model and make predictions

In [54]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_model = LogisticRegression(random_state=1)

# Fit the model using the resampled training data
logistic_model.fit(X_resampled, y_resampled)

# Make a prediction using the testing data
predictions = logistic_model.predict(X_test)

predictions

array([0, 0, 0, ..., 0, 0, 0])

### Step 3: Evaluate the model's performance by doing the following:

*   Calculate the accuracy score of the model.
*   Generate a confusion matrix
*   Print the classification report





In [56]:
# Print the balanced accuracy score of the model
# Calculate the accuracy score of the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy Score:", accuracy)

Accuracy Score: 0.9936545604622369


In [57]:
# Print the balanced accuracy score of the model
balanced_accuracy = balanced_accuracy_score(y_test, predictions)
print("Balanced Accuracy Score:", balanced_accuracy)

Balanced Accuracy Score: 0.9935981855334257


In [59]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, predictions)
conf_matrix

array([[18646,   119],
       [    4,   615]])

In [60]:
# Print the classification report
class_report = classification_report(y_test, predictions)
print("Classification Report:\n", class_report)

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.



**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** the model performs exceptionally well in predicting healthy loans, while it also shows strong performance in identifying high-risk loans, with a slight trade-off in precision. The high recall for the high-risk loans is particularly valuable, as it minimizes the chances of missing potential defaults, which is crucial in lending scenarios. The overall accuracy of 0.99 further supports the effectiveness of the model in distinguishing between the two classes.

## Credit Risk Analysis Report

### Overview of the Analysis

1.Question:Explain the purpose of the analysis..
* Answer:The primary objective of this analysis was to leverage historical financial data to optimize lending decision-making processes. Our focus was on creating machine learning models to forecast loan outcomes, specifically in identifying fraudulent loan applications and predicting loan defaults.
2. Question: Explain what financial information the data was on, and what you needed to predict.
*  Answer: The dataset used in this analysis featured data on Small Business Administration (SBA) loans, encompassing a range of financial metrics and borrower attributes.
3. Question: Provide essential details about the variables targeted for prediction (e.g., `value_counts`)
* Answer: Using `value_counts` in the `x_train` variable to predict the target variable (e.g., borrower characteristics, loan amounts, interest rates), and using `y_train` to predict Healthy Loan (0): indicating the loan is not at risk;high-Risk Loan (1): indicating the loan is at risk of default.
4. Question: Describe the stages of the machine learning process you went through as part of this analysis.
* Answer: Data Preprocessing, Model Training, Model Evaluation
5. Question: Briefly touch on any methods you used (e.g., `LogisticRegression`, or any other algorithms).
* Answer: LogisticRegression and RandomOverSampler Module technique used to address class imbalance by oversampling the minority class to improve model performance.

## Results
1. Machine Learning Model 1: Description of Logistic Regression: Accuracy, Precision, and Recall scores.
    *    Accuracy: 0.99 – The model correctly classifies 99% of the instances, indicating a high level of reliability.
    *    Precision for Healthy Loans (0): 1.00 – All loans predicted as healthy were indeed healthy, showing no false positives.
    *    Recall for Healthy Loans (0): 0.99 – The model identified 99% of the actual healthy loans, with only a small fraction missed.
    *    Precision for High-Risk Loans (1): 0.84 – 84% of loans predicted as high-risk were actually high-risk, indicating some false positives.
    *    Recall for High-Risk Loans (1): 0.94 – The model successfully identified 94% of the actual high-risk loans, demonstrating effectiveness in capturing most relevant cases.

## Summary

1. Which one seems to perform best? How do you know it performs best?
* The machine learning models were evaluated, showing an overall accuracy of 0.99 in both classification reports. However, distinctions in recall values for high-risk loans were observed: the First Classification Report had a recall of 0.94, while the Second Classification Report had a recall of 0.99. Despite the matching accuracy rates, the second report exhibited superior performance in identifying high-risk loans due to its higher recall score, indicating a more robust ability to detect true positive cases.


2. Does performance depend on the problem we are trying to solve? (For example, is it more important to predict the `1`'s, or predict the `0`'s? )
* Yes, model performance is influenced by the problem at hand, emphasizing the need to predict the '1's (high-risk loans) in this scenario. While predicting '0's (healthy loans) holds significance, the repercussions of misclassifying a healthy loan are generally less severe compared to misclassifying a high-risk loan.