In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame

df = pd.read_csv(
    Path('../Starter_Code/Resources/lending_data.csv')   
)
# Review the DataFrame
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = df['loan_status']

# Separate the X variable, the features

X = df.drop(columns='loan_status')

In [4]:
# Review the y variable Series
y[:5]

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1 )

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
from sklearn.linear_model import LogisticRegression
logistic_regressionmodel = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
# YOUR CODE HERE!
lr_model = logistic_regressionmodel.fit(X_train,y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data
# YOUR CODE HERE!
prediction = lr_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [9]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!
from sklearn.metrics import confusion_matrix

# Generate a confusion matrix for the model
testing_matrix = confusion_matrix(y_test, prediction)
print(testing_matrix)

[[18663   102]
 [   56   563]]


In this confusion matrix:
- True Positives (TP): 563 (number of high-risk loans correctly predicted as high-risk)
- True Negatives (TN): 18663 (number of healthy loans correctly predicted as healthy)
- False Positives (FP): 102 (number of healthy loans incorrectly predicted as high-risk)
- False Negatives (FN): 56 (number of high-risk loans incorrectly predicted as healthy)

In [10]:
# Print the classification report for the model
# YOUR CODE HERE!
# Calculate the classification report
from sklearn.metrics import classification_report
from sklearn import metrics 
testing_report = classification_report(y_test, prediction)

print(testing_report)

# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test, prediction))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384

ACCURACY OF THE MODEL:  0.9918489475856377


### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The logistic regression model was 95% accurate at predicting the healthy vs high-risk loan labels. However, Precision and recall are more relevant in this context due to the potential high cost of misclassification. Achieving a high precision score is crucial to minimize false positives, as misclassifying potential customers as high risk could result in lost business opportunities.

Similarly, a high recall score is vital to minimize false negatives, as misclassifying high-risk loans as healthy could lead to substantial financial losses.

The logistic regression model used in this analysis exhibits a high accuracy of 0.99. When predicting healthy loans, the precision value is at its maximum of 1, indicating an excellent ability to identify true positives for healthy loans.

However, the precision score for high-risk loans is 0.85, suggesting a decent ability to predict true positives for high-risk loans. Nevertheless, these results must be interpreted with caution because the dataset is imbalanced, meaning there are significantly more healthy loans than high-risk loans. As a result, it is essential to consider a resampled dataset to address the imbalance issue and obtain more reliable performance metrics.

---

## Creating a Logistic Regression Model with the StandardScaler() Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train_scaled` and `y_train`).

In [11]:
# Import the LogisticRegression module from SKLearn

from sklearn.preprocessing import StandardScaler
#Standaradizing the data 
scaler = StandardScaler()
# Fit the training data to the standard scaler
X_scaler = scaler.fit(X_train)

# Transform the training data using the scaler
X_train_scaled = X_scaler.transform(X_train)

# Transform the testing data using the scaler
X_test_scaled = X_scaler.transform(X_test)



In [12]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!
from sklearn.linear_model import LogisticRegression
logistic_regressionmodel = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
# YOUR CODE HERE!
lr_model = logistic_regressionmodel.fit(X_train_scaled,y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [13]:
# Make a prediction using the testing data
# YOUR CODE HERE!
prediction = lr_model.predict(X_test_scaled)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [14]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!
from sklearn.metrics import confusion_matrix

# Generate a confusion matrix for the model
testing_matrix = confusion_matrix(y_test, prediction)
print(testing_matrix)

[[18652   113]
 [   10   609]]


In [15]:
# Print the classification report for the model
# YOUR CODE HERE!
# Calculate the classification report
from sklearn.metrics import classification_report
from sklearn import metrics 
testing_report = classification_report(y_test, prediction)

print(testing_report)
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test, prediction))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.98      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

ACCURACY OF THE MODEL:  0.9936545604622369


### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** I have tried applying the same model on Scaled dataset. The evaluation of the logistic regression model showed high accuracy (0.99), indicating that the model made accurate predictions for most samples. The precision score was perfect (1.00) for healthy loans, meaning all predictions for healthy loans were correct. For high-risk loans, the precision score was 0.84, indicating a decent ability to predict true positives for high-risk loans.

The recall score, which measures the model's ability to identify positive samples, was high for both healthy loans (0.99) and high-risk loans (0.98). This demonstrated that the model correctly identified the vast majority of actual samples for both classes.

The F1-score, a balanced metric that considers both precision and recall, was excellent for healthy loans (1.00) and respectable for high-risk loans (0.91).

## Creating a Random Forest Classifier Model with the Original Data

###  Step 1: Fit a random forest  classifier model by using the training data (`X_train` and `y_train`).

In [16]:
from sklearn.ensemble import RandomForestClassifier
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
  
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
  
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
  
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
  
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test, y_pred))


ACCURACY OF THE MODEL:  0.9913330581923235


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [17]:
# Generate a confusion matrix for the model
from sklearn.metrics import confusion_matrix

# Generate a confusion matrix for the model
rf_testing_matrix = confusion_matrix(y_test, y_pred)
print(rf_testing_matrix)


[[18665   100]
 [   68   551]]


In [19]:
# Print the classification report for the model

from sklearn.metrics import classification_report
testing_report = classification_report(y_test, y_pred)

print(testing_report)
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.89      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.94      0.93     19384
weighted avg       0.99      0.99      0.99     19384

ACCURACY OF THE MODEL:  0.9913330581923235


### Step 4: Answer the following question.

**Question:** How well does the Randm Forest Classifier model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The provided classification report is the evaluation result of a Random Forest Classifier model on a dataset with two classes: `0` (healthy loan) and `1` (high-risk loan). The classification report provides metrics such as precision, recall, F1-score, and support for each class, as well as an overall weighted average.

Let's break down the interpretation of the classification report:

**Class 0 (healthy loan):**
- Precision: 1.00
- Recall: 0.99
- F1-score: 1.00
- Support: 18765

Explanation:
- Precision (1.00): The precision for class 0 is 1.00, which means that when the model predicts a loan as healthy (`0`), it is correct 100% of the time.
- Recall (0.99): The recall for class 0 is 0.99, indicating that the model identifies approximately 99% of the actual healthy loans correctly.
- F1-score (1.00): The F1-score is a harmonic mean of precision and recall and is also 1.00 for class 0, indicating a perfect balance between precision and recall.
- Support (18765): The support is the number of instances of class 0 in the test dataset, which is 18765.

**Class 1 (high-risk loan):**
- Precision: 0.85
- Recall: 0.90
- F1-score: 0.87
- Support: 619

Explanation:
- Precision (0.85): The precision for class 1 is 0.85, which means that when the model predicts a loan as high-risk (`1`), it is correct 85% of the time.
- Recall (0.90): The recall for class 1 is 0.90, indicating that the model identifies approximately 90% of the actual high-risk loans correctly.
- F1-score (0.87): The F1-score is 0.87 for class 1, which shows a good balance between precision and recall, although it is slightly lower than 1.00, indicating that there is room for improvement.
- Support (619): The support is the number of instances of class 1 in the test dataset, which is 619.

**Overall Metrics:**
- Accuracy: 0.99
- Macro avg (precision, recall, F1-score): 0.92 (precision), 0.95 (recall), 0.93 (F1-score)
- Weighted avg (precision, recall, F1-score): 0.99 (precision), 0.99 (recall), 0.99 (F1-score)

Explanation:
- Accuracy (0.99): The overall accuracy of the model is 0.99, indicating that it correctly predicts approximately 99% of the instances in the test dataset.
- Macro avg: The macro average calculates the unweighted mean of precision, recall, and F1-score across all classes. In this case, it's 0.92 (precision), 0.95 (recall), and 0.93 (F1-score).
- Weighted avg: The weighted average calculates the weighted mean of precision, recall, and F1-score based on the support for each class. In this case, it's 0.99 (precision), 0.99 (recall), and 0.99 (F1-score).

In summary, the Random Forest Classifier model performs exceptionally well in predicting the `0` (healthy loan) class with high precision, recall, and F1-score. For the `1` (high-risk loan) class, the model still performs well, but there is room for improvement, especially in terms of precision. The overall accuracy and F1-score of the model are excellent, indicating that it is a highly effective classifier for this binary classification problem.