# Students Do: Predicting Loan Default with Random Forests

In this activity, you are going to explore how random forest algorithm can be used to identify loans that are likely to default. You will use the `sba_loans_encoded.csv` file that you created before to train the model

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report


## Loading and Preprocessing Loans Encoded Data

Load the `sba_loans_encoded.csv` in a pandas DataFrame called `df_loans`

In [3]:
# Loading data
df_loans = pd.read_csv("../Resources/sba_loans_encoded.csv")
df_loans

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,...,City_WILLITS,City_WILMINGTON,City_WINDSOR,City_WINNETKA,City_WOODLAND,City_WOODLAND HILLS,City_WRIGHTWOOD,City_Watsonville,City_YORBA LINDA,City_YUBA CITY
0,2001,11,32812,36,92801,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2092,2006,6,150000,60,92346,0,5,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2093,1997,4,99000,300,92021,0,4,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2094,1997,2,50000,84,93012,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2095,1997,1,251150,120,91352,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Define the features set, by copying the `df_loans` DataFrame and dropping the `Default` column.

In [7]:
# Define features set
x = df_loans.copy()
x = x.drop(columns="Default")

x

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,...,City_WILLITS,City_WILMINGTON,City_WINDSOR,City_WINNETKA,City_WOODLAND,City_WOODLAND HILLS,City_WRIGHTWOOD,City_Watsonville,City_YORBA LINDA,City_YUBA CITY
0,2001,11,32812,36,92801,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2092,2006,6,150000,60,92346,0,5,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2093,1997,4,99000,300,92021,0,4,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2094,1997,2,50000,84,93012,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2095,1997,1,251150,120,91352,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Create the target vector by assigning the values of the `Default` column from the `df_loans` DataFrame.

In [9]:
# Define target vector

y = df_loans["Default"].values.reshape(-1,1)

y

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]])

Split the data into training and testing sets.

In [10]:
# Splitting into Train and Test sets

x_test, x_train, y_test, y_train = train_test_split(x,y,random_state=78)

Use the `StandardScaler` to scale the features data, remember that only `X_train` and `X_testing` DataFrames should be scaled.

In [11]:
# Create the StandardScaler instance
ss = StandardScaler()

In [12]:
# Fit the Standard Scaler with the training data

x_scaler = ss.fit(x_train)

In [13]:
# Scale the training data

x_test_scaled = x_scaler.transform(x_test)

x_train_scaled = x_scaler.transform(x_train)

## Fitting the Random Forest Model

Once data is scaled, create a random forest instance and train it with the training data (`X_train_scaled` and `y_train`), define `n_estimators=500` and `random_state=78`.

In [14]:
# Create the random forest classifier instance

rf_model = RandomForestClassifier(n_estimators=500, random_state=78)

In [15]:
# Fit the model

rf_model = rf_model.fit(x_train_scaled, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


## Making Predictions Using the Random Forest Model

Validate the trained model, by predicting loan defaults using the testing data (`X_test_scaled`).

In [16]:
# Making predictions using the testing data

predictions = rf_model.predict(x_test_scaled)

## Model Evaluation

Evaluate model's results, by using `sklearn` to calculate the confusion matrix, the accuracy score and to generate the classification report.

In [17]:
# Calculating the confusion matrix
cm = confusion_matrix(y_test, predictions)

cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)
# Calculating the accuracy score

acc_score = accuracy_score(y_test, predictions)

In [18]:
# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))


Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,988,79
Actual 1,81,424


Accuracy Score : 0.8982188295165394
Classification Report
              precision    recall  f1-score   support

           0       0.92      0.93      0.93      1067
           1       0.84      0.84      0.84       505

    accuracy                           0.90      1572
   macro avg       0.88      0.88      0.88      1572
weighted avg       0.90      0.90      0.90      1572



## Feature Importance

In this section, you are asked to fetch the features' importance from the random forest model and display the top 10 most important features.

In [19]:
# Get the feature importance array
importances = rf_model.feature_importances_


In [24]:
# List the top 10 most important features

importance_sorted = sorted(zip(rf_model.feature_importances_, x.columns), reverse=True)
importance_sorted

[(0.26364099604720953, 'Term'),
 (0.08351281823786369, 'Year'),
 (0.08330403703338427, 'Amount'),
 (0.05025563647685759, 'Zip'),
 (0.03951469954001163, 'NoEmp'),
 (0.03863888534729672, 'Month'),
 (0.03676387828231279, 'RealEstate'),
 (0.03465420177816074, 'RevLineCr'),
 (0.02493945963174351, 'CreateJob'),
 (0.019465553140959494, 'Bank_BANK OF AMERICA NATL ASSOC'),
 (0.013773859539404996, 'UrbanRural'),
 (0.010222365678964507, 'Bank_BBCN BANK'),
 (0.0064555808800925525, 'Bank_WELLS FARGO BANK NATL ASSOC'),
 (0.006412532940785636, 'Bank_CAPITAL ONE NATL ASSOC'),
 (0.006228986181480801, 'Bank_U.S. BANK NATIONAL ASSOCIATION'),
 (0.006002085106749823, 'City_SAN DIEGO'),
 (0.00585479109151305, 'Bank_CITIBANK, N.A.'),
 (0.005733506962154835, 'Bank_JPMORGAN CHASE BANK NATL ASSOC'),
 (0.005125310304157387, 'City_HERCULES'),
 (0.005024061445096221, 'Bank_SUPERIOR FINANCIAL GROUP, LLC'),
 (0.004856360219994621, 'City_LOS ANGELES'),
 (0.004452247676737376, 'Bank_CDC SMALL BUS. FINAN CORP'),
 (0.00

## Analysis Questions

Finally, analyze the model's evaluation results and answer the following questions.

* **Question 1:** Would you trust in this model to deploy a loan default solution in a bank?

 * **Your answer here**


* **Question 2:** What are your insights about the top 10 most importance features?

 * **Your answer here**