# Group 4: Random Forest Model and Evaluation

In [31]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
%matplotlib inline

# Needed for decision tree visualization
import pydotplus
from IPython.display import Image


## Loading and Preprocessing Loans Encoded Data

In [32]:
# Loading data
file_path = Path("../Resources/loans_data_encoded.csv")
df_loans = pd.read_csv(file_path)
df_loans.head()



Unnamed: 0,amount,term,age,bad,month_num,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male
0,1000,30,45,0,6,0,1,0,0,0,1
1,1000,30,50,0,7,1,0,0,0,1,0
2,1000,30,33,0,8,1,0,0,0,1,0
3,1000,15,27,0,9,0,0,0,1,0,1
4,1000,30,28,0,10,0,0,0,1,1,0


In [33]:
# Define features set
X = df_loans.copy()
X.drop("bad", axis=1, inplace=True)
X.head()



Unnamed: 0,amount,term,age,month_num,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male
0,1000,30,45,6,0,1,0,0,0,1
1,1000,30,50,7,1,0,0,0,1,0
2,1000,30,33,8,1,0,0,0,1,0
3,1000,15,27,9,0,0,0,1,0,1
4,1000,30,28,10,0,0,0,1,1,0


To create the target vector `y` before scaling the data, the `ravel` method is used instead of `reshape` as we did in the decision tree demo.

Reshape can also be used. As long as you pass in one of `lists, numpy arrays, scipy-sparse matrices or pandas dataframes`. Refer to `sklearn` docs for `train_test_split` - see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Upon further testing, it is worth noting that `y = df_loans["bad"]` is also sufficient as an input, which is a pandas Series.

In [None]:
# Define target vector
y = df_loans["bad"].reval()
# y = df_loans["bad"]
# y = df_loans["bad"].values.reshape(-1,1)
y[:5]



In [None]:
# Splitting into Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)



In [None]:
# Creating StandardScaler instance
scaler = StandardScaler()



In [None]:
# Fitting Standard Scaller
X_scaler = scaler.fit(X_train)



In [37]:
# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


amount                            1.0
term                              1.0
age                               1.0
month_num                         1.0
education_Bachelor                1.0
education_High School or Below    1.0
education_Master or Above         1.0
education_college                 1.0
gender_female                     1.0
gender_male                       1.0
dtype: float64

In [None]:
# Confirm scaled values...for fun
df_X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
round(df_X_train_scaled.mean(), 0)


In [38]:
round(df_X_train_scaled.std(), 0)


amount                            1.0
term                              1.0
age                               1.0
month_num                         1.0
education_Bachelor                1.0
education_High School or Below    1.0
education_Master or Above         1.0
education_college                 1.0
gender_female                     1.0
gender_male                       1.0
dtype: float64

## Fitting the Random Forest Model

When the random forest instance is created, there are two important parameters to set:

- `n_estimators`: This is the number of random forests to be created by the algorithm. In general, a higher number makes the predictions stronger and more stable. However, a very large number can result in higher training time. A good approach is to start low and increase the number if the model performance is not adequate.
- `random_state`: This parameter defines the seed used by the random number generator. It is important to define a random state when comparing multiple models.

In [None]:
# Create a random forest classifier
rf_model = RandomForestClassifier(n_estimators=500, random_state=78)



Once the random forest model is created, it is fitted with the training data.

In [None]:
# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)


After fitting the model, some predictions are made using the scaled testing data.

## Making Predictions Using the Random Forest Model

In [None]:
# Making predictions using the testing data
predictions = rf_model.predict(X_test_scaled)


In order to evaluate the model, a confusion matrix, the `accuracy_score`, and the `classification_report` from `sklearn.metrics` are used.

The confusion matrix is created using the `y_test` and the `results` vectors. The matrix shows how well the model predicts fraudulent loan applications.

## Model Evaluation

In [None]:
# Calculating the confusion matrix
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

# Calculating the accuracy score
acc_score = accuracy_score(y_test, predictions)



In [None]:
# Displaying results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))


After observing the results, it can be concluded that this model may not be the best one for preventing fraudulent loan applications. Now there are several strategies that may improve this model, such as:

- Reducing the number of features using principal component analysis (PCA).

- Creating new features based on new data from the problem domain.

- Increasing the number of estimators and/or general hyperparameter tuning and cross validation. See https://www.datasciencelearner.com/how-to-improve-accuracy-of-random-forest-classifier/.

Finally, explain to students that a byproduct of the random forest algorithm is a ranking of feature importance (i.e., which features have the most impact on the decision).

The `RandomForestClassifier` of `sklearn` provides an attribute called `feature_importances_`, where you can see which features were the most significant.

## Feature Importance

In [None]:
# Random Forests in sklearn will automatically calculate feature importance
importances = rf_model.feature_importances_



In [None]:
# We can sort the features by their importance
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)



In [None]:
# Visualize the features by importance
importances_df = pd.DataFrame(sorted(zip(rf_model.feature_importances_, X.columns), reverse=True))
importances_df.set_index(importances_df[1], inplace=True)
importances_df.drop(columns=1, inplace=True)
importances_df.rename(columns={0: 'Feature Importances'}, inplace=True)
importances_sorted = importances_df.sort_values(by='Feature Importances')
importances_sorted.plot(kind='barh', color='lightgreen', title= 'Features Importances', legend=False)



- In this demo, it can be seen that the `age` of the person and the `month` of the loan application are the more relevant features.

- If we need to drop some features, analyzing feature importance could help to decide which features can be removed.