# Random Forest

Implementing a random forest classifier to detect spam

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import numpy as np
import sklearn.preprocessing

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [2]:
data = pd.read_csv("spambase.data", header=None)
X = data.drop([57], axis=1) #drop target feature and keep the rest as X
Y = data[57] # save the target feature as Y

In [90]:
sc = StandardScaler() #Used to standardize the data 
l = [1,3,5,10,15,20,40,70]
seeds = np.arange(20)
for num_estimators in l: #testing different values for 'n_estimators' (the number of trees used in the forest)
    results1 = []
    results2 = []
    for x in range(20): #for 20 seds
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=seeds[x])
        sc.fit(X_train)
        X_train = pd.DataFrame(sc.transform(X_train))
        X_test = pd.DataFrame(sc.transform(X_test))
        model1 = RandomForestClassifier(n_estimators=num_estimators, criterion = 'gini', random_state=seeds[x])
        model1.fit(X_train, Y_train)
        pred = model1.predict(X_test)
        results1.append(accuracy_score(Y_test, pred))
        
        model2 = RandomForestClassifier(n_estimators=num_estimators, criterion = 'entropy', random_state=seeds[x])
        model2.fit(X_train, Y_train)
        pred = model2.predict(X_test)
        results2.append(accuracy_score(Y_test, pred))
    #Take best accuracy score for each 'n_estimator'
    best1 = max(results1)    
    best2 = max(results2)
    print('num_estimator', num_estimators)
    print('gini impurity: ', best1)
    print('shannon i.g: ', best2)

num_estimator 1
gini impurity:  0.9102099927588704
shannon i.g:  0.9225199131064447
num_estimator 3
gini impurity:  0.943519188993483
shannon i.g:  0.9478638667632151
num_estimator 5
gini impurity:  0.943519188993483
shannon i.g:  0.9572773352643013
num_estimator 10
gini impurity:  0.9536567704561911
shannon i.g:  0.9630702389572773
num_estimator 15
gini impurity:  0.9608979000724113
shannon i.g:  0.9616220130340333
num_estimator 20
gini impurity:  0.9652425778421434
shannon i.g:  0.9616220130340333
num_estimator 40
gini impurity:  0.9695872556118754
shannon i.g:  0.9695872556118754
num_estimator 70
gini impurity:  0.9710354815351194
shannon i.g:  0.9695872556118754


The test accuracy increased as the parameter, n estimators (essentially the number of trees in the forest) increased. In general, each estimator is noisy and prone to overfitting with a high variance on outside data (as decision trees are). So, increasing the number of estimators reduces the overall variance and smooths out the noise created by each individual estimator. As a result, For both the Gini Impurity and Shannon I.G. trials, the best accuracy was found with n estimators = 70. However, the downside is that the code ran lot slower at runtime due to large number of estimators. At a certain point, increasing the number of estimators does not have any significant effect as each new estimator will not decrease the variance in any meaningful way, which is shown in the data by the fact that there is no increase from 40 to 70 for Shannon I.G. (and only a very slight increase for intervals between 10 and 70 estimators). One final idea to take note of was that the difference between the Gini accuracy and the Information Gain accuracy was largest for n estimators = 1, likely because this trial was most similar to a basic decision tree 1. Conversely, for the largest number of estimators (n=70), the Gini accuracy was actually higher than that of the Information gain, which could be because the variance was reduced enough to minimize the imperfections that arise when using Gini impurity (because of the higher variance for each individual tree which is affected by imbalanced probabilities). Overall, it is important to choose the optimal number of estimators when running a random forest model.