# Random Forest Classifier For Spam Emails

This notebook outlines the usage of the random forest classifier model for the purpose of classifying spam emails. I chose the rfc model for the following reasons: it is highly robust due to the number of decision trees, it does not suffer from overfitting as a result of the averages which removes bias, it can handle missing values using either median values or proximity-weighted averages, and the relative feature importance is easily accessible. However, the random forests method is slower than others because of the necessity of creating multiple decision trees. For each prediction, each tree will have to make a prediction and then the model will need to vote. Moreover, the model can be more difficult to interpret as compared to a regular decision tree since it cannot be followed along a singular path. Despite the issues of the random forests model, the benefits seem to outweigh the problems and so it is likely the best option.

First, let's install some stuff just in case you don't already have them.

In [1]:
%%capture
!pip install pandas
!pip install sklearn

Some imports to help us moving forward.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import pandas as pd
import re

Now we can read from the files in order to create the dataframe. Then we split into training and testing sets.

In [3]:
# read in the names for the columns
with open('spambase/spambase.names') as fp:
    cols = []
    for line in fp:
        # ignore the documentation lines
        if line.startswith('|'):
            continue
        # essentially matches lines that go "foo_bar: foo bar."
        # colname = foo_bar
        regexer = re.match(r'(?P<colname>.+):.*\.', line)
        # add colname to the list of columns
        if regexer:
            cols.append(regexer.group('colname'))
    # append the label to the list of columns
    cols.append('label')

# now we can create the dataframe, this will contain all of the data
options = {'header': None, 'names': cols, 'skipinitialspace': True}
spambase = pd.read_csv('spambase/spambase.data', **options)

# split the dataframe into predictors and labels
x_spambase = spambase.drop(['label'], axis=1)
y_spambase = spambase['label']

# finally, split that result into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x_spambase, y_spambase, test_size=0.2, shuffle=True)

Next we'll run the random forest classifier and check the accuracy of the model.

In [4]:
# set up the classifier
mdl = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=0)

# fit the training data to the model
mdl.fit(x_train, y_train)

# run the prediction to check accuracy
y_pred = mdl.predict(x_test)

# calculate the accuracy and baseline scores
accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)
scores = classification_report(y_test, y_pred, output_dict=True)['weighted avg']
accuracy_info = f'''Test accuracy score: {accuracy}%
    Precision: {scores['precision']}
    Recall:    {scores['recall']}
    F1-Score:  {scores['f1-score']}
    Support:   {scores['support']}
'''

Tada, relatively high accuracy!

In [5]:
print(accuracy_info)

Test accuracy score: 94.9%
    Precision: 0.9490015478997007
    Recall:    0.9489685124864278
    F1-Score:  0.9487841760666152
    Support:   921

