# Random Forests Classifier

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In this section I will fit an initial out-of-the-box Random Forest and then optimise the hyper-parameters to try and improve my model's accuracy.

A Random Forest is an example of an ensemble classifier that aggregates multiple Decision Trees and averages over them in an effort to reduce over-fitting and to improve the accuracy of the model on unseen data.

First I need to import my TF-IDF vectorised data set.

In [3]:
# Import TF-IDF data
X_train_head = pd.read_csv('data/X_train_head.csv')

X_test_head = pd.read_csv('data/X_test_head.csv')

y_train_head = pd.read_csv('data/y_train_head.csv')

y_test_head = pd.read_csv('data/y_test_head.csv')

In [4]:
y_train_head.shape

(8258, 1)

In [5]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()
random_forest.fit(X_train_head, y_train_head)

training_score = random_forest.score(X_train_head, y_train_head)
test_score = random_forest.score(X_test_head, y_test_head)
 
print(f'Training Accuracy: {training_score}')
print(f'Test Accuracy: {test_score}')

  after removing the cwd from sys.path.


Training Accuracy: 0.6667473964640349
Test Accuracy: 0.6334140435835351


Our out-of-the-box Random Forest Classifier has managed to do a little better than our optimised Logistic Regression Classifier using all the features. 

It improved the training accuracy by 3.8% and the test accuracy by 0.7%. Let's see how optimising this model is able to improve our accuracy scores.

To optimise my model I will use GridSearchCV with a 5-fold cross-validation. The hyper-parameters I will be optimising over are:

n_estimators, criterion, max_depth, min_samples_split, min_samples_leaf

In [6]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [10, 25, 50, 75, 100],
    'max_depth': np.arange(1,20),
    'min_samples_split': np.arange(2,10),
    'min_samples_leaf': np.arange(1,10),
    'criterion': ['gini', 'entropy']
}

gridsearch = GridSearchCV(RandomForestClassifier(), params, cv=5, n_jobs=-1)

gridsearch_results = gridsearch.fit(X_train_head, y_train_head)

  self.best_estimator_.fit(X, y, **fit_params)


In [7]:
best_params = gridsearch_results.best_params_
best_params

{'criterion': 'gini',
 'max_depth': 18,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 25}

In [8]:
gridsearch_results.best_score_

0.6290872213333177

In [10]:
# Build a Random Forest with the optimised parameters
random_forest_opt = RandomForestClassifier(n_estimators=best_params["n_estimators"],
                                           max_depth=best_params["max_depth"],
                                           min_samples_split=best_params["min_samples_split"],
                                           min_samples_leaf=best_params["min_samples_leaf"],
                                           criterion=best_params["criterion"]                                  
                                          )
random_forest_opt.fit(X_train_head, y_train_head)

training_score = random_forest_opt.score(X_train_head, y_train_head)
test_score = random_forest_opt.score(X_test_head, y_test_head)
 
print(f'Training Accuracy: {training_score}')
print(f'Test Accuracy: {test_score}')

  


Training Accuracy: 0.6415596996851538
Test Accuracy: 0.625181598062954


Optimising over all the parameters did not help me improve my test accuracy over the out-of-the-box model. So using the Random Forest, my best accuracy is 63.5%. 

Let's take a look at the feature importance from the original out-of-the-box Random Forest.

In [53]:
# Sort the features by indice in order of most important to least important
features = random_forest.feature_importances_
sorted_indices = np.argsort(-1*features)
sorted_features = np.sort(features)[::-1]

# Pull out the top 5 features and pick out the corresponding columns
top_5 = sorted_indices[0:10]
columns = X_train_head.columns[top_5]
print(f'Top features: {columns}')
print(f'Feature values: {sorted_features[:10]}')

Top features: Index(['h_obama', 'h_gop', 'h_debate', 'h_hillary', 'h_10', 'h_trump',
       'h_week', 'h_clinton', 'h_say', 'h_video'],
      dtype='object')
Feature values: [0.06820744 0.0648731  0.05471714 0.05193111 0.04784863 0.04256605
 0.03508505 0.03475165 0.03247237 0.02945785]


The Top 10 features by importance from the out-of-the box Random Forest are a mix of Democrat and Republican sentiment, with people/figures taking the focus: Obama, Hillary Clinton and Trump.

I was optimistic that an optimised Random Forest would provide a bit more improvement over the Logistic Regression, but not to be. Let's see what some Recurrent Neural Networks will be able to achieve.