# Twitter Sentiment Analysis - POC
---

## 7. Quickly train various models

**Current State**: I've trained three baseline models with little tweaking, and fine-tuned a single decision tree, on a Bag-of-Word subset of $m\approx250k, n=50k$ of the training data, using cross validation, and got the following mean accuracy scores:

- Logistic Regression: $78.8\%$
- Naive Bayes: $77.6\%$ 
- SGD (log loss): $77.1\%$ 
- Decision Tree: $69.0\%$

**This Notebook**: Build random forests (DIY and using Scikit-learn's RandomForestClassifier class), and perform small grid searches to see whether I can quickly crack $80\%$ accuracy on the test set. 

- Results: the best accuracy I could get was $75\%$, but I also at most used 8,000 instances (out of roughly 200,000 for the training set) so this result can be improved with more data. In the next notebook I'll take the best two estimators and plot their learning curves.

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score

### Load $m\approx250k$, $n=50k$ training subset

In [2]:
# processed dir
proc_dir = os.path.join("..","data","3_processed","sentiment140")
X_train_transformed = sp.load_npz(os.path.join(proc_dir, "X_train_transformed_BoW_250k_50k.npz"))
with open(os.path.join(proc_dir, "y_array_250k.npy"), 'rb') as f:
    y_array = np.load(f)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X_train_transformed, 
                                                    y_array, 
                                                    test_size=0.2, 
                                                    random_state=42)

### DIY Random Forest

In the [POC for growing your own random forest](10.extra_GrowingRandomForests) I used 1,000 trees of 100 instances each for an 8,000 training dataset. If I were to mimic these proportions with the roughly 200,000 instances in our training data I'd have to train about 25,000 trees with 2,500 instances each. That might take too long so I'll start with the same 1,000 trees of 100 instances each and see where I get.

I'm setting `max_featuers=sqrt(n_features)` such that we'll have more diverse trees and hopefully improve accuracy. Also, it will help make the forest run faster, since instead of usign all $50,000$ features, we'll use only $\sqrt{50,000}\approx224$  max features. The square root is the default for Scikit-learn's **RandomForestClassifier** class, not the **DecisionTreeClassifier** class, whose default is `max_features=n_features`. [(source)](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/tree/_classes.py#L597)


In [5]:
n_trees = 1000
n_instances = 100
subsets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=X_train.shape[0] - n_instances, random_state=42)
for train_sub_ix, test_sub_ix in rs.split(X_train):
    X_sub_train = X_train[train_sub_ix]
    y_sub_train = y_train[train_sub_ix]
    subsets.append((X_sub_train, y_sub_train))

With this loop we get `subsets`: a 1,000-long list of 100 by 50k sparse matrices of features and 1000-long numpy arrays with target values. 

We start our forest by cloning our best estimator from the previous notebook 1,000 times.

In [6]:
best_estimator_ = DecisionTreeClassifier(random_state=42, 
                                         max_leaf_nodes=99,
                                         max_features="sqrt")

forest = [clone(best_estimator_) for _ in range(n_trees)]

Then we train each tree in our forest, make predictions on the test set and get the accuracy for each of these predictions.

In [7]:
accuracy_scores = []
start_loop = time.time()

for ix, (tree, (X_sub_train, y_sub_train)) in enumerate(zip(forest, subsets)):        
    tree.fit(X_sub_train, y_sub_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    if ix % 100 == 0 and ix != 0:
        ix_time = time.time()
        mins, secs = divmod(ix_time - start_loop, 60)
        avg_ix_time = (ix_time-start_loop)/(ix+1)
        rest_ix = n_trees-ix
        rest_mins, rest_secs = divmod(rest_ix * avg_ix_time, 60)
        print(f'Done {ix+1:0.0f} out of {n_trees:0.0f} tasks | \
elapsed: {mins:0.0f}m {secs:0.0f}s | remaining: {rest_mins:0.0f}m {rest_secs:0.0f}s')
    if ix == n_trees-1:
        ix_time = time.time()
        mins, secs = divmod(ix_time - start_loop, 60)
        print(f'Done {ix+1:0.0f} out of {n_trees:0.0f} tasks | \
elapsed: {mins:0.0f}m {secs:0.0f}s | finished')

Done 101 out of 1000 tasks | elapsed: 0m 6s | remaining: 0m 51s
Done 201 out of 1000 tasks | elapsed: 0m 11s | remaining: 0m 45s
Done 301 out of 1000 tasks | elapsed: 0m 17s | remaining: 0m 39s
Done 401 out of 1000 tasks | elapsed: 0m 23s | remaining: 0m 34s
Done 501 out of 1000 tasks | elapsed: 0m 29s | remaining: 0m 29s
Done 601 out of 1000 tasks | elapsed: 0m 35s | remaining: 0m 23s
Done 701 out of 1000 tasks | elapsed: 0m 40s | remaining: 0m 17s
Done 801 out of 1000 tasks | elapsed: 0m 46s | remaining: 0m 11s
Done 901 out of 1000 tasks | elapsed: 0m 51s | remaining: 0m 6s
Done 1000 out of 1000 tasks | elapsed: 0m 57s | finished


In [8]:
round(np.mean(accuracy_scores), 4)

0.5371

This is the **"magic"** step:

- for each test set instance, generate the predictions of the trees 
- keep only the most frequent prediction (the *mode*)

This procedure gives you the majority-vote predictions over the test set.

In [9]:
X_test.shape[0]

50294

In [10]:
# instantiate a 1,000 x 50,294 matrix
Y_pred = np.empty([1000, X_test.shape[0]], dtype=np.uint8) 

In [11]:
from scipy.stats import mode

# generate predictions for each classifier
for tree_ix, tree in enumerate(forest):              
    Y_pred[tree_ix] = tree.predict(X_test)

# compute the mode for each y_pred
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [12]:
# get accuracy of preds on test target
round(accuracy_score(y_test, y_pred_majority_votes.reshape([-1])), 4)

0.6929

That shows some potential, so let's scale it up to 10,000 trees of 2,000 instances each.

In [13]:
n_trees = 10000
n_instances = 2000
subsets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=X_train.shape[0] - n_instances, random_state=42)
for train_sub_ix, test_sub_ix in rs.split(X_train):
    X_sub_train = X_train[train_sub_ix]
    y_sub_train = y_train[train_sub_ix]
    subsets.append((X_sub_train, y_sub_train))

In [14]:
best_estimator_ = DecisionTreeClassifier(random_state=42, 
                                         max_leaf_nodes=99,
                                         max_features="sqrt")

forest = [clone(best_estimator_) for _ in range(n_trees)]

In [15]:
accuracy_scores = []
start_loop = time.time()

for ix, (tree, (X_sub_train, y_sub_train)) in enumerate(zip(forest, subsets)):
    tree.fit(X_sub_train, y_sub_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    if ix % 500 == 0 and ix != 0:
        ix_time = time.time()
        mins, secs = divmod(ix_time - start_loop, 60)
        avg_ix_time = (ix_time-start_loop)/(ix+1)
        rest_ix = n_trees-ix
        rest_mins, rest_secs = divmod(rest_ix * avg_ix_time, 60)
        print(f'Done {ix+1:0.0f} out of {n_trees:0.0f} tasks | \
elapsed: {mins:0.0f}m {secs:0.0f}s | remaining: {rest_mins:0.0f}m {rest_secs:0.0f}s')
    if ix == n_trees-1:
        ix_time = time.time()
        mins, secs = divmod(ix_time - start_loop, 60)
        print(f'Done {ix+1:0.0f} out of {n_trees:0.0f} tasks | \
elapsed: {mins:0.0f}m {secs:0.0f}s | finished')

Done 501 out of 10000 tasks | elapsed: 0m 15s | remaining: 4m 36s
Done 1001 out of 10000 tasks | elapsed: 0m 29s | remaining: 4m 22s
Done 1501 out of 10000 tasks | elapsed: 0m 44s | remaining: 4m 7s
Done 2001 out of 10000 tasks | elapsed: 0m 59s | remaining: 3m 55s
Done 2501 out of 10000 tasks | elapsed: 1m 13s | remaining: 3m 40s
Done 3001 out of 10000 tasks | elapsed: 1m 28s | remaining: 3m 25s
Done 3501 out of 10000 tasks | elapsed: 1m 43s | remaining: 3m 10s
Done 4001 out of 10000 tasks | elapsed: 1m 57s | remaining: 2m 56s
Done 4501 out of 10000 tasks | elapsed: 2m 12s | remaining: 2m 42s
Done 5001 out of 10000 tasks | elapsed: 2m 27s | remaining: 2m 27s
Done 5501 out of 10000 tasks | elapsed: 2m 41s | remaining: 2m 12s
Done 6001 out of 10000 tasks | elapsed: 2m 56s | remaining: 1m 57s
Done 6501 out of 10000 tasks | elapsed: 3m 11s | remaining: 1m 43s
Done 7001 out of 10000 tasks | elapsed: 3m 26s | remaining: 1m 28s
Done 7501 out of 10000 tasks | elapsed: 3m 41s | remaining: 1m 1

In [16]:
round(np.mean(accuracy_scores), 4)

0.593

**magic** step:

In [17]:
Y_pred = np.empty([n_trees, X_test.shape[0]], dtype=np.uint8)

In [18]:
# generate predictions for each classifier
for ix, tree in enumerate(forest):         
    Y_pred[ix] = tree.predict(X_test)

In [19]:
# compute mode for each y_pred
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [20]:
# get accuracy of preds on test target
round(accuracy_score(y_test, y_pred_majority_votes.reshape([-1])), 4)

0.6842

Interesting, maybe we need to add more features given the larger instance space, I'll change only `max_features=11000` to match a classifier found in the POC and train the same forest.

In [21]:
n_trees = 10000
n_instances = 2000
subsets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=X_train.shape[0] - n_instances, random_state=42)
for train_sub_ix, test_sub_ix in rs.split(X_train):
    X_sub_train = X_train[train_sub_ix]
    y_sub_train = y_train[train_sub_ix]
    subsets.append((X_sub_train, y_sub_train))

In [26]:
best_estimator_ = DecisionTreeClassifier(random_state=42, 
                                         max_leaf_nodes=99,
                                         max_features=11000)

forest = [clone(best_estimator_) for _ in range(n_trees)]

In [27]:
accuracy_scores = []
start_loop = time.time()

for ix, (tree, (X_sub_train, y_sub_train)) in enumerate(zip(forest, subsets)):
    tree.fit(X_sub_train, y_sub_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    if ix % 500 == 0 and ix != 0:
        ix_time = time.time()
        mins, secs = divmod(ix_time - start_loop, 60)
        avg_ix_time = (ix_time-start_loop)/(ix+1)
        rest_ix = n_trees-ix
        rest_mins, rest_secs = divmod(rest_ix * avg_ix_time, 60)
        print(f'Done {ix+1:0.0f} out of {n_trees:0.0f} tasks | \
elapsed: {mins:0.0f}m {secs:0.0f}s | remaining: {rest_mins:0.0f}m {rest_secs:0.0f}s')
    if ix == n_trees-1:
        ix_time = time.time()
        mins, secs = divmod(ix_time - start_loop, 60)
        print(f'Done {ix+1:0.0f} out of {n_trees:0.0f} tasks | \
elapsed: {mins:0.0f}m {secs:0.0f}s | finished')

Done 501 out of 10000 tasks | elapsed: 1m 40s | remaining: 31m 44s
Done 1001 out of 10000 tasks | elapsed: 3m 20s | remaining: 29m 57s
Done 1501 out of 10000 tasks | elapsed: 4m 58s | remaining: 28m 9s
Done 2001 out of 10000 tasks | elapsed: 6m 37s | remaining: 26m 28s
Done 2501 out of 10000 tasks | elapsed: 8m 16s | remaining: 24m 47s
Done 3001 out of 10000 tasks | elapsed: 9m 55s | remaining: 23m 8s
Done 3501 out of 10000 tasks | elapsed: 11m 33s | remaining: 21m 27s
Done 4001 out of 10000 tasks | elapsed: 13m 12s | remaining: 19m 48s
Done 4501 out of 10000 tasks | elapsed: 14m 51s | remaining: 18m 9s
Done 5001 out of 10000 tasks | elapsed: 16m 30s | remaining: 16m 29s
Done 5501 out of 10000 tasks | elapsed: 18m 8s | remaining: 14m 50s
Done 6001 out of 10000 tasks | elapsed: 19m 47s | remaining: 13m 11s
Done 6501 out of 10000 tasks | elapsed: 21m 26s | remaining: 11m 33s
Done 7001 out of 10000 tasks | elapsed: 23m 5s | remaining: 9m 54s
Done 7501 out of 10000 tasks | elapsed: 24m 44s

In [28]:
round(np.mean(accuracy_scores), 4)

0.6375

**magic** step:

In [29]:
Y_pred = np.empty([n_trees, X_test.shape[0]], dtype=np.uint8)

In [30]:
# generate predictions for each classifier
for ix, tree in enumerate(forest):         
    Y_pred[ix] = tree.predict(X_test)

In [31]:
# compute mode for each y_pred
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [32]:
# get accuracy of preds on test target
round(accuracy_score(y_test, y_pred_majority_votes.reshape([-1])), 4)

0.687

To recap, we trained DIY random forests leveraging Scikit-learn's **DecisionTreeClassifier** class three times using the following changing parameters (keeping `max_leaf_nodes=99`). Curiously, the fastest and smallest tree got the best accuracy so far. 

```
n_trees=1000
n_instances=100
max_features="sqrt"
avg accuracy=0.5371
majority vote accuracy=0.6929

n_trees = 10000
n_instances = 2000
max_features="sqrt"
avg accuracy=0.593
majority vote accuracy=0.6842

n_trees = 10000
n_instances = 2000
max_features=11000
avg accuracy=0.6375
majority vote accuracy=0.687
```

### Scikit-learn's RandomForestClassifier

[(source)](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/ensemble/_forest.py#L883)

```
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None.
```

Noticeably, when `bootstrap=True`, which is the default, then `max_samples=None` which means the entire instance space is used.

In [34]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=1000, # same value as first DIY forest
                                max_samples=100, # ibid.
                                max_features="sqrt", # ibid.
                                max_leaf_nodes=99, # ibid.
                                random_state=42, # ibid.
                                n_jobs=-1, 
                                verbose=1)

In [35]:
rf_clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   15.0s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   19.4s finished


RandomForestClassifier(max_features='sqrt', max_leaf_nodes=99, max_samples=100,
                       n_estimators=1000, n_jobs=-1, random_state=42,
                       verbose=1)

In [36]:
y_pred = rf_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.6s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    3.1s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    3.9s finished


In [37]:
round(accuracy_score(y_test, y_pred), 4)

0.6993

In [38]:
# more robust evaluation
scores = cross_val_score(rf_clf, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   10.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   17.0s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   21.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.1s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    2.7s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    4.9s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    6.2s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.8s remaining:    0.0s


[CV] ................................................. , total=  28.8s
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   19.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    2.7s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    3.5s finished


[CV] ................................................. , total=  23.9s
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   11.0s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    2.7s


[CV] ................................................. , total=  15.3s
Accuracy: 0.70 (+/- 0.01)


[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    3.5s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.1min finished


In [39]:
rf_clf = RandomForestClassifier(n_estimators=10000, # same value as second DIY forest
                                max_samples=2000, # ibid.
                                max_features="sqrt", # ibid.
                                max_leaf_nodes=99, # ibid.
                                random_state=42, # ibid.
                                n_jobs=-1, 
                                verbose=1)

In [40]:
rf_clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   13.1s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:   39.8s
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:   51.7s
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 9784 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 10000 out of 

RandomForestClassifier(max_features='sqrt', max_leaf_nodes=99, max_samples=2000,
                       n_estimators=10000, n_jobs=-1, random_state=42,
                       verbose=1)

In [41]:
y_pred = rf_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.1s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    2.0s
[Parallel(n_jobs=8)]: Done 1234 tasks      | elapsed:    3.2s
[Parallel(n_jobs=8)]: Done 1784 tasks      | elapsed:    4.7s
[Parallel(n_jobs=8)]: Done 2434 tasks      | elapsed:    6.4s
[Parallel(n_jobs=8)]: Done 3184 tasks      | elapsed:    8.4s
[Parallel(n_jobs=8)]: Done 4034 tasks      | elapsed:   10.6s
[Parallel(n_jobs=8)]: Done 4984 tasks      | elapsed:   13.1s
[Parallel(n_jobs=8)]: Done 6034 tasks      | elapsed:   15.9s
[Parallel(n_jobs=8)]: Done 7184 tasks      | elapsed:   18.9s
[Parallel(n_jobs=8)]: Done 8434 tasks      | elapsed:   22.2s
[Parallel(n_jobs=8)]: Done 9784 tasks      | elapsed:   25.7s
[Parallel(n_jobs=8)]: Done 10000 out of 10000 | elapsed:

In [42]:
round(accuracy_score(y_test, y_pred), 4)

0.7441

In [43]:
# more robust evaluation
scores = cross_val_score(rf_clf, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   11.8s
[Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 2452 tasks      | elapsed:   34.4s
[Parallel(n_jobs=-1)]: Done 3552 tasks      | elapsed:   49.6s
[Parallel(n_jobs=-1)]: Done 4852 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 6352 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 8052 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 9952 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  2.3min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]

[CV] ................................................. , total= 3.1min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:   22.2s
[Parallel(n_jobs=-1)]: Done 2452 tasks      | elapsed:   35.3s
[Parallel(n_jobs=-1)]: Done 3552 tasks      | elapsed:   50.7s
[Parallel(n_jobs=-1)]: Done 4852 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 6352 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 8052 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 9952 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  2.4min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.7s
[Parallel(n_jobs=8)]

[CV] ................................................. , total= 3.1min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   12.0s
[Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 2452 tasks      | elapsed:   34.5s
[Parallel(n_jobs=-1)]: Done 3552 tasks      | elapsed:   50.4s
[Parallel(n_jobs=-1)]: Done 4852 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 6352 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 8052 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 9952 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  2.4min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.7s
[Parallel(n_jobs=8)]

[CV] ................................................. , total= 3.2min
Accuracy: 0.75 (+/- 0.00)


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  9.4min finished


Using the same values as the third DIY forest decreased accuracy to $73\%$ after cross validation - in other words, we started overfitting. So the second classifier performed best so far with $75\%$ accuracy. I'm attempting one more semi-random choice below and then doing some light grid search as a last ditch attempt to crack $80\%$ accuracy.

In [50]:
rf_clf = RandomForestClassifier(n_estimators=300, # grow few trees...
                                max_depth=500, # ... as deep as they go...
                                max_features="sqrt", # using default num features (about 224)
                                max_samples=5000, # and a reasonable instance space - maybe increase?
                                random_state=42,
                                n_jobs=-1, 
                                verbose=1)

In [51]:
rf_clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   50.9s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.4min finished


RandomForestClassifier(max_depth=500, max_features='sqrt', max_samples=5000,
                       n_estimators=300, n_jobs=-1, random_state=42, verbose=1)

In [52]:
y_pred = rf_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.8s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    2.9s finished


In [53]:
round(accuracy_score(y_test, y_pred), 4)

0.7616

In [54]:
# more robust evaluation
scores = cross_val_score(rf_clf, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   17.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   56.4s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.5min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    3.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    4.9s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min remaining:    0.0s


[CV] ................................................. , total= 1.6min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   48.7s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.3min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    2.8s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    4.7s finished


[CV] ................................................. , total= 1.4min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   48.4s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.3min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    3.1s


[CV] ................................................. , total= 1.4min
Accuracy: 0.76 (+/- 0.00)


[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  4.4min finished


### Grid search

Since random forest take time, I'm just performing a small grid search, still keeping that `max_leaf_nodes=99` parameter constant.

In [55]:
from sklearn.model_selection import GridSearchCV

params = {'n_estimators': [500, 1000, 2000],
          'max_features':[200, 400, 800],
          'max_samples':[1000, 2000, 4000, 8000],
          'max_depth':[2, 8, 64, 512]}

rf_clf = RandomForestClassifier(random_state=42, max_leaf_nodes=99)

grid_search_cv = GridSearchCV(rf_clf, params, n_jobs=-1, verbose=1, cv=3)

In [56]:
grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 144 candidates, totalling 432 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 37.5min
[Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed: 140.7min finished


GridSearchCV(cv=3,
             estimator=RandomForestClassifier(max_leaf_nodes=99,
                                              random_state=42),
             n_jobs=-1,
             param_grid={'max_depth': [2, 8, 64, 512],
                         'max_features': [200, 400, 800],
                         'max_samples': [1000, 2000, 4000, 8000],
                         'n_estimators': [500, 1000, 2000]},
             verbose=1)

In [57]:
grid_search_cv.best_params_

{'max_depth': 512,
 'max_features': 200,
 'max_samples': 8000,
 'n_estimators': 1000}

In [58]:
y_pred = grid_search_cv.predict(X_test)
round(accuracy_score(y_test, y_pred), 4)

0.7498

In [59]:
# all other "best" params
for i,v in enumerate(grid_search_cv.cv_results_['mean_test_score']):
    if v == max(grid_search_cv.cv_results_['mean_test_score']):
        print('Max mean test accuracy:', round(v, 4), \
              '\nParams:', grid_search_cv.cv_results_['params'][i])

Max mean test accuracy: 0.7527 
Params: {'max_depth': 512, 'max_features': 200, 'max_samples': 8000, 'n_estimators': 1000}


### Save gridsearch results

In [60]:
from joblib import dump, load

model_dir = os.path.join("..","data","4_models","sentiment140")

In [61]:
file_path = os.path.join(model_dir, "20201026_RandomForestClassifier_GridSearchCV.joblib")
dump(grid_search_cv, file_path)

['..\\data\\4_models\\sentiment140\\20201026_RandomForestClassifier_GridSearchCV.joblib']

---