# Twitter Sentiment Analysis - POC
---

## Random Forests 1 - Tfidf

I reached $75\%$ accuracy in the BoW representation with random forests for the first time - this notebook tests whether the Tfidf with the same random forest classifiers reaches a better accuracy.

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score

### Load $m\approx250k$, $n=50k$ training subset

In [2]:
# processed dir
proc_dir = os.path.join("..","data","3_processed","sentiment140")
X_train_transformed = sp.load_npz(os.path.join(proc_dir, "X_train_transformed_Tfidf_250k_50k.npz"))
with open(os.path.join(proc_dir, "y_array_250k.npy"), 'rb') as f:
    y_array = np.load(f)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X_train_transformed, 
                                                    y_array, 
                                                    test_size=0.2, 
                                                    random_state=42)

### Scikit-learn's RandomForestClassifier

In [4]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=1000, # same value as first DIY forest
                                max_samples=100, # ibid.
                                max_features="sqrt", # ibid.
                                max_leaf_nodes=99, # ibid.
                                random_state=42, # ibid.
                                n_jobs=-1, 
                                verbose=1)

In [5]:
rf_clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   16.7s finished


RandomForestClassifier(max_features='sqrt', max_leaf_nodes=99, max_samples=100,
                       n_estimators=1000, n_jobs=-1, random_state=42,
                       verbose=1)

In [6]:
y_pred = rf_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    2.7s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    3.5s finished


In [7]:
# BoW accuracy was 0.6993
round(accuracy_score(y_test, y_pred), 4)

0.7056

In [8]:
# more robust evaluation: basically the same
scores = cross_val_score(rf_clf, X_train_transformed, y_array, cv=3, verbose=1, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   16.5s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   20.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.1s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    2.6s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    4.8s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    6.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elaps

Accuracy: 0.69 (+/- 0.01)


[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    6.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.3min finished


In [9]:
rf_clf = RandomForestClassifier(n_estimators=10000, # same value as second DIY forest
                                max_samples=2000, # ibid.
                                max_features="sqrt", # ibid.
                                max_leaf_nodes=99, # ibid.
                                random_state=42, # ibid.
                                n_jobs=-1, 
                                verbose=1)

In [10]:
rf_clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   18.0s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   28.4s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   40.9s
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:   56.4s
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 9784 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 10000 out of 

RandomForestClassifier(max_features='sqrt', max_leaf_nodes=99, max_samples=2000,
                       n_estimators=10000, n_jobs=-1, random_state=42,
                       verbose=1)

In [11]:
y_pred = rf_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    2.8s
[Parallel(n_jobs=8)]: Done 1234 tasks      | elapsed:    4.5s
[Parallel(n_jobs=8)]: Done 1784 tasks      | elapsed:    6.6s
[Parallel(n_jobs=8)]: Done 2434 tasks      | elapsed:    9.0s
[Parallel(n_jobs=8)]: Done 3184 tasks      | elapsed:   11.9s
[Parallel(n_jobs=8)]: Done 4034 tasks      | elapsed:   15.3s
[Parallel(n_jobs=8)]: Done 4984 tasks      | elapsed:   18.9s
[Parallel(n_jobs=8)]: Done 6034 tasks      | elapsed:   22.9s
[Parallel(n_jobs=8)]: Done 7184 tasks      | elapsed:   27.2s
[Parallel(n_jobs=8)]: Done 8434 tasks      | elapsed:   31.9s
[Parallel(n_jobs=8)]: Done 9784 tasks      | elapsed:   37.1s
[Parallel(n_jobs=8)]: Done 10000 out of 10000 | elapsed:

In [12]:
# BoW accuracy 0.7441
round(accuracy_score(y_test, y_pred), 4)

0.7448

In [13]:
# more robust evaluation
# BoW got 0.75 (+/- 0.00)
scores = cross_val_score(rf_clf, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   30.0s
[Parallel(n_jobs=-1)]: Done 2040 tasks      | elapsed:   40.6s
[Parallel(n_jobs=-1)]: Done 3340 tasks      | elapsed:   57.7s
[Parallel(n_jobs=-1)]: Done 4840 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 6540 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 8440 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 9985 out of 10000 | elapsed:  2.5min remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  2.5min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1

[CV] ................................................. , total= 3.4min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-1)]: Done 2452 tasks      | elapsed:   35.8s
[Parallel(n_jobs=-1)]: Done 3552 tasks      | elapsed:   51.8s
[Parallel(n_jobs=-1)]: Done 4852 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 6352 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 8052 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 9952 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  2.4min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]

[CV] ................................................. , total= 3.2min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:   22.6s
[Parallel(n_jobs=-1)]: Done 2452 tasks      | elapsed:   35.4s
[Parallel(n_jobs=-1)]: Done 3552 tasks      | elapsed:   51.3s
[Parallel(n_jobs=-1)]: Done 4852 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 6352 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 8052 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 9952 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:  2.4min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.7s
[Parallel(n_jobs=8)]

[CV] ................................................. , total= 3.2min
Accuracy: 0.7469 (+/- 0.0019)


[Parallel(n_jobs=8)]: Done 10000 out of 10000 | elapsed:   42.8s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  9.7min finished


In [14]:
rf_clf = RandomForestClassifier(n_estimators=300, # grow few trees...
                                max_depth=500, # ... as deep as they go...
                                max_features="sqrt", # using default num features (about 224)
                                max_samples=5000, # and a reasonable instance space - maybe increase?
                                random_state=42,
                                n_jobs=-1, 
                                verbose=1)

In [15]:
rf_clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   30.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   48.8s finished


RandomForestClassifier(max_depth=500, max_features='sqrt', max_samples=5000,
                       n_estimators=300, n_jobs=-1, random_state=42, verbose=1)

In [16]:
y_pred = rf_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    2.3s finished


In [17]:
# BoW got 0.7616
round(accuracy_score(y_test, y_pred), 4)

0.754

In [18]:
# more robust evaluation
# Bow Accuracy: 0.76 (+/- 0.00)
scores = cross_val_score(rf_clf, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   30.3s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   47.9s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.8s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    3.1s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   51.7s remaining:    0.0s


[CV] ................................................. , total=  51.8s
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   28.9s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   46.3s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.8s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    3.0s finished


[CV] ................................................. , total=  49.9s
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   46.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.8s


[CV] ................................................. , total=  49.8s
Accuracy: 0.7555 (+/- 0.0043)


[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    3.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.5min finished


### Grid search

Just the best results.

In [20]:
grid_search_best = RandomForestClassifier(n_estimators=1000,
                                max_depth=512, 
                                max_features=200, 
                                max_samples=8000, 
                                random_state=42,
                                n_jobs=-1, 
                                verbose=1)

In [21]:
grid_search_best.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    9.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   46.7s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.3min finished


RandomForestClassifier(max_depth=512, max_features=200, max_samples=8000,
                       n_estimators=1000, n_jobs=-1, random_state=42,
                       verbose=1)

In [22]:
y_pred = grid_search_best.predict(X_test)
round(accuracy_score(y_test, y_pred), 4)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    1.3s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    3.1s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    5.6s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    7.7s finished


0.763

In [23]:
scores = cross_val_score(grid_search_best, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   12.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   53.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.4min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    2.1s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    5.2s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    9.5s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:   12.2s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.6min remaining:    0.0s


[CV] ................................................. , total= 4.6min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   53.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.6min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    2.5s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    5.8s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:   10.4s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:   13.2s finished


[CV] ................................................. , total= 4.8min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   50.6s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.5min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    2.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    6.2s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:   11.2s


[CV] ................................................. , total= 4.8min
Accuracy: 0.7639 (+/- 0.0033)


[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:   14.2s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 14.2min finished


$76.4\%$ seems more promising than $75\%$ for the BoW, but tought to know without learning curves.

---