# Twitter Sentiment Analysis - POC
---

## 7. Quickly train various models

**Current State**: I've trained three baseline models with little tweaking, and fine-tuned a single decision tree, on a Bag-of-Word subset of $m\approx250k, n=50k$ of the training data, using cross validation, and got the following mean accuracy scores:

- Logistic Regression: $78.8\%$
- Naive Bayes: $77.6\%$ 
- SGD (log loss): $77.1\%$ 
- Decision Tree: $69.0\%$

**This Notebook**: Build my own random forest based on the previous decision tree's best parameters.

In [31]:
import re
import os
import time
import json

import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Load $m\approx250k$, $n=50k$ training subset

In [32]:
# processed dir
proc_dir = os.path.join("..","data","3_processed","sentiment140")
X_train_transformed = sp.load_npz(os.path.join(proc_dir, "X_train_transformed_BoW_250k_50k.npz"))
with open(os.path.join(proc_dir, "y_array_250k.npy"), 'rb') as f:
    y_array = np.load(f)

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X_train_transformed, 
                                                    y_array, 
                                                    test_size=0.2, 
                                                    random_state=42)

### DIY Random Forest

In the [POC for growing your own random forest](10.extra_GrowingRandomForests) (yes, a "POC of the POC", I'm sure agile covers this...) we used 1,000 trees of 100 instances each for an 8,000 training dataset. If I were to mimic these proportions with the roughly 200,000 instances in our training data I'd have to train approximately 25,000 trees with 2,500 instances each. That might take too long so I'll start with the same 1,000 trees of 100 instances each and see where I get.

In [34]:
from sklearn.model_selection import ShuffleSplit

n_trees = 1000
n_instances = 100
subsets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=X_train.shape[0] - n_instances, random_state=42)
for train_sub_ix, test_sub_ix in rs.split(X_train):
    X_sub_train = X_train[train_sub_ix]
    y_sub_train = y_train[train_sub_ix]
    subsets.append((X_sub_train, y_sub_train))

With this loop we get `subsets`: a 1,000-long list of 100 by 50k sparse matrices of features and 1000-long numpy arrays with target values. 

We start our forest by cloning our best estimator from the previous notebook 1,000 times.

In [35]:
from sklearn.base import clone

best_estimator_ = DecisionTreeClassifier(random_state=42, 
                                         max_leaf_nodes=99)

forest = [clone(best_estimator_) for _ in range(n_trees)]

Then we train each tree in our forest, make predictions on the test set and get the accuracy for each of these predictions.

In [36]:
import time 

accuracy_scores = []
start_loop = time.time()

for ix, (tree, (X_sub_train, y_sub_train)) in enumerate(zip(forest, subsets)):
    if ix % 100 == 0:
        mins, secs = divmod(time.time() - start_loop, 60)
        print(''.join(['Fitting estimator ', str(ix+1), ' | ', \
                       f'Time elapsed: {mins:0.0f} mins and {secs:0.0f} secs']))
        
    tree.fit(X_sub_train, y_sub_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

Fitting estimator 1 | Time elapsed: 0 mins and 0 secs
Fitting estimator 101 | Time elapsed: 0 mins and 16 secs
Fitting estimator 201 | Time elapsed: 0 mins and 30 secs
Fitting estimator 301 | Time elapsed: 0 mins and 46 secs
Fitting estimator 401 | Time elapsed: 1 mins and 2 secs
Fitting estimator 501 | Time elapsed: 1 mins and 17 secs
Fitting estimator 601 | Time elapsed: 1 mins and 33 secs
Fitting estimator 701 | Time elapsed: 1 mins and 49 secs
Fitting estimator 801 | Time elapsed: 2 mins and 4 secs
Fitting estimator 901 | Time elapsed: 2 mins and 20 secs


In [37]:
round(np.mean(accuracy_scores), 4)

0.5629

This is the **"magic"** step:

- for each test set instance, generate the predictions of the 10,000 trees 
- keep only the most frequent prediction (the *mode*)

This procedure gives you the majority-vote predictions over the test set.

In [38]:
X_test.shape[0]

50294

In [39]:
Y_pred = np.empty([1000, X_test.shape[0]], dtype=np.uint8) # a 1,000 x 62,867 matrix

In [40]:
# generate predictions for each classifier
for tree_ix, tree in enumerate(forest):              
    Y_pred[tree_ix] = tree.predict(X_test)

In [41]:
from scipy.stats import mode

# compute mode for each y_pred
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [42]:
# get accuracy of preds on test target
round(accuracy_score(y_test, y_pred_majority_votes.reshape([-1])), 4)

0.6608

That shows some potential, so let's scale it up to 10,000 trees of 2,000 instances each.

In [43]:
n_trees = 10000
n_instances = 2000
subsets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=X_train.shape[0] - n_instances, random_state=42)
for train_sub_ix, test_sub_ix in rs.split(X_train):
    X_sub_train = X_train[train_sub_ix]
    y_sub_train = y_train[train_sub_ix]
    subsets.append((X_sub_train, y_sub_train))

In [44]:
best_estimator_ = DecisionTreeClassifier(random_state=42, max_leaf_nodes=99, max_features=11000)
forest = [clone(best_estimator_) for _ in range(n_trees)]

In [45]:
accuracy_scores = []
start_loop = time.time()

for ix, (tree, (X_sub_train, y_sub_train)) in enumerate(zip(forest, subsets)):
    tree.fit(X_sub_train, y_sub_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    
    if ix % 500 == 0:
        mins, secs = divmod(time.time() - start_loop, 60)
        print(''.join(['Fitting estimator ', str(ix+1), ' | ', \
                       f'Time elapsed: {mins:0.0f} mins and {secs:0.0f} secs']))
        
    if ix == n_trees-1:
        mins, secs = divmod(time.time() - start_loop, 60)
        print(''.join(['Fitting completed. | ', \
                       f'Time elapsed: {mins:0.0f} mins and {secs:0.0f} secs']))

Fitting estimator 1 | Time elapsed: 0 mins and 0 secs
Fitting estimator 501 | Time elapsed: 1 mins and 39 secs
Fitting estimator 1001 | Time elapsed: 3 mins and 18 secs
Fitting estimator 1501 | Time elapsed: 4 mins and 56 secs
Fitting estimator 2001 | Time elapsed: 6 mins and 34 secs
Fitting estimator 2501 | Time elapsed: 8 mins and 13 secs
Fitting estimator 3001 | Time elapsed: 9 mins and 52 secs
Fitting estimator 3501 | Time elapsed: 11 mins and 30 secs
Fitting estimator 4001 | Time elapsed: 13 mins and 9 secs
Fitting estimator 4501 | Time elapsed: 14 mins and 47 secs
Fitting estimator 5001 | Time elapsed: 16 mins and 27 secs
Fitting estimator 5501 | Time elapsed: 18 mins and 5 secs
Fitting estimator 6001 | Time elapsed: 19 mins and 44 secs
Fitting estimator 6501 | Time elapsed: 21 mins and 23 secs
Fitting estimator 7001 | Time elapsed: 23 mins and 1 secs
Fitting estimator 7501 | Time elapsed: 24 mins and 50 secs
Fitting estimator 8001 | Time elapsed: 26 mins and 35 secs
Fitting esti

In [46]:
round(np.mean(accuracy_scores), 4)

0.6375

**magic** step:

In [27]:
Y_pred = np.empty([n_trees, X_test.shape[0]], dtype=np.uint8)

In [28]:
# generate predictions for each classifier
for ix, tree in enumerate(forest):         
    Y_pred[ix] = tree.predict(X_test)

In [29]:
# compute mode for each y_pred
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [30]:
# get accuracy of preds on test target
round(accuracy_score(y_test, y_pred_majority_votes.reshape([-1])), 4)

0.685

### Scikit-Learn's RandomForestClassifier class

[(source)](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/ensemble/_forest.py#L883)

```
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None
```

`max_features`'s default is "auto"; then `max_features=sqrt(n_features)` - in our case: $\sqrt{50000}\approx224$

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42, verbose=3)
clf = clf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


building tree 1 of 500building tree 2 of 500
building tree 3 of 500

building tree 4 of 500
building tree 5 of 500
building tree 6 of 500
building tree 7 of 500
building tree 8 of 500


In [10]:
scores = cross_val_score(clf, X_train_transformed, y_array, cv=3, verbose=2, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   49.5s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   56.6s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   59.2s remaining:    0.0s


[CV] ................................................. , total=  59.2s
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   23.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   55.4s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  1.1min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    1.7s finished


[CV] ................................................. , total= 1.1min
[CV]  ................................................................


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   23.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   55.8s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  1.1min finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    1.5s


[CV] ................................................. , total= 1.1min
Accuracy: 0.75 (+/- 0.00)


[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    1.7s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.2min finished


### GridSearch 1

* $\text{n_estimators}=100$, vary `max_depth`

In [11]:
start_gridsearch = time.time()

depth, runtime, accuracy = [], [], []
for i in range(2, 10):
    depth.append(i)
    start_clf = time.time()
    clf = RandomForestClassifier(n_estimators=100, 
                                 max_depth=i, 
                                 n_jobs=6, 
                                 random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy.append(round(accuracy_score(y_test, y_pred), 4))
    runtime.append(round(time.time() - start_clf, 1))
    
mins, secs = divmod(time.time() - start_gridsearch, 60)
print(f'Gridsearch total time: {mins:0.0f} mins and {secs:0.0f} secs')

Gridsearch total time: 1 mins and 10 secs


In [16]:
df1 = pd.DataFrame(
    {'depth': depth,
     'runtime': runtime,
     'accuracy': accuracy
    })
df1

Unnamed: 0,depth,runtime,accuracy
0,2,4.3,0.6512
1,3,5.2,0.6672
2,4,6.2,0.6883
3,5,8.2,0.7035
4,6,9.4,0.707
5,7,10.8,0.725
6,8,12.1,0.725
7,9,13.6,0.7273


In [17]:
start_gridsearch = time.time()

depth, runtime, accuracy = [], [], []
for i in range(10, 100, 10):
    depth.append(i)
    start_clf = time.time()
    clf = RandomForestClassifier(n_estimators=100, 
                                 max_depth=i, 
                                 n_jobs=6, 
                                 random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy.append(round(accuracy_score(y_test, y_pred), 4))
    runtime.append(round(time.time() - start_clf, 1))
    
mins, secs = divmod(time.time() - start_gridsearch, 60)
print(f'Gridsearch total time: {mins:0.0f} mins and {secs:0.0f} secs')

Gridsearch total time: 15 mins and 25 secs


In [18]:
df2 = pd.DataFrame(
    {'depth': depth,
     'runtime': runtime,
     'accuracy': accuracy
    })
df2

Unnamed: 0,depth,runtime,accuracy
0,10,13.4,0.7309
1,20,27.9,0.7355
2,30,43.4,0.7442
3,40,63.0,0.7494
4,50,83.6,0.7531
5,60,111.5,0.7586
6,70,158.9,0.7607
7,80,193.1,0.7643
8,90,230.1,0.7671


In [19]:
start_gridsearch = time.time()

depth, runtime, accuracy = [], [], []
for i in range(100, 1100, 100):
    depth.append(i)
    start_clf = time.time()
    clf = RandomForestClassifier(n_estimators=100, 
                                 max_depth=i, 
                                 n_jobs=6, 
                                 random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy.append(round(accuracy_score(y_test, y_pred), 4))
    runtime.append(round(time.time() - start_clf, 1))
    
mins, secs = divmod(time.time() - start_gridsearch, 60)
print(f'Gridsearch total time: {mins:0.0f} mins and {secs:0.0f} secs')

Gridsearch total time: 98 mins and 58 secs


In [20]:
df3 = pd.DataFrame(
    {'depth': depth,
     'runtime': runtime,
     'accuracy': accuracy
    })
df3

Unnamed: 0,depth,runtime,accuracy
0,100,216.3,0.7695
1,200,531.9,0.7777
2,300,867.3,0.7795
3,400,588.4,0.7793
4,500,598.6,0.7806
5,600,584.9,0.7808
6,700,647.2,0.7781
7,800,637.5,0.7799
8,900,604.2,0.7794
9,1000,661.8,0.7789


In [25]:
rf_gridsearch = pd.concat([df1,df2,df3])

In [26]:
# save gridsearch results
model_tuning_dir = os.path.join("..","data","4_models","sentiment140","tuning")

try:
    os.stat(model_tuning_dir)
except:
    os.mkdir(model_tuning_dir)
    
filepath = os.path.join(model_tuning_dir, "POC7_rf_gridsearch1.csv")
rf_gridsearch.to_csv(filepath, index=False)

### GridSearch 2

* $\text{max_depth}=500$, vary `n_estimators`

In [None]:
start_gridsearch = time.time()

ntrees, runtime, accuracy = [], [], []
for i in range(100, 500, 100):
    ntrees.append(i)
    start_clf = time.time()
    clf = RandomForestClassifier(n_estimators=i,
                                 max_depth=500, 
                                 n_jobs=6, 
                                 random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy.append(round(accuracy_score(y_test, y_pred), 4))
    runtime.append(round(time.time() - start_clf, 1))
    
mins, secs = divmod(time.time() - start_gridsearch, 60)

In [34]:
print(f'Gridsearch total time: {mins:0.0f} mins and {secs:0.0f} secs')

Gridsearch total time: 165 mins and 11 secs


In [39]:
df4 = pd.DataFrame(
    {'ntrees': ntrees,
     'runtime': runtime,
     'accuracy': accuracy
    })
df4

Unnamed: 0,ntrees,runtime,accuracy
0,100,543.0,0.7806
1,200,1162.2,0.7826
2,300,1770.1,0.7843
3,400,2259.7,0.7843


In [40]:
# time notebook
mins, secs = divmod(time.time() - start_notebook, 60)
print(f'Total running time: {mins:0.0f} minute(s) and {secs:0.0f} second(s).')

Total running time: 378 minute(s) and 0 second(s).


---