# Twitter Sentiment Analysis - POC
---

## 7. Quickly train various models

**Current State**: I've trained three baseline models without much tweaking on BoW matrices and using cross validation got the following accuracy scores:

- Logistic Regression: $78.8\%$
- Naive Bayes: $77.6\%$ 
- SGD (log loss): $77.1\%$ 

**Next Steps**: Train Decision Trees.


In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Load $m\approx250k$, $n=50k$ training subset

In [2]:
# time notebook
start_notebook = time.time()

# processed dir
proc_dir = os.path.join("..","data","3_processed","sentiment140")
X_train_transformed = sp.load_npz(os.path.join(proc_dir, "X_train_transformed_BoW_250k_50k.npz"))
with open(os.path.join(proc_dir, "y_array_250k.npy"), 'rb') as f:
    y_array = np.load(f)

In [3]:
# sanity check
X_train_transformed, len(y_array)

(<251468x50001 sparse matrix of type '<class 'numpy.int32'>'
 	with 2569112 stored elements in Compressed Sparse Row format>,
 251468)

### Decision Trees


In [4]:
X_train, X_test, y_train, y_test = train_test_split(X_train_transformed, 
                                                    y_array, 
                                                    test_size=0.25, 
                                                    random_state=2)

In [5]:
clf = DecisionTreeClassifier(criterion="gini", max_depth=5, random_state=42)

In [6]:
start_time = time.time()
clf.fit(X_train, y_train)
mins, secs = divmod(time.time() - start_time, 60)
print(f'Time: {mins:0.0f} mins and {secs:0.0f} secs')

Time: 0 mins and 14 secs


In [7]:
y_pred = clf.predict(X_test)

In [8]:
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))

Accuracy: 0.632


In [9]:
# Cross validate with entire 250k rows
scores = cross_val_score(clf, X_train_transformed, y_array, cv=5, verbose=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[CV]  ................................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................................... , score=0.631, total=   4.9s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.9s remaining:    0.0s


[CV] .................................... , score=0.633, total=   7.0s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   11.9s remaining:    0.0s


[CV] .................................... , score=0.632, total=   8.7s
[CV]  ................................................................
[CV] .................................... , score=0.632, total=   8.9s
[CV]  ................................................................
[CV] .................................... , score=0.629, total=   8.8s
Accuracy: 0.63 (+/- 0.00)


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   38.4s finished


### GridSearch too early?

In [37]:
start_time = time.time()

depth, runtime, accuracy = [], [], []
for i in range(10, 410, 30):
    clf = DecisionTreeClassifier(criterion="gini", max_depth=i, random_state=42)
    depth.append(i)
    
    start_time = time.time()
    clf.fit(X_train, y_train)
    runtime.append(round(time.time() - start_time, 1))
    
    y_pred = clf.predict(X_test)
    accuracy.append(round(accuracy_score(y_test, y_pred), 4))
    
mins, secs = divmod(time.time() - start_time, 60)
print(f'Time: {mins:0.0f} mins and {secs:0.0f} secs')

Time: 3 mins and 10 secs


In [38]:
df = pd.DataFrame(
    {'depth': depth,
     'runtime': runtime,
     'accuracy': accuracy
    })
df

Unnamed: 0,depth,runtime,accuracy
0,10,15.7,0.6545
1,40,37.3,0.698
2,70,64.0,0.7082
3,100,80.7,0.7129
4,130,98.6,0.7158
5,160,123.4,0.7141
6,190,131.5,0.7146
7,220,138.1,0.7163
8,250,143.0,0.7144
9,280,145.3,0.7154


In [None]:
# Time was only the last iteration because start_time... yeah...

In [39]:
# time notebook
mins, secs = divmod(time.time() - start_notebook, 60)
print(f'Total running time: {mins:0.0f} minute(s) and {secs:0.0f} second(s).')

Total running time: 42 minute(s) and 25 second(s).


---