# Twitter Sentiment Analysis - POC
---

## 4. Cleaning Pipeline - streamlined BoW

This notebook refines the previous notebook's cleaning pipeline by testing a larger dataset and bringing it via a custom module. Simple baseline Logistic Regression and Naive Bayes models are tested again, showing an improvement in accuracy with more data from approx. 66% accuracy for LR and 70% accuracy for NB with approx. 120 observations to 75% for both models with approx. 12,000 observations. 

**Next Steps**

Since the 5% improvement in accuracy seems low for a dataset that is 100x larger, I suspect these simple models are underfitting and adding more data would not improve accuracy that much. Formal modeling with learning curves and more testing would confirm this. Conversely, we could improve the quality of the data representation by adding N-grams, using TF-IDF vectorization, and projecting into a semantic space with SVD, and we could engineer features to further improve model accuracy, but I will save all these steps for later, after testing out a few more models with this basic Bag of Words (BoW) representation, tweaking our vocabulary size, among other tests.

- Note: the BoW representation is explained in more detail in this [Document Term Matrices notebook.](10.extra_Document_Term_Matrices.ipynb)


## POC Only - Sample Data

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd

import cleanup_module_POC as Cmod
from sklearn.model_selection import cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# time notebook
start_notebook = time.time()

# load minimally prepared X, y train subsets
raw_path = os.path.join("..","data","1_raw","sentiment140")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

In [2]:
# sample down considerably to X, y sample subsets
from sklearn.model_selection import train_test_split

X, X_rest, y, y_rest = train_test_split(X_train, y_train, test_size=0.99, random_state=158)

The plan is to forget about the `_rest` datasets and focus on the X, y small subsets, as if they were the entire training data.

In [3]:
print(f'Dataset size: {len(X):0.0f}')
print(f'Target distribution: {sum(y["target"])/len(y):0.3f}')

Dataset size: 11974
Target distribution: 0.503


In [4]:
# keep indices
X.insert(3, 'index', X.index)
X.index = range(len(X))

In [5]:
# transform into arrays
X_array = np.array(X.iloc[:, 2]).ravel()
y_array = y.iloc[:,0].ravel()

In [6]:
preprocess_pipeline = Pipeline([
    ("document_to_wordcount", Cmod.DocumentToWordCounterTransformer()),
    ("wordcount_to_vector", Cmod.WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_array)

In [7]:
X_train_transformed

<11974x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 105899 stored elements in Compressed Sparse Row format>

In [8]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_array, cv=10, verbose=3, scoring='accuracy')
print('Mean accuracy: ' + str(score.mean()))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.759, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.745, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.742, total=   0.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV] .................................... , score=0.725, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.734, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.754, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.725, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.771, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.750, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.759, total=   0.1s
Mean accuracy: 0.7464518976908046


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.1s finished


In [9]:
NB_clf = MultinomialNB()
score = cross_val_score(NB_clf, X_train_transformed, y_array, cv=10, verbose=3, scoring='accuracy')
print('Mean accuracy: ' + str(score.mean()))

[CV]  ................................................................
[CV] .................................... , score=0.755, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.740, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.751, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.720, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.736, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.749, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.739, total=   0.0s
[CV]  

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished


In [10]:
# time notebook
mins, secs = divmod(time.time() - start_notebook, 60)
print(f'Total running time: {mins:0.0f} minute(s) and {secs:0.0f} second(s).')

Total running time: 0 minute(s) and 17 second(s).


---