# Cleanup & Preprocess - Bigrams

---

*Features* 

- The **DocumentToWordCounterTransformer** class is extended to a **DocumentToBigramCounterTransformer** class.
- Note that this implementation keeps unigrams. 
- Tfidf is performed on this Bag-of-(upto)-Bigrams representation.


*Results*

- using the new **DocumentToBigramCounterTransformer** class .... [todo]

| Model | Representation | Accuracy | Variance |
|:---|:---|:---|:---|
|Naive Bayes |Bag-of-upto-Bigrams | 0.7890 |(+/- 0.0022)|
|Naive Bayes |BoB + TF-IDF| 0.7908 |(+/- 0.0024)|
|Logistic Regr. |Bag-of-upto-Bigrams  | 0.7895 |(+/- 0.0021)|
|Logistic Regr. |BoB + TF-IDF| 0.8019 |(+/- 0.0024)|

### Setup

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd

import urlextract
from datetime import datetime
from nltk import ngrams
from nltk.stem import WordNetLemmatizer

import cleanup_module as Cmod
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.model_selection import train_test_split

start_time = time.time()
dt_object = datetime.fromtimestamp(time.time())
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2020-12-06


### Load Data

In [2]:
with open("contractions_map.json") as f:
    contractions_map = json.load(f)

url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()

In [3]:
# load X, y train subsets
raw_path = os.path.join("..","data","1_raw")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

# create arrays
X_array = np.array(X_train.iloc[:, 0]).ravel()
y_array = np.array(y_train.iloc[:,0]).ravel()

In [29]:
# full pipe
pipe = Pipeline([('counter', Cmod.DocumentToBigramCounterTransformer()),
                 ('bow', Cmod.WordCounterToVectorTransformer(vocabulary_size=1000)),
                 ('tfidf', TfidfTransformer())])

### Train and evaluate couple quick models

In [30]:
X_bow = pipe['bow'].fit_transform(X_counter)
X_bow

<3900x1001 sparse matrix of type '<class 'numpy.intc'>'
	with 49084 stored elements in Compressed Sparse Row format>

In [31]:
X_tfidf = pipe['tfidf'].fit_transform(X_bow)
X_tfidf

<3900x1001 sparse matrix of type '<class 'numpy.float64'>'
	with 49084 stored elements in Compressed Sparse Row format>

### Naive Bayes

In [32]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

NB_clf = MultinomialNB()

# BoW with bigrams
score = cross_val_score(NB_clf, X_bow, y_array, cv=10, verbose=1, scoring='accuracy')
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


Accuracy: 0.9787 (+/- 0.0069)


In [33]:
# Tfidf with bigrams
score = cross_val_score(NB_clf, X_tfidf, y_array, cv=10, verbose=1, scoring='accuracy')
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.9731 (+/- 0.0053)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


### Logistic Regression

In [34]:
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(solver="liblinear", random_state=42)

# BoW with bigrams
score = cross_val_score(log_clf, X_bow, y_array, cv=10, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.9846 (+/- 0.0063)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


In [35]:
# Tfidf with bigrams
score = cross_val_score(log_clf, X_tfidf, y_array, cv=10, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.9695 (+/- 0.0108)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


---