# Cleanup & Preprocess - Bigrams

---

*Features* 

- The **DocumentToWordCounterTransformer** class is extended to a **DocumentToBigramCounterTransformer** class.
- This implementation keeps unigrams. 
- Tfidf is performed on this Bag-of-(upto)-Bigrams representation.


*Results*

- using the new **DocumentToBigramCounterTransformer** class yields the following results with a vocabulary of 500 terms:

| Model | Representation | Accuracy | Variance |
|:---|:---|:---|:---|
|Naive Bayes |Bag-of-upto-Bigrams | 0.9767 |(+/- 0.0062)|
|Naive Bayes |BoB + TF-IDF| 0.9672 |(+/- 0.0073)|
|Logistic Regr. |Bag-of-upto-Bigrams  | 0.9846 |(+/- 0.0062)|
|Logistic Regr. |BoB + TF-IDF| 0.9649 |(+/- 0.0101)|

### Setup

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd

from datetime import datetime

start_time = time.time()
dt_object = datetime.fromtimestamp(time.time())
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2020-12-07


### Load Data

In [2]:
def load_data(data):
    raw_path = os.path.join("..","data","1_raw")
    filename = ''.join([data, ".csv"])
    out_dfm = pd.read_csv(os.path.join(raw_path, filename))
    out_arr = np.array(out_dfm.iloc[:,0].ravel())
    return out_arr

X_train = load_data("X_train")
y_train = load_data("y_train")

### Cleanup & Preprocess

In [3]:
import urlextract
from nltk.stem import WordNetLemmatizer

with open("contractions_map.json") as f:
    contractions_map = json.load(f)

url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()

In [4]:
import cleanup_module as Cmod
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

# full pipe
pipe = Pipeline([('counter', Cmod.DocumentToBigramCounterTransformer()),
                 ('bow', Cmod.WordCounterToVectorTransformer(vocabulary_size=500)),
                 ('tfidf', TfidfTransformer())])

X_counter = pipe['counter'].fit_transform(X_train)
X_bow = pipe['bow'].fit_transform(X_counter)
X_tfidf = pipe['tfidf'].fit_transform(X_bow)

### Train and evaluate a couple baseline models

#### Naive Bayes

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

NB_clf = MultinomialNB()

# Bag-of-bigrams
score = cross_val_score(NB_clf, X_bow, y_train, cv=10, verbose=1, scoring='accuracy')
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


Accuracy: 0.9767 (+/- 0.0062)


In [6]:
# Bag-of-bigrams + Tfidf
score = cross_val_score(NB_clf, X_tfidf, y_train, cv=10, verbose=1, scoring='accuracy')
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.1s finished


Accuracy: 0.9672 (+/- 0.0073)


#### Logistic Regression

In [7]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(solver="liblinear", random_state=42)

# Bag-of-bigrams
score = cross_val_score(log_clf, X_bow, y_train, cv=10, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Accuracy: 0.9846 (+/- 0.0062)


[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    1.5s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.6s finished


In [8]:
# Bag-of-bigrams + Tfidf
score = cross_val_score(log_clf, X_tfidf, y_train, cv=10, verbose=1, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


Accuracy: 0.9649 (+/- 0.0101)


In [9]:
mins, secs = divmod(time.time() - start_time, 60)
print(f'Time elapsed: {mins:0.0f} m {secs:0.0f} s')

Time elapsed: 0 m 12 s


---