## 0. System Configuration

- **RAM:** *30GB*
- **CPU:** *12 Core - 24 thread Thread Reaper, INTEL Xenon*
- **GPU:** *Nvidia 1080 Ti - 11GB vRAM* + *Nvidia Titan XP - 12GB vRAM* 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
%config InlineBackend.figure_format = 'retina'

In [2]:
!ls

Bag Of Words.ipynb  data  naive-babes-sklearn.ipynb  Testing.ipynb


## 1. Loading the dataset¶

In [3]:
with sqlite3.connect('./data/reviewsV1.db') as conn:
    data = pd.read_sql_query('SELECT * FROM Review', conn)

In [4]:
data.drop('index', inplace=True, axis=1)

## 2. Some Legacy Data Cleaning that causes problem in Word2Vec modelling

In [5]:
data = data[data.index != 258456]

## 3. Time Based Splitting

In [6]:
data.sort_values(by='Time', inplace=True)
data.reset_index(drop=True, inplace=True)
TRAIN_SIZE = int(data.shape[0] * 0.7)
TEST_SIZE = data.shape[0] - TRAIN_SIZE

In [7]:
TRAIN_SIZE

254882

In [8]:
TEST_SIZE

109236

In [9]:
data_train = data[0: TRAIN_SIZE]
data_test = data[TRAIN_SIZE:]

#### 3.1 Check if the Splitting was performed properly

In [10]:
assert(data_train.shape[0] == TRAIN_SIZE)
assert(data_test.shape[0] == TEST_SIZE)
assert(data.Time.max() == data_test.Time.reset_index(drop=True)[TEST_SIZE -1])
assert(data.Time.min() == data_train.Time.reset_index(drop=True)[0])

## 4. Training BagOfWords Model on data_train

#### 4.1 Creating a BoW on train data

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
cunt = CountVectorizer(binary=True, max_features=5000)  # For Bernouli's Naive Bayes
cunt.fit(data_train.Text)

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [13]:
Dtrain = cunt.transform(data_train.Text)

Dtrain.get_shape()

(254882, 5000)

In [14]:
Dtrain = Dtrain.toarray()  # Almost 10.2 GB of RAM usage

#### 4.2 Creating a BoW on test data

In [15]:
Dtest = cunt.transform(data_test.Text)
Dtest.get_shape()

(109236, 5000)

In [16]:
109236 * 5000 * 8

4369440000

In [17]:
Dtest = Dtest.toarray()  # Almost 4.3 GB of RAM usage

In [18]:
assert(any(Dtrain[0] == 1) == True)
assert(any(Dtest[0] == 1) == True)

## 5. K-Fold Cross Validation

In [19]:
from sklearn.model_selection import cross_validate

*Scoring parameters*

**http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter**

```python
['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'mutual_info_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'v_measure_score']
```

We are going to optimize the $\alpha$ parameter in the Laplace Smoothing using F1 score but however we are going the report the following metrics in each iterations.

- Accuracy
- F1 score
- Recall
- Precison
- Confusion Matrix
- Training F1 score
- Valiation F1 score
- Training Recall
- Validation Recall
- Training Precision
- Validation Precision
- Average Fitting Time
- Average Prediction time

#### 5.1 KFold Cross Validation

In [20]:
NUM_FOLDS = 10
scorings = ['accuracy', 'f1', 'recall', 'precision']

In [55]:
from prettytable import PrettyTable
from IPython.core.display import display, HTML

In [57]:
def BC_Kfold_cross_validation(
    clf,
    features,
    labels,
    scorings,
    cv,
    test_size,
    shuffle,
    **kwargs):
    """
    Perform Kfold cross validation for a Binary Classifier.
    
    :param clf: Classifier to use for fitting the data
    :param features: Features for the dataset. Expected np.matrix, np.array
    :param labels: Labels for the dataset. Expected np.array, list
    :param scorings: Scoring algorithms to evaluate
    :param test_size: Splitting ratio while splitting the dataset for train and CV.
    :param cv: Number of iterations cross validation is perfomed.
    :param shuffle: Shuffle the data each time for cross validations.
    """
    table = kwargs.get('ptable')
    field_names=[
        "fit_time",
        "score_time",
        "train_accuracy",
        "cv_accuracy",
        "train_f1",
        "cv_f1",
        "train_recall",
        "cv_recall",
        "train_precision",
        "cv_precision"
    ]
    if shuffle:
        from sklearn.model_selection import ShuffleSplit
        CV = ShuffleSplit(n_splits=cv, test_size=test_size, random_state=42)
    else:
        CV = cv
        
    scores = cross_validate(
        clf,
        features,
        labels,
        scoring=scorings,
        cv=CV,
        verbose=2,
        return_train_score=True)
    
    ## Formatted strings would only work in python >= 3.6
    table_row = []
    # Report the timings
    table_row.append(round(scores['fit_time'].mean(), 5))
    table_row.append(round(scores['score_time'].mean(), 5))
    table_row.append(round(scores['train_accuracy'].mean(), 5))
    table_row.append(round(scores['test_accuracy'].mean(), 5))
    table_row.append(round(scores['train_f1'].mean(), 5))
    table_row.append(round(scores['test_f1'].mean(), 5))
    table_row.append(round(scores['train_recall'].mean(), 5))
    table_row.append(round(scores['test_recall'].mean(), 5))
    table_row.append(round(scores['train_precision'].mean(), 5))
    table_row.append(round(scores['test_precision'].mean(), 5))
    hparams = kwargs.get("hyperparams", None)
    if hparams:
        for each in hparams:
            field_names.append(each.upper())
            table_row.append(clf.get_params()[each])
    table.field_names = field_names
    table.add_row(table_row)
    print(f'Done for {clf}')
    return table

In [66]:
def fit_params_with_ALPHA(ALPHAS, features, labels, scorings, cv, **kwargs):
    """
    Fit KNN with all combination of ALGO and NEIGHBOURS
    in each iterrations.
    
    :param ALGO: List of algorithms to try
    :param NEIGHBOURS: List of neighbours to try
    """
    table = PrettyTable()
    for ALPHA in ALPHAS:
        clf = MultinomialNB(alpha=ALPHA)
        test_size = kwargs.get('test_size', 0.3)
        shuffle = kwargs.get('shuffle', True)
        hparams = kwargs.get('hyperparams')
        table = BC_Kfold_cross_validation(clf,
                                            features,
                                            labels,
                                            scorings,
                                            cv,
                                            test_size,
                                            shuffle,
                                            hyperparams=hparams,
                                            ptable=table)
    return table

In [59]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

In [60]:
features = Dtrain
labels = data_train.Polarity.apply(lambda x: 1 if x == 'positive' else 0).values

In [61]:
features.shape

(254882, 5000)

In [62]:
labels.shape

(254882,)

In [None]:
alphas = [0.001, 0.01, 0.1, 1, 10]
table = fit_params_with_ALPHA(alphas, features, labels, 
                              scorings, NUM_FOLDS, hyperparams=['alpha'])

[CV]  ................................................................
[CV] ................................................. , total=  19.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   38.2s remaining:    0.0s


[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  20.3s
[CV]  ................................................................
[CV] ................................................. , total=  20.0s
[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  20.2s
[CV]  ................................................................
[CV] ................................................. , total=  20.2s
[CV]  ................................................................
[CV] ................................................. , total=  20.3s
[CV]  

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  6.5min finished


Done for MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True)
[CV]  ................................................................
[CV] ................................................. , total=  19.6s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   38.4s remaining:    0.0s


[CV]  ................................................................
[CV] ................................................. , total=  20.0s
[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  19.8s
[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  19.5s
[CV]  ................................................................
[CV] ................................................. , total=  19.3s
[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  

In [90]:
display(HTML(table.get_html_string()))

fit_time,score_time,train_accuracy,cv_accuracy,train_f1,cv_f1,train_recall,cv_recall,train_precision,cv_precision,ALPHA
12.03081,8.02428,0.90058,0.89638,0.94182,0.93938,0.9456,0.94406,0.93806,0.93474,0.001
11.95605,7.86128,0.90057,0.89636,0.94181,0.93937,0.9456,0.94404,0.93806,0.93474,0.01
11.88452,7.91433,0.90052,0.89634,0.94179,0.93935,0.94555,0.94401,0.93805,0.93474,0.1
11.93363,7.92629,0.90016,0.89605,0.94157,0.93918,0.9453,0.94376,0.93787,0.93464,1.0
11.88802,7.95451,0.89793,0.89444,0.94045,0.93842,0.94704,0.94572,0.93395,0.93122,10.0


As we can see that the `f1` score for most of the alpha values are pretty comparable along with the `recall` but the `precision` drops a little in alpha = 10. Hence I feel that alpha = 1, the default value of alpha is good enough. Although lets see if the F1 score increases even more with higher alpha value.

In [72]:
alphas = [50, 100, 1000]
table2 = fit_params_with_ALPHA(alphas, features, labels, 
                              scorings, NUM_FOLDS, hyperparams=['alpha'])

[CV]  ................................................................
[CV] ................................................. , total=  20.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   38.9s remaining:    0.0s


[CV] ................................................. , total=  19.5s
[CV]  ................................................................
[CV] ................................................. , total=  19.6s
[CV]  ................................................................
[CV] ................................................. , total=  19.8s
[CV]  ................................................................
[CV] ................................................. , total=  20.1s
[CV]  ................................................................
[CV] ................................................. , total=  20.1s
[CV]  ................................................................
[CV] ................................................. , total=  20.1s
[CV]  ................................................................
[CV] ................................................. , total=  19.5s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  6.4min finished


[CV] ................................................. , total=  19.9s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   38.9s remaining:    0.0s


[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  19.8s
[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  19.8s
[CV]  ................................................................
[CV] ................................................. , total=  19.9s
[CV]  ................................................................
[CV] ................................................. , total=  19.7s
[CV]  ................................................................
[CV] ................................................. , total=  20.0s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  6.4min finished


[CV] ................................................. , total=  19.9s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   38.9s remaining:    0.0s


[CV] ................................................. , total=  20.0s
[CV]  ................................................................
[CV] ................................................. , total=  20.0s
[CV]  ................................................................
[CV] ................................................. , total=  19.6s
[CV]  ................................................................
[CV] ................................................. , total=  20.0s
[CV]  ................................................................
[CV] ................................................. , total=  20.1s
[CV]  ................................................................
[CV] ................................................. , total=  20.3s
[CV]  ................................................................
[CV] ................................................. , total=  20.2s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  6.5min finished


In [92]:
display(HTML(table2.get_html_string()))

fit_time,score_time,train_accuracy,cv_accuracy,train_f1,cv_f1,train_recall,cv_recall,train_precision,cv_precision,ALPHA
11.90005,7.9462,0.89059,0.88834,0.93798,0.93671,0.9722,0.97163,0.90608,0.90421,50
11.94663,7.93912,0.87561,0.87409,0.93119,0.93035,0.98905,0.98887,0.87973,0.87837,100
12.02611,7.97107,0.85103,0.85043,0.91952,0.91917,1.0,1.0,0.85103,0.85043,1000


Hmm the trend is not consintent any more. We can see that the `cv_f1` decreases as the alpha value increased although the precison dropped continuously. For alpha 1000, recall is perfect but this ends up hurting precision a lot. We can see the effect in F1 score.

## Reporting the different evaluation metics for `Alpha = 1`

In [134]:
report = PrettyTable(field_names=[
    'test_accuracy',
    'test_f1',
    'test_recall',
    'test_precision'
])

In [78]:
clf = MultinomialNB(alpha=1)
clf.fit(features, labels)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

In [80]:
preds = clf.predict(Dtest)

In [96]:
actual = data_test.Polarity.apply(lambda x: 1 if x == 'positive' else 0).values

#### Test Accuracy 

In [97]:
from sklearn.metrics import accuracy_score

In [118]:
ta = round(accuracy_score(actual, preds), 5)

In [119]:
ta

0.8862

#### Test Recall

In [120]:
from sklearn.metrics import recall_score

In [122]:
tr = round(recall_score(actual, preds), 5)

#### Test Precison

In [123]:
from sklearn.metrics import precision_score

In [125]:
tp = round(precision_score(actual, preds), 5)

#### Test F1

In [126]:
from sklearn.metrics import f1_score

In [127]:
tf1 = round(f1_score(actual, preds), 5)

In [136]:
report.add_row([ta, tf1, tr, tp])

In [139]:
print(report)

+---------------+---------+-------------+----------------+
| test_accuracy | test_f1 | test_recall | test_precision |
+---------------+---------+-------------+----------------+
|     0.8862    | 0.93136 |   0.93555   |    0.92721     |
|     0.8862    | 0.93136 |   0.93555   |    0.92721     |
+---------------+---------+-------------+----------------+


#### Confusion Matrix

In [140]:
from sklearn.metrics import confusion_matrix

In [142]:
cm = confusion_matrix(actual, preds)

Unnamed: 0,0,1
0,12462,6621
1,5810,84343


#### Feature Importance

In [189]:
clf.feature_log_prob_

array([[-10.76626381, -10.6609033 , -10.52114135, ..., -11.17172892,
        -10.61211313, -10.47858174],
       [-10.79797896, -10.55989382, -10.38181856, ..., -10.07830502,
        -10.1210923 , -10.01737247]])

In [191]:
def top_k_features(k):
    neg_class_prob_sorted = clf.feature_log_prob_[0, :].argsort()
    pos_class_prob_sorted = clf.feature_log_prob_[1, :].argsort()

    print(np.take(cunt.get_feature_names(), neg_class_prob_sorted[:k]))
    print(np.take(cunt.get_feature_names(), pos_class_prob_sorted[:k]))

In [193]:
top_k_features(20)

['excellant' 'divide' 'stews' 'rebecca' 'addicting' 'peterson' 'pickiest'
 'lends' 'accompaniment' 'nuke' 'joints' 'jitters' 'scrumptious'
 'smoothest' 'yuban' 'gems' 'crock' 'backpack' 'breasts' 'perfection']
['deceptive' 'moldy' 'disapointed' 'embarrassed' 'emails' 'lousy'
 'defective' 'canceled' 'ashamed' 'bait' 'ruins' 'contacting' 'sorely'
 'returns' 'inedible' 'horrid' 'contaminated' 'vaguely' 'hopeful' 'lesson']
