## 0. System Configuration

- **RAM:** *30GB*
- **CPU:** *12 Core - 24 thread Thread Reaper, INTEL Xenon*
- **GPU:** *Nvidia 1080 Ti - 11GB vRAM* + *Nvidia Titan XP - 12GB vRAM* 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
%config InlineBackend.figure_format = 'retina'

In [2]:
!ls

Bag Of Words.ipynb	   README.md	  TFIDF_sklearn.ipynb
data			   Testing.ipynb
naive-babes-sklearn.ipynb  TFIDF.ipynb


## 1. Loading the dataset¶

In [3]:
with sqlite3.connect('./data/reviewsV1.db') as conn:
    data = pd.read_sql_query('SELECT * FROM Review', conn)

In [4]:
data.drop('index', inplace=True, axis=1)

## 2. Some Legacy Data Cleaning that causes problem in Word2Vec modelling

In [5]:
data = data[data.index != 258456]

## 3. Time Based Splitting

In [6]:
data.sort_values(by='Time', inplace=True)
data.reset_index(drop=True, inplace=True)
TRAIN_SIZE = int(data.shape[0] * 0.7)
TEST_SIZE = data.shape[0] - TRAIN_SIZE

In [7]:
TRAIN_SIZE

254882

In [8]:
TEST_SIZE

109236

In [9]:
data_train = data[0: TRAIN_SIZE]
data_test = data[TRAIN_SIZE:]

#### 3.1 Check if the Splitting was performed properly

In [10]:
assert(data_train.shape[0] == TRAIN_SIZE)
assert(data_test.shape[0] == TEST_SIZE)
assert(data.Time.max() == data_test.Time.reset_index(drop=True)[TEST_SIZE -1])
assert(data.Time.min() == data_train.Time.reset_index(drop=True)[0])

## 4. Training BagOfWords Model on data_train

#### 4.1 Creating a BoW on train data

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
tfidf = TfidfVectorizer(max_features=5000)
tfidf.fit(data_train.Text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
Dtrain = tfidf.transform(data_train.Text)
Dtrain.get_shape()

(254882, 5000)

In [16]:
Dtrain = Dtrain.toarray()  # Almost 10.2 GB of RAM usage

#### 4.2 Creating a BoW on test data

In [17]:
Dtest = tfidf.transform(data_test.Text)
Dtest.get_shape()

(109236, 5000)

In [18]:
109236 * 5000 * 8

4369440000

In [19]:
Dtest = Dtest.toarray()  # Almost 4.3 GB of RAM usage

In [22]:
assert(any(Dtrain[0] <= 1) == True)
assert(any(Dtest[0] <= 1) == True)

## 5. K-Fold Cross Validation

In [26]:
from sklearn.model_selection import cross_validate

*Scoring parameters*

**http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter**

```python
['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'mutual_info_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'v_measure_score']
```

We are going to optimize the $\alpha$ parameter in the Laplace Smoothing using F1 score but however we are going the report the following metrics in each iterations.

- Accuracy
- F1 score
- Recall
- Precison
- Confusion Matrix
- Training F1 score
- Valiation F1 score
- Training Recall
- Validation Recall
- Training Precision
- Validation Precision
- Average Fitting Time
- Average Prediction time

#### 5.1 KFold Cross Validation

In [27]:
NUM_FOLDS = 10
scorings = ['accuracy', 'f1', 'recall', 'precision']

In [28]:
from prettytable import PrettyTable
from IPython.core.display import display, HTML

In [29]:
def BC_Kfold_cross_validation(
    clf,
    features,
    labels,
    scorings,
    cv,
    test_size,
    shuffle,
    **kwargs):
    """
    Perform Kfold cross validation for a Binary Classifier.
    
    :param clf: Classifier to use for fitting the data
    :param features: Features for the dataset. Expected np.matrix, np.array
    :param labels: Labels for the dataset. Expected np.array, list
    :param scorings: Scoring algorithms to evaluate
    :param test_size: Splitting ratio while splitting the dataset for train and CV.
    :param cv: Number of iterations cross validation is perfomed.
    :param shuffle: Shuffle the data each time for cross validations.
    """
    table = kwargs.get('ptable')
    field_names=[
        "fit_time",
        "score_time",
        "train_accuracy",
        "cv_accuracy",
        "train_f1",
        "cv_f1",
        "train_recall",
        "cv_recall",
        "train_precision",
        "cv_precision"
    ]
    if shuffle:
        from sklearn.model_selection import ShuffleSplit
        CV = ShuffleSplit(n_splits=cv, test_size=test_size, random_state=42)
    else:
        CV = cv
        
    scores = cross_validate(
        clf,
        features,
        labels,
        scoring=scorings,
        cv=CV,
        verbose=2,
        return_train_score=True)
    
    ## Formatted strings would only work in python >= 3.6
    table_row = []
    # Report the timings
    table_row.append(round(scores['fit_time'].mean(), 5))
    table_row.append(round(scores['score_time'].mean(), 5))
    table_row.append(round(scores['train_accuracy'].mean(), 5))
    table_row.append(round(scores['test_accuracy'].mean(), 5))
    table_row.append(round(scores['train_f1'].mean(), 5))
    table_row.append(round(scores['test_f1'].mean(), 5))
    table_row.append(round(scores['train_recall'].mean(), 5))
    table_row.append(round(scores['test_recall'].mean(), 5))
    table_row.append(round(scores['train_precision'].mean(), 5))
    table_row.append(round(scores['test_precision'].mean(), 5))
    hparams = kwargs.get("hyperparams", None)
    if hparams:
        for each in hparams:
            field_names.append(each.upper())
            table_row.append(clf.get_params()[each])
    table.field_names = field_names
    table.add_row(table_row)
    print(f'Done for {clf}')
    return table

In [30]:
def fit_params_with_ALPHA(ALPHAS, features, labels, scorings, cv, **kwargs):
    """
    Fit KNN with all combination of ALGO and NEIGHBOURS
    in each iterrations.
    
    :param ALGO: List of algorithms to try
    :param NEIGHBOURS: List of neighbours to try
    """
    table = PrettyTable()
    for ALPHA in ALPHAS:
        clf = MultinomialNB(alpha=ALPHA)
        test_size = kwargs.get('test_size', 0.3)
        shuffle = kwargs.get('shuffle', True)
        hparams = kwargs.get('hyperparams')
        table = BC_Kfold_cross_validation(clf,
                                            features,
                                            labels,
                                            scorings,
                                            cv,
                                            test_size,
                                            shuffle,
                                            hyperparams=hparams,
                                            ptable=table)
    return table

In [31]:
from sklearn.naive_bayes import MultinomialNB

In [32]:
features = Dtrain
labels = data_train.Polarity.apply(lambda x: 1 if x == 'positive' else 0).values

In [33]:
features.shape

(254882, 5000)

In [34]:
labels.shape

(254882,)

In [35]:
alphas = [0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 50, 100, 500, 1000]
table = fit_params_with_ALPHA(alphas, features, labels, 
                              scorings, NUM_FOLDS, hyperparams=['alpha'])

[CV]  ................................................................
[CV] ................................................. , total=   9.6s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.7s remaining:    0.0s


[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


[CV] ................................................. , total=   9.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.4s remaining:    0.0s


[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.2min finished


[CV] ................................................. , total=   9.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.6s remaining:    0.0s


[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


Done for MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.8s remaining:    0.0s


[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


Done for MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.8s remaining:    0.0s


[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


[CV] ................................................. , total=   9.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   14.0s remaining:    0.0s


[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


[CV] ................................................. , total=   9.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.6s remaining:    0.0s


[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


[CV] ................................................. , total=   9.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.6s remaining:    0.0s


[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.5s remaining:    0.0s


[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.1s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.1s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.2min finished


Done for MultinomialNB(alpha=100, class_prior=None, fit_prior=True)
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.3s remaining:    0.0s


[CV] ................................................. , total=   9.1s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.5s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


[CV] ................................................. , total=   9.5s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.8s remaining:    0.0s


[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.4s
[CV]  ................................................................
[CV] ................................................. , total=   9.2s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] ................................................. , total=   9.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  2.3min finished


In [36]:
display(HTML(table.get_html_string()))

fit_time,score_time,train_accuracy,cv_accuracy,train_f1,cv_f1,train_recall,cv_recall,train_precision,cv_precision,ALPHA
7.69887,1.64806,0.87964,0.87727,0.93368,0.93242,0.99554,0.99561,0.87906,0.87677,0.0001
7.5817,1.64789,0.87964,0.87727,0.93368,0.93242,0.99554,0.99561,0.87905,0.87677,0.001
7.66021,1.60375,0.87963,0.87726,0.93367,0.93242,0.99554,0.99561,0.87904,0.87677,0.01
7.63644,1.64629,0.87954,0.87714,0.93363,0.93235,0.99557,0.99562,0.87894,0.87664,0.1
7.61501,1.6684,0.87807,0.87577,0.93288,0.93166,0.99571,0.99573,0.87752,0.87533,1.0
7.74889,1.71424,0.87107,0.86922,0.92937,0.92838,0.99675,0.99677,0.87052,0.86878,5.0
7.70566,1.64466,0.86356,0.86225,0.92565,0.92493,0.99797,0.99796,0.8631,0.86187,10.0
7.68087,1.62195,0.85112,0.85051,0.91956,0.91921,0.99998,0.99996,0.85112,0.85052,50.0
7.56968,1.6364,0.851,0.8504,0.9195,0.91916,1.0,1.0,0.851,0.8504,100.0
7.63088,1.67178,0.851,0.8504,0.9195,0.91916,1.0,1.0,0.851,0.8504,500.0


We can see that the `recall` increses but the `precison` decreases. Hence `f1` score would be a good method to figure out the right alpha value. It seems the minimum value of alpha of **0.0001** is best.

## Reporting the different evaluation metics for `Alpha = 0.0001`

In [37]:
report = PrettyTable(field_names=[
    'test_accuracy',
    'test_f1',
    'test_recall',
    'test_precision'
])

In [38]:
clf = MultinomialNB(alpha=0.0001)
clf.fit(features, labels)

MultinomialNB(alpha=0.0001, class_prior=None, fit_prior=True)

In [39]:
preds = clf.predict(Dtest)

In [40]:
actual = data_test.Polarity.apply(lambda x: 1 if x == 'positive' else 0).values

#### Test Accuracy 

In [41]:
from sklearn.metrics import accuracy_score

In [42]:
ta = round(accuracy_score(actual, preds), 5)

#### Test Recall

In [45]:
from sklearn.metrics import recall_score

In [46]:
tr = round(recall_score(actual, preds), 5)

#### Test Precison

In [47]:
from sklearn.metrics import precision_score

In [48]:
tp = round(precision_score(actual, preds), 5)

#### Test F1

In [49]:
from sklearn.metrics import f1_score

In [50]:
tf1 = round(f1_score(actual, preds), 5)

In [51]:
report.add_row([ta, tf1, tr, tp])

In [52]:
print(report)

+---------------+---------+-------------+----------------+
| test_accuracy | test_f1 | test_recall | test_precision |
+---------------+---------+-------------+----------------+
|    0.85973    |  0.9213 |    0.9948   |    0.85792     |
+---------------+---------+-------------+----------------+


#### Confusion Matrix

In [53]:
from sklearn.metrics import confusion_matrix

In [54]:
cm = confusion_matrix(actual, preds)

#### Feature Importance

In [55]:
clf.feature_log_prob_

array([[-10.36573982, -10.21059643,  -9.94469937, ..., -10.7197657 ,
        -10.33670262, -10.12949606],
       [-10.38326582, -10.12529537, -10.01238421, ...,  -9.65723172,
         -9.81874281,  -9.71084588]])

In [56]:
def top_k_features(k):
    neg_class_prob_sorted = clf.feature_log_prob_[0, :].argsort()
    pos_class_prob_sorted = clf.feature_log_prob_[1, :].argsort()

    print(np.take(tfidf.get_feature_names(), neg_class_prob_sorted[:k]))
    print(np.take(tfidf.get_feature_names(), pos_class_prob_sorted[:k]))

In [57]:
top_k_features(20)

['excellant' 'addicting' 'divide' 'rebecca' 'stews' 'pickiest' 'backpack'
 'yuban' 'scrumptious' 'accompaniment' 'crock' 'peterson' 'joints' 'gems'
 'nutritionist' 'lends' 'arthritis' 'relax' 'jitters' 'comforting']
['deceptive' 'embarrassed' 'moldy' 'emails' 'lousy' 'defective'
 'contacting' 'ashamed' 'bait' 'ruins' 'disapointed' 'horrid' 'canceled'
 'sorely' 'returns' 'linked' 'hopeful' 'lesson' 'inedible' 'choked']
