## Outline of steps
* [step0](#step0): import necessary packages
* [step1](#step1): import dataset `part6_dataset.pickle` as `part7_dataset`
* [step2](#step2): stratified sampling **50%** of the dataset
* [step3](#step3): create necessary `class` for sentiment analysis
* [step4](#step4): create a bag of words solely for sentiment analysis
* [step5](#step5): build up the benchmark model (naive bayes) for sentiment analysis
* [step6](#step6): evaluate the naive bayes model performance based on cross-validation
* [step7](#step7): general output of the model prediction and confusion matrix on the training dataset (nb model)
* [step8](#step8): evaluate navie bayes model on the `testing` dataset
* [step9](#step9): the comparison model - multi-layer perceptron model
* [step10](#step10): general output of the model prediction and confusion matrix on the training dataset (mlp model)
* [step11](#step11): evaluate multi-layer perceptron model on the `testing` dataset
* [step12](#step12): choose the better model and predict text sentiment for the remaining 50% dataset
* [step13](#step13): export the remaining 50% dataset for the next modeling process

In [36]:
# import necessary packages
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # module for missing value visualization
from scipy import stats # implement box-cox transformation
from math import ceil
from sklearn.utils import shuffle # shuffling the dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.naive_bayes import MultinomialNB # for sentiment analysis benchmark model
from sklearn.model_selection import cross_val_score # cross validation score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.metrics import classification_report

from keras.utils import np_utils # encode categorical variable
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.callbacks import ModelCheckpoint, EarlyStopping  


# Pretty display for notebooks
%matplotlib inline

<a id="step1"></a>
## step1: import dataset `part6_dataset.pickle` as `part7_dataset`

In [37]:
part7_dataset = pd.read_pickle("part6_dataset.pickle")

<a id="step2"></a>
## step2: stratified sampling 50% of the dataset
1. Use the **first 50%** of dataset as training and validation dataset for sentiment analysis, which can be achieved by using train_test_split() to achieve same result.
    - a. Especially, for the first 50% of dataset, I will split out 10% again for use as testing dataset, so that I can better evaluate the performance between Naive Bayes and Multi-layer Perceptron model.
    - b. This manner also avoids **data leakage**, I don't use the remaining 50% of the population dataset for guiding what model should be chosen.
    - c. All the practices are implemented by the function **train_test_split**, and **classes are stratified** as well.
2. The **remaining 50%** can be used later for rating prediction model.

In [38]:
# separate target variable out - review_sentiment
target_variable = part7_dataset.review_sentiment
target_variable = target_variable.astype("category")

# just sample 50% of the whole dataset - use train_test_split() to achieve same result
X_first50, X_remaining50, y_first50, y_remaining50 = train_test_split(part7_dataset, target_variable,
                                                                      test_size = 0.5, stratify = target_variable)

In [39]:
# separate out a 10% testing dataset to evaluate the performance between Naive Bayes and Multi-layer Perceptron
X_first40, X_test, y_first40, y_test = train_test_split(X_first50, y_first50,
                                                        test_size = 0.2, stratify = y_first50)

<a id="step3"></a>
## step3: create necessary `class` for sentiment analysis
1. In order to come up with better text sentiment model, I further implement advanced lemmatization to clean up the text data. It is referred from [sklearn customized lemmatizer](http://scikit-learn.org/stable/modules/feature_extraction.html)

In [40]:
# create a class for lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

<a id="step4"></a>
## step4: create a bag of words solely for sentiment analysis
1. Setting the ngram up to **bi-grams**, so that we can further experiment to see if potential influential word patterns exist. 
2. Owing to later on I will use **MultinomialNB model**, so it's better to use **`CountVectorizer`** instead of `TfidfVectorizer` in this bag of word model, **contrary to the `TfidfVectorizer` being used in LDA model**.
3. Again, words to be collected in bag of words model is set **5000**. 

In [41]:
# build up a bag of words for Sentiment Analysis
n_features = 5000

sentiment_count_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                                             max_df=0.5, min_df=2, # word fequency less than 50% and shows at least in 2 doc
                                             max_features=n_features,
                                             stop_words="english",
                                             ngram_range=(1,2))

# fit and transform data
sentiment_count = sentiment_count_vectorizer.fit_transform(X_first40["combined_review"])
sentiment_count = sentiment_count.toarray() # transform from sparse to dense matrix

<a id="step5"></a>
## step5: build up the benchmark model (naive bayes) for sentiment analysis

In [42]:
# create the benchmark model
naive_bayes = MultinomialNB()
naive_bayes.fit(sentiment_count, y_first40)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

<a id="step6"></a>
## step6: evaluate the naive bayes model performance based on cross-validation 
1. In this classification case, the metrics include **accuracy**, **f1**, **recall**. 
    - a. In hindsight, the **accuracy** for this text sentiment model should be greater than **0.89**, because the probability for the positive review class is 0.89 and probability for the negative review class is (1-0.89).
    - b. Hence, ideally it's better to have a model with accuracy higher than 0.89. 
2. Cross-validation fold is set to be **5**.

In [43]:
# evaluate the model performance based on cross-validation
mean_accuracy = cross_val_score(estimator=naive_bayes, X=sentiment_count, y=y_first40, cv=5, scoring="accuracy").mean()
print("mean accuracy of corss-validation: {}".format(round(mean_accuracy,2)))

# calculate the f1 score
mean_f1 = cross_val_score(estimator=naive_bayes, X=sentiment_count, y=y_first40, cv=5, scoring="f1").mean()
print("mean f1 of corss-validation: {}".format(round(mean_f1,2)))

# calculate the recall
mean_recall = cross_val_score(estimator=naive_bayes, X=sentiment_count, y=y_first40, cv=5, scoring="recall").mean()
print("mean recall of corss-validation: {}".format(round(mean_recall,2)))

mean accuracy of corss-validation: 0.86
mean f1 of corss-validation: 0.92
mean recall of corss-validation: 0.88


<a id="step7"></a>
## step7: general output of the model prediction and confusion matrix on the training dataset (nb model)

In [44]:
# create the prediction from naive_bayes model
prediction = naive_bayes.predict(sentiment_count)

# print the accuracy metric
print("the accuracy of training dataset")
print(round(accuracy_score(y_first40, prediction),2))
print("\n")

# print the confusion matrix
print("confusion matrix from naive bayes model")
print(confusion_matrix(y_first40, prediction))
print("\n")

# print the overview of performance metrics
print("the overview of performance metrics")
print(classification_report(y_first40, prediction))

the accuracy of training dataset
0.87


confusion matrix from naive bayes model
[[ 14738   6197]
 [ 21189 162653]]


the overview of performance metrics
             precision    recall  f1-score   support

          0       0.41      0.70      0.52     20935
          1       0.96      0.88      0.92    183842

avg / total       0.91      0.87      0.88    204777



<a id="step8"></a>
## step8: evaluate navie bayes model on the `testing` dataset.
**REMINDING**: the testing dataset for text sentiment model is `X_test` and `y_test`, not `X_remaining50` and `y_remaining50`.

In [45]:
# transform the testing dataset
test_sentiment_count = sentiment_count_vectorizer.transform(X_test["combined_review"])
test_sentiment_count = test_sentiment_count.toarray()

In [46]:
# evaluate the performance on the testing dataset
test_prediction = naive_bayes.predict(test_sentiment_count)

# print the accuracy metric
print("the accuracy of testing dataset")
print(round(accuracy_score(y_test, test_prediction),2))
print("\n")

# print the confusion matrix
print("confusion matrix from naive bayes model (in remaining test dataset)")
print(confusion_matrix(y_test, test_prediction))
print("\n")

# print the overview of performance metrics
print("the overview of performance metrics (in remaining test dataset)")
print(classification_report(y_test, test_prediction))

the accuracy of testing dataset
0.86


confusion matrix from naive bayes model (in remaining test dataset)
[[ 3629  1605]
 [ 5351 40610]]


the overview of performance metrics (in remaining test dataset)
             precision    recall  f1-score   support

          0       0.40      0.69      0.51      5234
          1       0.96      0.88      0.92     45961

avg / total       0.90      0.86      0.88     51195



<a id="step9"></a>
## step9: the comparison model - multi-layer perceptron model

In [47]:
# encode the target variable
target_variable = np_utils.to_categorical(y_first40, num_classes=2) # training
test_target_varialbe = np_utils.to_categorical(y_test, num_classes=2) # test

In [48]:
# Building the model architecture
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(n_features,)))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.summary()

# Compiling the model using categorical_crossentropy loss, and rmsprop optimizer.
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_43 (Dense)             (None, 512)               2560512   
_________________________________________________________________
dropout_9 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_44 (Dense)             (None, 128)               65664     
_________________________________________________________________
dense_45 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_46 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_47 (Dense)             (None, 2)                 66        
Total params: 2,636,578
Trainable params: 2,636,578
Non-trainable params: 0
_________________________________________________________________


In [49]:
# Running and evaluating the model

checkpointer = ModelCheckpoint(filepath='sentiment.model.best.hdf5', 
                               verbose=1, save_best_only=True)

earlystop = EarlyStopping(patience=2)

hist = model.fit(sentiment_count, target_variable,
          batch_size=50,
          epochs=20,
          validation_split=0.25,
          callbacks=[checkpointer, earlystop],
          verbose=2,
          shuffle=True)

Train on 153582 samples, validate on 51195 samples
Epoch 1/20
Epoch 00000: val_loss improved from inf to 0.22081, saving model to sentiment.model.best.hdf5
27s - loss: 0.2327 - acc: 0.9126 - val_loss: 0.2208 - val_acc: 0.9162
Epoch 2/20
Epoch 00001: val_loss improved from 0.22081 to 0.21949, saving model to sentiment.model.best.hdf5
26s - loss: 0.2167 - acc: 0.9203 - val_loss: 0.2195 - val_acc: 0.9170
Epoch 3/20
Epoch 00002: val_loss improved from 0.21949 to 0.21876, saving model to sentiment.model.best.hdf5
26s - loss: 0.2121 - acc: 0.9229 - val_loss: 0.2188 - val_acc: 0.9178
Epoch 4/20
Epoch 00003: val_loss did not improve
26s - loss: 0.2084 - acc: 0.9257 - val_loss: 0.2240 - val_acc: 0.9174
Epoch 5/20
Epoch 00004: val_loss did not improve
26s - loss: 0.2045 - acc: 0.9287 - val_loss: 0.2293 - val_acc: 0.9172
Epoch 6/20
Epoch 00005: val_loss did not improve
26s - loss: 0.1984 - acc: 0.9321 - val_loss: 0.2358 - val_acc: 0.9163


<a id="step10"></a>
## step10: general output of the model prediction and confusion matrix on the training dataset (mlp model)

In [50]:
# create the prediction from multi-layer perceptron model
prediction_mlp = model.predict(sentiment_count)
prediction_mlp = prediction_mlp.argmax(axis=1)

# print the accuracy metric
print("the accuracy of training dataset")
print(round(accuracy_score(y_first40, prediction_mlp),2))
print("\n")

# print the confusion matrix
print("confusion matrix from multi-layer perceptron model")
print(confusion_matrix(y_first40, prediction_mlp))
print("\n")

# print the overview of performance metrics
print("the overview of performance metrics")
print(classification_report(y_first40, prediction_mlp))

the accuracy of training dataset
0.93


confusion matrix from multi-layer perceptron model
[[  9229  11706]
 [  2028 181814]]


the overview of performance metrics
             precision    recall  f1-score   support

          0       0.82      0.44      0.57     20935
          1       0.94      0.99      0.96    183842

avg / total       0.93      0.93      0.92    204777



<a id="step11"></a>
## step11: evaluate multi-layer perceptron model on the `testing` dataset.

In [51]:
# evaluate the performance on the testing dataset
test_prediction_mlp = model.predict(test_sentiment_count)
test_prediction_mlp = test_prediction_mlp.argmax(axis=1)

# print the accuracy metric
print("the accuracy of testing dataset")
print(round(accuracy_score(y_test, test_prediction_mlp),2))
print("\n")

# print the confusion matrix
print("confusion matrix from multi-layer perceptron model (in remaining test dataset)")
print(confusion_matrix(y_test, test_prediction_mlp))
print("\n")

# print the overview of performance metrics
print("the overview of performance metrics (in remaining test dataset)")
print(classification_report(y_test, test_prediction_mlp))

the accuracy of testing dataset
0.92


confusion matrix from multi-layer perceptron model (in remaining test dataset)
[[ 1852  3382]
 [  810 45151]]


the overview of performance metrics (in remaining test dataset)
             precision    recall  f1-score   support

          0       0.70      0.35      0.47      5234
          1       0.93      0.98      0.96     45961

avg / total       0.91      0.92      0.91     51195



<a id="step12"></a>
## step12: choose the better model and predict text sentiment for the remaining 50% dataset
The brief performance comparison between **Naive Bayes** and **Multi-layer Perceptron** model:

| Naive Bayes 	|  Metric  	| Multi-layer Perceptron 	|
|------------:	|:--------:	|------------------------	|
|        0.86 	| Accuracy 	| 0.92                   	|
|        0.88 	| F1-score 	| 0.91                   	|
|        0.86 	|  Recall  	| 0.92                   	|

* class 0: negative review
* class 1: positive review

As we can tell, the MLP model seems to outperform the Naive Bayes model a little bit on all three metrics. But if we look closer, on the unbalanced(fewer) class 0, of all class 0, Naive Bayes can identify correctly 69% in testing dataset while the MLP model can only identify 35% in testing dataset. That means Naive Bayes model has a better performance of recall and f1 on the small class (that is the 0 group). If ientifying class 0 is very important, then Naive Bayes model should be chosen instead. 
<br/><br/>
However, in my case, I will weight class 0 and 1 equally, and choose the model with higher overall performance - that is **Multi-layer Perceptron Model**. 

In [52]:
# fit and transform remaining data into bag of words
remaining_sentiment_count = sentiment_count_vectorizer.fit_transform(X_remaining50["combined_review"])
remaining_sentiment_count = remaining_sentiment_count.toarray() # transform from sparse to dense matrix

# predict the text sentiment for the remaining 50% dataset
mlp_predict_review_sentiment = model.predict(remaining_sentiment_count)
mlp_predict_review_sentiment = mlp_predict_review_sentiment.argmax(axis=1)

<a id="step13"></a>
## step13: export the remaining 50% dataset for the next modeling process

In [53]:
# add in the predicted review sentiment from MLP model 
X_remaining50["mlp_predict_review_sentiment"] = mlp_predict_review_sentiment
X_remaining50.to_pickle("X_remaining50.pickle")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
