# Naive Bayes Binary Classification: Proof of Concept

In this notebook I demonstrate the use of a popular benchmarking ML technique, Naive Bayes, as a binary classifier for Genuine and Deceptive reviews over the small ground truth Chicago Hotel dataset.

The accuracies produced in this notebook are not to be trusted as they are based on a very small dataset (1600 samples) with a TINY testing set (480 samples). However, it shows a proof of concept using simple Bag-Of-Words features such as count vectorization and Tf-idf to give a baseline model to improve on with larger datasets and more features.

Without further ado, let's continue.

We start by importing the modules we need.

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import train_test_split

Now let's pull the dataset and process it into data frames using Pandas.

In [2]:
neg_deceptive_folder_path = r"../../data/hotels/negative_polarity/deceptive_from_MTurk/"
neg_true_folder_path = r'../../data/hotels/negative_polarity/truthful_from_Web/'
pos_deceptive_folder_path = r'../../data/hotels/positive_polarity/deceptive_from_MTurk/'
pos_true_folder_path = r'../../data/hotels/positive_polarity/truthful_from_TripAdvisor/'

sentiment_class = []
reviews = []
deceptive_class =[]

for i in range(1,6):
    positive_true = pos_true_folder_path + 'fold' + str(i) 
    positive_deceptive = pos_deceptive_folder_path + 'fold' + str(i)
    negative_true = neg_true_folder_path + 'fold' + str(i) 
    negative_deceptive = neg_deceptive_folder_path + 'fold' + str(i) 
    for data_file in sorted(os.listdir(negative_deceptive)):
      sentiment_class.append('negative')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(negative_deceptive, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(negative_true)):
      sentiment_class.append('negative')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(negative_true, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(positive_deceptive)):
      sentiment_class.append('positive')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(positive_deceptive, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(positive_true)):
      sentiment_class.append('positive')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(positive_true, data_file)) as f:
        contents = f.read()
        reviews.append(contents)


data_fm = pd.DataFrame({'sentiment':sentiment_class,'review':reviews,'deceptive':deceptive_class})

data_fm.loc[data_fm['deceptive']=='d','deceptive']=1
data_fm.loc[data_fm['deceptive']=='t','deceptive']=0

data_fm.loc[data_fm['sentiment']=='positive','sentiment']=1
data_fm.loc[data_fm['sentiment']=='negative','sentiment']=0

data_x = data_fm['review']

data_y = np.asarray(data_fm['deceptive'],dtype=int)

Then, we split our data into training and testing. Experiment shows that 70/30 split is ideal. The random_state parameter lets us shuffle our data randomly.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3, random_state=1)

We now want to initialize our feature extractor, CountVectorizer. This turns the reviews into vectors of word counts.

In [11]:
cv = CountVectorizer()

And now we fit it to our training and testing data, respectively.

In [12]:
X_traincv = cv.fit_transform(X_train)
X_testcv = cv.transform(X_test)

Now we initialize our model instance, Complement Naive Bayes.

In [18]:
nbayes = ComplementNB()

And fit it to our training data.

In [19]:
nbayes.fit(X_traincv, y_train)

ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)

And run it over our testing data.

In [20]:
y_predictions_nbayes = list(nbayes.predict(X_testcv))

Let's see the results!

In [21]:
yp=["Genuine" if a==0 else "Deceptive" for a in y_predictions_nbayes]
output_fm = pd.DataFrame({'Review':list(X_test) ,'Genuine(0)/Deceptive(1)':yp})
print(output_fm)
print(metrics.classification_report(y_test, y_predictions_nbayes, target_names=set(yp)))

                                                Review Genuine(0)/Deceptive(1)
0    This hotel has the worst rooms we have stayed ...                 Genuine
1    Terrible experience, I will not stay here agai...               Deceptive
2    Spent a wonderful night at the Amalfi with fri...                 Genuine
3    The concierge was so helpful. Without hesitati...                 Genuine
4    Oh my goodness, this has got to be one of the ...               Deceptive
5    I booked two rooms four months in advance at t...                 Genuine
6    My husband and I made a reservation at the She...               Deceptive
7    I travel to chicago quite often, and have stay...               Deceptive
8    My husband and I and our 10 month old took a q...                 Genuine
9    My Family and I went to this hotel on holiday ...                 Genuine
10   Stayed at the Chicago Affinia Dec 10 thru Dec ...                 Genuine
11   I was really excited to be visiting Chicago fo.

89% accuracy! Wow!