# Naive Bayes Binary Classification: Proof of Concept

In this notebook I demonstrate the use of a popular benchmarking ML technique, Naive Bayes, as a binary classifier for Genuine and Deceptive reviews over the small ground truth Chicago Hotel dataset.

The accuracies produced in this notebook are not to be trusted as they are based on a very small dataset (1600 samples) with a TINY testing set (480 samples). However, it shows a proof of concept using simple Bag-Of-Words features such as count vectorization and Tf-idf to give a baseline model to improve on with larger datasets and more features.

Without further ado, let's continue.

We start by importing the modules we need.

In [1]:
import os
import numpy as np
import pandas as pd

Now let's pull the dataset and process it into data frames using Pandas.

In [3]:
neg_deceptive_folder_path = r"../data/hotels/negative_polarity/deceptive_from_MTurk/"
neg_true_folder_path = r'../data/hotels/negative_polarity/truthful_from_Web/'
pos_deceptive_folder_path = r'../data/hotels/positive_polarity/deceptive_from_MTurk/'
pos_true_folder_path = r'../data/hotels/positive_polarity/truthful_from_TripAdvisor/'

sentiment_class = []
reviews = []
deceptive_class =[]

for i in range(1,6):
    positive_true = pos_true_folder_path + 'fold' + str(i) 
    positive_deceptive = pos_deceptive_folder_path + 'fold' + str(i)
    negative_true = neg_true_folder_path + 'fold' + str(i) 
    negative_deceptive = neg_deceptive_folder_path + 'fold' + str(i) 
    for data_file in sorted(os.listdir(negative_deceptive)):
      sentiment_class.append('negative')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(negative_deceptive, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(negative_true)):
      sentiment_class.append('negative')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(negative_true, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(positive_deceptive)):
      sentiment_class.append('positive')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(positive_deceptive, data_file)) as f:
        contents = f.read()
        reviews.append(contents)
    for data_file in sorted(os.listdir(positive_true)):
      sentiment_class.append('positive')
      deceptive_class.append(str(data_file.split('_')[0]))
      with open(os.path.join(positive_true, data_file)) as f:
        contents = f.read()
        reviews.append(contents)


df = pd.DataFrame({'sentiment':sentiment_class,'review':reviews,'deceptive':deceptive_class})

df.loc[df['deceptive']=='d','deceptive']=1
df.loc[df['deceptive']=='t','deceptive']=0

df.loc[df['sentiment']=='positive','sentiment']=1
df.loc[df['sentiment']=='negative','sentiment']=0

data_x = df['review']

data_y = np.asarray(df['deceptive'],dtype=int)

Then, we split our data into training and testing. Experiment shows that 70/30 split is ideal. The random_state parameter lets us shuffle our data randomly.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3, random_state=1)

We now want to initialize our feature extractor, CountVectorizer. This turns the reviews into vectors of word counts.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

cv = CountVectorizer()
tfidf = TfidfTransformer()

And now we fit it to our training and testing data, respectively.

In [6]:
X_train_count = cv.fit_transform(X_train) # Transforming the Training reviews to count vectors and fitting for TF-idf
X_test_count = cv.transform(X_test) # Only transforming the test reviews to count vectors

X_train_tfidf = tfidf.fit_transform(X_train_count) # Transforming the fitted training Count Vectors to Tfidf vectors
X_test_tfidf = tfidf.transform(X_test_count) # Transforming test count vectors to tf-df vectors

Now we initialize our model instance, Complement Naive Bayes.

In [7]:
from sklearn.naive_bayes import ComplementNB

nbayes = ComplementNB()

And fit it to our training data.

In [8]:
nbayes.fit(X_train_tfidf, y_train)

ComplementNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False)

And run it over our testing data.

In [9]:
y_predictions = list(nbayes.predict(X_test_tfidf))

Let's see the results!

In [10]:
from sklearn import metrics

yp=["Genuine" if a==0 else "Deceptive" for a in y_predictions]
output_fm = pd.DataFrame({'Review':list(X_test) ,'Genuine(0)/Deceptive(1)':yp})
print(output_fm)
print(metrics.classification_report(y_test, y_predictions, target_names=set(yp)))
print('Overall score: ', nbayes.score(X_test_tfidf, y_test))

                                                Review Genuine(0)/Deceptive(1)
0    I recently stayed at the Sofitel Hotel (Chicag...               Deceptive
1    I was unimpressed by the quality of this hotel...               Deceptive
2    I've been here for 4 days. Great location righ...                 Genuine
3    I recently visited Chicago. I stayed at the Ho...               Deceptive
4    Extravagant, Exuberant Experience! Our stay at...               Deceptive
5    I stayed here in the last weekend of September...               Deceptive
6    We booked the Amalfi looking for a great bouti...                 Genuine
7    I've stayed at several different hotels in Chi...               Deceptive
8    Stays here while on a business trip in Chicago...               Deceptive
9    I just had a conference there. They have bed b...               Deceptive
10   My family really enjoyed this hotel the weeken...               Deceptive
11   We had our hotel reservations at another hotel.

87% accuracy! Wow!