# Training & Testing on Single Domain (Ground Truth)

This experiment uses a 'ground truth' labelled dataset of NYC Resteraunt reviews from Yelp. This dataset has 359,052 reviews so should be sufficient for training.

The aim of this experiment is to produce a benchmark from which we can compare our explorative experiments. This is the first of a number of experiments using statistical modelling, all of which are done with the aim of finding a benchmark. 

We will use Complement Naive Bayes to produce a model, and Bag of Words to convert our text features to usable predictor features.

First, to access out project files we add the project directory to PYTHONPATH

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '..'))

Our data is located in the following file:

In [2]:
data_file_path = 'data/yelpNYC'

Our data is in protobuf format, so we read it into the ReviewSet protobuffer.

In [3]:
from protos import review_set_pb2
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Let's take a look at our data. We use the following features from our data:
* Review Content. The actual text description of the restaurant.
* Date user left review.
* ID of the user that left the review
* ID of the product the review is being left on

And also the label (Fake = True, Genuine = False)

In [4]:
import pandas

frame_data = {
    "review content": [],
    "date": [],
    "label": []
}
for review in review_set.reviews:  
  frame_data["review content"].append(review.review_content)
  frame_data["date"].append(review.date)
  frame_data["label"].append(review.label)

data_frame = pandas.DataFrame(frame_data)
data_frame.head()

Unnamed: 0,review content,date,label
0,The food at snack is a selection of popular Gr...,2014-12-08,True
1,This little place in Soho is wonderful. I had ...,2013-05-16,True
2,ordered lunch for 15 from Snack last Friday. ...,2013-07-01,True
3,This is a beautiful quaint little restaurant o...,2011-07-28,True
4,Snack is great place for a casual sit down lu...,2010-11-01,True


It is interesting to know if the number of fake and genuine reviews are balanced in our training set. If not, the classifier would be biased towards picking either fake or genuine because it is more common.

In [5]:
reviews = review_set.reviews
fake_reviews = [x for x in reviews if x.label]
num_fake = len(fake_reviews)
print("Fake:", len(fake_reviews), "Genuine:", len(reviews)-num_fake)

Fake: 36885 Genuine: 322167


There are many more genuine reviews than fake ones. To avoid a bias, let's get an even number of both

In [6]:
genuine_reviews = []
i = 0
while len(genuine_reviews) < num_fake:
  if not reviews[i].label:
    genuine_reviews.append(reviews[i])
  i+=1

Next we will convert our reviews into features. Since we will be doing cross validation, we will prepare our entire sample set, and we will split it later.

In [28]:
from sklearn.utils import shuffle
X_reviews = shuffle(reviews)#genuine_reviews + fake_reviews)

We will use Bag of Words to convert our review content to a format that can be used as a feature. In scikit learn the Bag of Words format is created using a CountVectorizer. We transform our test set into the same feature format to correspond with the Bag of Words created for our training set.

In [29]:
X_review_content = [x.review_content for x in X_reviews]

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X_review_content)
X_counts.shape

(359052, 128280)

Next we convert the dates to numerical ordinals, so we can use them as a feature.

In [30]:
from datetime import datetime as dt
def extract_date_ordinals(reviews):
  return [dt.strptime(x.date, '%Y-%m-%d').date().toordinal() for x in reviews]

X_date_ordinals = extract_date_ordinals(X_reviews)

Next we can simply read our user ids and product ids. They are already numbers.

In [31]:
X_user_ids = [x.user_id for x in X_reviews]
X_product_ids = [x.product_id for x in X_reviews]

Now we put our features together. The sparse features from Bag of Words overshadows our dense features (date). We put this into a format we can train/test on:

In [32]:
from scipy.sparse import coo_matrix, hstack
def format_column(features_row):
  return coo_matrix([[x] for x in features_row])

def stack_features(counts, ordinals, user_ids, product_ids):
  return hstack([counts, format_column(ordinals), format_column(user_ids), format_column(product_ids)])

predictor_data = stack_features(X_counts, X_date_ordinals, X_user_ids, X_product_ids)

And preparing the targets:

In [33]:
targets = [1 if x.label else 0 for x in X_reviews]

We will use Complement Naive Bayes to generate our model.

In [34]:
from sklearn.naive_bayes import ComplementNB, MultinomialNB
cnb = MultinomialNB()

Alright! Now let's test what we have. We will use cross validation here, splitting our set into 5.

In [35]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_data, targets, cv=10, return_train_score=False)

{'fit_time': array([1.12083054, 0.98920536, 0.98635817, 0.97608113, 0.97887635,
        0.9742465 , 0.97270703, 0.98200083, 0.97625303, 0.9753201 ]),
 'score_time': array([0.03546643, 0.03169847, 0.03196883, 0.03173232, 0.03199983,
        0.03211808, 0.03211379, 0.03218532, 0.03198695, 0.03186417]),
 'test_score': array([0.66136579, 0.6618114 , 0.6622013 , 0.66169999, 0.66058597,
        0.66180198, 0.66489347, 0.66048351, 0.65686275, 0.66045566])}

* When the only features were review_content (Bag of words) and date, the score was around 0.52. Adding user_id and product_id increased this to around 0.65
* When then reducing the number of genuine reviews to match the number of fake reviews, the accuracy increases to 0.88.
* When using Multinomail NB with reduced size genuine reviews, it beats Stanfords.

The accuracy increases a lot when the number of genuine reviews drops. This might be because with a small enough sample size it's possible to just mimic the behaviour of the sample. I will check how the classifier classes all the remaining genuine reviews. Now it should class as few as possible as fake.

In [15]:
spare_genuine_reviews = [x for x in reviews[i:] if x.label == False]
print (len(spare_genuine_reviews))

285282


In [16]:
cnb.fit(predictor_data, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
S_review_content = [s.review_content for s in spare_genuine_reviews]
S_counts = count_vect.transform(S_review_content)

S_date_ordinals = extract_date_ordinals(spare_genuine_reviews)
S_user_ids = [s.user_id for s in spare_genuine_reviews]
S_product_ids = [s.product_id for s in spare_genuine_reviews]

In [18]:
spare_predictor_data = stack_features(S_counts, S_date_ordinals, S_user_ids, S_product_ids)
cnb.predict(spare_predictor_data)

array([0, 0, 0, ..., 1, 1, 0])