# Training & Testing on Single Domain (Ground Truth)

This experiment uses a 'ground truth' labelled dataset of NYC Resteraunt reviews from Yelp. This dataset has 359,052 reviews so should be sufficient for training.

The aim of this experiment is to produce a benchmark from which we can compare our explorative experiments. This is the first of a number of experiments using statistical modelling, all of which are done with the aim of finding a benchmark. 

This time we will use:
* Bag of words to convert our review description to usable predictor features
* Date ordinals to convert our dates to usable predictor features
* Complement Naive Bayes to produce our model

First, to access our project files we add the project directory to PYTHONPATH

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '..'))

Our data is located in the following file:

In [2]:
data_file_path = 'data/yelpNYC'

Our data is in protobuf format, so we read it into the ReviewSet protobuffer.

In [3]:
from protos import review_set_pb2
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Let's take a look at our data. We use the following features from our data:
* Review Content. The actual text description of the restaurant.
* Date user left review.
* ID of the user that left the review
* ID of the product the review is being left on

And also the label (Fake = True, Genuine = False)

In [4]:
import pandas

frame_data = {
    "review content": [],
    "date": [],
    "user id": [],
    "product id": [],
    "label": []
}
for review in review_set.reviews:  
  frame_data["review content"].append(review.review_content)
  frame_data["date"].append(review.date)
  frame_data["user id"].append(review.user_id)
  frame_data["product id"].append(review.product_id)
  frame_data["label"].append(review.label)

data_frame = pandas.DataFrame(frame_data)
data_frame.head()

Unnamed: 0,review content,date,user id,product id,label
0,The food at snack is a selection of popular Gr...,2014-12-08,923,0,True
1,This little place in Soho is wonderful. I had ...,2013-05-16,924,0,True
2,ordered lunch for 15 from Snack last Friday. ...,2013-07-01,925,0,True
3,This is a beautiful quaint little restaurant o...,2011-07-28,926,0,True
4,Snack is great place for a casual sit down lu...,2010-11-01,927,0,True


Now we will shuffle our dataset. Since we will be doing cross validation we will prepare our entire sample set, and we  split it later during cross validation.

In [5]:
from sklearn.utils import shuffle
X_reviews = shuffle(review_set.reviews)

Next we will convert our review content into features. We will use Bag of Words to convert the text to a usable format. In scikit-learn the Bag of Words format is created using a CountVectorizer.

In [6]:
X_review_content = [x.review_content for x in X_reviews]

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X_review_content)
X_counts.shape

(359052, 128280)

Next we convert the dates to numerical ordinals, so we can use them as a feature.

In [7]:
from datetime import datetime as dt
def extract_date_ordinals(reviews):
  return [dt.strptime(x.date, '%Y-%m-%d').date().toordinal() for x in reviews]

X_date_ordinals = extract_date_ordinals(X_reviews)

Next we can simply read our user ids and product ids. They are already numbers.

In [8]:
X_user_ids = [x.user_id for x in X_reviews]
X_product_ids = [x.product_id for x in X_reviews]

Now we put our features together. The sparse features from Bag of Words overshadows our dense features (date). We put this into a format we can train/test on:

In [9]:
from scipy.sparse import coo_matrix, hstack
def format_column(features_row):
  return coo_matrix([[x] for x in features_row])

def stack_features(counts, ordinals, user_ids, product_ids):
  return hstack([counts, format_column(ordinals), format_column(user_ids), format_column(product_ids)])

predictor_data = stack_features(X_counts, X_date_ordinals, X_user_ids, X_product_ids)

And preparing the targets:

In [10]:
targets = [1 if x.label else 0 for x in X_reviews]

We will use Complement Naive Bayes to generate our model.

In [11]:
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()

Alright! Now let's test what we have. We will use cross validation here, splitting our set into 10.

In [12]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_data, targets, cv=10, return_train_score=False)

{'fit_time': array([1.18505359, 1.26089787, 1.03071284, 1.01075435, 1.01910639,
        1.06638956, 1.01255083, 1.01158953, 1.01455379, 1.02667451]),
 'score_time': array([0.05481863, 0.05596185, 0.03254151, 0.03274274, 0.03354454,
        0.03227592, 0.03401256, 0.03233457, 0.03264117, 0.03322887]),
 'test_score': array([0.66158859, 0.66055812, 0.65671476, 0.66189495, 0.66420654,
        0.66252611, 0.6621919 , 0.6637422 , 0.65884024, 0.65931373])}

* When the only features were review_content (Bag of words) and date, the score was around 0.52. Adding user_id and product_id increased this to around 0.66
* When then reducing the number of genuine reviews to match the number of fake reviews, the accuracy increases to 0.88.
* When using Multinomail NB with reduced size genuine reviews it is also around 0.88.
* Using MultinomialNB with all data gives us around 0.66