# Training & Testing on Single Domain (Ground Truth)

This experiment uses a 'ground truth' labelled dataset of NYC Resteraunt reviews from Yelp. This dataset has 359,052 reviews so should be sufficient for training.

The aim of this experiment is to produce a benchmark from which we can compare our explorative experiments. This is the first of a number of experiments using statistical modelling, all of which are done with the aim of finding a benchmark. 

We will use Complement Naive Bayes to produce a model, and Bag of Words to convert our text features to usable predictor features.

First, to access out project files we add the project directory to PYTHONPATH

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '..'))
from protos import review_set_pb2

Our data is located in the following file:

In [2]:
data_file_path = 'data/yelpNYC'

Our data is in protobuf format, so we read it into the ReviewSet protobuffer.

In [3]:
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Let's take a look at our data. At this stage we use 5 features from our data:
* Review Content. The actual text description of the restaurant.
* Date user left review.
* Label (Fake = True, Genuine = False)

In [4]:
import pandas

frame_data = {
    "review content": [],
    "date": [],
    "label": []
}
for review in review_set.reviews:  
  frame_data["review content"].append(review.review_content)
  frame_data["date"].append(review.date)
  frame_data["label"].append(review.label)

data_frame = pandas.DataFrame(frame_data)
data_frame.head()

Unnamed: 0,review content,date,label
0,The food at snack is a selection of popular Gr...,2014-12-08,True
1,This little place in Soho is wonderful. I had ...,2013-05-16,True
2,ordered lunch for 15 from Snack last Friday. ...,2013-07-01,True
3,This is a beautiful quaint little restaurant o...,2011-07-28,True
4,Snack is great place for a casual sit down lu...,2010-11-01,True


Next we will split our sample into a training set and a test set:

In [37]:
from sklearn.model_selection import train_test_split
training_set, test_set = train_test_split(review_set.reviews)
"Training set size:", len(training_set), "Test set size:", len(test_set)

('Training set size:', 269289, 'Test set size:', 89763)

Next we will convert our review content to a format that can be used as a feature, using Bag of Words. In scikit learn the Bag of Words is created using a CountVectorizer. We transform our test set into the same feature format to correspond with the Bag of Words created for our training set.

In [6]:
get_review_content = lambda review: review.review_content
training_set_review_content = map(get_review_content, training_set)
test_set_review_content = map(get_review_content, test_set)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(training_set_review_content)
test_counts = count_vect.transform(test_set_review_content)

Next we convert the dates to numerical 'ordinal's, so we can use them as a feature:

In [26]:
from datetime import datetime as dt

to_date_ordinal = lambda review: [dt.strptime(review.date, '%Y-%m-%d').date().toordinal()]
train_date_ordinals = list(map(to_date_ordinal, training_set))
test_date_ordinals = list(map(to_date_ordinal, test_set))

Now we put our features together. The sparse features from Bag of Words loom over our dense features (date). We put this into a format we can train/test on:

In [36]:
from scipy.sparse import coo_matrix, hstack
predictor_data = hstack([train_counts,coo_matrix(train_date_ordinals)])

And preparing the targets:

In [42]:
targets = [1 if x.label else 0 for x in review_set.reviews]