Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

First the same setup as Experiment 1:

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '..'))
data_file_path = 'data/yelpNYC'
from protos import review_set_pb2, review_pb2
import nltk
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Our first features will be the structural features. These are length of the review, average word length, number of sentences, average sentence length, percentage of numerals, percentage of capitalized words:

In [2]:
from exp2_feature_extraction import find_avg_token_length, find_numerals_ratio
from exp2_feature_extraction import find_capitalised_word_ratio

find_words = lambda text: nltk.tokenize.word_tokenize(text)

def structural_features(review):
  review_content = review.review_content
  length_of_review = len(review_content)
  words = find_words(review_content)
  avg_word_length = find_avg_token_length(words)
  sentences = nltk.tokenize.sent_tokenize(review_content)
  sentence_length_of_review = len(sentences)
  avg_sentence_length = find_avg_token_length(sentences)
  numerals_ratio = find_numerals_ratio(words)
  capitalised_word_ratio = find_capitalised_word_ratio(words)
  return (length_of_review, avg_word_length, sentence_length_of_review,
          avg_sentence_length, numerals_ratio, capitalised_word_ratio)

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [3]:
pos_tags = lambda text: nltk.pos_tag(words(text))

Next are semantic features.

Skipping right now, want to get an end to end first, to see how these features even matter.

...

Now we will find the reviewer features. This includes the maximum number of reviews in a day, *percentage of positive / negative reviews (MISSING)*, average review length, standard deviation of reviewer's ratings

In [4]:
from exp2_feature_extraction import find_capitalised_word_ratio
from exp2_feature_extraction import max_date_occurrences
import statistics
import functools

def reviewer_features(review, reviews_by_reviewer):
  reviews = reviews_by_reviewer[review.user_id]
  max_reviews_in_day = max_date_occurrences(reviews)
  average_review_length = functools.reduce(lambda total, review: total + len(review.review_content), reviews, 0) / len(reviews)
  ratings_stdev = 0 if len(reviews) == 1 else statistics.stdev([x.rating for x in reviews])
  return (max_reviews_in_day, average_review_length, ratings_stdev)

Now we put our features together

In [24]:
from exp2_feature_extraction import reviews_by_reviewer
from sklearn.utils import shuffle
reviews = shuffle(review_set.reviews)
reviews_reviewer_map = reviews_by_reviewer(reviews)
predictor_features = [list(structural_features(x)) + list(reviewer_features(x, reviews_reviewer_map)) for x in reviews]

In [25]:
targets = [x.label for x in reviews]

In [26]:
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()

In [27]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=10, return_train_score=False)

{'fit_time': array([0.44158745, 0.49054933, 0.53898025, 0.54834628, 0.62135291,
        0.56310439, 0.66381502, 0.60599899, 0.67143273, 0.53081226]),
 'score_time': array([0.04401064, 0.04098248, 0.05866456, 0.05319452, 0.08195329,
        0.05859399, 0.05887151, 0.06138062, 0.0479188 , 0.05649495]),
 'test_score': array([0.53305854, 0.53576004, 0.53676266, 0.53161032, 0.53381051,
        0.52987049, 0.53488372, 0.53576203, 0.53428587, 0.53503788])}