Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

First the same setup as Experiment 1:

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '..'))
data_file_path = 'data/yelpNYC'
from protos import review_set_pb2, review_pb2
import nltk
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Our first features will be the structural features. These are length of the review, average word length, number of sentences, average sentence length, percentage of numerals, percentage of capitalized words:

In [2]:
from exp2_feature_extraction import find_avg_token_length, find_numerals_ratio
from exp2_feature_extraction import find_capitalised_word_ratio

find_words = lambda text: nltk.tokenize.word_tokenize(text)

def structural_features(review):
  review_content = review.review_content
  length_of_review = len(review_content)
  words = find_words(review_content)
  avg_word_length = find_avg_token_length(words)
  sentences = nltk.tokenize.sent_tokenize(review_content)
  sentence_length_of_review = len(sentences)
  avg_sentence_length = find_avg_token_length(sentences)
  numerals_ratio = find_numerals_ratio(words)
  capitalised_word_ratio = find_capitalised_word_ratio(words)
  return (length_of_review, avg_word_length, sentence_length_of_review,
          avg_sentence_length, numerals_ratio, capitalised_word_ratio)

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [3]:
pos_tags = lambda text: nltk.pos_tag(words(text))

Next are sentiment features.

In [9]:
#import nltk
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import functools
sentiment_analyzer = SentimentIntensityAnalyzer()
def sentiment_features(review):
  polarities = []
  num_positive = 0
  num_negative = 0
  for word in find_words(review.review_content):
    polarity = sentiment_analyzer.polarity_scores(word)['compound']
    polarities.append(1 if polarity > 0 else -1 if polarity < 0 else 0)
  reduce_func = lambda c, p: (c[0] + 1, c[1]) if p > 0 else (c[0], c[1] + 1) if p < 0 else (c[0], c[1])
  num_positive, num_negative = functools.reduce(reduce_func, polarities, (0, 0))
  total = num_positive + num_negative
  if total == 0:
    return (0, 0)
  return (num_positive / total, num_negative / total)

Now we will find the reviewer features. This includes the maximum number of reviews in a day, *percentage of positive / negative reviews (MISSING)*, average review length, standard deviation of reviewer's ratings

In [10]:
from exp2_feature_extraction import find_capitalised_word_ratio
from exp2_feature_extraction import max_date_occurrences
import statistics
import functools

def reviewer_features(review, reviews_by_reviewer):
  reviews = reviews_by_reviewer[review.user_id]
  max_reviews_in_day = max_date_occurrences(reviews)
  average_review_length = functools.reduce(lambda total, review: total + len(review.review_content), reviews, 0) / len(reviews)
  ratings_stdev = 0 if len(reviews) == 1 else statistics.stdev([x.rating for x in reviews])
  return (max_reviews_in_day, average_review_length, ratings_stdev)

Now we put our features together

In [None]:
from exp2_feature_extraction import reviews_by_reviewer
from sklearn.utils import shuffle
reviews = shuffle(review_set.reviews)
reviews_reviewer_map = reviews_by_reviewer(reviews)

predictor_features = []
i = 0
for review in reviews:
  next_entry = []
  next_entry += list(structural_features(review))
  next_entry += list(sentiment_features(review))
  next_entry += list(reviewer_features(review, reviews_reviewer_map))
  predictor_features.append(next_entry)
  i+=1
  if i % 1000 == 0:
    print(i)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000


In [25]:
print(predictor_features[:10])

[[1113, 3.9012875536480687, 15, 73.2, 0.0, 0.11587982832618025, 0.9166666666666666, 0.08333333333333333, 3, 802.0, 0.5773502691896257], [820, 3.4123711340206184, 8, 101.375, 0.015463917525773196, 0.12886597938144329, 0.6153846153846154, 0.38461538461538464, 1, 811.25, 0.5773502691896257], [232, 4.0638297872340425, 6, 37.666666666666664, 0.0, 0.1702127659574468, 0.7777777777777778, 0.2222222222222222, 1, 221.5, 0.5], [941, 4.085106382978723, 7, 132.28571428571428, 0.005319148936170213, 0.15425531914893617, 0.6923076923076923, 0.3076923076923077, 2, 878.25, 1.2817398889233114], [345, 4.0, 5, 68.0, 0.0, 0.1111111111111111, 1.0, 0.0, 1, 563.5, 0.9574271077563381], [612, 4.235294117647059, 8, 75.5, 0.0, 0.09243697478991597, 0.9230769230769231, 0.07692307692307693, 1, 612.0, 0], [307, 4.080645161290323, 3, 101.0, 0.0, 0.04838709677419355, 0.6, 0.4, 1, 699.6666666666666, 0.5773502691896257], [556, 3.616, 6, 91.66666666666667, 0.0, 0.128, 0.8, 0.2, 3, 558.0, 0.6887372317211945], [1163, 4.06779

In [26]:
targets = [x.label for x in reviews]

In [27]:
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()

In [29]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=10, return_train_score=False)

  self.class_log_prior_ = (np.log(self.class_count_) -


{'fit_time': array([0.021451  , 0.00132608, 0.0012238 , 0.00122094, 0.0011251 ,
        0.00103354, 0.001127  , 0.00131989, 0.0012095 , 0.00104022]),
 'score_time': array([0.00076175, 0.00043392, 0.00041699, 0.00041246, 0.00041604,
        0.00036812, 0.00056982, 0.00045633, 0.0004456 , 0.00093961]),
 'test_score': array([0.66666667, 0.        , 1.        , 0.        , 1.        ,
        0.5       , 1.        , 0.5       , 0.5       , 1.        ])}