Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

First the same setup as Experiment 1:

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '..'))
data_file_path = 'data/yelpNYC'
from protos import review_set_pb2, review_pb2
import nltk
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Our first features will be the structural features. These are length of the review, average word length, number of sentences, average sentence length, percentage of numerals, percentage of capitalized words:

In [2]:
from exp2_feature_extraction import find_avg_token_length, find_numerals_ratio
from exp2_feature_extraction import find_capitalised_word_ratio

find_words = lambda text: nltk.tokenize.word_tokenize(text)

def structural_features(review):
  review_content = review.review_content
  length_of_review = len(review_content)
  words = find_words(review_content)
  avg_word_length = find_avg_token_length(words)
  sentences = nltk.tokenize.sent_tokenize(review_content)
  sentence_length_of_review = len(sentences)
  avg_sentence_length = find_avg_token_length(sentences)
  numerals_ratio = find_numerals_ratio(words)
  capitalised_word_ratio = find_capitalised_word_ratio(words)
  return (length_of_review, avg_word_length, sentence_length_of_review,
          avg_sentence_length, numerals_ratio, capitalised_word_ratio)

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [3]:
pos_tags = lambda text: nltk.pos_tag(words(text))

Next are sentiment features.

In [4]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import functools
sentiment_analyzer = SentimentIntensityAnalyzer()
def sentiment_features(review):
  polarities = []
  num_positive = 0
  num_negative = 0
  for word in find_words(review.review_content):
    polarity = sentiment_analyzer.polarity_scores(word)['compound']
    polarities.append(1 if polarity > 0 else -1 if polarity < 0 else 0)
  reduce_func = lambda c, p: (c[0] + 1, c[1]) if p > 0 else (c[0], c[1] + 1) if p < 0 else (c[0], c[1])
  num_positive, num_negative = functools.reduce(reduce_func, polarities, (0, 0))
  total = num_positive + num_negative
  if total == 0:
    return (0, 0)
  return (num_positive / total, num_negative / total)



Next the topic model features from LDA

In [5]:
all_reviews = review_set.reviews

In [6]:
import gensim

reviews_words = []
for review in all_reviews:
  reviews_words.append(find_words(review.review_content))
dictionary = gensim.corpora.Dictionary(reviews_words)

In [7]:
bow_corpus = [dictionary.doc2bow(doc) for doc in reviews_words]

In [8]:
lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=10, id2word=dictionary, passes=2)

In [9]:
def topic_features(review):
  words = find_words(review.review_content)
  bow = dictionary.doc2bow(words)
  topics = lda_model.get_document_topics(bow)
  t = [0] * 10
  for topic in topics:
    t[topic[0]] = topic[1]
  return t

Now we will find the reviewer features. This includes the maximum number of reviews in a day, average review length, standard deviation of reviewer's ratings

In [10]:
from exp2_feature_extraction import find_capitalised_word_ratio
from exp2_feature_extraction import max_date_occurrences
import statistics
import functools

def reviewer_features(review, reviews_by_reviewer):
  reviews = reviews_by_reviewer[review.user_id]
  max_reviews_in_day = max_date_occurrences(reviews)
  average_review_length = functools.reduce(lambda total, review: total + len(review.review_content), reviews, 0) / len(reviews)
  ratings_stdev = 0 if len(reviews) == 1 else statistics.stdev([x.rating for x in reviews])
  return (max_reviews_in_day, average_review_length, ratings_stdev)

Now we put our features together

In [11]:
from exp2_feature_extraction import reviews_by_reviewer
from sklearn.utils import shuffle
reviews = shuffle(all_reviews)
reviews_reviewer_map = reviews_by_reviewer(reviews)

predictor_features = []
for review in reviews:
  next_entry = []
  next_entry += list(structural_features(review))
  next_entry += list(sentiment_features(review))
  next_entry += topic_features(review)
  next_entry += list(reviewer_features(review, reviews_reviewer_map))
  predictor_features.append(next_entry)





In [12]:
from scipy.sparse import coo_matrix, vstack
def format_row(features_row):
  return coo_matrix(features_row)

new_predictor_features = vstack([format_row(x) for x in predictor_features[:10]])



In [13]:
targets = [x.label for x in all_reviews]

In [14]:
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()

In [15]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=10, return_train_score=False)

{'fit_time': array([0.74030423, 0.72387052, 0.70630813, 0.70531082, 0.70363092,
        0.70707583, 0.71101546, 0.70721221, 0.70555878, 0.71357942]),
 'score_time': array([0.07638741, 0.07003665, 0.06944442, 0.06952405, 0.07006812,
        0.06943226, 0.0694654 , 0.06953883, 0.06995726, 0.06973457]),
 'test_score': array([0.56856793, 0.57330251, 0.5556453 , 0.56294213, 0.56528157,
        0.57276145, 0.55217936, 0.56667781, 0.53860294, 0.54417335])}