Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

This notebook has been heavily unit tested, and as a result a lot of code has been removed from the notebook itself. I have demonstrated as much as possible through importing units and running them on example values.

First the same setup as Experiment 1:

In [1]:
from protos import review_set_pb2, review_pb2
review_set = review_set_pb2.ReviewSet()
with open("data/yelpNYC", 'rb') as f:
  review_set.ParseFromString(f.read())

In [2]:
from exp2_feature_extraction import find_words
from sklearn.utils import shuffle
all_reviews = [(x, find_words(x.review_content)) for x in shuffle(review_set.reviews)]

Our first features will be the structural features. These are:
* Length of the review
* Average word length
* Number of sentences
* Average sentence length
* Percentage of numerals
* percentage of capitalized words:

In [3]:
from exp2_feature_extraction import structural_features

review = review_pb2.Review()
review.review_content = "1 very horrible restaurant. Eat 10 Starbucks instead."
structural_features((review, ["1", "very", "horrible", "restaurant", "Eat", "10", "Starbucks", "instead"]))

(53, 5.5, 2, 26.0, 0.25, 0.25)

In [4]:
features_structural = [structural_features(x) for x in all_reviews]

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [7]:
import nltk
from exp2_feature_extraction import pos_features

sample_pos_features = pos_features(["Dog", "and", "beautiful"], nltk)
[(i, str(sample_pos_features[i])) for i in range(0, len(sample_pos_features)) if sample_pos_features[i] > 0]

[(5, '0.3333333333333333'),
 (11, '0.3333333333333333'),
 (35, '0.3333333333333333')]

Here 5 is JJ, which is for adjective. This represents "beautiful" in the input. 11 is NNP, which is proper noun, which represents "Dog". 35 is CC which is coordinating conjuction, which is "and". These are given as percentages, where each here is one third.

In [6]:
features_pos = [pos_features(x[1], nltk) for x in all_reviews]

Next are sentiment features.

In [7]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
print("Scores:")
print("none:", sentiment_analyzer.polarity_scores("none")["compound"])
print("good:", sentiment_analyzer.polarity_scores("good")["compound"])
print("bad:", sentiment_analyzer.polarity_scores("bad")["compound"])

from exp2_feature_extraction import sentiment_features
sentiment_features(["good", "good", "none", "bad"], sentiment_analyzer)

Scores:
none: 0.0
good: 0.4404
bad: -0.5423




(0.5, 0.25)

In [8]:
features_sentiment = [sentiment_features(x[1], sentiment_analyzer) for x in all_reviews]

Next the topic model features from LDA

We will perform some data cleaning, which will improve our results when generating LDA topics. This cleaning is stemming and lemmatization, and removing words with three or less characters.

### Stemming and Lemmatization

Stemming will reduce a word to a 'stem' form, which can be used to normalise words that mean the same thing. For example 'cleanly' and 'cleanest' would be stemmed to 'clean'. Lemmatization uses a vocabulary and morphological analysis to more intelligently normalise words, for example 'car' and 'automobile' could go to 'vehicle'

In [9]:
from exp2_feature_extraction import preprocess_words
preprocess_words(["I", "was", "gonna", "just", "stay", "in", "for", "coffee", "but", "this", "little", "store", "happily", "surprised", "me"])

['gonna', 'stay', 'coffe', 'littl', 'store', 'happili', 'surpris']

In [10]:
import gensim

def get_topic_features_maker(reviews):
  num_topics = 10
  preprocessed_words = [preprocess_words(x[1]) for x in reviews]
  
  dictionary = gensim.corpora.Dictionary(preprocessed_words)
  dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
  bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_words]
  lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=2)
    
  for index, topic in lda_model.show_topics(formatted=False, num_words=3):
    print('{}: {}'.format(index, [w[0] for w in topic]))
  
  def make_topic_features(review_words):
    topics = lda_model.get_document_topics(dictionary.doc2bow(preprocess_words(review_words)))
    return topic_features(topics, num_topics)
  return make_topic_features

topic_features_maker = get_topic_features_maker(all_reviews)

0: ['food', 'order', 'wait']
1: ['food', 'place', 'great']
2: ['flavor', 'dish', 'like']
3: ['dessert', 'pasta', 'restaur']
4: ['burger', 'fri', 'like']
5: ['brunch', 'good', 'egg']
6: ['place', 'great', 'drink']
7: ['pizza', 'place', 'good']
8: ['chicken', 'sauc', 'good']
9: ['ramen', 'pork', 'noodl']


In [11]:
from exp2_feature_extraction import topic_features

words = ["Pizza", "Pasta", "Italian", "Spaghetti"]
topic_features_maker(words)

[0.02, 0.020000432, 0.02, 0.54804873, 0.02, 0.02, 0.02, 0.2919508, 0.02, 0.02]

In [12]:
features_topic = [topic_features_maker(x[1]) for x in all_reviews]

Now we will find the reviewer features. This includes the maximum number of reviews in a day, average review length, standard deviation of reviewer's ratings

In [13]:
from exp2_feature_extraction import find_capitalised_word_ratio
from exp2_feature_extraction import max_date_occurrences
import statistics
import functools

def reviewer_features(review, reviews_by_reviewer):
  reviews = reviews_by_reviewer[review.user_id]
  max_reviews_in_day = max_date_occurrences(reviews)
  average_review_length = functools.reduce(lambda total, review: total + len(review.review_content), reviews, 0) / len(reviews)
  ratings_stdev = 0 if len(reviews) == 1 else statistics.stdev([x.rating for x in reviews])
  return (max_reviews_in_day, average_review_length, ratings_stdev)

In [14]:
from exp2_feature_extraction import reviews_by_reviewer
reviews_reviewer_map = reviews_by_reviewer([x[0] for x in all_reviews])
features_reviewer = [reviewer_features(x[0], reviews_reviewer_map) for x in all_reviews]

Now we put our features together:

In [15]:
from exp2_feature_extraction import structural_features

def features_row(review, reviews_reviewer_map, sentiment_analyzer, pos_tagger, make_topic_features):
  words = find_words(review.review_content)
  row = list(structural_features(review))
  row += list(sentiment_features(words, sentiment_analyzer))
  row += pos_features(words, pos_tagger)
  row += make_topic_features(words)
  row += list(reviewer_features(review, reviews_reviewer_map))
  return row

Here I include a test to check my features are generated correctly:

In [16]:
from unittest.mock import Mock

review = review_pb2.Review()
review.review_content = "1 really horrible restaurant. Drink 10 Starbucks instead."
review.user_id = 1
review.date = "2011-07-28"
review.rating = 5

review2 = review_pb2.Review()
review2.review_content = "example"
review2.user_id = 1
review2.date = "2011-07-28"
review2.rating = 4

test_reviews_reviewer_map = {
    1: [review, review2]
}

def analyze_sentiment(word):
  score = 0.0
  if word == "1":
    score = 0.1
  if word in ["horrible", "instead"]:
    score = -0.5
  return { "compound": score }

test_sentiment_analyzer = Mock()
test_sentiment_analyzer.polarity_scores = analyze_sentiment

def tag(words):
  tag_map = {
    "really": "CD", "horrible": "DT", "restaurant": "CD", "Drink": "FW", "Starbucks": "JJ"
  }
  return [(x, tag_map[x]) for x in [y for y in words if y in tag_map] if x]

test_pos_tagger = Mock()
test_pos_tagger.pos_tag = tag

test_make_topic_features = lambda x: [0.1, 0.2, 0.3, 0.4, 0.5]

row = features_row(review, test_reviews_reviewer_map, test_sentiment_analyzer, test_pos_tagger, test_make_topic_features)

# Structural features
assert row[0] == 57
assert row[1] == 6
assert row[2] == 2
assert row[3] == 28.0
assert row[4] == 0.25
assert row[5] == 0.25
# Sentiment features
assert row[6] == 0.125 # 1/8
assert row[7] == 0.25  # 2/8
# POS features
assert row[8:44] == [0.4, 0.2, 0.0, 0.2, 0.0, 0.2] + [0.0] * 30
# Topic features (This is not testing much as it doesn't use the real thing)
assert row[44:49] == [0.1, 0.2, 0.3, 0.4, 0.5]
# Reviewer features
assert row[49] == 2
assert row[50] == 32
assert float("%0.2f"%row[51]**2) == 0.5

TypeError: 'Review' object does not support indexing

In [17]:
from scipy.sparse import coo_matrix, hstack
predictor_features = hstack([coo_matrix(features_structural), coo_matrix(features_sentiment), coo_matrix(features_pos),
                             coo_matrix(features_topic), coo_matrix(features_reviewer)])

In [18]:
targets = [x[0].label for x in all_reviews]

In [19]:
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()

In [20]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=10, return_train_score=False)

{'fit_time': array([0.27180743, 0.27057576, 0.26438594, 0.27046919, 0.24466586,
        0.24403763, 0.24148178, 0.24270821, 0.24425125, 0.24098802]),
 'score_time': array([0.01063466, 0.01147199, 0.01149797, 0.01017594, 0.01002407,
        0.01007032, 0.0102942 , 0.01008558, 0.00996137, 0.01010466]),
 'test_score': array([0.5306077 , 0.53459032, 0.53158247, 0.53411686, 0.53523088,
        0.5343267 , 0.530149  , 0.53111074, 0.53704323, 0.52774064])}