Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

First the same setup as Experiment 1:

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), '../..'))
data_file_path = 'data/yelpNYC'
from protos import review_set_pb2, review_pb2
import nltk
review_set = review_set_pb2.ReviewSet()
with open(data_file_path, 'rb') as f:
  review_set.ParseFromString(f.read())

Our first features will be the structural features. These are length of the review, average word length, number of sentences, average sentence length, percentage of numerals, percentage of capitalized words:

In [2]:
from exp2_feature_extraction import find_avg_token_length, find_numerals_ratio
from exp2_feature_extraction import find_capitalised_word_ratio

find_words = lambda text: nltk.tokenize.word_tokenize(text)

def structural_features(review):
  review_content = review.review_content
  length_of_review = len(review_content)
  words = find_words(review_content)
  avg_word_length = find_avg_token_length(words)
  sentences = nltk.tokenize.sent_tokenize(review_content)
  sentence_length_of_review = len(sentences)
  avg_sentence_length = find_avg_token_length(sentences)
  numerals_ratio = find_numerals_ratio(words)
  capitalised_word_ratio = find_capitalised_word_ratio(words)
  return (length_of_review, avg_word_length, sentence_length_of_review,
          avg_sentence_length, numerals_ratio, capitalised_word_ratio)

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [3]:
def pos_features(words):
  tag_map = {
    "CD":  0, "DT":  0, "EX":  0, "FW":   0, "IN":  0, "JJ":  0, "JJR": 0, "JJS": 0, "LS":   0,
    "MD":  0, "NN":  0, "NNP": 0, "NNPS": 0, "NNS": 0, "PDT": 0, "POS": 0, "PRP": 0, "PRP$": 0,
    "RB":  0, "RBR": 0, "RBS": 0, "RP":   0, "SYM": 0, "TO":  0, "UH":  0, "VB":  0, "VBD":  0,
    "VBG": 0, "VBN": 0, "VBP": 0, "VBZ":  0, "WDT": 0, "WP":  0, "WP$": 0, "WRB": 0, "CC":   0
  }
  tags = nltk.pos_tag(words)
  for tag in tags:
    key = tag[1]
    if key in tag_map:
      tag_map[key] += 1
  order = ["CD", "DT", "EX", "FW", "IN", "JJ", "JJR", "JJS", "LS", "MD", "NN", "NNP", "NNPS", "NNS", "PDT",
           "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN",
           "VBP", "VBZ", "WDT", "WP", "WP$", "WRB", "CC"]
  return [tag_map[x] for x in order]

Next are sentiment features.

In [4]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import functools
sentiment_analyzer = SentimentIntensityAnalyzer()
def sentiment_features(review_words):
  polarities = []
  num_positive = 0
  num_negative = 0
  for word in review_words:
    polarity = sentiment_analyzer.polarity_scores(word)['compound']
    polarities.append(1 if polarity > 0 else -1 if polarity < 0 else 0)
  reduce_func = lambda c, p: (c[0] + 1, c[1]) if p > 0 else (c[0], c[1] + 1) if p < 0 else (c[0], c[1])
  num_positive, num_negative = functools.reduce(reduce_func, polarities, (0, 0))
  total = num_positive + num_negative
  if total == 0:
    return (0, 0)
  return (num_positive / total, num_negative / total)



Next the topic model features from LDA

In [5]:
all_reviews = review_set.reviews[:100000]

We will perform some data cleaning, which will improve our results when generating LDA topics. This cleaning is stemming and lemmatization, and removing words with three or less characters.

### Stemming and Lemmatization

Stemming will reduce a word to a 'stem' form, which can be used to normalise words that mean the same thing. For example 'cleanly' and 'cleanest' would be stemmed to 'clean'. Lemmatization uses a vocabulary and morphological analysis to more intelligently normalise words, for example 'car' and 'automobile' could go to 'vehicle'

In [6]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

def lemmatize_words(words):
  lemmatized = []
  for word in words:
    lemmatized.append(stemmer.stem(lemmatizer.lemmatize(word, pos='v')))
  return lemmatized

def preprocess_words(words):
  return lemmatize_words([x for x in words if len(x) > 3])

preprocessed_words = [preprocess_words(find_words(x.review_content)) for x in all_reviews]

In [7]:
import gensim

dictionary = gensim.corpora.Dictionary(preprocessed_words)

In [8]:
bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_words]

In [9]:
lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=10, id2word=dictionary, passes=2)

In [10]:
def topic_features(review):
  words = find_words(review.review_content)
  bow = dictionary.doc2bow(preprocess_words(words))
  topics = lda_model.get_document_topics(bow)
  t = [0] * 10
  for topic in topics:
    t[topic[0]] = topic[1]
  return t

Now we will find the reviewer features. This includes the maximum number of reviews in a day, average review length, standard deviation of reviewer's ratings

In [11]:
from exp2_feature_extraction import find_capitalised_word_ratio
from exp2_feature_extraction import max_date_occurrences
import statistics
import functools

def reviewer_features(review, reviews_by_reviewer):
  reviews = reviews_by_reviewer[review.user_id]
  max_reviews_in_day = max_date_occurrences(reviews)
  average_review_length = functools.reduce(lambda total, review: total + len(review.review_content), reviews, 0) / len(reviews)
  ratings_stdev = 0 if len(reviews) == 1 else statistics.stdev([x.rating for x in reviews])
  return (max_reviews_in_day, average_review_length, ratings_stdev)

Now we put our features together

In [12]:
from exp2_feature_extraction import reviews_by_reviewer
from sklearn.utils import shuffle
reviews = shuffle(all_reviews)
reviews_reviewer_map = reviews_by_reviewer(reviews)

predictor_features = []
for review in reviews:
  review_words = find_words(review.review_content)
  next_entry = []
  next_entry += list(structural_features(review))
  next_entry += list(sentiment_features(review_words))
  next_entry += pos_features(review_words)
  next_entry += topic_features(review)
  next_entry += list(reviewer_features(review, reviews_reviewer_map))
  predictor_features.append(next_entry)

In [14]:
from scipy.sparse import coo_matrix, vstack
def format_row(features_row):
  return coo_matrix(features_row)

new_predictor_features = vstack([format_row(x) for x in predictor_features])

In [15]:
targets = [x.label for x in all_reviews]

In [16]:
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()

In [17]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=10, return_train_score=False)

{'fit_time': array([0.45631766, 0.45523643, 0.44877887, 0.44843102, 0.44655967,
        0.44545507, 0.447119  , 0.44638729, 0.44400477, 0.44465542]),
 'score_time': array([0.07666707, 0.0496645 , 0.04661965, 0.04664135, 0.04653859,
        0.04677415, 0.04654527, 0.04655075, 0.04710889, 0.04622293]),
 'test_score': array([0.39326067, 0.601     , 0.562     , 0.5819    , 0.5081    ,
        0.5411    , 0.4716    , 0.6047    , 0.6133    , 0.54705471])}