Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

This notebook has been heavily unit tested, and as a result a lot of code has been removed from the notebook itself. I have demonstrated as much as possible through importing units and running them on example values.

First the same setup as Experiment 1:

In [1]:
from protos import review_set_pb2, review_pb2
review_set = review_set_pb2.ReviewSet()
with open("data/yelpNYC", 'rb') as f:
  review_set.ParseFromString(f.read())

In [2]:
from exp2_feature_extraction import find_words
from sklearn.utils import shuffle

num_each_class = 8141
reviews = shuffle(review_set.reviews)

i = 0
fake_reviews = []
for x in reviews:
  if i == num_each_class:
    break
  if x.label:
    fake_reviews.append(x)
    i+=1

i = 0
genuine_reviews = []
for x in reviews:
  if i == num_each_class:
    break
  if x.label == False:
    fake_reviews.append(x)
    i+=1
    
all_reviews = [(x, find_words(x.review_content)) for x in shuffle(fake_reviews + genuine_reviews)]

In [3]:
print(len(all_reviews))

16282


Our first features will be the structural features. These are:
* Length of the review
* Average word length
* Number of sentences
* Average sentence length
* Percentage of numerals
* percentage of capitalized words:

In [4]:
from exp2_feature_extraction import structural_features

review = review_pb2.Review()
review.review_content = "1 very horrible restaurant. Eat 10 Starbucks instead."
structural_features((review, ["1", "very", "horrible", "restaurant", "Eat", "10", "Starbucks", "instead"]))

(53, 5.5, 2, 26.0, 0.25, 0.25)

In [5]:
features_structural = [structural_features(x) for x in all_reviews]

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [6]:
import nltk
from exp2_feature_extraction import pos_features

sample_pos_features = pos_features(["Dog", "and", "beautiful"], nltk)
[(i, str(sample_pos_features[i])) for i in range(0, len(sample_pos_features)) if sample_pos_features[i] > 0]

[(5, '0.3333333333333333'),
 (11, '0.3333333333333333'),
 (35, '0.3333333333333333')]

Here 5 is JJ, which is for adjective. This represents "beautiful" in the input. 11 is NNP, which is proper noun, which represents "Dog". 35 is CC which is coordinating conjuction, which is "and". These are given as percentages, where each here is one third.

In [7]:
features_pos = [pos_features(x[1], nltk) for x in all_reviews]

Next are sentiment features. These will be:
* Percentage of words that have positive sentiment
* Percentage of words that have negative sentiment

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
print("Scores:")
print("none:", sentiment_analyzer.polarity_scores("none")["compound"])
print("good:", sentiment_analyzer.polarity_scores("good")["compound"])
print("bad:", sentiment_analyzer.polarity_scores("bad")["compound"])

from exp2_feature_extraction import sentiment_features
sentiment_features(["good", "good", "none", "bad"], sentiment_analyzer)

Scores:
none: 0.0
good: 0.4404
bad: -0.5423




(0.5, 0.25)

In [9]:
features_sentiment = [sentiment_features(x[1], sentiment_analyzer) for x in all_reviews]

Next the topic model features from LDA

We will perform some data cleaning, which will improve our results when generating LDA topics. This cleaning is stemming and lemmatization, and removing words with three or less characters.

### Stemming and Lemmatization

Stemming will reduce a word to a 'stem' form, which can be used to normalise words that mean the same thing. For example 'cleanly' and 'cleanest' would be stemmed to 'clean'. Lemmatization uses a vocabulary and morphological analysis to more intelligently normalise words, for example 'car' and 'automobile' could go to 'vehicle'

In [10]:
from exp2_feature_extraction import preprocess_words
preprocess_words(["I", "was", "gonna", "just", "stay", "in", "for", "coffee", "but", "this", "little", "store", "happily", "surprised", "me"])

['gonna', 'stay', 'coffe', 'littl', 'store', 'happili', 'surpris']

In [11]:
import gensim

def get_topic_features_maker(reviews, num_topics=100, bigrams=False):
  preprocessed_words = [preprocess_words(x[1], bigrams=bigrams) for x in reviews]
  
  dictionary = gensim.corpora.Dictionary(preprocessed_words)
  dictionary.filter_extremes(no_below=15, no_above=0.33, keep_n=100000)
  bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_words]
  lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=2)
    
  def make_topic_features(review_words):
    topics = lda_model.get_document_topics(dictionary.doc2bow(preprocess_words(review_words, bigrams=bigrams)))
    return topic_features(topics, num_topics)

  def get_terms(topic_id):
    return [dictionary.id2token[x[0]] + " " + str(x[1]) for x in lda_model.get_topic_terms(topic_id)]
  return (make_topic_features, get_terms)

topic_features_maker, get_topic_terms = get_topic_features_maker(all_reviews)

  diff = np.log(self.expElogbeta)


In [12]:
from exp2_feature_extraction import topic_features

words = ["Pizza", "Pasta", "Ramen", "Noodles"]
tf = topic_features_maker(words)
[get_topic_terms(i) for i in range(0, len(tf)) if tf[i] > 0]

[['ramen 0.14306363',
  'noodl 0.10816208',
  'pork 0.05333299',
  'broth 0.04551034',
  'wait 0.043994926',
  'bun 0.038505413',
  'spici 0.030381061',
  'bowl 0.029212855',
  'soup 0.023365302',
  'extra 0.019488957'],
 ['pizza 0.3116736',
  'slice 0.050003514',
  'crust 0.04422086',
  'best 0.028866129',
  'chees 0.028013052',
  'fresh 0.018963119',
  'sauc 0.018711366',
  'top 0.018625498',
  'brooklyn 0.01526527',
  'mozzarella 0.014776899'],
 ['pasta 0.16058289',
  'restaur 0.05215843',
  'dish 0.050591186',
  'italian 0.028142707',
  'veget 0.026763672',
  'both 0.02527639',
  'order 0.022927662',
  'veal 0.022089723',
  'ragu 0.016029848',
  'nick 0.01597658']]

In [13]:
features_unigram_topic = [topic_features_maker(x[1]) for x in all_reviews]

In [26]:
bigram_topic_features_maker, get_topic_terms2 = get_topic_features_maker(all_reviews, bigrams=True)

In [30]:
words = ["Fried Chicken", "Pizza Slice"]
tf2 = bigram_topic_features_maker(words)
print(tf2)
[get_topic_terms2(i) for i in range(0, len(tf2)) if tf2[i] > 0]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


[]

In [None]:
features_bigram_topic = [bigram_topic_features_maker(x[1]) for x in all_reviews]

Now we will find the reviewer features. This includes:
* The maximum number of reviews in a day
* average review length
* standard deviation of reviewer's ratings

In [14]:
from exp2_feature_extraction import reviewer_features

review1 = review_pb2.Review()
review1.review_content="1234"
review1.date="2018-11-12"
review1.rating=2

review2 = review_pb2.Review()
review2.review_content="12345678"
review2.date="2018-11-12"
review2.rating=4

user_id = 1234
reviewer_map = {
    user_id: [review1, review2]
}

reviewer_features(user_id, reviewer_map)

(2, 6.0, 1.4142135623730951, 0.5, 0.5)

As shown, our first feature is the number of reviews on this day (2), the average review length (6.0) and the standard devation of our ratings (1.41...)

In [15]:
from exp2_feature_extraction import reviews_by_reviewer
reviews_reviewer_map = reviews_by_reviewer([x[0] for x in all_reviews])
features_reviewer = [reviewer_features(x[0].user_id, reviews_reviewer_map) for x in all_reviews]

Now we put our features together:

### Feature scaling

We normalise all our features to be between one and zero. We need to do this to suppress the mega features vs tiny features situation. Most classifiers use Euclidian distance, which has no knowledge of the units being used.

In [16]:
def features_row(review, reviews_reviewer_map, sentiment_analyzer, pos_tagger, make_topic_features):
  words = find_words(review.review_content)
  row = list(structural_features(review))
  row += list(sentiment_features(words, sentiment_analyzer))
  row += pos_features(words, pos_tagger)
  row += make_topic_features(words)
  row += list(reviewer_features(review, reviews_reviewer_map))
  return row

Here I include a test to check my features are generated correctly:

In [17]:
from unittest.mock import Mock

review = review_pb2.Review()
review.review_content = "1 really horrible restaurant. Drink 10 Starbucks instead."
review.user_id = 1
review.date = "2011-07-28"
review.rating = 5

review2 = review_pb2.Review()
review2.review_content = "example"
review2.user_id = 1
review2.date = "2011-07-28"
review2.rating = 4

test_reviews_reviewer_map = {
    1: [review, review2]
}

def analyze_sentiment(word):
  score = 0.0
  if word == "1":
    score = 0.1
  if word in ["horrible", "instead"]:
    score = -0.5
  return { "compound": score }

test_sentiment_analyzer = Mock()
test_sentiment_analyzer.polarity_scores = analyze_sentiment

def tag(words):
  tag_map = {
    "really": "CD", "horrible": "DT", "restaurant": "CD", "Drink": "FW", "Starbucks": "JJ"
  }
  return [(x, tag_map[x]) for x in [y for y in words if y in tag_map] if x]

test_pos_tagger = Mock()
test_pos_tagger.pos_tag = tag

test_make_topic_features = lambda x: [0.1, 0.2, 0.3, 0.4, 0.5]

row = features_row(review, test_reviews_reviewer_map, test_sentiment_analyzer, test_pos_tagger, test_make_topic_features)

# Structural features
assert row[0] == 57
assert row[1] == 6
assert row[2] == 2
assert row[3] == 28.0
assert row[4] == 0.25
assert row[5] == 0.25
# Sentiment features
assert row[6] == 0.125 # 1/8
assert row[7] == 0.25  # 2/8
# POS features
assert row[8:44] == [0.4, 0.2, 0.0, 0.2, 0.0, 0.2] + [0.0] * 30
# Topic features (This is not testing much as it doesn't use the real thing)
assert row[44:49] == [0.1, 0.2, 0.3, 0.4, 0.5]
# Reviewer features
assert row[49] == 2
assert row[50] == 32
assert float("%0.2f"%row[51]**2) == 0.5

TypeError: 'Review' object does not support indexing

In [18]:
from scipy.sparse import coo_matrix, hstack
predictor_features = hstack([coo_matrix(features_structural), coo_matrix(features_sentiment), coo_matrix(features_pos),
                             coo_matrix(features_topic), coo_matrix(features_reviewer)])

In [19]:
targets = [x[0].label for x in all_reviews]

In [20]:
from sklearn.naive_bayes import MultinomialNB
cnb = MultinomialNB()

In [21]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=5, return_train_score=False)

{'fit_time': array([0.02335429, 0.01304007, 0.01229262, 0.01186943, 0.0153687 ]),
 'score_time': array([0.00145555, 0.00159121, 0.00147748, 0.00139523, 0.00144601]),
 'test_score': array([0.59576427, 0.60380835, 0.59060197, 0.61578624, 0.5958231 ])}

cannot replicate results. adding more topics seems to have virtually no difference. using bigrams alone for lda topics seems to improve by 0.01.