Following experiment 1, we now want to try to find better features. This will take inspiration from existing research done at Stanford. We will derive the same features and attempt to replicate the same benchmark as them. [Paper here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjf1Pbo6ZveAhUKBsAKHbeSAicQFjAAegQIBhAC&url=http%3A%2F%2Fcs229.stanford.edu%2Fproj2017%2Ffinal-reports%2F5229663.pdf&usg=AOvVaw1SAoqP8hAkRiRJH9lwmeEn)

This notebook has been heavily unit tested, and as a result a lot of code has been removed from the notebook itself. I have demonstrated as much as possible through importing units and running them on example values.

First the same setup as Experiment 1:

In [1]:
from protos import review_set_pb2, review_pb2
review_set = review_set_pb2.ReviewSet()
with open("data/yelpNYC", 'rb') as f:
  review_set.ParseFromString(f.read())

In [2]:
from exp2_feature_extraction import find_words
from sklearn.utils import shuffle

num_each_class = 8141
reviews = shuffle(review_set.reviews)

i = 0
fake_reviews = []
for x in reviews:
  if i == num_each_class:
    break
  if x.label:
    fake_reviews.append(x)
    i+=1

i = 0
genuine_reviews = []
for x in reviews:
  if i == num_each_class:
    break
  if x.label == False:
    fake_reviews.append(x)
    i+=1
    
all_reviews = [(x, find_words(x.review_content)) for x in shuffle(fake_reviews + genuine_reviews)]

In [3]:
print(len(all_reviews))

16282


Our first features will be the structural features. These are:
* Length of the review
* Average word length
* Number of sentences
* Average sentence length
* Percentage of numerals
* percentage of capitalized words:

In [4]:
from exp2_feature_extraction import structural_features

review = review_pb2.Review()
review.review_content = "1 very horrible restaurant. Eat 10 Starbucks instead."
structural_features((review, ["1", "very", "horrible", "restaurant", "Eat", "10", "Starbucks", "instead"]))

(53, 5.5, 2, 26.0, 0.25, 0.25)

In [5]:
features_structural = [structural_features(x) for x in all_reviews]

And then the Part of Speech features. There are 36 part of speech categories. Descriptions can be found [here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [6]:
import nltk
from exp2_feature_extraction import pos_features

sample_pos_features = pos_features(["Dog", "and", "beautiful"], nltk)
[(i, str(sample_pos_features[i])) for i in range(0, len(sample_pos_features)) if sample_pos_features[i] > 0]

[(5, '0.3333333333333333'),
 (11, '0.3333333333333333'),
 (35, '0.3333333333333333')]

Here 5 is JJ, which is for adjective. This represents "beautiful" in the input. 11 is NNP, which is proper noun, which represents "Dog". 35 is CC which is coordinating conjuction, which is "and". These are given as percentages, where each here is one third.

In [7]:
features_pos = [pos_features(x[1], nltk) for x in all_reviews]

Next are sentiment features. These will be:
* Percentage of words that have positive sentiment
* Percentage of words that have negative sentiment

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
print("Scores:")
print("none:", sentiment_analyzer.polarity_scores("none")["compound"])
print("good:", sentiment_analyzer.polarity_scores("good")["compound"])
print("bad:", sentiment_analyzer.polarity_scores("bad")["compound"])

from exp2_feature_extraction import sentiment_features
sentiment_features(["good", "good", "none", "bad"], sentiment_analyzer)

Scores:
none: 0.0
good: 0.4404
bad: -0.5423




(0.5, 0.25)

In [9]:
features_sentiment = [sentiment_features(x[1], sentiment_analyzer) for x in all_reviews]

Next the topic model features from LDA

We will perform some data cleaning, which will improve our results when generating LDA topics. This cleaning is stemming and lemmatization, and removing words with three or less characters.

### Stemming and Lemmatization

Stemming will reduce a word to a 'stem' form, which can be used to normalise words that mean the same thing. For example 'cleanly' and 'cleanest' would be stemmed to 'clean'. Lemmatization uses a vocabulary and morphological analysis to more intelligently normalise words, for example 'car' and 'automobile' could go to 'vehicle'

In [10]:
from exp2_feature_extraction import preprocess_words
preprocess_words(["I", "was", "gonna", "just", "stay", "in", "for", "coffee", "but", "this", "little", "store", "happily", "surprised", "me"])

['gonna', 'stay', 'coffe', 'littl', 'store', 'happili', 'surpris']

In [11]:
import gensim

def get_topic_features_maker(reviews, num_topics=100, bigrams=False):
  preprocessed_words = [preprocess_words(x[1], bigrams=bigrams) for x in reviews]
  
  dictionary = gensim.corpora.Dictionary(preprocessed_words)
  dictionary.filter_extremes(no_below=15, no_above=0.33, keep_n=100000)
  bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_words]
  lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=2)
    
  def make_topic_features(review_words):
    topics = lda_model.get_document_topics(dictionary.doc2bow(preprocess_words(review_words, bigrams=bigrams)))
    return topic_features(topics, num_topics)

  def get_terms(topic_id):
    return [dictionary.id2token[x[0]] + " " + str(x[1]) for x in lda_model.get_topic_terms(topic_id)]
  return (make_topic_features, get_terms)

In [12]:
topic_features_maker, get_topic_terms = get_topic_features_maker(all_reviews)

  diff = np.log(self.expElogbeta)


In [13]:
def print_topic_terms(term_frequencies, term_getter):
  for topic_id in range(0, len(term_frequencies)):
    if term_frequencies[topic_id] > 0:
      print("----- Topic", topic_id)
      print("\n".join(term_getter(topic_id)))

In [14]:
from exp2_feature_extraction import topic_features

words = ["Pizza", "Pasta", "Ramen", "Noodles"]
tf = topic_features_maker(words)
print_topic_terms(tf, get_topic_terms)

----- Topic 22
noodl 0.096852325
broth 0.05096265
bowl 0.045923855
pork 0.03812034
flavor 0.028089475
like 0.02164319
soup 0.019695487
cook 0.018089803
beef 0.016119294
textur 0.015956793
----- Topic 51
pizza 0.32059258
crust 0.05568068
slice 0.049427867
artichok 0.029564379
best 0.027534088
oven 0.02168064
chees 0.02072949
fresh 0.015081689
tast 0.014505719
basil 0.014213047
----- Topic 60
ramen 0.16034897
ippudo 0.049952243
spici 0.048881866
wait 0.043555398
burrito 0.028100757
mexican 0.025394771
extra 0.022398656
miso 0.020769633
best 0.019556362
worth 0.019325428
----- Topic 63
pasta 0.056896374
ravioli 0.04104559
risotto 0.035842206
spaghetti 0.03391445
foie 0.029736236
gras 0.027632628
gelato 0.022858901
like 0.022318793
sauc 0.02117488
polenta 0.019895256


In [15]:
features_unigram_topic = [topic_features_maker(x[1]) for x in all_reviews]

In [16]:
bigram_topic_features_maker, get_topic_terms = get_topic_features_maker(all_reviews, bigrams=True)

In [17]:
words = ["Fried", "Chicken", "Pizza", "Slice"]
tf2 = bigram_topic_features_maker(words)
print_topic_terms(tf2, get_topic_terms)

----- Topic 5
dish come 0.07545652
good pizza 0.06775505
leav hungri 0.060532913
food recommend 0.042339265
food spici 0.042271245
moment walk 0.040766425
place peopl 0.038288265
reserv tabl 0.035552364
look nice 0.034365393
pizza brooklyn 0.033186343
----- Topic 61
fri chicken 0.24945906
feel like 0.23236562
scrambl egg 0.02751178
poor servic 0.02662584
walk door 0.02419437
dinner drink 0.023946911
hollandais sauc 0.021571178
nice wine 0.019685354
romant dinner 0.019272646
like come 0.019226033


In [18]:
features_bigram_topic = [bigram_topic_features_maker(x[1]) for x in all_reviews]

Now we will find the reviewer features. This includes:
* The maximum number of reviews in a day
* average review length
* standard deviation of reviewer's ratings

In [19]:
from exp2_feature_extraction import reviewer_features

review1 = review_pb2.Review()
review1.review_content="1234"
review1.date="2018-11-12"
review1.rating=2

review2 = review_pb2.Review()
review2.review_content="12345678"
review2.date="2018-11-12"
review2.rating=4

user_id = 1234
reviewer_map = {
    user_id: [review1, review2]
}

reviewer_features(user_id, reviewer_map)

(2, 6.0, 1.4142135623730951, 0.5, 0.5)

As shown, our first feature is the number of reviews on this day (2), the average review length (6.0) and the standard devation of our ratings (1.41...)

In [20]:
from exp2_feature_extraction import reviews_by_reviewer
reviews_reviewer_map = reviews_by_reviewer([x[0] for x in all_reviews])
features_reviewer = [reviewer_features(x[0].user_id, reviews_reviewer_map) for x in all_reviews]

Now we put our features together:

### Feature scaling

We normalise all our features to be between one and zero. We need to do this to suppress the mega features vs tiny features situation. Most classifiers use Euclidian distance, which has no knowledge of the units being used.

In [21]:
#def features_row(review, reviews_reviewer_map, sentiment_analyzer, pos_tagger, make_topic_features):
#  words = find_words(review.review_content)
#  row = list(structural_features(review))
#  row += list(sentiment_features(words, sentiment_analyzer))
#  row += pos_features(words, pos_tagger)
#  row += make_topic_features(words)
#  row += list(reviewer_features(review, reviews_reviewer_map))
#  return row

Here I include a test to check my features are generated correctly:

In [22]:
def t():
  """
from unittest.mock import Mock

review = review_pb2.Review()
review.review_content = "1 really horrible restaurant. Drink 10 Starbucks instead."
review.user_id = 1
review.date = "2011-07-28"
review.rating = 5

review2 = review_pb2.Review()
review2.review_content = "example"
review2.user_id = 1
review2.date = "2011-07-28"
review2.rating = 4

test_reviews_reviewer_map = {
    1: [review, review2]
}

def analyze_sentiment(word):
  score = 0.0
  if word == "1":
    score = 0.1
  if word in ["horrible", "instead"]:
    score = -0.5
  return { "compound": score }

test_sentiment_analyzer = Mock()
test_sentiment_analyzer.polarity_scores = analyze_sentiment

def tag(words):
  tag_map = {
    "really": "CD", "horrible": "DT", "restaurant": "CD", "Drink": "FW", "Starbucks": "JJ"
  }
  return [(x, tag_map[x]) for x in [y for y in words if y in tag_map] if x]

test_pos_tagger = Mock()
test_pos_tagger.pos_tag = tag

test_make_topic_features = lambda x: [0.1, 0.2, 0.3, 0.4, 0.5]

row = features_row(review, test_reviews_reviewer_map, test_sentiment_analyzer, test_pos_tagger, test_make_topic_features)

# Structural features
assert row[0] == 57
assert row[1] == 6
assert row[2] == 2
assert row[3] == 28.0
assert row[4] == 0.25
assert row[5] == 0.25
# Sentiment features
assert row[6] == 0.125 # 1/8
assert row[7] == 0.25  # 2/8
# POS features
assert row[8:44] == [0.4, 0.2, 0.0, 0.2, 0.0, 0.2] + [0.0] * 30
# Topic features (This is not testing much as it doesn't use the real thing)
assert row[44:49] == [0.1, 0.2, 0.3, 0.4, 0.5]
# Reviewer features
assert row[49] == 2
assert row[50] == 32
assert float("%0.2f"%row[51]**2) == 0.5
  """
  pass

In [23]:
from scipy.sparse import coo_matrix, hstack
predictor_features = hstack([coo_matrix(features_structural), coo_matrix(features_sentiment), coo_matrix(features_pos),
                             coo_matrix(features_unigram_topic), coo_matrix(features_bigram_topic),
                             coo_matrix(features_reviewer)])

In [24]:
targets = [x[0].label for x in all_reviews]

In [25]:
from sklearn.naive_bayes import MultinomialNB
cnb = MultinomialNB()

In [26]:
from sklearn.model_selection import cross_validate
cross_validate(cnb, predictor_features, targets, cv=5, return_train_score=False)

{'fit_time': array([0.01776314, 0.01655006, 0.01647902, 0.01409626, 0.01379395]),
 'score_time': array([0.00212693, 0.00185037, 0.00241184, 0.0020442 , 0.00201035]),
 'test_score': array([0.60128913, 0.58998771, 0.5970516 , 0.60012285, 0.60042998])}

I do not seem to be able to replicate the results from the paper. Adding more topics for bigrams/unigrams seems to have virtually no difference. Using bigrams alone for lda topics seems to improve by 0.01. By splitting unigrams and bigrams to be modelled by different LDA we can get about + 0.005.

Based on the paper, the possible differences are:

* Structural features - Very clear, difficult to get wrong
* POS Percentages - perhaps another POS tagger could improve results. We are using one from nltk, but there are alternatives
* Semantic features - perhaps another Semantic tagger could improve results, but it's hard to imagine it being that much better at tagging
* Unigram / Bigram features - It's not very obvious exactly what they're doing. Maybe a better LDA topic modeller could be used? The paper claims that high accuracy can be obtained through just unigrams and bigrams, so this could be tried too.
* Reviewer features - all very clear, difficult to get this wrong.

Let's try with just Unigrams/Bigrams to see how accurate we are:

In [37]:
def avg_accuracy(features):
  predictor_features = hstack([coo_matrix(x) for x in features])
  naive_bayes = MultinomialNB()
  fold = 5
  results = cross_validate(naive_bayes, predictor_features, targets, cv=fold, return_train_score=False)
  return sum([x for x in results['test_score']])/fold

In [75]:
avg_accuracy([features_unigram_topic, features_bigram_topic])

0.6111660505306914

Out of interest, let's try training with only our other features to see how predictive they are alone:

In [71]:
avg_accuracy([features_pos])

0.566699735898631

In [72]:
predictor_features_structural_only = hstack([coo_matrix(features_structural)])
cnb_structural_only = MultinomialNB()
cross_validate(cnb_structural_only, predictor_features_structural_only, targets, cv=5, return_train_score=False)
avg_accuracy([features_structural])

0.5937844549723004

In [73]:
avg_accuracy([features_sentiment])

0.5205757364597143

In [74]:
avg_accuracy([features_reviewer])

0.6142977105684289

It appears as though our best predicting features are the reviewer features. Let's try combining them with the next best features, unigrams/bigrams. To avoid swamping the reviewer features, we will reduce our topics to a small number, 10 each.

In [33]:
unigram_topic_features_predictive_maker, get_topic_terms = get_topic_features_maker(all_reviews, num_topics=10)

words = ["Fried", "Chicken", "Pizza", "Slice"]
unigram_topic_features_predictive = unigram_topic_features_predictive_maker(words)
print_topic_terms(unigram_topic_features_predictive, get_topic_terms)

----- Topic 0
dish 0.04223491
flavor 0.011829933
thai 0.011820312
order 0.010700842
dessert 0.010402134
delici 0.010185039
perfect 0.009887086
cook 0.009617706
rice 0.009331908
duck 0.008528771
----- Topic 1
wait 0.027837483
ramen 0.026999397
brunch 0.022529101
pork 0.019509893
noodl 0.016152045
love 0.014694691
soup 0.014545276
egg 0.014516102
come 0.0124907475
delici 0.011654576
----- Topic 2
sandwich 0.026573358
sauc 0.021678047
fri 0.02104345
pancak 0.014249963
pork 0.013925266
chicken 0.012638441
delici 0.0113640465
love 0.011248021
like 0.010264614
meat 0.009770032
----- Topic 3
servic 0.022240961
friend 0.019313317
restaur 0.018040491
love 0.016207548
nice 0.015208652
wine 0.015053305
atmospher 0.013252496
menu 0.012698647
staff 0.012097907
dinner 0.01146572
----- Topic 4
chicken 0.01809943
order 0.017613953
like 0.014749082
come 0.012905807
fri 0.0097111445
tast 0.009418359
taco 0.009069484
drink 0.0084821815
corn 0.007956692
sweet 0.0078054527
----- Topic 5
wait 0.017870381
ti

In [34]:
features_unigram_topic_predictive = [unigram_topic_features_predictive_maker(x[1]) for x in all_reviews]

In [35]:
bigram_topic_features_predictive_maker, get_topic_terms = get_topic_features_maker(all_reviews, num_topics=10,
                                                                                   bigrams=True)

words = ["Fried", "Chicken", "Pizza", "Slice"]
bigram_topic_features_predictive = bigram_topic_features_predictive_maker(words)
print_topic_terms(bigram_topic_features_predictive, get_topic_terms)

----- Topic 0
food good 0.02478526
this place 0.020148406
friday night 0.01274327
good servic 0.01204978
place good 0.009531203
excel food 0.0089609
good food 0.008449192
italian restaur 0.007931107
good friend 0.0072339
servic good 0.0070772353
----- Topic 1
great place 0.02720957
lunch dinner 0.012021749
friend servic 0.009640679
fri chicken 0.009593315
nice place 0.0086197
bread pud 0.007981123
pretti good 0.007927147
good good 0.007824834
food good 0.0076706037
foie gras 0.00701384
----- Topic 2
this place 0.021023594
great price 0.009763191
pita bread 0.009595114
soup dumpl 0.008980876
artichok pizza 0.00883899
credit card 0.008277147
arugula salad 0.008176162
wine select 0.007947528
bake bean 0.0072855265
white sauc 0.007005657
----- Topic 3
tast like 0.012579144
fri chicken 0.012534842
feel like 0.011147479
thai restaur 0.0091396
tast menu 0.0066287043
dish come 0.006467858
food good 0.0063346084
east villag 0.006318108
pretti good 0.006038828
crispi outsid 0.0059070545
----- To

In [36]:
features_bigram_topic_predictive = [bigram_topic_features_maker(x[1]) for x in all_reviews]

In [38]:
avg_accuracy([features_reviewer, features_unigram_topic, features_bigram_topic_predictive])

0.6197030028521742

In [39]:
avg_accuracy([features_reviewer, features_unigram_topic])

0.6224051776537964

In [40]:
avg_accuracy([features_reviewer, features_unigram_topic_predictive])

0.6222206762262011

In [41]:
avg_accuracy([features_reviewer, features_unigram_topic_predictive, features_bigram_topic_predictive])

0.6202558284049997

In [42]:
avg_accuracy([features_reviewer, features_bigram_topic_predictive])

0.6142365494575439

In [43]:
avg_accuracy([features_reviewer, features_bigram_topic])

0.6138679613817736

Let's try forgetting about LDA topic models and using bag of words to generate our features. Unigrams give the best accuracy, even just including bigrams drops the accuracy

In [44]:
corpus = [x[0].review_content for x in all_reviews]

In [204]:
from sklearn.feature_extraction.text import CountVectorizer
unigram_count_vect = CountVectorizer()
features_ngram_bow = unigram_count_vect.fit_transform(corpus)

In [205]:
features = [features_reviewer for i in range(0, 4)]
features.append(features_ngram_bow)
#features.append(features_unigram_topic)
avg_accuracy(features)

0.6695738179163594

Trying it with Tf-idf:

In [69]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
features_ngram_tfidf = tfidf_transformer.fit_transform(features_ngram_bow)

In [159]:
features = [features_sentiment for i in range(0, 9)]
features.append(features_ngram_tfidf)
avg_accuracy(features)

0.6506573499667422