## 2. Sentiment Analysis
In this exercise, we will classify the sentiment of text documents. Complete the code with TODO tag.

References and Further Readings:
+ http://www.nltk.org/howto/sentiment.html
+ https://www.nltk.org/api/nltk.sentiment.html
+ http://datameetsmedia.com/vader-sentiment-analysis-explained/
+ https://github.com/cjhutto/vaderSentiment
+ https://marcobonzanini.com/2015/05/17/mining-twitter-data-with-python-part-6-sentiment-analysis-basics/
+ https://github.com/marrrcin/ml-twitter-sentiment-analysis


### 2.1. Classification approach

Classification approach looks at previously labeled data in order to determine the sentiment of never-before-seen sentences. It involves training a model using previously seen text to predict/classify the sentiment of some new input text. The nice thing is that, with a greater volume of data, we generally get better prediction or classification results. However, unlike the lexical approach, we need previously labeled data.

In [1]:
import nltk
from nltk.classify import NaiveBayesClassifier
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
from nltk.corpus import subjectivity
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

# n_instances = 100
# subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
# obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
# len(subj_docs), len(obj_docs)

n_instances = None
if n_instances is not None:
    n_instances = int(n_instances/2)

pos_docs = [(list(movie_reviews.words(pos_id)), 'pos') for pos_id in movie_reviews.fileids('pos')[:n_instances]]
neg_docs = [(list(movie_reviews.words(neg_id)), 'neg') for neg_id in movie_reviews.fileids('neg')[:n_instances]]
len(pos_docs), len(neg_docs)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/gu/nltk_data...
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/gu/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


(1000, 1000)

Each document is represented by a tuple (sentence, label). The sentence is tokenized, so it is represented by a list of strings:

In [5]:
pos_docs[0]

(['films',
  'adapted',
  'from',
  'comic',
  'books',
  'have',
  'had',
  'plenty',
  'of',
  'success',
  ',',
  'whether',
  'they',
  "'",
  're',
  'about',
  'superheroes',
  '(',
  'batman',
  ',',
  'superman',
  ',',
  'spawn',
  ')',
  ',',
  'or',
  'geared',
  'toward',
  'kids',
  '(',
  'casper',
  ')',
  'or',
  'the',
  'arthouse',
  'crowd',
  '(',
  'ghost',
  'world',
  ')',
  ',',
  'but',
  'there',
  "'",
  's',
  'never',
  'really',
  'been',
  'a',
  'comic',
  'book',
  'like',
  'from',
  'hell',
  'before',
  '.',
  'for',
  'starters',
  ',',
  'it',
  'was',
  'created',
  'by',
  'alan',
  'moore',
  '(',
  'and',
  'eddie',
  'campbell',
  ')',
  ',',
  'who',
  'brought',
  'the',
  'medium',
  'to',
  'a',
  'whole',
  'new',
  'level',
  'in',
  'the',
  'mid',
  "'",
  '80s',
  'with',
  'a',
  '12',
  '-',
  'part',
  'series',
  'called',
  'the',
  'watchmen',
  '.',
  'to',
  'say',
  'moore',
  'and',
  'campbell',
  'thoroughly',
  'researche

We separately split subjective and objective instances to keep a balanced uniform class distribution in both train and test sets.

In [27]:
import random
split = int(len(pos_docs) * 0.8)

assert len(pos_docs) == len(neg_docs)

random.shuffle(pos_docs)
train_pos_docs = pos_docs[:split]
test_pos_docs = pos_docs[split:]

train_neg_docs = neg_docs[:split]
test_neg_docs = neg_docs[split:]

training_docs = train_pos_docs+train_neg_docs
testing_docs = test_pos_docs+test_neg_docs
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

all_words_neg[:5]

['it', 'may', 'seem', 'weird', 'to']

We use simple unigram word features, handling negation:

In [28]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
print(len(unigram_feats))
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

14690


We apply features to obtain a feature-value representation of our datasets:

In [29]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)
print(training_set[0])



We can now train our classifier on the training set, and subsequently output the evaluation results:

In [30]:
naive_bayes = NaiveBayesClassifier.train
clf = sentim_analyzer.train(naive_bayes, training_set)
sentim_analyzer.evaluate(test_set).items()

Training classifier
Evaluating NaiveBayesClassifier results...


dict_items([('Accuracy', 0.82), ('Precision [pos]', 0.867816091954023), ('Recall [pos]', 0.755), ('F-measure [pos]', 0.8074866310160427), ('Precision [neg]', 0.7831858407079646), ('Recall [neg]', 0.885), ('F-measure [neg]', 0.8309859154929579)])

### 2.2. Lexical approach

Lexical approaches aim to map words to sentiment by building a lexicon or a 'dictionary of sentiment'. We can use this dictionary to assess the sentiment of phrases and sentences, without the need of looking at anything else. Sentiment can be categorical – such as {negative, neutral, positive} – or it can be numerical – like a range of intensities or scores. Lexical approaches look at the sentiment category or score of each word in the sentence and decide what the sentiment category or score of the whole sentence is. The power of lexical approaches lies in the fact that we do not need to train a model using labeled data, since we have everything we need to assess the sentiment of sentences in the dictionary of emotions. VADER is an example of a lexical method.

In [31]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Run the lexical approach

In [32]:
sid = SentimentIntensityAnalyzer()
for doc in testing_docs:
    doc = " ".join(doc[0])
    print(doc[:100] + "...")
    ss = sid.polarity_scores(doc)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

i must admit i ' m going to be a bit biased in my review of the new romantic comedy serendipity , be...
compound: 0.9973, neg: 0.054, neu: 0.765, pos: 0.181, 
you ' ve got mail is a timely romance for this impersonal , computer - driven decade . two people wh...
compound: 0.9903, neg: 0.039, neu: 0.834, pos: 0.127, 
" being john malkovich " is the type of film we need to see more . today ' s films are either blockb...
compound: 0.9971, neg: 0.045, neu: 0.773, pos: 0.182, 
in a flashback , the teenage girl in the eccentric family in the canadian film , the hanging garden ...
compound: 0.999, neg: 0.079, neu: 0.713, pos: 0.209, 
" when it ' s cold , molecules aren ' t moving . everything is clean . " these are the essential wor...
compound: 0.9395, neg: 0.071, neu: 0.836, pos: 0.093, 
a movie that ' s been as highly built up as the truman show , with reviews boasting , " the film of ...
compound: 0.9924, neg: 0.039, neu: 0.826, pos: 0.135, 
since their film debut in 1984 with the tightly

compound: 0.9964, neg: 0.068, neu: 0.798, pos: 0.133, 
national lampoon ' s animal house , made in 1978 and set in 1962 , remains one of the -- no , fuck t...
compound: -0.9536, neg: 0.134, neu: 0.767, pos: 0.099, 
one can not observe a star trek movie and expect to see serious science fiction . the purpose of sta...
compound: 0.9985, neg: 0.081, neu: 0.716, pos: 0.204, 
the event of events is upon us . people have waited twenty - two years for the prequel to star wars ...
compound: 0.9784, neg: 0.105, neu: 0.769, pos: 0.126, 
50 ' s ` aliens vs . earth ' idea revamped ! if you have been following the movie news over the net ...
compound: -0.9751, neg: 0.111, neu: 0.811, pos: 0.079, 
i don ' t box with kid gloves . i don ' t play nice , i ' m not a nice guy , and i never , ever , go...
compound: 0.9791, neg: 0.117, neu: 0.728, pos: 0.156, 
the most amazing thing about paul cox ' s innocence is how unlike a movie it is . i mean that as the...
compound: 0.9972, neg: 0.049, neu: 0.781, po

compound: 0.9997, neg: 0.081, neu: 0.704, pos: 0.215, 
jean - luc picard ( patrick stewart ) and the rest of the crew of the u . s . s . enterprise are bac...
compound: 0.9703, neg: 0.074, neu: 0.818, pos: 0.109, 
118 minutes ; not rated ( though i suspect it would be rated pg for adult themes and language ) mamo...
compound: 0.9957, neg: 0.068, neu: 0.821, pos: 0.112, 
review - peter jackson ' s the frighteners has received some notice for setting the record for most ...
compound: -0.8868, neg: 0.189, neu: 0.636, pos: 0.175, 
marie ( charlotte rampling , " aberdeen " ) and jean ( bruno cremet , " sorcerer " ) are a comfortab...
compound: 0.8318, neg: 0.102, neu: 0.782, pos: 0.116, 
the saint was actually a little better than i expected it to be , in some ways . in this theatrical ...
compound: 0.9394, neg: 0.074, neu: 0.828, pos: 0.098, 
not since attending an ingmar bergman retrospective a few years ago have i seen a film as uncompromi...
compound: 0.9914, neg: 0.066, neu: 0.797, pos

compound: 0.9985, neg: 0.042, neu: 0.821, pos: 0.137, 
usually a movie is about something more than a soiled rug , but not the big lebowski . the new offer...
compound: 0.9966, neg: 0.06, neu: 0.704, pos: 0.236, 
there were four movies that earned jamie lee curtis the title of " scream queen " in the early ' 80s...
compound: -0.9919, neg: 0.18, neu: 0.696, pos: 0.125, 
according to hitchcock and various other filmmakers , isolated motels , diners , gas stations and si...
compound: 0.8968, neg: 0.096, neu: 0.761, pos: 0.142, 
if you ' ve been following william fichtner ' s career ( and there ' s absolutely no reason why you ...
compound: -0.8932, neg: 0.146, neu: 0.741, pos: 0.113, 
note : some may consider portions of the following text to be spoilers . be forewarned . it ' s rath...
compound: 0.9919, neg: 0.119, neu: 0.717, pos: 0.164, 
for his directoral debut , gary oldman chose a highly personal family drama about a violent , alcoho...
compound: -0.9897, neg: 0.159, neu: 0.754, pos

compound: 0.9964, neg: 0.114, neu: 0.701, pos: 0.184, 
how could a g - rated disney film based on meg cabot ' s novel " the princess diaries " be anything ...
compound: 0.9977, neg: 0.103, neu: 0.69, pos: 0.207, 
the corruptor is a big silly mess of an action movie , complete with pointless plot turns and gratui...
compound: 0.9868, neg: 0.089, neu: 0.777, pos: 0.134, 
" return to horror high , " wants to be a couple different types of movies at once . the film tells ...
compound: -0.9949, neg: 0.193, neu: 0.724, pos: 0.083, 
the art of woo attempts to be one of those films like breakfast at tiffany ' s in which the audience...
compound: 0.9886, neg: 0.07, neu: 0.774, pos: 0.156, 
capsule : a science fiction allegory . at the millennium a lethal contagious virus has hit taiwan . ...
compound: -0.9786, neg: 0.169, neu: 0.762, pos: 0.068, 
seen december 2 , 1997 at 6 : 50 p . m . at the glenwood movieplex cinemas ( oneida , ny ) , theater...
compound: -0.8219, neg: 0.133, neu: 0.74, pos:

compound: 0.8147, neg: 0.137, neu: 0.72, pos: 0.142, 
words i thought i ' d never write : the sequel to urban legend lacks the grace , wit , and power of ...
compound: -0.9979, neg: 0.252, neu: 0.671, pos: 0.077, 
godzilla is the ultimate culmination of the " who cares about plot " summer movie . a loose remake o...
compound: 0.9871, neg: 0.103, neu: 0.764, pos: 0.132, 
recently one night a young director named baz luhrmann couldn ' t sleep . he tumbled out of bed and ...
compound: 0.815, neg: 0.084, neu: 0.811, pos: 0.106, 
in times of crisis people are driven to desperate measures . of course what constitutes a crisis dif...
compound: -0.9856, neg: 0.142, neu: 0.756, pos: 0.101, 
" virus " is a monster movie without a monster . any movie with a hurdle that large to overcome had ...
compound: 0.8922, neg: 0.09, neu: 0.807, pos: 0.103, 
dr . alan grant ( sam neill , " jurassic park " ) is becoming disillusioned . paleontology is no lon...
compound: 0.9893, neg: 0.069, neu: 0.803, pos: 

### 2.3 Comparing two approaches

First we can transform the sentiment score by the lexical approach into label by the following rules:

+ positive sentiment: compound score > 0
+ negative sentiment: compound score <= 0

In [35]:
def lexical_sentiment(doc, sid=None):
    if sid is None: 
        sid = SentimentIntensityAnalyzer()
    ss = sid.polarity_scores(doc)
    if ss["compound"] > 0:
        return 'pos'
    return 'neg'

for doc in testing_docs:
    doc = " ".join(doc[0])
    label = lexical_sentiment(doc, sid)
    print(doc[:100] + "...", label)

i must admit i ' m going to be a bit biased in my review of the new romantic comedy serendipity , be... pos
you ' ve got mail is a timely romance for this impersonal , computer - driven decade . two people wh... pos
" being john malkovich " is the type of film we need to see more . today ' s films are either blockb... pos
in a flashback , the teenage girl in the eccentric family in the canadian film , the hanging garden ... pos
" when it ' s cold , molecules aren ' t moving . everything is clean . " these are the essential wor... pos
a movie that ' s been as highly built up as the truman show , with reviews boasting , " the film of ... pos
since their film debut in 1984 with the tightly wrought texas thriller " blood simple , " joel and e... pos
bruce lee was a bigger - than - life martial artist ( and ) actor . bruce ' s unique character ( i .... pos
while screen adaptations of john irving ' s novels have been disappointingly uneven , the films have... pos
for a movie about disco - er

probably the most popular and praised film of all time , turned out to be a primitive and predictabl... pos
the most common ( and in many cases the only ) complaint against francis ford coppola ' s 1972 maste... pos
aggressive , bleak , and unrelenting film about an interracial couple , steve and sam ( damon jones ... neg
bob the happy bastard ' s quickie review : the mummy brendan fraser ' s stuck in the past again , bu... pos
directed by : pixote hunt , hendel butoy , eric goldberg , james algar , francis glebas , gaetan & p... pos
not since 1996 ' s shine , which starred geoffrey rush as pianist david helfgott , has a movie so de... pos
when people are talking about good old times , they actually want to make some bad times look better... neg
being the self - proclaimed professional film critic that i am , i am somewhat embarrassed to admit ... pos
on seeing the outrageous previews for bulworth one wonders what plot could possibly allow beatty get... pos
playwright tom stoppard and 

and i thought " stigmata " would be the worst religiously - oriented thriller released this year . t... neg
stephen , please post if appropriate . " mafia ! " - crime isn ' t a funny business by homer yen ( c... pos
die hard 2 is an altogether unfortunate fiasco , inferior to the original in every respect . place t... neg
i admit it . i thought arnold schwarzenegger had a knack for comedy when he made twins and true lies... neg
when the haunting arrived in theaters , all i kept hearing about was the overdone special effects an... neg
plot : set in the future , a courier has uploaded some data into a " hard drive " that resides in hi... neg
i am a steven seagal fan . i only say this now because " mufti splenetik " isn ' t my real name and ... neg
" the 13th warrior " comes at the end of as summer where we ' ve already experienced man eating shar... neg
when i watch a movie like mike nichols ' what planet are you from ? i can ' t help but feel like eve... pos
it seems like i ' m reviewin

writing a screenplay for a thriller is hard . harder than pouring concrete under the texas sun . har... pos
all right , all right , we get the point : despite all similarities to the best - selling story , sp... pos
teenagers have a lot of power in hollywood . every year countless films will be made targeting that ... pos
walken stars as a mobster who is kidnapped and held for ransom by four bratty rich kids . it seems t... pos
> from writer and director darren stein comes jawbreaker , the poorly told tale of what can happen w... neg
well there goes another one . sadly this like other movies this year wasn ' t good . this one being ... pos
my inner flag was at half - mast last year when nick at nite pulled " dragnet " reruns off the air .... neg
frank detorri ' s ( bill murray ) a single dad who lives on beer and junk food with no apparent unde... pos
woof ! too bad that leap of faith was the title of a 1992 comedy starring steve martin and debra win... neg
there ' s only one president

Now we evaluate the lexical approach by computing accuracy metrics

In [37]:
from collections import defaultdict
from nltk.metrics import (accuracy as eval_accuracy, precision as eval_precision,
        recall as eval_recall, f_measure as eval_f_measure)

gold_results = defaultdict(set)
test_results = defaultdict(set)
acc_gold_results = []
acc_test_results = []
labels = set()
num = 0
for i, (text, label) in enumerate(testing_docs):
    labels.add(label)
    gold_results[label].add(i)
    acc_gold_results.append(label)
    observed = lexical_sentiment(" ".join(text), sid)
    num += 1
    acc_test_results.append(observed)
    test_results[observed].add(i)
metrics_results = {}

metrics_results["Accuracy"] = eval_accuracy(acc_gold_results, acc_test_results)

for label in labels:
    metrics_results["F-measure [" + label + "]"] = eval_f_measure(gold_results[label], test_results[label])
    metrics_results["Precision [" + label + "]"] = eval_precision(gold_results[label], test_results[label])
    metrics_results["Recall [" + label + "]"] = eval_recall(gold_results[label], test_results[label])

for result in sorted(metrics_results):
        print('{0}: {1}'.format(result, metrics_results[result]))

Accuracy: 0.6575
F-measure [neg]: 0.5678233438485805
F-measure [pos]: 0.7163561076604555
Precision [neg]: 0.7692307692307693
Precision [pos]: 0.6113074204946997
Recall [neg]: 0.45
Recall [pos]: 0.865
