## 2. Sentiment Analysis
In this exercise, we will classify the sentiment of text documents. Complete the code with TODO tag.

References and Further Readings:
+ http://www.nltk.org/howto/sentiment.html
+ https://www.nltk.org/api/nltk.sentiment.html
+ http://datameetsmedia.com/vader-sentiment-analysis-explained/
+ https://github.com/cjhutto/vaderSentiment
+ https://marcobonzanini.com/2015/05/17/mining-twitter-data-with-python-part-6-sentiment-analysis-basics/
+ https://github.com/marrrcin/ml-twitter-sentiment-analysis


### 2.1. Classification approach

Classification approach looks at previously labeled data in order to determine the sentiment of never-before-seen sentences. It involves training a model using previously seen text to predict/classify the sentiment of some new input text. The nice thing is that, with a greater volume of data, we generally get better prediction or classification results. However, unlike the lexical approach, we need previously labeled data.

In [1]:
import nltk
from nltk.classify import NaiveBayesClassifier
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
from nltk.corpus import subjectivity
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

n_instances = None
if n_instances is not None:
    n_instances = int(n_instances/2)

pos_docs = [(list(movie_reviews.words(pos_id)), 'pos') for pos_id in movie_reviews.fileids('pos')[:n_instances]]
neg_docs = [(list(movie_reviews.words(neg_id)), 'neg') for neg_id in movie_reviews.fileids('neg')[:n_instances]]
len(pos_docs), len(neg_docs)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


(1000, 1000)

Each document is represented by a tuple (sentence, label). The sentence is tokenized, so it is represented by a list of strings:

In [2]:
pos_docs[0]

(['films',
  'adapted',
  'from',
  'comic',
  'books',
  'have',
  'had',
  'plenty',
  'of',
  'success',
  ',',
  'whether',
  'they',
  "'",
  're',
  'about',
  'superheroes',
  '(',
  'batman',
  ',',
  'superman',
  ',',
  'spawn',
  ')',
  ',',
  'or',
  'geared',
  'toward',
  'kids',
  '(',
  'casper',
  ')',
  'or',
  'the',
  'arthouse',
  'crowd',
  '(',
  'ghost',
  'world',
  ')',
  ',',
  'but',
  'there',
  "'",
  's',
  'never',
  'really',
  'been',
  'a',
  'comic',
  'book',
  'like',
  'from',
  'hell',
  'before',
  '.',
  'for',
  'starters',
  ',',
  'it',
  'was',
  'created',
  'by',
  'alan',
  'moore',
  '(',
  'and',
  'eddie',
  'campbell',
  ')',
  ',',
  'who',
  'brought',
  'the',
  'medium',
  'to',
  'a',
  'whole',
  'new',
  'level',
  'in',
  'the',
  'mid',
  "'",
  '80s',
  'with',
  'a',
  '12',
  '-',
  'part',
  'series',
  'called',
  'the',
  'watchmen',
  '.',
  'to',
  'say',
  'moore',
  'and',
  'campbell',
  'thoroughly',
  'researche

We separately split subjective and objective instances to keep a balanced uniform class distribution in both train and test sets.

In [3]:
# TODO: split training and testing data as 80/20
train_pos_docs = pos_docs[:800]
test_pos_docs = pos_docs[800:]

train_neg_docs = pos_docs[:800]
test_neg_docs = pos_docs[800:]


training_docs = train_pos_docs+train_neg_docs
testing_docs = test_pos_docs+test_neg_docs
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])
all_words_neg[:5]

['films', 'adapted', 'from', 'comic', 'books']

We use simple unigram word features, handling negation:

In [4]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
print(len(unigram_feats))
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

14332


We apply features to obtain a feature-value representation of our datasets:

In [5]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)
print(training_set[0])



We can now train our classifier on the training set, and subsequently output the evaluation results:

In [6]:
# TODO: Use Naive Bayes to train the sentiment classifier

naive_bayes = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(naive_bayes, training_set)
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 1.0
F-measure [pos]: 1.0
Precision [pos]: 1.0
Recall [pos]: 1.0


### 2.2. Lexical approach

Lexical approaches aim to map words to sentiment by building a lexicon or a 'dictionary of sentiment'. We can use this dictionary to assess the sentiment of phrases and sentences, without the need of looking at anything else. Sentiment can be categorical – such as {negative, neutral, positive} – or it can be numerical – like a range of intensities or scores. Lexical approaches look at the sentiment category or score of each word in the sentence and decide what the sentiment category or score of the whole sentence is. The power of lexical approaches lies in the fact that we do not need to train a model using labeled data, since we have everything we need to assess the sentiment of sentences in the dictionary of emotions. VADER is an example of a lexical method.

In [7]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Run the lexical approach

In [8]:
sid = SentimentIntensityAnalyzer()
for doc in testing_docs:
    doc = " ".join(doc[0])
    print(doc[:100] + "...")
    ss = sid.polarity_scores(doc)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

those of you who frequently read my reviews are not likely to be surprised by the fact that i have n...
compound: 0.9948, neg: 0.041, neu: 0.813, pos: 0.146, 
this is a stagy film adapted from roger rueff ' s play " hospitality suite . '' its about three trav...
compound: -0.3452, neg: 0.088, neu: 0.825, pos: 0.087, 
the coen brothers are back again , this time with homer ' s " odyssey " as the backdrop in their tal...
compound: 0.9975, neg: 0.067, neu: 0.795, pos: 0.138, 
contact ( pg ) there ' s a moment late in robert zemeckis ' s contact where i was reminded of why i ...
compound: 0.9989, neg: 0.053, neu: 0.776, pos: 0.171, 
satirical films usually fall into one of two categories : 1 ) long - term satire where everything , ...
compound: 0.9977, neg: 0.06, neu: 0.783, pos: 0.156, 
after sixteen years francis ford copolla has again returned to his favorite project , making the thi...
compound: 0.9913, neg: 0.131, neu: 0.695, pos: 0.174, 
bruce willis is a type - casted actor . in die

compound: 0.999, neg: 0.099, neu: 0.733, pos: 0.168, 
not since attending an ingmar bergman retrospective a few years ago have i seen a film as uncompromi...
compound: 0.9914, neg: 0.066, neu: 0.797, pos: 0.137, 
after a successful run in australia last year , and with much critical praise heaped upon it , murie...
compound: 0.9941, neg: 0.1, neu: 0.695, pos: 0.205, 
every once in a while , a film sneaks up on me and takes me completely by surprise . i don ' t neces...
compound: 0.9986, neg: 0.064, neu: 0.752, pos: 0.184, 
" through a spyglass , i could see everything . " king louis xvi was beheaded on january 21 , 1793 ,...
compound: 0.993, neg: 0.095, neu: 0.762, pos: 0.144, 
wong kar - wei ' s " fallen angels " is , on a purely visceral level , one of the most exciting film...
compound: 0.8237, neg: 0.108, neu: 0.762, pos: 0.13, 
city of angels is the kind of love story that i enjoy the most : thought - provoking , moving , and ...
compound: 0.9993, neg: 0.046, neu: 0.765, pos: 0.18

compound: -0.4945, neg: 0.148, neu: 0.697, pos: 0.155, 
ingredients : down - on - his - luck evangelist , church synopsis : sonny dewey ( robert duvall ) is...
compound: 0.9901, neg: 0.065, neu: 0.801, pos: 0.134, 
the most amazing thing about paul cox ' s innocence is how unlike a movie it is . i mean that as the...
compound: 0.9966, neg: 0.05, neu: 0.787, pos: 0.164, 
contact is a film that tries to do several different things . it is intended to present a realistic ...
compound: 0.997, neg: 0.041, neu: 0.827, pos: 0.132, 
billy bob thornton , who had a sudden rise to fame with 1996 ' s sling blade after spending years as...
compound: -0.9445, neg: 0.172, neu: 0.67, pos: 0.158, 
tibet has entered the american consciousness slowly during the past few years and burst into the for...
compound: 0.9413, neg: 0.07, neu: 0.848, pos: 0.082, 
i swear i have seen the edge before . in fact , it reminded me of the bear , the river wild , and ot...
compound: 0.9995, neg: 0.062, neu: 0.752, pos: 0

compound: -0.3034, neg: 0.133, neu: 0.731, pos: 0.136, 
i rented this movie with very high hopes . this movie got praise as one of the best films of 1998 , ...
compound: 0.998, neg: 0.07, neu: 0.745, pos: 0.185, 
few films in 1999 have divided the critical consensus as sharply as alan parker ' s adaptation of fr...
compound: 0.9918, neg: 0.101, neu: 0.751, pos: 0.148, 
now that " boogie nights " has made disco respectable again ( well , fashionable at least ) , we sho...
compound: 0.9981, neg: 0.026, neu: 0.784, pos: 0.19, 
this has been an extraordinary year for australian films . " shine " has just scooped the pool at th...
compound: 0.9956, neg: 0.066, neu: 0.782, pos: 0.152, 
i think the first thing this reviewer should mention is wether or not i am a fan of the x - files . ...
compound: 0.9985, neg: 0.078, neu: 0.777, pos: 0.145, 
trees lounge is the directoral debut from one of my favorite actors , steve buscemi . he gave memora...
compound: 0.9748, neg: 0.077, neu: 0.749, pos: 0

compound: -0.9976, neg: 0.222, neu: 0.693, pos: 0.086, 
assume nothing . the phrase is perhaps one of the most used of the 1990 ' s , as first impressions a...
compound: 0.9986, neg: 0.093, neu: 0.702, pos: 0.205, 
the question isn ' t why has grease been reissued . the answer to that one is easy : to celebrate th...
compound: 0.9968, neg: 0.055, neu: 0.78, pos: 0.165, 
let ' s face it : since waterworld floated by , the summer movie season has grown * very * stale . w...
compound: 0.9323, neg: 0.1, neu: 0.724, pos: 0.176, 
if beavis and butthead had a favorite movie , from dusk till dawn would probably be it . scripted by...
compound: 0.9367, neg: 0.075, neu: 0.827, pos: 0.098, 
the small - scale film , in limited release , " waking ned devine " is a pleasant excursion to a tim...
compound: 0.9939, neg: 0.018, neu: 0.808, pos: 0.174, 
warren beatty ' s " bulworth " is a caustic political comedy that doesn ' t attack any particular po...
compound: -0.996, neg: 0.151, neu: 0.75, pos: 0.

compound: 0.9981, neg: 0.078, neu: 0.715, pos: 0.207, 
the only historical figure that has been written about more than william shakespeare is jesus christ...
compound: 0.9991, neg: 0.055, neu: 0.674, pos: 0.271, 
in 1912 , a ship set sail on her maiden voyage across the atlantic for america . this ship was built...
compound: 0.9985, neg: 0.048, neu: 0.808, pos: 0.144, 
the start of this movie reminded me of parts from the movie stargate . people are looking around in ...
compound: -0.8486, neg: 0.12, neu: 0.777, pos: 0.103, 
note : some may consider portions of the following text to be spoilers . be forewarned . james tobac...
compound: 0.9976, neg: 0.064, neu: 0.81, pos: 0.126, 
robert altman ' s cookie ' s fortune is that rare movie that does not depend on sentimentality to be...
compound: 0.9979, neg: 0.106, neu: 0.685, pos: 0.209, 
well i ' ll be damned . . . the canadians can make a good movie . the world is coming to an end . we...
compound: 0.9149, neg: 0.078, neu: 0.815, pos: 

compound: -0.9761, neg: 0.126, neu: 0.767, pos: 0.106, 
the latest epos from lars is a blast , although a rather moody one . this lovestory is situated in a...
compound: 0.9841, neg: 0.123, neu: 0.712, pos: 0.166, 
anastasia contains something that has been lacking from all of the recent disney releases . . . ( es...
compound: 0.9962, neg: 0.064, neu: 0.767, pos: 0.17, 
i can hear the question already . what on earth do these two movies have in common ? to most people ...
compound: 0.9994, neg: 0.065, neu: 0.712, pos: 0.223, 
you ' ve got to think twice before you go see a movie with a title like maximum risk . the title is ...
compound: 0.9972, neg: 0.1, neu: 0.734, pos: 0.166, 
in many ways , " twotg " does for tough - guy movies what la confidential did for police stories . t...
compound: 0.8662, neg: 0.123, neu: 0.743, pos: 0.134, 
marie ( charlotte rampling , " aberdeen " ) and jean ( bruno cremet , " sorcerer " ) are a comfortab...
compound: 0.772, neg: 0.102, neu: 0.783, pos: 0.

### 2.3 Comparing two approaches

First we can transform the sentiment score by the lexical approach into label by the following rules:

+ positive sentiment: compound score > 0
+ negative sentiment: compound score <= 0

In [9]:
def lexical_sentiment(doc, sid=None):
    """TODO: return the label 'pos' or 'neg' for a document"""
    
    if sid is None: 
        sid = SentimentIntensityAnalyzer()
    ss = sid.polarity_scores(doc)
    if ss["compound"] <= 0:
        return 'neg'
    
    return 'pos'

for doc in testing_docs:
    doc = " ".join(doc[0])
    label = lexical_sentiment(doc, sid)
    print(doc[:100] + "...", label)

those of you who frequently read my reviews are not likely to be surprised by the fact that i have n... pos
this is a stagy film adapted from roger rueff ' s play " hospitality suite . '' its about three trav... neg
the coen brothers are back again , this time with homer ' s " odyssey " as the backdrop in their tal... pos
contact ( pg ) there ' s a moment late in robert zemeckis ' s contact where i was reminded of why i ... pos
satirical films usually fall into one of two categories : 1 ) long - term satire where everything , ... pos
after sixteen years francis ford copolla has again returned to his favorite project , making the thi... pos
bruce willis is a type - casted actor . in die hard , he played john mcclaine , a rough and tough ch... pos
there must be some unwritten rule that states , one gets enlightenment not in the way one expects to... pos
did claus von bulow try to kill his wife sunny in their newport mansion ? that is the question rever... neg
driving miss daisy takes its

let me open this one with a confession : i love cop movies . i adore them with such an unwavering , ... pos
carla gugino graduates from high school and instead of staying in her small farming town , she goes ... pos
the soldiers of three kings have taken their cue from movies about vietnam . ( fitting , since the m... neg
bill condon ' s " gods and monsters " is a fascinating look into the last days in the life of gay di... pos
i must admit that i was a tad skeptical of " good will hunting " , based both on the previews and th... pos
a cinematic version of one of john irving ' s novels is always cause for celebration even if , as in... pos
the only historical figure that has been written about more than william shakespeare is jesus christ... pos
in 1912 , a ship set sail on her maiden voyage across the atlantic for america . this ship was built... pos
the start of this movie reminded me of parts from the movie stargate . people are looking around in ... neg
note : some may consider por

( note : there are spoilers regarding the film ' s climax ; the election , of course ) we see matthe... pos
martin scorsese ' s films used to intimidate me . because of his reputation , i felt obligated to ap... pos
robert redford ' s a river runs through it is not a film i watch often . it is a masterpiece -- one ... pos
richard gere is not one of my favorite actors . however , i like courtroom dramas , and this film lo... pos
" when you get out of jail , you can kill him . " starring ashley judd , tommy lee jones , bruce gre... neg
let ' s say you live at the end of an airport runway . large jetliners continuously pass over your h... pos
in this good natured , pleasent and easy going comedy , bill murray ( ghostbusters , 1984 ) plays gr... pos
" no man is an island , " one character quotes john donne in apt pupil , effectively summarizing the... neg
i rented this movie with very high hopes . this movie got praise as one of the best films of 1998 , ... pos
few films in 1999 have divid

what i look for in a movie is not necessarily perfection . sometimes a movie has such strong ideas t... neg
mike myers , you certainly did throw us a ? frickin ' bone here in what you call ? the biggest austi... neg
it is with hesitance that i call " apocalypse now " a masterpiece . certainly , it had the pedigree ... pos
the verdict : spine - chilling drama from horror maestro stephen king , featuring an outstanding , o... neg
an indian runner was more than a courier . he * became * the message he was carrying . what danger i... neg
seen july 8 , 1998 at the crossgates cinema 18 , ( albany , ny ) , theater # 7 , at 8 : 15 p . m . w... pos
not since attending an ingmar bergman retrospective a few years ago have i seen a film as uncompromi... pos
after a successful run in australia last year , and with much critical praise heaped upon it , murie... pos
every once in a while , a film sneaks up on me and takes me completely by surprise . i don ' t neces... pos
" through a spyglass , i cou

it ' s tough to really say something nice about a type of person who ' s so ethnocentric that any hu... neg
synopsis : as a response to accusations of sexual prejudice in the armed forces , a female naval int... pos
the idea at the center of the devil ' s advocate , which is , thus far , one of the three or four be... pos
scream 2 , like its predecessor , is a genre - crossing film . it is about 50 % horror film and 50 %... neg
when i first saw the previews for ron howard ' s latest film , my expectations were discouragingly l... pos
ingredients : james bond , scuba scene , car controlled by cellular telephone synopsis : warped medi... pos
it is always refreshing to see a superstar actor who gets paid more than enough to forget about work... pos
if you thought baz luhrmann ' s radical take on _william_shakespeare ' s_romeo_ + _juliet_ was wild ... pos
meteor threat set to blow away all volcanoes & twisters ! summer is here again ! this season could p... pos
there is a striking scene ea

Now we evaluate the lexical approach by computing accuracy metrics

In [10]:
from collections import defaultdict
from nltk.metrics import (accuracy as eval_accuracy, precision as eval_precision,
        recall as eval_recall, f_measure as eval_f_measure)

gold_results = defaultdict(set)
test_results = defaultdict(set)
acc_gold_results = []
acc_test_results = []
labels = set()
num = 0
for i, (text, label) in enumerate(testing_docs):
    labels.add(label)
    gold_results[label].add(i)
    acc_gold_results.append(label)
    observed = lexical_sentiment(" ".join(text), sid)
    num += 1
    acc_test_results.append(observed)
    test_results[observed].add(i)
metrics_results = {}

# TODO: compute the accuracy metrics

for label in labels:
    metrics_results["F-measure [" + label + "]"] = eval_f_measure(gold_results[label], test_results[label])
    metrics_results["prec [" + label + "]"] = eval_precision(gold_results[label], test_results[label])
    metrics_results["rec [" + label + "]"] = eval_recall(gold_results[label], test_results[label])
    
    
metrics_results["acc"] = eval_accuracy(acc_gold_results, acc_test_results)
print('accuracy:', metrics_results['acc'])

for result in sorted(metrics_results):
        print('{0}: {1}'.format(result, metrics_results[result]))

accuracy: 0.83
F-measure [pos]: 0.907103825136612
acc: 0.83
prec [pos]: 1.0
rec [pos]: 0.83
