<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis of Movie Reviews with Spacy and VADER

_Authors: Kiefer Katovich (SF)_

---

### Learning Objectives
- Understand the goal of basic sentiment analysis.
- Calculate sentiment scores manually using a reviews dataset and scores tagged by word.
- Practice using the spacy parser to get out part of speech tags from text.
- Fit a model using sentiment and grammar features.
- Use the VADER sentiment analyzer to get out more accurate sentiment scores and compare the models.

### Lesson Guide
- [Introduction to sentiment analysis](#intro)
- [Load the word sentiment dataset](#load-sen)
    - [Engineer objectivity and positive difference scores](#adj-scores)
    - [Put scores in a part of speech dictionary](#pos-dict)
- [Load the rotten tomatoes review dataset](#rt-reviews)
    - [Restrict reviews to valid lengths and ratings](#subset)
- [Import spacy](#spacy)
    - [Parse all the quotes using spacy's multithreaded parser](#multi)
- [Part of speech features](#pos-features)
- [Assign sentiment scores](#assign)
- [Print out the most positive and most negative reviews](#print-most)
- [Print out the most objective and most subjective reviews](#print-most-obj)
- [Build a model to classify fresh vs. rotten with the sentiment and grammar features](#model)
- [User the VADER library to get better sentiment scores](#vader)
    - [Build a model using the VADER sentiment features](#vader-model)

<a id='intro'></a>

## Introduction to sentiment analysis
---

Sentiment analysis is one of the most popular topics in NLP. Most commonly it is the quantification of text into valence and subjectivity scores.

First we will load in a dataset of pre-coded sentiment scores for positivity and negativity on words. These words are also tagged with their part of speech in the sentence. We can use these valence scores to evaluate the sentiment of rottentomatoes movie reviews. Many packages such as TextBlob come pre-packaged with sentiment scores for words after parsing text, but doing the sentiment parsing manually will show you how it can be done without any "magic".

We will also explore a more advanced sentiment analysis library in python: [VADER](https://github.com/cjhutto/vaderSentiment). We can parse the sentiment of the movie reviews using this package and compare it to our more basic method.



<a id='load-sen'></a>

## Load the word sentiment dataset
---

Below we will load in some pre-tagged positive and negative valence scores for a dictionary of words. Each row of the dataset contains the part of speech, the word, the positive score, and the negative score for the word. A word may appear more than once if it can appear with different part of speech tags. 

These scores are designed so that we can also derive the *objectivity score* of the word from the positive and negative scores.

Objectivity is calculated: 

    1. - (positive_score + negative_score)

Thus if a score has zero positive score and negative score it is completely objective. If a score has, for example, 0.5 positive and 0.5 negative, it may not be any more positive than negative but we can tell that it is subjective (objectivity = 0.).


In [3]:
import pandas as pd
import numpy as np

In [4]:
sen = pd.read_csv('/Users/Indraja/Documents/Dsi/9.4.1_nlp-sentiment_analysis-lesson/datasets/sentiment_words_simple.csv')

In [5]:
# A:
sen.head()

Unnamed: 0,pos,word,pos_score,neg_score
0,adj,.22-caliber,0.0,0.0
1,adj,.22-calibre,0.0,0.0
2,adj,.22_caliber,0.0,0.0
3,adj,.22_calibre,0.0,0.0
4,adj,.38-caliber,0.0,0.0


**Make the part of speech tags uppercase (this will come in handy later when we use Spacy).**

In [6]:
# A:
sen.pos = sen.pos.map(lambda x: x.upper())

<a id='adj-scores'></a>

### Engineer objectivity and positive difference scores

Since subjective vs. objective is embedded in the positive and negative scores, we should extract this and convert the positive and negative into a relative difference scores.

**Calculate two new scores:**

    objectivity = 1. - (pos_score + neg_score)
    pos_vs_neg = pos_score - neg_score
    

In [7]:
# A:
sen['objectivity']= 1.0 - (sen.pos_score+sen.neg_score)
sen['pos_vs_neg']=sen.pos_score-sen.neg_score

<a id='pos-dict'></a>

### Put scores in a part of speech dictionary

The dictionary format of the data will be much easier to index using our parsing functions later on. Create a dictionary where the keys are the four part of speech tags:

    ADJ
    NOUN
    VERB
    ADV

For each key, store a dictionary that contains all of the words for that part of speech with their objectivity and positive vs. negative scores.

In [8]:
# A:
sen_dict = {'ADJ':{},'NOUN':{},'VERB':{},'ADV':{}}

for i, row in enumerate(sen.itertuples()):
    if (i % 10000) == 0:
        print i
    sen_dict[row[1]][row[2]] = {'objectivity':row[5], 'pos_vs_neg':row[6]}


0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000


<a id='rt-reviews'></a>

## Load the rotten tomatoes reviews dataset

---

This dataset has:
    
    critic: critic's name
    fresh: fresh vs. rotten rating
    imdb: code for imdb
    publication: where the review was published
    quote: the review snippet
    review_date: date of review
    rtid: rottentomatoes id
    title: name of movie

In [9]:
rt = pd.read_csv('./datasets/rt_critics.csv')

In [10]:
# A:
rt

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",4/10/09,9559,Toy story
1,Richard Corliss,fresh,114709,TIME Magazine,The year's most inventive comedy.,31/8/08,9559,Toy story
2,David Ansen,fresh,114709,Newsweek,A winning animated feature that has something ...,18/8/08,9559,Toy story
3,Leonard Klady,fresh,114709,Variety,The film sports a provocative and appealing st...,9/6/08,9559,Toy story
4,Jonathan Rosenbaum,fresh,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",10/3/08,9559,Toy story
5,Michael Booth,fresh,114709,Denver Post,"As Lion King did before it, Toy Story revived ...",3/5/07,9559,Toy story
6,Geoff Andrew,fresh,114709,Time Out,The film will probably be more fully appreciat...,24/6/06,9559,Toy story
7,Janet Maslin,fresh,114709,New York Times,Children will enjoy a new take on the irresist...,20/5/03,9559,Toy story
8,Kenneth Turan,fresh,114709,Los Angeles Times,Although its computer-generated imagery is imp...,13/2/01,9559,Toy story
9,Susan Wloszczyna,fresh,114709,USA Today,How perfect that two of the most popular funny...,1/1/00,9559,Toy story


<a id='subset'></a>

### Restrict data to reviews with valid ratings and reviews over 10 words long

Also clean up the reviews, making a column with the case and punctuation removed.

In [11]:
# A:
rt['quote_len'] = rt.quote.map(lambda x: len(x.split()))
rt = rt[rt.quote_len > 10]
rt.shape

(11233, 9)

<a id='spacy'></a>

## Import spacy

---

The spacy package is the current gold standard for parsing the grammatical structure of text (aside from neural network architectures). We are going to use it to find the part of speech tags for the review words. 

Once we have parsed the tags with spacy, we can assign objectivity and valence scores by finding the match in our sentiment dataset.

In [12]:
import spacy
en_nlp = spacy.load('en')

**Parse a single quote:**

In [26]:
# A:
import string
rt['qt'] = rt.quote.map(lambda x: unicode(''.join([y for y in list(x.lower()) if y in string.ascii_lowercase+" -'"])))
rt.qt = rt.qt.map(lambda x: x.replace('-',' '))
tmp = en_nlp(rt.qt.values[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [27]:
tmp

so ingenious in concept design and execution that you could watch it on a postage stamp sized screen and still be engulfed by its charm

In [28]:
tmp[3]

concept

**Print out the part of speech tags for each word in the quote:**

In [29]:
# A:
for token in tmp:
    print token.pos_

ADV
ADJ
ADP
NOUN
NOUN
CONJ
NOUN
ADJ
PRON
VERB
VERB
PRON
ADP
DET
NOUN
NOUN
VERB
NOUN
CONJ
ADV
VERB
VERB
ADP
ADJ
NOUN


<a id='multi'></a>
### Parse all the quotes using spacy's multithreaded parser

Parsing a lot of text can take quite awhile. Luckily spacy comes with multithreading functionality to speed up the process considerably. Below is code that will parse the quotes across multiple threads and assign them to a list.

In [30]:
# A:
parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(rt.qt.values, batch_size=50, n_threads=3)):
    assert parsed.is_parsed
    if (i % 1000) == 0:
        print i
    parsed_quotes.append(parsed) 

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


<a id='pos-features'></a>

## Create features with part of speech proportions

---

With our spacy parsed reviews, we have a lot of feature engineering potential even before we get to sentiment. Something simple we could do is calculate the proportion of words in the quote that have each part of speech tag. We can try using these as predictors in a model later.

**Find all the unique part of speech categories in the reviews.**

In [31]:
# A:
unique_pos = []
for parsed in parsed_quotes:
    unique_pos.extend([t.pos_ for t in parsed])
unique_pos = np.unique(unique_pos)
print unique_pos

[u'ADJ' u'ADP' u'ADV' u'CONJ' u'DET' u'INTJ' u'NOUN' u'NUM' u'PART' u'PRON'
 u'PROPN' u'PUNCT' u'SPACE' u'SYM' u'VERB' u'X']


**Create the proportion columns for each part of speech.**

In [32]:
# A:
for pos in unique_pos:
    rt[pos+'_prop'] = 0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


**Iterate through the reviews and calculate the proportions of each part of speech tag.**

In [33]:
# A:
rt = rt.reset_index(drop=True)
for i, parsed in enumerate(parsed_quotes):
    if (i % 1000) == 0:
        print i
    parsed_len = len(parsed)
    for pos in unique_pos:
        count = len([x for x in parsed if x.pos_ == pos])
        rt.ix[i, pos+'_prop'] = float(count)/parsed_len
    

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


In [34]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,...,NOUN_prop,NUM_prop,PART_prop,PRON_prop,PROPN_prop,PUNCT_prop,SPACE_prop,SYM_prop,VERB_prop,X_prop
0,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",4/10/09,9559,Toy story,24,so ingenious in concept design and execution t...,...,0.28,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.2,0.0
1,David Ansen,fresh,114709,Newsweek,A winning animated feature that has something ...,18/8/08,9559,Toy story,13,a winning animated feature that has something ...,...,0.384615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153846,0.0
2,Leonard Klady,fresh,114709,Variety,The film sports a provocative and appealing st...,9/6/08,9559,Toy story,17,the film sports a provocative and appealing st...,...,0.277778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0
3,Jonathan Rosenbaum,fresh,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",10/3/08,9559,Toy story,14,an entertaining computer generated hyperrealis...,...,0.4375,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.1875,0.0
4,Michael Booth,fresh,114709,Denver Post,"As Lion King did before it, Toy Story revived ...",3/5/07,9559,Toy story,40,as lion king did before it toy story revived t...,...,0.325581,0.0,0.023256,0.046512,0.0,0.0,0.0,0.0,0.139535,0.0


<a id='assign'></a>

## Assign sentiment scores
---

We will now use the parsed reviews and the sentiment dataset to assign the average objectivity and positive vs. negative scores.

If a word cannot be found in the dataset we can ignore it. If a review has no words that match something in our dataset, will can assign overall neutral scores of `objectivity = 1` and `pos_vs_neg = 0`.

There are definitely problems with this approach, but for now we can keep it "dumb" and see if things improve when we use the VADER analyzer later.

In [35]:
# A:
def scorer(parsed):
    obj_scores, pvn_scores = [], []
    for token in [t for t in parsed if t.pos_ in ['NOUN','VERB','ADV','ADJ']]:
        try:
            obj_scores.append(sen_dict[token.pos_][str(token)]['objectivity'])
            pvn_scores.append(sen_dict[token.pos_][str(token)]['pos_vs_neg'])
        except:
            pass
    if len(obj_scores) == 0:
        obj_scores = [1.]
    if len(pvn_scores) == 0:
        pvn_scores = [0.]
    return [obj_scores, pvn_scores]


scores = []
for i, parsed in enumerate(parsed_quotes):
    if (i % 1000) == 0:
        print i
    scores.append(scorer(parsed))
    
rt['objectivity_avg'] = [np.mean(x[0]) for x in scores]
rt['pos_vs_neg_avg'] = [np.mean(x[1]) for x in scores]

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


<a id='print-most'></a>
## Print out the most positive and most negative reviews
---

Now that we have the average valence for reviews, try printing out the top 10 most positive and top 10 most negative reviews to visually verify that our approach makes sense.

In [36]:
# A:
rt.sort_values('pos_vs_neg_avg', ascending=False, inplace=True)
for quote in rt.quote[0:10]:
    print quote
    print '============================================================\n'

Almodovar has called his near-unique creations 'screwball drama.' This finds him working at his best.

Streep (the best thing she has done in ages) carries it along.

High Noon combines its points about good citizenship with some excellent picturemaking.

Paths of Glory is all about that greatest of all movie subjects: power.

Improbabilities and all, Simpatico still boasts wonderful scenes and a cast that is truly superb.

As bustling and impassioned as the best Sturges and Capra movies.

From Russia with Love is a preposterous, skillful slab of hardhitting, sexy hokum.

Succeeds, in part, because the film is as non-judgmental as the famed sex researcher himself.

From bumbling infants to majestic adults, a flock hasn't been this charismatic onscreen since Hitchcock went bird-watching.

It's an excellent movie for kids, because it is about how amazing children can be.



In [38]:
rt.sort_values('pos_vs_neg_avg', ascending=True, inplace=True)
for quote in rt.quote[0:10]:
    print quote
    print '============================================================\n'

Unoriginal and insulting, 3 Strikes goes down without scoring a single chuckle.

What pulls you over the bum spots is the electrifying immediacy.

Unfortunately Mr. Fraser comes off as a forlorn, outsize Pee Wee Herman.

Brooding, somber film is ragged around the edges and not without problematic aspects.

Its tone is never exactly comedic and its horrific touches are more disgusting than scary.

It's a disturbing, hopeless, irredeemable series of images that will scar you if you wander into it unprepared.

...Liar Liar stands to make a liar out of those who predicted that Carrey's career was on the skids.

A silly movie, with silly jokes and a silly story. But the talents at work in it are not silly.

Likely to be disappointing to Almodovar's admirers, and inexplicable to anyone else.

It retains the cheesy look of the 1979 original, pure schlock not gussied up to appear to be anything else.



<a id='print-most-obj'></a>

## Print out the most objective and most subjective reviews
---

Do the same as above, but now sort by the objectivity. What kind of differences do you notice between these? Does our approach actually appear to capture meaningful subjectivity and objectivity in the reviews?

In [39]:
# A:
rt.sort_values('objectivity_avg', ascending=False, inplace=True)
for quote in rt.quote[0:10]:
    print quote
    print '============================================================\n'

Dr. Dolittle runs out of ideas long before the projector runs out of film.

This is one of the films that made Jackie Chan Jackie Chan.

Mechanically written, but within its own middlebrow limitations, it delivers the goods.

... I felt as if I were being preached to throughout this film.

Barbara Stanwyck is the sexiest con woman ever captured on film.

One of the finest collaborations between husband and wife ever committed to film.

Everything about Something to Talk About feels off by a few beats.

Vicky Cristina Barcelona is the cinematic equivalent of a book on tape: a movie that watches itself for you and tells you what it sees.

As Chan moved from Hong Kong to Hollywood, something got lost in the translation.

Producer Dore Schary, in association with Adrian Scott, has pulled no punches.



In [40]:
rt.sort_values('objectivity_avg', ascending=True, inplace=True)
for quote in rt.quote[0:10]:
    print quote
    print '============================================================\n'

They keep getting worse and worse and worse . . .

What pulls you over the bum spots is the electrifying immediacy.

Almodovar has called his near-unique creations 'screwball drama.' This finds him working at his best.

I am not sure why this isn't very funny, but it's not.

Hawthorne is by turn outrageous and pathetic and imperious and poignant and very funny.

At its best, this achieves the beauty and grandeur of a Kurosawa epic -- at its worst, however, it feels like a Python remake of The Vikings.

In spite of its shortcomings, children love these characters and will enjoy Tigger.

At its best when it's being lighthearted and at its weakest when it takes a halfhearted stab at semi-seriousness.

Hilarious, sexy, clever, playful and as initially teasing as it is ultimately satisfying.

As bustling and impassioned as the best Sturges and Capra movies.



<a id='model'></a>

## Build a model to classify fresh vs. rotten with the sentiment and grammar features

---

Let's use the features we've created to construct a Logistic Regression to predict whether a review is fresh vs. rotten. 

Don't forget to check the baseline score, and it's a good practice to standardize your predictors.


In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

In [43]:
# A:
X = rt[['objectivity_avg','pos_vs_neg_avg','quote_len']+[x for x in rt.columns if x.endswith('_prop')]]
y = rt.fresh.values

ss = StandardScaler()
Xs = ss.fit_transform(X)

lr_scores = cross_val_score(LogisticRegression(), Xs, y, cv=10)
print np.mean(lr_scores), rt.fresh.mean

 0.638108349812 <bound method Series.mean of 5624     rotten
1430      fresh
10182      none
2164     rotten
1171      fresh
9335     rotten
11046     fresh
5493      fresh
8184      fresh
75        fresh
4645      fresh
6264     rotten
10354    rotten
8315      fresh
10415    rotten
2694     rotten
2465     rotten
1683      fresh
2241     rotten
7359      fresh
859      rotten
6223      fresh
708       fresh
6317     rotten
8397     rotten
9438     rotten
10568     fresh
3501      fresh
7780      fresh
151       fresh
          ...  
8613     rotten
7951     rotten
4306      fresh
4269      fresh
4271      fresh
5590     rotten
9683      fresh
7997      fresh
948      rotten
11158    rotten
1713     rotten
7673     rotten
3755      fresh
10256    rotten
9638      fresh
5445     rotten
2107     rotten
4067      fresh
4081      fresh
10545     fresh
1132      fresh
6193      fresh
9198      fresh
1407     rotten
1744     rotten
1511      fresh
10940    rotten
2060      fresh
2040     ro

<a id='vader'></a>

## Use the VADER library to get better sentiment scores
---

The [VADER](https://github.com/cjhutto/vaderSentiment) package for python is a more advanced way to calculate positivity, negativity, and objectivity in our reviews. The github page describes VADER as:

> VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

You will likely need to install VADER with pip or conda. Instructions can be found on the github page. Once you have it installed you can load the `SentimentIntensityAnalyzer` and parse text.

**Parse a couple of quotes with the `SentimentIntensityAnalyzer` and print out the dictionary of scores using `analyzer.polarity_scores`:

In [44]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

ImportError: No module named vaderSentiment.vaderSentiment

In [None]:
# A:

You can see that these scores look more legitimate. VADER polarity score dictionaries have 4 elements: `neg`, `pos`, `neu` and `compound`. The compound score is a single metric that represents the "overall" valence.

**Calculate the four scores for each review and save them as features in the dataframe.**

In [None]:
# A:

<a id='vader-model'></a>

### Fit a model using the VADER sentiment features

Does this model perform better? 

In [None]:
# A:

<a id='vader-top'></a>

### Print out the top most negative, positive, neutral, and subjective features by VADER score

In [None]:
# A: