# Sentiment analysis codealong using spacy and movie reviews

Sentiment analysis is one of the more popular topics in NLP. It is concerned with finding some kind of valence to written text. This could be positivity, negativity, subjectivity and many others. In this lesson we will just be looking at those three. 

First we will load in a dataset of pre-coded sentiment scores for positivity and negativity on words. These words are also divided up by their part of speech in the sentence.

Then we will load snippets of rottentomatoes reviews and explore the sentiment of the writing.

---

### Load packages and sentiment data

In [1]:
import pandas as pd
import numpy as np

In [2]:
sen = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/sentiment_words/sentiment_words_simple.csv')

In [8]:
sen[sen.word == 'magnificent']

Unnamed: 0,pos,word,pos_score,neg_score
10978,adj,magnificent,0.5,0.25


In [10]:
sen.sort_values('neg_score', ascending=False).head(10)

Unnamed: 0,pos,word,pos_score,neg_score
3686,adj,cheapjack,0.0,1.0
115694,noun,scut_work,0.0,1.0
8534,adj,henpecked,0.0,1.0
117755,noun,shitwork,0.0,1.0
10290,adj,lamentable,0.0,1.0
32976,noun,blackguard,0.0,1.0
25791,noun,angriness,0.0,1.0
91963,noun,motormouth,0.0,1.0
5782,adj,distressing,0.0,0.9375
19500,adj,unfortunate,0.037,0.921333


---

### Create a sentiment dataset that does not take into account part of speech tags

This will be what we use first, not knowing the part of speech a word is in. Later when we use spacy we will be able to determine the part of speech of each word and pair the scores accordingly.

In [11]:
sen_agg = sen[['word','pos_score','neg_score']].groupby('word').agg(np.mean).reset_index()
sen_agg.head()

Unnamed: 0,word,pos_score,neg_score
0,'hood,0.0,0.375
1,'s_gravenhage,0.0,0.0
2,'tween,0.0,0.0
3,'tween_decks,0.0,0.0
4,.22,0.125,0.0


---

### Create a dictionary version of the sentiment data for both the part of speech and aggregate

The dictionary format of the data will be much easier to index into in our functions later. If we don't do this it's much harder to make those functions run quickly.

In [14]:
sen_dict = {
    'ADJ':{},
    'NOUN':{},
    'VERB':{},
    'ADV':{}
}

for i, row in enumerate(sen.itertuples()):
    #if (i % 10000) == 0:
    #    print i
    sen_dict[row[1].upper()][row[2]] = {'pos_score':row[3], 'neg_score':row[4]}

In [19]:
sen_dict['ADJ']['worst']

{'neg_score': 0.75, 'pos_score': 0.25}

In [16]:
sen_agg_dict = {}
for row in sen_agg.itertuples():
    sen_agg_dict[row[1]] = {'pos_score':row[2], 'neg_score':row[3]}

In [20]:
sen_agg_dict['worst']

{'neg_score': 0.63541666666675001, 'pos_score': 0.125}

---

### Load the rotten tomatoes dataset

This dataset has:
    
    critic: critic's name
    fresh: fresh vs. rotten rating
    imdb: code for imdb
    publication: where the review was published
    quote: the review snippet
    review_date: date of review
    rtid: rottentomatoes id
    title: name of movie

In [21]:
rt = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/rottentomatoes_critics/rt_critics.csv')

In [22]:
rt.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story


---

### Restrict data to reviews with valid ratings and reviews over 10 words long

Clean up the reviews, making a column with the case and punctuation removed.

In [23]:
rt.fresh.unique()

array(['fresh', 'rotten', 'none'], dtype=object)

In [24]:
rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [25]:
rt['quote_len'] = rt.quote.map(lambda x: len(x.split()))
rt = rt[rt.quote_len > 10]
rt.shape

(11215, 9)

In [26]:
for q in rt.quote.values[0:4]:
    print q

So ingenious in concept, design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm.
A winning animated feature that has something for everyone on the age spectrum.
The film sports a provocative and appealing story that's every bit the equal of this technical achievement.
An entertaining computer-generated, hyperrealist animation feature (1995) that's also in effect a toy catalog.


In [28]:
import string
string.ascii_lowercase

rt['qt'] = rt.quote.map(lambda x: unicode(''.join([ch for ch in list(x.lower()) 
                                                    if ch in string.ascii_lowercase+" -'"])))

rt.head(2)


Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...


---

### Write a function to assign positive rating, negative, and objective based on words in review

We'll use the dictionary we constructed above (without the part of speech tags). 

Objectivity is calculated: 

    1. - (positive_score + negative_score)

In [29]:
def agg_scorer(x):
    x = x.split()
    pos_scores, neg_scores, obj_scores = [], [], []
    for word in x:
        try:
            pos_scores.append(sen_agg_dict[word]['pos_score'])
            neg_scores.append(sen_agg_dict[word]['neg_score'])
            obj_scores.append((1. - (pos_scores[-1] + neg_scores[-1])))
        except:
            pos_scores.append(0.)
            neg_scores.append(0.)
            obj_scores.append(1.)
    return [pos_scores, neg_scores, obj_scores]

In [31]:
rev = rt.qt[7]
rev

u'children will enjoy a new take on the irresistible idea of toys coming to life adults will marvel at a witty script and utterly brilliant anthropomorphism'

In [32]:
p, n, o = agg_scorer(rev)

In [34]:
for word, n_ in zip(rev.split(), n):
    print word, n_

children 0.0
will 0.0
enjoy 0.05
a 0.0357142857143
new 0.056818181818
take 0.014880952381
on 0.0
the 0.0
irresistible 0.375
idea 0.075
of 0.0
toys 0.0
coming 0.0
to 0.0
life 0.0
adults 0.0
will 0.0
marvel 0.03125
at 0.0
a 0.0357142857143
witty 0.0
script 0.0
and 0.0
utterly 0.0
brilliant 0.0625
anthropomorphism 0.0


---

### Calculate the sum and average ratings for positive, negative, and objective for each review

In [36]:
agg_scores = map(agg_scorer, rt.qt)

rt['pos_avg'] = [np.mean(x[0]) for x in agg_scores]
rt['neg_avg'] = [np.mean(x[1]) for x in agg_scores]
rt['obj_avg'] = [np.mean(x[2]) for x in agg_scores]

rt['pos_sum'] = [np.sum(x[0]) for x in agg_scores]
rt['neg_sum'] = [np.sum(x[1]) for x in agg_scores]
rt['obj_sum'] = [np.sum(x[2]) for x in agg_scores]

In [37]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,pos_avg,neg_avg,obj_avg,pos_sum,neg_sum,obj_sum
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...,0.045647,0.024706,0.929647,1.095524,0.592949,22.311527
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...,0.062271,0.021978,0.915751,0.809524,0.285714,11.904762
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...,0.057831,0.024271,0.917897,0.983135,0.412608,15.604257
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14,an entertaining computer-generated hyperrealis...,0.072688,0.042331,0.884982,0.94494,0.550298,11.504762
5,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40,as lion king did before it toy story revived t...,0.028408,0.021935,0.949657,1.136316,0.877397,37.986287


---

### Evaluate predictive ability using the sentiment scores

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

X = rt[['pos_avg','neg_avg','obj_avg','quote_len']]
y = rt.fresh.values

lr_scores = cross_val_score(LogisticRegression(), X, y, cv=10)
print np.mean(lr_scores), np.mean(y)

lr = LogisticRegression().fit(X, y)


0.624431608768 0.615069103879


In [39]:
for predictor, coef in zip(X.columns, lr.coef_[0]):
    print predictor, coef

pos_avg 9.09342432956
neg_avg -7.52380496815
obj_avg -0.776342845242
quote_len 0.0115949884562


In [40]:
pp = pd.DataFrame({
        'prob_fresh':lr.predict_proba(X)[:,1],
        'prob_rotten':lr.predict_proba(X)[:,0],
        'quote':rt.quote.values
    })

In [41]:
pp.head()

Unnamed: 0,prob_fresh,prob_rotten,quote
0,0.640845,0.359155,"So ingenious in concept, design and execution ..."
1,0.65339,0.34661,A winning animated feature that has something ...
2,0.65046,0.34954,The film sports a provocative and appealing st...
3,0.64818,0.35182,"An entertaining computer-generated, hyperreali..."
4,0.648649,0.351351,"As Lion King did before it, Toy Story revived ..."


In [43]:
pp.sort_values('prob_rotten', ascending=False, inplace=True)
for quote in pp.quote.values[0:10]:
    print quote
    print '===============================================\n'


Unfortunately Mr. Fraser comes off as a forlorn, outsize Pee Wee Herman.

Any room in that freezer for this inadequate, inauthentic, indigestible film?

Its tone is never exactly comedic and its horrific touches are more disgusting than scary.

It's a disturbing, hopeless, irredeemable series of images that will scar you if you wander into it unprepared.

Unoriginal and insulting, 3 Strikes goes down without scoring a single chuckle.

If inspiration is lacking, talent is not. Count Lynch down but never out.

This is a terrible, terrible worthless movie that you shouldn't give any time to.

The movie is marred by an overreliance on unfunny bathroom gags.

This landmark movie's madcap humor and terrifying suspense remain undiminished by time.

Uninspired actors intone a banal script, reduced by clumsy pacing to a minimum of suspense.



In [48]:
pp.head()

Unnamed: 0,prob_fresh,prob_rotten,quote,difference
4506,0.499924,0.500076,"A shambolic, deafening, intelligence-insulting...",0.000152
10765,0.500093,0.499907,It's like watching the dreckiest of teen puppy...,0.000186
9491,0.500166,0.499834,This cockamamy action flick is excruciatingly ...,0.000333
3830,0.499817,0.500183,"The film, for all its mayhem and fury, is too ...",0.000366
8824,0.499796,0.500204,The story is no more than a thread stitching s...,0.000408


In [47]:
pp['difference'] = np.abs(pp.prob_fresh - pp.prob_rotten)
pp.sort_values('difference', ascending=True, inplace=True)
for quote in pp.quote.values[0:10]:
    print quote
    print '===============================================\n'


A shambolic, deafening, intelligence-insulting mess, a crushing failure on almost all counts.

It's like watching the dreckiest of teen puppy courtships trying to pass itself off as 'Annie Hall.

This cockamamy action flick is excruciatingly formulaic -- brimming with spy movie cliches but devoid of the genre's fun, upper-class pretensions.

The film, for all its mayhem and fury, is too distant to be truly disturbing; it treats everything with an impatient, born-too-late shrug.

The story is no more than a thread stitching set pieces of increasing implausibility and ineptitude.

This may work for you if you settle at the outset for a nostalgic, all-American mood piece.

Notorious has a fine time along the way, with Woolard channeling the rapper's sweetness and wit as comfortably as his pathos.

Never manages more than a glib, TV movie-of- the-week glance at their lives.

An old hand at this sort of thing, Pakula goes through the motions, but not much more.

Another soulless, by-the-num

In [51]:
for quote in rt.sort_values('neg_avg', ascending=False).quote.values[0:10]:
    print quote
    print '===============================================\n'

Hawthorne is by turn outrageous and pathetic and imperious and poignant and very funny.

Rounders' script is pretty shabby going. Well, not shabby, really, just simplistic.

Peter Berg's Very Bad Things isn't a bad movie, just a reprehensible one.

Its tone is never exactly comedic and its horrific touches are more disgusting than scary.

Bad taste of this order is rare but not yet dead.

An anarchic slob movie, a celebration of all that is irreverent, reckless, foolhardy, undisciplined, and occasionally scatological. It's a lot of fun.

A sprawling, rowdy, vital film laced with both outrageous absurdist dark humor and unspeakable pain, suffering and injustice.

Regrettably, an overblown finale and redundant trick ending undercut the mild subversiveness of what's gone before.

Not only is the picture woefully short on laughs, it's also coarse, overbearing and, in places, downright insulting.

A generally dumb movie with a smart, appealing, gutsy leading lady.



---

### Import spacy

The spacy package is the current gold standard for parsing text. We are going to use it to find the part of speech tags for the review words. 

Once we have parsed the tags with spacey, we can assign sentiment scores at a more granular level, using the correct part of speech version of the word.

In [52]:
import spacy
en_nlp = spacy.load('en')

In [53]:
txt = en_nlp(rt.qt.values[0])
txt

so ingenious in concept design and execution that you could watch it on a postage stamp-sized screen and still be engulfed by its charm

In [58]:
for x in txt:
    print x.pos_

ADV
ADJ
ADP
NOUN
NOUN
CONJ
NOUN
ADJ
PRON
VERB
VERB
PRON
ADP
DET
NOUN
NOUN
PUNCT
ADJ
NOUN
CONJ
ADV
VERB
VERB
ADP
ADJ
NOUN


In [59]:
token1 = txt[0]
token1

so

In [63]:
#str(token1) == 'so'
token1.string

u'so '

---

### Parse the quotes using spacey's multithreaded parser

In [64]:
parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(rt.qt.values, batch_size=50, n_threads=4)):
    if (i % 1000) == 0:
        print i
    parsed_quotes.append(parsed)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


In [67]:
unique_pos = []
for parsed in parsed_quotes:
    unique_pos.extend([t.pos_ for t in parsed])
unique_pos = np.unique(unique_pos)
print "','".join(unique_pos)

ADJ','ADP','ADV','CONJ','DET','INTJ','NOUN','NUM','PART','PRON','PROPN','PUNCT','SPACE','SYM','VERB','X


In [68]:
useful_grammar = ['ADJ','ADP','ADV','CONJ','DET','INTJ','NOUN','PART','PRON','PROPN','VERB']

In [69]:
for pos in useful_grammar:
    rt[pos+'_prop'] = 0.

In [70]:
rt = rt.reset_index(drop=True)
for i, parsed in enumerate(parsed_quotes):
    if (i % 500) == 0:
        print i
    parsed_len = len(parsed)
    for pos in useful_grammar:
        prop = len([x for x in parsed if x.pos_ == pos]) / float(parsed_len)
        rt.ix[i, pos+'_prop'] = prop



0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000


In [71]:
rt.head(3)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,...,ADP_prop,ADV_prop,CONJ_prop,DET_prop,INTJ_prop,NOUN_prop,PART_prop,PRON_prop,PROPN_prop,VERB_prop
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...,...,0.115385,0.076923,0.076923,0.038462,0.0,0.269231,0.0,0.076923,0.0,0.153846
1,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...,...,0.153846,0.0,0.0,0.153846,0.0,0.384615,0.0,0.0,0.0,0.153846
2,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...,...,0.055556,0.0,0.055556,0.277778,0.0,0.277778,0.0,0.0,0.0,0.055556


---

### Create columns for part of speech proportions

For each of the part of speech tags, create a column in the dataset that records the proportion of words in the quote that have that part of speech tag. We can try using these as predictors.

---

### Evaluate a model with the new part of speech predictors

---

### Print out the most likely fresh and most likely rotten reviews

Using the predicted probabilities from our model, we can see which reviews are most likely to be fresh or rotten. We can easily validate that our model is doing something that makes sense by looking at these (one of the benefits of doing NLP work!)

---

### Assign sentiment scores using the correct part of speech tag

We need to write another function that will take into account the part of speech tags using the parsed quotes we created earlier and the original sentiment data dictionary.

In [73]:
def scorer(parsed):
    pos_scores, neg_scores, obj_scores = [], [], []
    for token in [t for t in parsed if t.pos_ in ['NOUN','VERB','ADV','ADJ']]:
        try:
            pos_scores.append(sen_dict[token.pos_][str(token)]['pos_score'])
            neg_scores.append(sen_dict[token.pos_][str(token)]['neg_score'])
            obj_scores.append(1. - (pos_scores[-1] + neg_scores[-1]))
        except:
            pos_scores.append(0.)
            neg_scores.append(0.)
            obj_scores.append(1.)
    return [pos_scores, neg_scores, obj_scores]

In [74]:
scores = map(scorer, parsed_quotes)


In [75]:
rt['pos_part_avg'] = [np.mean(x[0]) for x in scores]
rt['neg_part_avg'] = [np.mean(x[1]) for x in scores]
rt['obj_part_avg'] = [np.mean(x[2]) for x in scores]

In [76]:
rt[[col for col in rt.columns if col.endswith('_avg')]].head(10)

Unnamed: 0,pos_avg,neg_avg,obj_avg,pos_part_avg,neg_part_avg,obj_part_avg
0,0.045647,0.024706,0.929647,0.069186,0.025553,0.90526
1,0.062271,0.021978,0.915751,0.020833,0.0,0.979167
2,0.057831,0.024271,0.917897,0.085227,0.030475,0.884298
3,0.072688,0.042331,0.884982,0.07197,0.036742,0.891288
4,0.028408,0.021935,0.949657,0.03787,0.013122,0.949009
5,0.119091,0.045225,0.835684,0.101595,0.032503,0.865902
6,0.112158,0.028341,0.859501,0.140578,0.042689,0.816733
7,0.055437,0.019369,0.925193,0.067751,0.036273,0.895975
8,0.025202,0.012446,0.962352,0.082465,0.038194,0.87934
9,0.097777,0.011065,0.891158,0.174148,0.018466,0.807386


---

### Evaluate the new predictors with different models.

Does regularization help? Decision trees?

In [77]:
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [78]:
X = rt[['quote_len'] + [c for c in rt.columns if c.endswith('_avg')] + [c for c in rt.columns if c.endswith('_prop')]]

In [79]:
X.head(2)

Unnamed: 0,quote_len,pos_avg,neg_avg,obj_avg,pos_part_avg,neg_part_avg,obj_part_avg,ADJ_prop,ADP_prop,ADV_prop,CONJ_prop,DET_prop,INTJ_prop,NOUN_prop,PART_prop,PRON_prop,PROPN_prop,VERB_prop
0,24,0.045647,0.024706,0.929647,0.069186,0.025553,0.90526,0.153846,0.115385,0.076923,0.076923,0.038462,0.0,0.269231,0.0,0.076923,0.0,0.153846
1,13,0.062271,0.021978,0.915751,0.020833,0.0,0.979167,0.153846,0.153846,0.0,0.0,0.153846,0.0,0.384615,0.0,0.0,0.0,0.153846


In [80]:
Xn = StandardScaler().fit_transform(X)

In [82]:
sgd_params = {
    'loss':['log'],
    'penalty':['elasticnet'],
    'alpha':np.logspace(-4,2,50),
    'l1_ratio':np.linspace(0.01, 1.0, 15)
}

sgd_gs = GridSearchCV(SGDClassifier(), sgd_params, cv=5, verbose=1)
sgd_gs.fit(Xn, y)

Fitting 5 folds for each of 750 candidates, totalling 3750 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.9s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    3.7s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    8.3s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   14.9s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:   23.5s
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:   33.7s
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:   45.7s
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed:  1.0min
[Parallel(n_jobs=1)]: Done 3750 out of 3750 | elapsed:  1.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['elasticnet'], 'loss': ['log'], 'l1_ratio': array([ 0.01   ,  0.08071,  0.15143,  0.22214,  0.29286,  0.36357,
        0.43429,  0.505  ,  0.57571,  0.64643,  0.71714,  0.78786,
        0.85857,  0.92929,  1.     ]), 'alpha': array([  1.00000e-04,   1.32571e-04,   1.75751e-04...    2.44205e+01,   3.23746e+01,   4.29193e+01,   5.68987e+01,
         7.54312e+01,   1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [83]:
print sgd_gs.best_score_
print sgd_gs.best_params_

0.646723138654
{'penalty': 'elasticnet', 'alpha': 0.0051794746792312128, 'loss': 'log', 'l1_ratio': 0.57571428571428573}


In [84]:
for var, coef in zip(X.columns, sgd_gs.best_estimator_.coef_[0]):
    print var, coef

quote_len 0.0822293966478
pos_avg 0.173766610449
neg_avg -0.308953504392
obj_avg 0.0
pos_part_avg 0.291492706393
neg_part_avg 0.0
obj_part_avg 0.0
ADJ_prop 0.0728550950628
ADP_prop 0.0
ADV_prop -0.159857092263
CONJ_prop 0.0844512198989
DET_prop -0.0199989322871
INTJ_prop 0.0
NOUN_prop 0.125303028351
PART_prop -0.0937812943019
PRON_prop 0.0572366705904
PROPN_prop 0.0528157059288
VERB_prop -0.072128233792
