# Natural Language Processing - Sentiment Analysis on Rotten Tomatoes quotes

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [2]:
# load data from rt_critics.csv in the data folder of our DAT2 repo
# at 'https://raw.githubusercontent.com/JamesByers/GA-SEA-DAT2/master/data/rt_critics.csv'
df = pd.read_csv('https://raw.githubusercontent.com/JamesByers/GA-SEA-DAT2/master/data/rt_critics.csv')


In [3]:
# look at first 5 rows
df.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559,Toy story
1,Richard Corliss,fresh,114709,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559,Toy story
2,David Ansen,fresh,114709,Newsweek,A winning animated feature that has something ...,2008-08-18,9559,Toy story
3,Leonard Klady,fresh,114709,Variety,The film sports a provocative and appealing st...,2008-06-09,9559,Toy story
4,Jonathan Rosenbaum,fresh,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559,Toy story


In [4]:
# Check the shape of dataframe
df.shape

(14072, 8)

In [5]:
# Fresh is the column with ratings.  Count the number of each value in column fresh
df.fresh.value_counts()

fresh     8613
rotten    5436
none        23
Name: fresh, dtype: int64

In [6]:
# vectorize the quotes and store it on a variable names Xcv
vectorizer = CountVectorizer()
Xcv = vectorizer.fit_transform(df['quote'])

In [7]:
# Check the shape of dataframe Xcv
Xcv.shape

(14072, 21544)

But wait! We have more features than samples. This would ensure overfitting. Let's trim that number down to the top 5000, ranked by the term frequency across all documents.

In [8]:
# Create an vectorizer object as a variable named vectorizer that includes just the top 5000
# Hint: check the documentation for CountVectorizer
vectorizer = CountVectorizer(max_features=5000)

In [9]:
#  Create a new vectorized feature matix named Xcv with the new vectorizer
Xcv = vectorizer.fit_transform(df['quote'])

In [10]:
# Create the response vector y where the value is 1 if "fresh" and 0 if any other value than fresh
y = np.where(df.fresh == 'fresh',1,0)
# or
# y = (df['fresh'] == 'fresh').values.astype(np.int8)
y[:5]

array([1, 1, 1, 1, 1])

In [11]:
# Determine the null accuracy
max(y.mean(), 1 - y.mean())

0.61206651506537801

In [12]:
# split the data into training and test sets
xtrain_cv, xtest_cv, ytrain_cv, ytest_cv = train_test_split(Xcv, y)

In [13]:
# Evaluate performance of models using test train split
log_reg = LogisticRegression().fit(xtrain_cv, ytrain_cv)
print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtest_cv, ytest_cv))

Accuracy: 75.55%


In [14]:
# Tune the logistic Regression regularization parameter "C" to improve performance.
# Evaluate performance of models using test train split
log_reg = LogisticRegression(C=1.5).fit(xtrain_cv, ytrain_cv)
print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtest_cv, ytest_cv))

Accuracy: 74.99%


In [15]:
#Bonus if you have extra time: Create a for loop to find the C value
# that produces the most accurate model 
step = np.arange(0.5,10,0.5)
for c in step:
    log_reg = LogisticRegression(C=c).fit(xtrain_cv, ytrain_cv)
    print "C= ", c, "  Accuracy: ", round(100 * log_reg.score(xtest_cv, ytest_cv),2)

C=  0.5   Accuracy:  76.09
C=  1.0   Accuracy:  75.55
C=  1.5   Accuracy:  74.99
C=  2.0   Accuracy:  74.9
C=  2.5   Accuracy:  74.84
C=  3.0   Accuracy:  74.62
C=  3.5   Accuracy:  74.5
C=  4.0   Accuracy:  74.22
C=  4.5   Accuracy:  73.93
C=  5.0   Accuracy:  73.91
C=  5.5   Accuracy:  74.02
C=  6.0   Accuracy:  73.91
C=  6.5   Accuracy:  73.91
C=  7.0   Accuracy:  73.76
C=  7.5   Accuracy:  73.74
C=  8.0   Accuracy:  73.71
C=  8.5   Accuracy:  73.68
C=  9.0   Accuracy:  73.65
C=  9.5   Accuracy:  73.65


# Stop Words

The performance isn't bad, but it's not great. Let's see if we can improve things by [using stop words](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)?

In [16]:
# Modify your vectorizer to also remove stop words (still allow only 5000 features)

# create a new vectorizer object that only allows 5000 features
vectorizer = CountVectorizer(max_features=5000, stop_words='english')

In [17]:
# Create a new X called Xcvs
Xcvs = vectorizer.fit_transform(df['quote'])

In [18]:
# split the converted data (Xcvs) into training and test sets
xtraincvs, xtestcvs, ytraincvs, ytestcvs = train_test_split(Xcvs, y)

In [19]:
# Evaluate performance of models using the test data
# Tune the regularization parameter, C, to improve performance.

log_reg = LogisticRegression(C=1.0).fit(xtraincvs, ytraincvs)
print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtestcvs, ytestcvs))

Accuracy: 73.08%


In [20]:
# Tune the regularization parameter, C, to improve performance.
log_reg = LogisticRegression(C=1.5).fit(xtraincvs, ytraincvs)
print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtestcvs, ytestcvs))

Accuracy: 73.02%


In [21]:
#Alternate tuning of C using for loop
step = np.arange(0.5,10,0.5)
for c in step:
    log_reg = LogisticRegression(C=c).fit(xtraincvs, ytraincvs)
    print "C= ", c, "  Accuracy: ", round(100 * log_reg.score(xtestcvs, ytestcvs),2)

C=  0.5   Accuracy:  73.42
C=  1.0   Accuracy:  73.08
C=  1.5   Accuracy:  73.02
C=  2.0   Accuracy:  73.22
C=  2.5   Accuracy:  73.17
C=  3.0   Accuracy:  73.39
C=  3.5   Accuracy:  73.05
C=  4.0   Accuracy:  72.88
C=  4.5   Accuracy:  72.85
C=  5.0   Accuracy:  72.6
C=  5.5   Accuracy:  72.6
C=  6.0   Accuracy:  72.6
C=  6.5   Accuracy:  72.57
C=  7.0   Accuracy:  72.46
C=  7.5   Accuracy:  72.4
C=  8.0   Accuracy:  72.34
C=  8.5   Accuracy:  72.2
C=  9.0   Accuracy:  71.92
C=  9.5   Accuracy:  71.92


# tf-idf

If that didn't work, how about using tf-idf weighting?

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [22]:
# edit this cell to create a TfidfVectorizer instead of a simple CountVectorizer
# or start with your own model with CountVectorizer from the cells above

# create vectorizer object
# vectorizer = CountVectorizer(max_features=5000)

# Create Xti and y
# Xti = vectorizer.fit_transform(df['quote'])
# Y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
# xtrainti, xtestti, ytrainti, ytestti = train_test_split(Xti, y)

In [23]:
# solution for "edit this cell to create a TfidfVectorizer instead of a simple CountVectorizer" above
from sklearn.feature_extraction.text import TfidfVectorizer

# create vectorizer object
vectorizer = TfidfVectorizer(stop_words='english')

# convert our documents and their labels into numpy arrays
Xti = vectorizer.fit_transform(df['quote'])
y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
xtrainti, xtestti, ytrainti, ytestti = train_test_split(Xti, y)

In [24]:
# Evaluate performance of the new model
log_reg = LogisticRegression(C=1.0).fit(xtrainti, ytrainti)
print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtestti, ytestti))

Accuracy: 75.55%


In [25]:
# Tune the regularization parameter, C, to improve performance.
step = np.arange(0.5,10,0.5)
for c in step:
    logr = LogisticRegression(C=c).fit(xtrainti, ytrainti)
    print "Accuracy: %0.2f%%" % (100 * logr.score(xtestti, ytestti))

Accuracy: 72.65%
Accuracy: 75.55%
Accuracy: 76.44%
Accuracy: 76.58%
Accuracy: 76.98%
Accuracy: 77.12%
Accuracy: 77.12%
Accuracy: 77.09%
Accuracy: 77.17%
Accuracy: 76.98%
Accuracy: 77.06%
Accuracy: 76.89%
Accuracy: 76.92%
Accuracy: 76.86%
Accuracy: 76.75%
Accuracy: 76.81%
Accuracy: 76.78%
Accuracy: 76.78%
Accuracy: 76.66%


In [26]:
#Bonus: if you have time find the best value of C using a for loop
step = np.arange(0.5,10,0.5)
for c in step:
    log_reg = LogisticRegression(C=c).fit(xtrainti, ytrainti)
    print "C= ", c, "  Accuracy: ", round(100 * log_reg.score(xtestti, ytestti),2)

C=  0.5   Accuracy:  72.65
C=  1.0   Accuracy:  75.55
C=  1.5   Accuracy:  76.44
C=  2.0   Accuracy:  76.58
C=  2.5   Accuracy:  76.98
C=  3.0   Accuracy:  77.12
C=  3.5   Accuracy:  77.12
C=  4.0   Accuracy:  77.09
C=  4.5   Accuracy:  77.17
C=  5.0   Accuracy:  76.98
C=  5.5   Accuracy:  77.06
C=  6.0   Accuracy:  76.89
C=  6.5   Accuracy:  76.92
C=  7.0   Accuracy:  76.86
C=  7.5   Accuracy:  76.75
C=  8.0   Accuracy:  76.81
C=  8.5   Accuracy:  76.78
C=  9.0   Accuracy:  76.78
C=  9.5   Accuracy:  76.66


# tf-idf and stop words

Do both together help?

In [None]:
# edit this cell to create a TfidfVectorizer that uses stop words

# create vectorizer object
#vectorizer = CountVectorizer(max_features=5000)

# convert our documents and their labels into numpy arrays
#Xtis = vectorizer.fit_transform(df['quote'])
#y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
#xtraintis, xtesttis, ytraintis, ytesttis = train_test_split(Xtis, y)

In [None]:
# Solution for "edit this cell to create a TfidfVectorizer that uses stop words" above

# create vectorizer object
vectorizer = vectorizer = TfidfVectorizer()

# convert our documents and their labels into numpy arrays
Xtis = vectorizer.fit_transform(df['quote'])
y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
xtraintis, xtesttis, ytraintis, ytesttis = train_test_split(Xtis, y)

In [None]:
# Evaluate performance of models
# Tune the regularization parameter, C, to improve performance.

log_reg = LogisticRegression(C=1.0).fit(xtraintis, ytraintis)
print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtesttis, ytesttis))

In [None]:
# #Bonus: if you have time find the best value of C using a for loop
step = np.arange(0.5,10,0.5)
for c in step:
    log_reg = LogisticRegression(C=c).fit(xtraintis, ytraintis)
    print "Accuracy: %0.2f%%" % (100 * log_reg.score(xtesttis, ytesttis))

# Next steps

Are you satisfied with these results? Why might you be less than satisfied? How can you explain the observed behavior? What are the next steps you would need to do to improve this classifier? If you have time remaining, try a few strategies out below.

In [None]:
# continue playing here

# Use pipeline to evaluate accuracy with cross validation

# More Next Steps

The hardest part of creating a sentiment model is finding good training data. Googling 'sentiment analysis training data' or 'sentiment analysis test data' turns up a few freely available sources. Most of them are hosted by universities.

But notice, determining the judgment of a movie review isn't the same task as determining the emotional content of a tweet. And yet, it kind of is. The computer doesn't know anything about nature of the text. All it knows is that there are documents with one label (fresh/happy) and documents with another label (rotten/sad) and it needs to fit a model to discriminate between the two. This can be extended to more classes (look into the 20 newsgroups dataset in sci-kit learn) and to proprietary corpora.

One application you might use at work is classifying support emails from users. The classes may be 'ranting', 'mischarge', 'lost order', 'gushing'. Or whatever is common. Even if the classifier isn't perfect, it could help streamline the process of getting the right emails to the right support personnel.

In [None]:
from IPython.display import HTML
HTML('''
<style>
.text_cell_render {
  background-color: silver
}
</style>
''')