# Sentiment Analysis

*Sentiment analysis* is the task of evaluating whether a given passage of text is primarily "positive" or "negative." The meanings of these terms can change in context. For example, a "positive" product review would indicate that the customer likes the product, whereas a "positive" tweet might just indicate that the user is happy that day. 

In this lecture, we'll discuss how familiar machine learning tools can allow us to perform sentiment analysis on unstructured text. 

Our data set for this task comes from the `nltk` package again. It's a set of movie reviews. 

In [1]:
import numpy as np
import pandas as pd
import nltk

nltk.download('movie_reviews')

from nltk.corpus import movie_reviews

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/philchodrow/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


The `movie_reviews` object allows us to read in the data. 

In [2]:
movie_reviews

<CategorizedPlaintextCorpusReader in '/Users/philchodrow/nltk_data/corpora/movie_reviews'>

For today, the two most important methods of this object are `fileids()` and `raw()`. The first method will allow us to locate the files on disk in which the movie reviews are contained, and the second method will allow us to then obtain the full text of the reviews from the file path. 

Let's first look at the fileids. 

In [3]:
f = movie_reviews.fileids()[0]
f

'neg/cv000_29416.txt'

Each review is contained in its own file, in one of two folders. The `neg` folder contains negative reviews, while the `pos` folder contains positive reviews. 

Once we have picked fixed a file path, we can then use the `raw()` method to extract the raw text of the movie review. 

In [4]:
movie_reviews.raw(f)

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

Take a moment to think: how can we read in the complete data set? 

<br> 
<br> 
<br> 
<br> 
<br> 
<br> 

A `for`-loop would be one way. In this approach, we would create an empty list to hold the review texts, iterate over the list of file paths, and populate the list for texts as we go. For example: 

In [5]:
raw_texts = []
for p in movie_reviews.fileids():
    raw_texts.append(movie_reviews.raw(p))

This does work, but it requires three lines and still leaves us with the task of bringing the texts into a format (like a data frame) that we know how to work with. 

Using the `apply` method from `pandas` gives us a much more efficient way: 

In [6]:
# create a data frame whose only column contains the fileids

df = pd.DataFrame({"fileid" : movie_reviews.fileids()})
# create a new column by applying the movie_reviews.raw()
# method to each entry of df['fileid']
df['raw_text'] = df['fileid'].apply(movie_reviews.raw)

In [7]:
df

Unnamed: 0,fileid,raw_text
0,neg/cv000_29416.txt,"plot : two teen couples go to a church party ,..."
1,neg/cv001_19502.txt,the happy bastard's quick movie review \ndamn ...
2,neg/cv002_17424.txt,it is movies like these that make a jaded movi...
3,neg/cv003_12683.txt,""" quest for camelot "" is warner bros . ' firs..."
4,neg/cv004_12641.txt,synopsis : a mentally unstable man undergoing ...
...,...,...
1995,pos/cv995_21821.txt,wow ! what a movie . \nit's everything a movie...
1996,pos/cv996_11592.txt,"richard gere can be a commanding actor , but h..."
1997,pos/cv997_5046.txt,"glory--starring matthew broderick , denzel was..."
1998,pos/cv998_14111.txt,steven spielberg's second epic film on world w...


We now have read in the data. Do we have what we need for sentiment analysis? 

Not quite yet, but we're close! In this lecture, we'll treat sentiment analysis as a form of *classification*: our aim is to build a machine learning model that we can use to predict whether a given text is positive or negative. For this approach, we are going to need both target and predictor variables. Fortunately, we know how to obtain both of these. 

In [8]:
# check whether the text came from the pos folder. 
df['is_good'] = df['fileid'].str.split('/').str.get(0) == 'pos'
df

Unnamed: 0,fileid,raw_text,is_good
0,neg/cv000_29416.txt,"plot : two teen couples go to a church party ,...",False
1,neg/cv001_19502.txt,the happy bastard's quick movie review \ndamn ...,False
2,neg/cv002_17424.txt,it is movies like these that make a jaded movi...,False
3,neg/cv003_12683.txt,""" quest for camelot "" is warner bros . ' firs...",False
4,neg/cv004_12641.txt,synopsis : a mentally unstable man undergoing ...,False
...,...,...,...
1995,pos/cv995_21821.txt,wow ! what a movie . \nit's everything a movie...,True
1996,pos/cv996_11592.txt,"richard gere can be a commanding actor , but h...",True
1997,pos/cv997_5046.txt,"glory--starring matthew broderick , denzel was...",True
1998,pos/cv998_14111.txt,steven spielberg's second epic film on world w...,True


We can use tools from before to create a term-document matrix. This time, we treat each movie review as a document. 

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_df = 0.2, min_df = 30, stop_words = 'english')

counts = vec.fit_transform(df['raw_text'])
count_df = pd.DataFrame(counts.toarray(), columns = vec.get_feature_names())

In [10]:
df = pd.concat((df, count_df), axis = 1)

In [11]:
df

Unnamed: 0,fileid,raw_text,is_good,000,10,100,11,12,13,15,...,writing,written,wrong,wrote,yeah,yes,york,younger,youth,zero
0,neg/cv000_29416.txt,"plot : two teen couples go to a church party ,...",False,0,10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,neg/cv001_19502.txt,the happy bastard's quick movie review \ndamn ...,False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,neg/cv002_17424.txt,it is movies like these that make a jaded movi...,False,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
3,neg/cv003_12683.txt,""" quest for camelot "" is warner bros . ' firs...",False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,neg/cv004_12641.txt,synopsis : a mentally unstable man undergoing ...,False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,pos/cv995_21821.txt,wow ! what a movie . \nit's everything a movie...,True,0,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1996,pos/cv996_11592.txt,"richard gere can be a commanding actor , but h...",True,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1997,pos/cv997_5046.txt,"glory--starring matthew broderick , denzel was...",True,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1998,pos/cv998_14111.txt,steven spielberg's second epic film on world w...,True,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We have now successfully read in and prepared our data. 

---

# On to Sentiment Analysis

These steps should be pretty familiar. We are going to split our data into training and test sets, create a logistic classifier, and evaluate the logistic classifier on the 

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size = 0.4)

X_train = train.drop(['fileid', 'raw_text', 'is_good'], axis = 1)
y_train = train['is_good']

X_test = test.drop(['fileid', 'raw_text', 'is_good'], axis = 1)
y_test = test['is_good']

In [13]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(X_train, y_train)
LR.score(X_train, y_train)

1.0

In [14]:
from sklearn.model_selection import cross_val_score

cross_val_score(LR, X_train, y_train, cv = 5).mean()

0.7791666666666666

Our model perfectly fits the test data, but based on CV it looks like our predictive accuracy might only be around 80%. This looks like overfitting, which makes sense -- overfitting is a very common problem when we have many predictor columns (lots of words) and not that many data observations. 

There are multiple ways to address this. In this lecture, let's use the regularization parameter `C`, which controls model complexity in logistic regression. While one could be more systematic about this, here's a simple little loop: 

In [15]:
for C in np.linspace(0.005, 0.05, 10):
    print(str(np.round(C, 4)), end = ": ")
    LR = LogisticRegression(C = C)
    cv_score = cross_val_score(LR, X_train, y_train, cv = 5).mean()
    print(np.round(cv_score, 3))

0.005: 0.81
0.01: 0.815
0.015: 0.815
0.02: 0.811
0.025: 0.811
0.03: 0.812
0.035: 0.811
0.04: 0.807
0.045: 0.807
0.05: 0.805


Looks like we can improve our estimated accuracy using C = 0.01 or so. Let's do that and evaluate on the test set. 

In [16]:
LR = LogisticRegression(C = 0.01)
LR.fit(X_train, y_train)
LR.score(X_test, y_test)

0.81125

So, our simple logistic model is able to correctly identify vs. negative movie reviews about 82% of the time. Not bad! 

However, we're not done yet. Wh

One of the primary purposes of sentiment analysis is to determine which words carry positive or negative associations. It is common to assign scores to each word that govern how positive or negative they are. We can do this using the coefficients of the logistic model. First, let's make a data frame of the words and their scores. 

In [17]:
result_df = pd.DataFrame({"coef" : LR.coef_[0], "word" : X_train.columns})
result_df

Unnamed: 0,coef,word
0,-0.024640,000
1,0.051605,10
2,0.014529,100
3,0.015553,11
4,-0.004789,12
...,...,...
3019,-0.026830,yes
3020,0.047812,york
3021,-0.005082,younger
3022,0.022394,youth


Now let's sort the data frame to see the most negative words according to the model. 

In [18]:
result_df.sort_values('coef', ascending = True).head(10)

Unnamed: 0,coef,word
3002,-0.233484,worst
2626,-0.151567,supposed
2837,-0.147333,unfortunately
1601,-0.140733,looks
2587,-0.138808,stupid
1707,-0.132609,mess
975,-0.130985,fails
1993,-0.127598,poor
2139,-0.121122,reason
2926,-0.120749,wasted


That makes sense! What about the most positive words? 

In [19]:
result_df.sort_values('coef', ascending = False).head(10)

Unnamed: 0,coef,word
1938,0.146348,performances
118,0.130839,american
1934,0.121159,perfect
1282,0.11785,hilarious
1163,0.117374,gives
745,0.107687,different
989,0.107235,family
1717,0.104652,mike
925,0.104095,excellent
2918,0.102962,war


This also looks pretty logical. We can conclude that our model has had some success in learning which words have positive and negative meanings. 

Of course, the story isn't over: there are many different models that can be used for sentiment analysis, some of which highlight different features. 

Finally, the combination of term-document extraction with classification models isn't just for sentiment analysis! Essentially the same pipeline can work to produce a functioning spam classifier, in which a "negative" set of text is spam and a "positive" set of text is a legitimate email. 