# Sentiment Analysis

*Sentiment analysis* is the task of evaluating whether a given passage of text is primarily "positive" or "negative." The meanings of these terms can change in context. For example, a "positive" product review would indicate that the customer likes the product, whereas a "positive" tweet might just indicate that the user is happy that day. 

In this lecture, we'll discuss how familiar machine learning tools can allow us to perform sentiment analysis on unstructured text. 

Our data set for this task comes from the `nltk` package again. It's a set of movie reviews. 

In [1]:
#standard imports 
import numpy as np
import pandas as pd

In [2]:
#use nltk to get movie reviews
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


The `movie_reviews` object allows us to read in the data. 

In [3]:
movie_reviews

<CategorizedPlaintextCorpusReader in 'C:\\Users\\micha\\AppData\\Roaming\\nltk_data\\corpora\\movie_reviews'>

For today, the two most important methods of this object are `fileids()` and `raw()`. The first method will allow us to locate the files on disk in which the movie reviews are contained, and the second method will allow us to then obtain the full text of the reviews from the file path. 

Let's first look at the fileids. 

In [4]:
type(movie_reviews.fileids())

list

In [5]:
review_list=movie_reviews.fileids()
len(review_list)

2000

movie_reviews.fileids() returns a list of 2000 fileids

In [7]:
print(review_list[0],review_list[1000])

neg/cv000_29416.txt pos/cv000_29590.txt


The first 1000 are in a folder called neg and the last 1000 are in a file called pos.

We can use moview_reviews raw method to read the reviews contained in the files.

In [8]:
movie_reviews.raw(review_list[0])

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

### How can we read in the complete dataset???

One option would be to use for loops, but this will be slow and won't give us the data  in a nice format. A betters solution is to use dataframes and the apply method.

In [9]:
df=pd.DataFrame({"fileid":review_list})
df.head(3)

Unnamed: 0,fileid
0,neg/cv000_29416.txt
1,neg/cv001_19502.txt
2,neg/cv002_17424.txt


So far we have a dataframe with one column, the fileids. Now, let's create another column by __applying__ the raw method to the fileid column

In [10]:
df['raw_text']=df['fileid'].apply(movie_reviews.raw)
df.head()

Unnamed: 0,fileid,raw_text
0,neg/cv000_29416.txt,"plot : two teen couples go to a church party ,..."
1,neg/cv001_19502.txt,the happy bastard's quick movie review \ndamn ...
2,neg/cv002_17424.txt,it is movies like these that make a jaded movi...
3,neg/cv003_12683.txt,""" quest for camelot "" is warner bros . ' firs..."
4,neg/cv004_12641.txt,synopsis : a mentally unstable man undergoing ...


We can use tools from before to create a term-document matrix. This time, we treat each movie review as a document. 

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vec=CountVectorizer(max_df=.2,min_df=30,stop_words="english")
counts=vec.fit_transform(df['raw_text'])

In [12]:
counts_df=pd.DataFrame(counts.toarray(),columns=vec.get_feature_names())
counts_df.head(3)

Unnamed: 0,000,10,100,11,12,13,15,17,18,1993,...,writing,written,wrong,wrote,yeah,yes,york,younger,youth,zero
0,0,10,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0


In [13]:
df=pd.concat((df,counts_df),axis=1)
df.head(3)

Unnamed: 0,fileid,raw_text,000,10,100,11,12,13,15,17,...,writing,written,wrong,wrote,yeah,yes,york,younger,youth,zero
0,neg/cv000_29416.txt,"plot : two teen couples go to a church party ,...",0,10,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,neg/cv001_19502.txt,the happy bastard's quick movie review \ndamn ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,neg/cv002_17424.txt,it is movies like these that make a jaded movi...,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0


Now, let's add a column called 'is_good' which captures if the review is positive or negative

In [15]:
df['is_good']=df['fileid'].str.split('/').str.get(0)>="p"

We have now successfully read in and prepared our data. 

---

# On to Sentiment Analysis

These steps should be pretty familiar. We are going to split our data into training and test sets, create a logistic classifier, and evaluate the logistic classifier on the test set.

As usual, we start by splitting into train and test

In [16]:
from sklearn.model_selection import train_test_split
train,test =train_test_split(df,test_size=.4)

Now, let's select the appropriate columns. Our target is the 'is_good' column. From our training set, let's exclude the fileid, raw_text, and is_good

In [17]:
X_train=train.drop(['fileid','raw_text','is_good'],axis=1)
y_train=train['is_good']

X_test=test.drop(['fileid','raw_text','is_good'],axis=1)
y_test=test['is_good']

Now, lets apply a Logitic Regression Model

In [18]:
from sklearn.linear_model import LogisticRegression

LR=LogisticRegression()
LR.fit(X_train,y_train)

LogisticRegression()

In [19]:
LR.score(X_train,y_train)

1.0

Is this too good to be true???

In [20]:
from sklearn.model_selection import cross_val_score

cross_val_score(LR,X_train,y_train,cv=5).mean()

0.8099999999999999

Our model perfectly fits the test data, but based on CV it looks like our predictive accuracy might only be around 80%. This looks like overfitting, which makes sense -- overfitting is a very common problem when we have many predictor columns (lots of words) and not that many data observations. 

There are multiple ways to address this. In this lecture, let's use cross-validation to tune the regularization parameter

In [21]:
C_pool=np.linspace(.005,.05,10)
best_score=-np.inf

for c in  C_pool:
    LR=LogisticRegression(C=c)
    score=cross_val_score(LR,X_train,y_train,cv=5).mean()
    if score>best_score:
        best_score=score
        best_c=c
    print("C=",np.round(c,3)," CrossValScore= ",score)
    

C= 0.005  CrossValScore=  0.8083333333333333
C= 0.01  CrossValScore=  0.8275
C= 0.015  CrossValScore=  0.8241666666666667
C= 0.02  CrossValScore=  0.8233333333333335
C= 0.025  CrossValScore=  0.8191666666666666
C= 0.03  CrossValScore=  0.8200000000000001
C= 0.035  CrossValScore=  0.8216666666666667
C= 0.04  CrossValScore=  0.8233333333333333
C= 0.045  CrossValScore=  0.8233333333333333
C= 0.05  CrossValScore=  0.8225


Now, let's use the best c to evaluation on the test set

In [22]:
best_c

0.010000000000000002

In [23]:
LR=LogisticRegression(C=best_c)
LR.fit(X_train,y_train)
LR.score(X_test,y_test)

0.81

As we can see, using the best_c increased our accuracy! 

__However, we're not done yet.__ 

One of the primary purposes of sentiment analysis is to determine which words carry positive or negative associations. It is common to assign scores to each word that govern how positive or negative they are. We can do this using the coefficients of the logistic model. First, let's make a data frame of the words and their scores. 

In [25]:
sentiment_df=pd.DataFrame({"word":X_train.columns,"coef":LR.coef_[0]})
sentiment_df

Unnamed: 0,word,coef
0,000,-0.022492
1,10,0.058030
2,100,-0.010759
3,11,0.002476
4,12,-0.003182
...,...,...
3019,yes,0.001501
3020,york,-0.039812
3021,younger,0.004182
3022,youth,0.022853


Now, let's look at the most negative words

In [26]:
sentiment_df.sort_values('coef',ascending=True).head(10)

Unnamed: 0,word,coef
2626,supposed,-0.20293
3002,worst,-0.187666
308,boring,-0.176584
2587,stupid,-0.166226
2837,unfortunately,-0.155261
1601,looks,-0.144165
1993,poor,-0.13752
2224,ridiculous,-0.130595
2814,tv,-0.125457
1707,mess,-0.11448


What about postive words?

In [27]:
sentiment_df.sort_values('coef',ascending=False).head(10)

Unnamed: 0,word,coef
903,especially,0.145686
118,american,0.137636
989,family,0.134044
2801,true,0.131652
1938,performances,0.115545
1890,overall,0.115223
2918,war,0.114682
1934,perfect,0.114127
1163,gives,0.112693
745,different,0.111366


This also looks pretty logical. We can conclude that our model has had some success in learning which words have positive and negative meanings. 

Of course, the story isn't over: there are many different models that can be used for sentiment analysis, some of which highlight different features. 

Finally, the combination of term-document extraction with classification models isn't just for sentiment analysis! Essentially the same pipeline can work to produce a functioning spam classifier, in which a "negative" set of text is spam and a "positive" set of text is a legitimate email. 