#Sentiment Analysis on Movie Reviews

In [1]:
import pandas as pd
import numpy as np


In [4]:
path = 'data/'
file_train = 'traindata.tsv'
file_test = 'testdata.tsv'

In [5]:
train = pd.read_csv(path+file_train, error_bad_lines=False, sep = '\t')
test = pd.read_csv(path+file_test, error_bad_lines=False, sep = '\t')


The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv and test.csv contains the phrases and their associated sentiment labels.

The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

In [6]:
train.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
124843,46527,2263,the four main actresses,2
124844,140751,7636,hang together,2
124845,88481,4598,", the score is too insistent",1
124846,37285,1770,is either,2
124847,15739,675,gets you riled up,3


In [7]:
test.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
31207,49227,2407,technical skill and rare depth,4
31208,62407,3154,genuinely engaging performers,4
31209,114834,6113,served up,2
31210,135545,7319,post-modern,2
31211,91212,4745,a stark portrait of motherhood deferred and de...,3


##Extracting words from phrases

In [8]:
import re
def grab_words(sentence):
    str_extract = re.findall(r'\w+', sentence)
    return ' '.join(str_extract).lower()

#####Creating a new column in the data frame to store all the formatted phrases

In [9]:
train['formatted_sentence'] = train.Phrase.apply(grab_words)
test['formatted_sentence'] = test.Phrase.apply(grab_words)


## Creating tf-idf and removal of stop words

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

We also eliminate stop words, which are words which are filtered out before or after processing of natural language data (text). Any group of words can be chosen as the stop words for a given purpose. In this case, stop words can cause problems when searching for phrases therefore we have used buitin stop words of english present in scikit learn.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_generator = TfidfVectorizer(stop_words='english',
                              use_idf = True,
                              ngram_range = (0,4),
                              min_df = 1
                             )
train_transformed = tfidf_generator.fit_transform(train.formatted_sentence.as_matrix())
test_transformed = tfidf_generator.transform(test.formatted_sentence.as_matrix())
train_sentiment = train.Sentiment.as_matrix()
test_sentiment = test.Sentiment.as_matrix()

**print tfidf_generator.vocabulary - can be used to know which token is used as what column feature in the output matrix**. For example from the below snippet it can be seen that **care cleverness wit** is column feature **19163** for our dataset.

{u'': 0, u'90 minute film unfortunately': 839, **u'care cleverness wit': 19163**, u'funny albeit superficial cautionary': 62256, u'thriller directorial debut': 157832, u'bodies hardly laugh': 14942, u'hour photo disappointing generalities': 73674, u'woods': 173298, u'clotted': 25055, u'woody': 173299,

**The train and test data are transformed into datasets with 176159 feature, each feature representing a bag of word**

In [11]:
train_transformed

<124848x176288 sparse matrix of type '<type 'numpy.float64'>'
	with 1345597 stored elements in Compressed Sparse Row format>

In [12]:
test_transformed

<31212x176288 sparse matrix of type '<type 'numpy.float64'>'
	with 332471 stored elements in Compressed Sparse Row format>

##Predicting the sentiment using LinearSVC model
The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the "predicted" class is. This makes this specific algorithm rather suitable for our uses, though you can use this for many situations.

In [13]:
from sklearn.svm import LinearSVC
svc_clf = LinearSVC(penalty = 'l2', dual = False, tol = 1e-3)
svc_clf.fit(train_transformed, train_sentiment)


LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0)

In [14]:
from sklearn.metrics import accuracy_score
print (('Accuracy of the model on the test set - %f')%
(accuracy_score(test_sentiment, svc_clf.predict(test_transformed))))


Accuracy of the model on the test set - 0.643502
