# Sentiment Analysis Using Stanford Data

This exploration involves a large dataset of movie reviews from IMDb that has been collected by Maas et al. and used in the paper "Learning Word Vectors for Sentiment Analysis". The dataset can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ as a zip archive.


## Importing Libraries and Processing/Loading Data

In [1]:
# Importing necessary libraries
#import pyprind # For seeing the progress of loading the data
import pandas as pd
import os

In [4]:
# Converting the dataset into a dataframe object
'''
The code below will create a dataframe with a column for the text of the review and a column for the sentiment,
with 1 representing a positive sentiment and 0 representing a negative sentiment
'''

pbar = pyprind.ProgBar(50000)
labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = './aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update() # This allows us to see the action in progress!
df.columns = ['review', 'sentiment'] 

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:16


In [5]:
import numpy as np
np.random.seed(0) # Sets random seed for reproducible results
df = df.reindex(np.random.permutation(df.index)) # Shuffles the data
df.to_csv('./stanford_movie_data.csv', index=False) # for storing the data later

NameError: name 'df' is not defined

In [2]:
df = pd.read_csv('./stanford_movie_data.csv') #Just to confirm that the data was loaded properly
df.head(10)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


## Cleaning Up The Text Data

In some cases, text data can contain unwanted characters such as HTML markup or punctuation. Since this is the case, we can clean up the text data using Python's regex library.

In [3]:
import re # regex library
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # Effectively removes HTML markup tags
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

In [4]:
# Next we can apply this function to each review
df['review'] = df['review'].apply(preprocessor)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
review       50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


## Using the TF-IDF statistic and training models

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer # Used for vectorizing each review

In [7]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, ngram_range=(1,1))

def tokenizer(text):    # Function for tokenizing text, this is not used here, but defined just in case
    return text.split()

tfidf_log_model = Pipeline([('vectorizer', tfidf), ('log_model', LogisticRegression(penalty='l2', C=10.0))])

In [8]:
# Split into training and testing sets
X_train = df.loc[:30000, 'review'].values
y_train = df.loc[:30000, 'sentiment'].values
X_test = df.loc[30000:, 'review'].values
y_test = df.loc[30000:, 'sentiment'].values

In [9]:
tfidf_log_model.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=T...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [10]:
pred = tfidf_log_model.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy of logistic regression model: {}%".format(100*accuracy_score(y_test, pred)))

Accuracy of logistic regression model: 89.55499999999999%


In [51]:
tfidf_neural_net = Pipeline([('vectorizer', tfidf), ('mlp', MLPClassifier(hidden_layer_sizes=(30, 30, 30)))])
# A neural network with 3 hidden layers with 40 units took 175.9 seconds to train

In [52]:
import time
start = time.time()
tfidf_neural_net.fit(X_train, y_train)
end = time.time()
print("Time taken to train model: {} s".format(end - start)) # Prints the time taken to train the model

Time taken to train model: 132.82956314086914 s


In [53]:
pred2 = tfidf_neural_net.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy of neural network model: {}%".format(100*accuracy_score(y_test, pred2)))

Accuracy of neural network model: 88.85%


In [17]:
from sklearn.naive_bayes import MultinomialNB

In [18]:
tfidf_nb = Pipeline([('vectorizer', tfidf), ('nb', MultinomialNB())])
tfidf_nb.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=T...True,
        vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [20]:
pred3 = tfidf_nb.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy of naive bayes model: {}%".format(100*accuracy_score(y_test, pred3)))

Accuracy of naive bayes model: 85.87%
