# Part 2: Sentiment Analysis of IMDB Movie Reviews

### Abstract

This is part 2 of a series of projects to help me understand the real-life applications of Machine Learning. I will be trying to answer the question, "How do startup fintech companies provide sentiment based trading signals to investment professionals?".

#### Aim

Having worked within investment management for many years, I have become interested in understanding the real-life applications of Machine Learning, specifically Natural Language Processing (NLP), within the finance industry. There is a huge amount of information that a Chartered Financial Analyst needs to navigate through. Traditional methods of analysis needs to be augmented with state-of-the-art data science techniques. 

Some of the really exciting potential applications include:
-  Trading signals derived from Sentiment Analysis
-  Classification and clustering of financial related documents.
-  Auto summarisation of text documents including transcipts of company conference calls. 
-  Processing company earnings reports in the quickest time possible in order to gain an information advantage.

My plan is to conduct a series of small projects, narrow in scope, that will allow me to keep within my abilities. From my initial research, I become inspired by the "institutional quality data feeds" built into the Quantopian.com pipeline. For example: 
-  StockTwits Trade Mood from PsychSignal
-  Twitter Trader Mood from PsychSignal
-  Sentdex Sentiment Analysis

(I have not developed any trading algorithms on Quantopian yet because my focus is on becoming a Data Scientist.)

**How do startup companies provide sentiment based trading signals to investment professionals?**

To tackle this question, I will conduct several loosely related project workstreams: 
-  News classification - filtering general news articles to those related to 'Business'
-  Sentiment Analysis (postive or negative) - movie reviews and Twitter feeds. 
-  Applying Daily News to to predict Stock Market returns. 

These building blocks will be powerful components that I can adapt as a framework for use in future work related to Sentiment Analysis and NLP. 

One burning question you might want me to answer immediately is whether I have found any Alpha? Yes. To a certain extent. At this stage of my analysis there is not enough Alpha on a stand-alone basis for a complete trading model. But I am very confident that with more work, there is enough Alpha there to be extracted and used as a part of an overall strategy, namely the Sentiment-based component. 

I am constantly learning and there is always room for improvements. 

Please note that I am only able to use non-proprietary, publicly available datasets, which will obviously lessen the information advantage. 

## 2. Sentiment Analysis

### 2.1 About the Dataset

A dataset of movie reviews from the IMDb (Internet Movie Database) website collected by Stanford researcher Andrew Maas.

-  The dataset is available at <http://ai.stanford.edu/~amaas/data/sentiment/>
-  The original pubilication Andrew L. Maas, et al (2011) [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf).

This dataset contains the text of the reviews, together with a label that indicates whether a review is “positive” or “negative.” The IMDb website itself contains ratings from 1 to 10. To simplify the modeling, this annotation is summarized as a two-class classification dataset where reviews with a score of 7 or higher are labeled as positive, and the rest as negative. 

The dataset is provided as individual txt files within 2 folders:
-  Train data: 25,000 labelled text files
-  Test data: 25,000 labelled text files

The train and test folders, each contain two sub‐folders called pos and neg.

### 2.1a Loading data

In [1]:
import numpy as np
import matplotlib.pyplot as plt

#### Load training data

In [2]:
from sklearn.datasets import load_files
%time reviews_train = load_files("C:/Python Project Files/aclImdb/train")
# load_files returns a bunch, containing training texts and training labels

text_train, y_train = reviews_train.data, reviews_train.target

Wall time: 4.94 s


In [3]:
print("length of text_train: {}".format(len(text_train)))
print("text_train[20]:\n{}".format(text_train[20]))

length of text_train: 25000
text_train[20]:
b"This independent, B&W, DV feature consistently shocks, amazes and amuses with it's ability to create the most insane situations and then find humor and interest in them. It's all hilarious and ridiculous stuff, yet as absurd as much of the film should be, there is a heart and a reality here that keeps the film grounded, keeps the entire piece from drifting into complete craziness and therein lies the real message here. This film is about how we all survive in a world gone mad. That seems to be the heart of the film. For as insane and off the wall as things get, Leon, the 30 yr. old paperboy-protagonist, always tries to keep it together. He's like a child forever trying to catch the balloon that is floating away so that everything will work out for the best, so that everyone can have what they want.<br /><br />The acting in the film could have went far over the top but the exceptional cast really keeps the piece cohesive. Van Meter is perhap

Clean data and remove HTML line breaks

In [4]:
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

In [5]:
print("Samples per class (training): {}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500]


#### Load test data

In [6]:
%time reviews_test = load_files("C:/Python Project Files/aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target

print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

Wall time: 4.53 s
Samples per class (test): [12500 12500]


## 2.2 Simple Bag of Words approach

One of the most simple but effective ways to represent text is by using the Bag of Words method. Here the model learns a vocabulary from all of the documents by discarding most of the structure of the input test, and only counting the number of times each word appears in the corpus.

CountVectorizer is used in the tokenization of the training data and building of the volcabulary.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

%time vect = CountVectorizer().fit(text_train)
%time X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

Wall time: 8.95 s
Wall time: 8.06 s
X_train:
<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3431196 stored elements in Compressed Sparse Row format>


In [8]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("Features 8000 to 8020:\n{}".format(feature_names[8000:8020]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 74849
Features 8000 to 8020:
['blustering', 'blusters', 'blustery', 'blut', 'bluth', 'bluto', 'blvd', 'blystone', 'blyth', 'blythe', 'blythen', 'blyton', 'bmacv', 'bmi', 'bmob', 'bmoc', 'bmoviefreak', 'bmovies', 'bmw', 'bmws']
Every 2000th feature:
['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']


## 2.3 Logistic Regression using cross-validation

A Logistic regression will be carried out using k-fold cross validation.

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time)

In [9]:
%%time
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)
print(scores.mean())

0.88136
Wall time: 1min 39s


**88.1% accuracy using 5-fold cross validation** 

In [10]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))

Number of features: 74849


The number of tokens at 74,849 is too high. To improve the extraction of words we should cut down this number and and only use tokens that appears in at least 5 documents.

In [11]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

X_train with min_df: <25000x27271 sparse matrix of type '<class 'numpy.int64'>'
	with 3354014 stored elements in Compressed Sparse Row format>


27,272 features is about a third of the original features. Better but still quite high. 

Let's take a look at these tokens.

In [12]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("Features 8000 to 8020:\n{}".format(feature_names[8000:8020]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 27271
Features 8000 to 8020:
['elton', 'elude', 'eluded', 'eludes', 'elusive', 'elves', 'elvira', 'elvis', 'ely', 'em', 'email', 'emails', 'emanating', 'emancipation', 'emanuelle', 'emasculated', 'embarassing', 'embark', 'embarking', 'embarks']
Every 2000th feature:
['00', 'baked', 'centipede', 'cutlery', 'elton', 'gaining', 'ideals', 'leering', 'moxy', 'picasso', 'repartee', 'silvers', 'talkative', 'verisimilitude']


#### Removing Stopwords

Next, we remove words that appear too frequently to be informative. 
scikit-learn has a built-in list of English stopwords.

In [13]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword:
['towards', 'take', 'bottom', 'him', 'moreover', 'via', 'sixty', 'no', 'itself', 'often', 'with', 'seem', 'get', 'a', 'fire', 'is', 'within', 'de', 'whatever', 'amount', 'over', 'see', 'everyone', 'has', 'beyond', 'due', 'throughout', 'keep', 'somewhere', 'below', 'whom', 'in']


Removing 318 stopwords is unlikely to have much impact.

In [14]:
# Specifying stop_words="english" uses the built-in list.
# We could also augment it and pass our own.
vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))

X_train with stop words:
<25000x26966 sparse matrix of type '<class 'numpy.int64'>'
	with 2149958 stored elements in Compressed Sparse Row format>


## 2.4 Applying tf–idf to the data

Term frequency–inverse document frequency (tf–idf) gives high weight to any term that appears
often in a particular document, but not in many documents in the corpus. These terms are likely
to be very descriptive of the content of that document.

To prevent information leak when applying tf-idf, we use a pipeline to glue multiple processing steps together. 

### 2.4a Using pipeline and GridSearch

In [15]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),
                     LogisticRegression())
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.4f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

Best cross-validation score: 0.89
Best parameters:
{'logisticregression__C': 0.001}
Wall time: 10min 23s


**89% accuracy using 5-fold cross validation** 

There is some improvement using tf-idf but not much.

The features with the highest tf-idf identify specific films such as Titanic and Bridget Jones Diary. These words are unlikely to help in our sentiment classification task. 

Also words that appear frequently and deemed less important are actually very useful in movie sentiment analysis, such as 'movie', 'film', 'good', 'great', and 'bad'.

## 2.5 Bag-of-Words with More Than One Word (n-Grams)

One of the main disadvantages of using a bag-of-words representation is that word
order is completely discarded. Therefore, the two strings “it’s bad, not good at all” and
“it’s good, not bad at all” have exactly the same representation, even though the meanings are inverted. 

Pairs of tokens are known as bigrams, triplets of tokens are known as trigrams, and
more generally sequences of tokens are known as n-grams: CountVectorizer(ngram_range=(1, 3))

In [16]:
%%time
pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression())
# running the grid search takes a long time because of the
# relatively large grid and the inclusion of trigrams
param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
"tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
model = grid.fit(text_train, y_train)
print("Best cross-validation score: {:.4f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

Best cross-validation score: 0.91
Best parameters:
{'logisticregression__C': 100, 'tfidfvectorizer__ngram_range': (1, 3)}
Wall time: 1h 23min 34s


**91% accuracy using 5-fold cross validation** 

## 2.6 Lessons I have learnt from this project

New tools to add to skillset:
-  Using the load_files function from sklearn to handle 50,000 individual text files
-  CountVectorizer(min_df=5)
-  Using pipeline
-  Using GridSearchCV
-  tf-idf
-  n-Grams (1, 1), (1, 2), (1, 3)]

Practiced:
-  Logistic Regression
-  Bag of words
-  tf-idf
-  Removing stopwords
-  sklearn features extraction