# Applying Machine Learning to Sentiment Analysis

## Chapter 8 from Sebastian Raschka's Python Machine Learning

Sentiment Analysis is a sub-discipline of **Natural Language Proprocessing (NLP)**.  

We will be classifying documents based on their polarity: the attitude of the writer.

We will be using a dataset that consists of 25,000 positive movie reviews and 25,000 negative movie reviews for a total of 50,000 reviews that will be split into training and testing sets.

The movie reviews come from the **Internet Movie Database (IMDB)**.  

The purpose is to build a model that can accurately predict the sentiment of a movie review on new data.

The dataset was specifically obtained from [http://ai.stanford.edu/~amaas/data/sentiment/]

*Some parts of this Jupyter Notebook copies from Raschka's book verbatim.*

In [12]:
import pandas as pd
import numpy as np

# Here I'm using an the same dataset on the Stanford site but previously downloaded and shuffled in a random way
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [13]:
df.shape

(50000, 2)

In [14]:
df.dtypes

review       object
sentiment     int64
dtype: object

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
review       50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


In [16]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [17]:
df['sentiment'].value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

For df[sentiment] 1 = positive review, 0 = negative review

## Introducing the bag-of-words model

**Bag-of-Word** method allows us to represent textual data as numerical vectors.

Specifically,
1. We create a set of tokens - words - from the entire set of documents
2. We construct a feature vector from each document that counts the frequency of each word in a particular document.

### Transforming words into feature vectors

We can use the **CountVectorizer** class from scikit-learn to help us out.  **CountVectorizer** takes an array of text data, sentences or entire documents, and turns it into a bag-of-words model:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, The weather is sweet, and one and one is two'
])

bag = count.fit_transform(docs)

In [19]:
type(bag)

scipy.sparse.csr.csr_matrix

Bag is now an array of sparse feature vectors.

We can print the contents of **bag** by:

In [20]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


This gives us a dictionary of the words and their integer indices.

To print the actual feature vectors:

In [21]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


The rows represent each of the three sentences, and the columns represent how many times the words occur in each sentence.  So for example, the first word at 0 in the above dictionary is the word "and" which occurs in the first two sentences 0 times and twice in the third sentence...giving us the 0, 0, and 2 down the first column.

**The importance of sklearn's CountVectorizer is that it gives us a foundation for understanding and analyzing text data.  At the end of the day, all we are doing is counting words which seems deceptively simple.  But it's the combination of counting and figuring out whether the more frequent words have meaning for the whole document or not that's the hard part.**

The values in the feature vectors are called **raw term frequencies**: the number of times a term *t* occurs in a document *d*.

### Assessing word relevancy via term frequency-inverse document frequency

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes.  These frequently occurring words typically don't contain useful or discriminatory information.  This is where the **term frequency-inverse document frequency** technique can come in handy.  The **tf-idf** is defined as the product of the term frequency *tf(t,d)* and inverse document frequency *idf(t,d)* => tf(d,f) * idf(t,d)

Here tf(t,d) is the term frequency that we talked about above with CountVectorizer().  The **idf(t,d)** can be calculated as the log of (n sub d) divided by 1 + df(t,d) where (n sub d) is the total number of documents and df(t,d) is the number of documents that contain the term t.

#### The logarithm ensures that low document frequencies are not given too much weight.

Now we can use sklearn's **TfidTransformer** which takes in the raw term frequencies from CountVectorizer class as input and transforms them as tf-idfs:


In [22]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf = True,
                       norm = 'l2',
                       smooth_idf = True)

np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous subsection, the word 'is' had the largest term frequency in the third document, being the most frequently occurring word. However, As we saw in the previous subsection, the word is had the largest term frequency in the 3rd document, being the most frequently occurring word. **However, after transforming the same feature vector into tf-idfs, we see that the word is is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.**

### Cleaning Text data

The first important step to textual analysis before we get on with our bag-of-words model is to clean the text data by stripping it of all unwanted characters.

For simplicity, we will remove HTML markup and all punctuation marks except for emoticons such as :) as they contain useful information.  

For this task we will Python's **regular expression (regex)** library **re**:

In [23]:
import re


def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

Via the first regex '<[^>]*>' in the preceding code, we tried to remove all of the HTML markup from the movie reviews.  And then we used a slightly more complex (really?) regex to find emoticons, which we temporarily stored as emoticons.  Next we removed all non-word characters from the text via the regex [\W]+ and converted the text into lowercase characters.  

Eventually, we added the temporarily stored *emoticons* to the end of the processed document string.  Additionally, we removed the *nose* character (-) from the emoticons for consistency.

Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, we shall note that the order of the words doesn't matter in our bag-of-words model if our vocabulary only consists of one-word tokens.  

Let's confirm that our preprocessor works correctly:

In [24]:
preprocessor(df.loc[0, 'review'][-50:])

'to star cinema way to go jericho and claudine '

In [25]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

Let's now apply our **preprocessor** function to all the movie reviews in our DataFrame:

In [26]:
df['review'] = df['review'].apply(preprocessor)

In [27]:
df['review'].head()

0    my family and i normally do not watch local mo...
1    believe it or not this was at one time the wor...
2    after some internet surfing i found the homefr...
3    one of the most unheralded great works of anim...
4    it was the sixties and anyone with long hair a...
Name: review, dtype: object

### Processing documents into tokens



In [28]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [29]:

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]


In [30]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [31]:
import nltk

In [32]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Alexander/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [33]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]



['runner', 'like', 'run', 'run', 'lot']

In [34]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [39]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 55.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 265.6min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 343.1min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_tr

In [40]:
print('Best parameter set: {}'.format(gs_lr_tfidf.best_params_))

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x1a183461e0>}


As we can see in the preceding output, we obtained the best grid search results using the regular **tokenizer** without Porter stemming, no stop-word library, and tf-idfs in combination with a logistic regression classifier that uses L2-regularization with the regularization strength C of 10.0

Using the best model from the grid search, let's print the average 5-fold cross validation accuracy scores on the training set and the classification accuracy on the test dataset:

In [41]:
print('CV Accuracy: {:.3f}'.format(gs_lr_tfidf.best_score_))

CV Accuracy: 0.893


In [42]:
clf = gs_lr_tfidf.best_estimator_

print('Test accuracy: {:.3f}'.format(clf.score(X_test, y_test)))

Test accuracy: 0.900


The results reveal that our machine learning model can predict whether a movie review is positive or negative with 90 percent accuracy.

In [48]:
predict_1 = clf.predict(['This movie was absolute garbage!'])
predict_1

# 0 meaning bad review

array([0])

In [49]:
predict_2 = clf.predict(['This movie made me want to live my life in happiness and joy again.  I loved it from start to finish'])
predict_2

# 1 meaning good review

array([1])

In [50]:
""" 

### How To Save a Model for Future Use ###

filename = 'finalized_model.sav'
joblib.dump(model, filename)
 
# some time later...
 
# load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)
"""
from sklearn.externals import joblib

filename = 'IMDB_trained_model.sav'
joblib.dump(clf, filename)    # clf = gs_lr_tfidf.best_estimator_


['IMDB_trained_model.sav']

In [56]:
# Before loading and running a previously saved model using joblib, 
# you need to import all the neccessary libraries that you had when you created the model

loaded_model = joblib.load('IMDB_trained_model.sav')
result = loaded_model.predict(['This movie was pure trash.  Who wrote this horrible script?'])

In [57]:
result # Meaning bad review

array([0])

### Working with bigger data - online algorithms and out-of-core learning

In real life, it is not uncommon to work with even larger datasets that can exceed our computer's memory.  Since not everyone has access to supercomputer facilities, we will now apply a technique called **out-of-core-learning**, which allows us to work with such large datasets by fitting the classifier incrementally on smaller batches of the dataset.

Specifically, we will be using the **partial_fit** function of the SGDClassifier in scikit-learn to stream the documents directly from our local drive, and train a logistic regression model using small batches.

In [68]:
import numpy as np
import re

from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)    # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To verify that our **stream_docs** function workds correctly, let's read in the first document from the **movie_data.csv** file which should return a tuple consisting of the review text as well as the corresponding class label:

In [69]:
next(stream_docs(path='movie_data.csv'))

('"My family and I normally do not watch local movies for the simple reason that they are poorly made, they lack the depth, and just not worth our time.<br /><br />The trailer of ""Nasaan ka man"" caught my attention, my daughter in law\'s and daughter\'s so we took time out to watch it this afternoon. The movie exceeded our expectations. The cinematography was very good, the story beautiful and the acting awesome. Jericho Rosales was really very good, so\'s Claudine Barretto. The fact that I despised Diether Ocampo proves he was effective at his role. I have never been this touched, moved and affected by a local movie before. Imagine a cynic like me dabbing my eyes at the end of the movie? Congratulations to Star Cinema!! Way to go, Jericho and Claudine!!"',
 1)

We now define a function, **get_minibatch** that will take a document stream from the **stream_docs** function and return a particular number of documents specified by the **size** parameter:

In [70]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y



Unfortunately, we can't use **CountVectorizer** for out-of-core learning since it requires holding the complete vocab in memory.  Same type of problem for **TfidfVectorizer**.

However, another useuful vectorizer for text processing implemented in scikit-learn is **HashingVectorizer**...data-independent and makes us of the 32-bit Murmurhash3 function.

In [71]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                        n_features=2**21,
                        preprocessor=None,
                        tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

For the above code, we initialized **HashingVectorizer** with our tokenizer function and set the number of features to 2** 21.  Furthermore, we reinitialized a logistic regression classifier by setting the **loss** parameter of the **SGDClassifier** to **'log'**.  

Now comes the really interesting part.  Having set up all the compelementary functions, we can now start the out-of-learning using the following code:

In [72]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()



0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:38


In the **for** loop, we iterated over 45 mini-batches of documents where each min-batch consists of 1000 documents.  Having completed the incremental learning process, we will use the last 5000 documents to evaluate the performance of our model:

In [73]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)

print('Test set Accuracy: {:.3f}'.format(clf.score(X_test, y_test)))

Test set Accuracy: 0.868


The accuracy of the model is approximately 87% slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning.  However, out-of-core learning is very memory efficient and took les than a minute to complete.  Finally, we can use the last 5000 documents to update our model:

In [74]:
clf = clf.partial_fit(X_test, y_test)



## Chapter 9: Embedding a Machine Learning Model into a Web Application

In [102]:
import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)
    
pickle.dump(stop,
           open(os.path.join(dest, 'stopwords.pkl'), 'wb'),
           protocol=4)

pickle.dump(clf,
           open(os.path.join(dest, 'classifier.pkl'), 'wb'),
           protocol=4)

Using the preceding code, we created a **movieclassifier** directory where we will later store the files and data for our web application.  Within this **movieclassifier** directory, we created a **pkl_objects** subdirectory to save the serialized Python objects to our local drive.  Via the **dump** method of the pickle module, we then serialized the trained logistic regression model as well as the stop word set from **NLTK** so that we don't have to install the NLTK vocabulary on our server.

We don't need to pickle **HashingVectorizer** since it does not need to be fitted.  Instead, we can create a new Python script file from which we can import the vectorizer into our current Python session.  Now, copy the following code and save it as **vectorizer.py** in the **movieclassifier** directory:

Or simply do it from your Jupyter Notebook like this

In [104]:
%%writefile vectorizer.py

from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
                   + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

Overwriting vectorizer.py


In [105]:
import os
os.chdir('movieclassifier')

In [106]:
import pickle
import re
import os
from vectorizer import vect

clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))

In [107]:
import numpy as np
label = {0:'negative', 1:'positive'}

example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], 
       np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 86.74%


In [108]:
example_2 = ["I can't stand this movie!"]

X = vect.transform(example_2)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], 
       np.max(clf.predict_proba(X))*100))

Prediction: negative
Probability: 53.95%
