# Chapter 8 - Applying ML to Sentiment Analysis

In [1]:
import pyprind
import pandas as pd
import os

#change the 'basepath' to the directory of the unzipped movie dataset
basepath = 'aclImdb'

labels = {'pos' : 1, 'neg' : 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),
                     'r', encoding ='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]],
                          ignore_index = True)
            pbar.update()

df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:31


#### Sentiment analysis
Sentiment analysis (or opinion mining) is a popular subdiscipline of NLP and is concerned with analyzing the polarity of documents.

In [2]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')

In [3]:
df = pd.read_csv('movie_data.csv', encoding = 'utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [4]:
df.shape

(50000, 2)

### Bag-of-words model

A model to represent text as numerical feature vectors.
* 1) Create a vocabulary of unique tokens - for example, words = from the entire set of documents
* 2) Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

This will yield a sparse vector. We can do this by using the CountVectorizer() class from scikit-learn.

In [5]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
    'the sun is shining',
    'The weather is sweet',
    'The sun is shining, the weather is sweet, and one and one is two'
])
bag = count.fit_transform(docs)

In [6]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [7]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


The values in the above vector represent the raw term frequencies: tf(t,d) - the number of times a term t occurs in a document d.

The above created bag-of-words is also called the 1-gram or unigram model (each item represents a single word). The contiguous sequences of items in NLP - words, letters or symbols - are also called n-grams. The coice of the number n in the n-gram model depends on the particular application (i.e. 3/4-grams yield good performances in anti-spam filtering. Can be done via the ngram_range parameter in CountVectorizer.

#### Word relevancy via term frequency

Some words occur often accross both negative and positive classes (such as 'and', or 'is'). Term frequency-inverse document frequency is a technique to downweight these frequently occuring words.

tf-idf(d,f) can be defined as the product of the term frequency and the inverse document frequency (tf(t,d) * idf(t,d)).

idf(t,d) = log(n/(1+df(t,d)); where n is the total number of documents and df(t,d) is the number of documents that contain the term t.
idf is small if the term occurs in many documents and thus probably has little discriminative value.

Can be implemented vai TfidfTransformer() class of scikit-learn.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf = True,
                        norm = 'l2',
                        smooth_idf = True)
np.set_printoptions(precision = 2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


## Cleaning text data

Before we can implement the above models on the IMDB database we have to lean the text data (for example HTML codes). For simplicity we will now delte punctuation marks, although these can be useful in some cases.

Python's regular expression (regex) library, re, can help with this task.

In [9]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', "", text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

#Many programmers advise against the use of regex to parse HTML markup.

In [10]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [11]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [12]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [13]:
df['review'] = df['review'].apply(preprocessor)

### Processing documents into tokens

The dataset is now prepared, now we want to split the text in to 'tokens' by tokenizing it(splitting the cleaned documents at its whitespace characters). Another useful techniques are: word stemming (processing a word into its root form) and stop-word removal.

In [14]:
#Tokenizer
def tokenizer(text):
    return text.split()
#PorterStemmr - word stemmer
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')
#alternative stemmers are 'Snowball stemmer' (Porter2 or English stemmer) and 'Lancaster stemmer' (Paice/Husk stemmer)
#Both are available via the NLTK packages

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [15]:
#stop word removal
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rikkr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### Traing a LR-model for document classification

Below movie reviews will be classified into positive and negative reviews by logistic regression after dividing the DataFrame into 25,000 documents for training and 25,000 for testing.

In [17]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [None]:
#Use grid-search to find the optimal set of parameters using 5-fold (stratified) cross validation.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents = None,
                       lowercase = False, 
                       preprocessor = None)

param_grid = [{'vect__ngram_range' : [(1,1)],
              'vect__stop_words' : [stop, None],
              'vect__tokenizer': [tokenizer, tokenizer_porter],
              'clf__penalty' : ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range' : [(1,1)],
              'vect__stop_words' : [stop, None],
              'vect__tokenizer': [tokenizer, tokenizer_porter],
              'vect__use_idf' : [False],
              'vect__norm' : [None],
              'clf__penalty' : ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]}]

lr_tfidf = Pipeline([('vect', tfidf),
                    ('clf', LogisticRegression(random_state = 0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                          scoring = 'accuracy',
                          cv = 5, verbose = 1,
                          n_jobs = 1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [None]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)

In [None]:
print ('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)
clf = gs_lr_tfidf.best_estmator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

### Working with bigger data - onine algorithms and out-of-core learning

As we encountered; constructing feature vectors for large databases is computationally expensive. It is not uncommon to work with even larger datasets in real-world applications (een exceeding the computer's memory). 
We can apply a technique called Out-of-core learning, which allows us to work with large datasets by fitting the classifier incrementally on smaller batchers of the dataset (directly from the csv-file stored on the harddisk).

In [18]:
#First a tokenizer function that cleans unprocessed text data from the movie_data.csv file.
import numpy as np
import re
from nltk.corpus import stopwords
def tokenizer(text):
    text = re.sub('<[^>]*>', "", text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

#define a generator that reads in and returns one document at a time
def stream_docs(path):
    with open(path, 'r', encoding = 'utf-8') as csv:
        next(csv) #skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [19]:
#test stream_docs
next(stream_docs(path = 'movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [20]:
#Define function get_minibatch that will take a document streaam from the stream_docs 
# and return a particular umber of documents (specified by the size parameter)
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    
    return docs, y

In [24]:
#CountVectorizer requires holding the complete vocabulary in memroy and can thus not be used for out-of-core learning.
#Same with TfidfVectorizer, therefore we use HashingVectorizer.
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error = 'ignore',
                        n_features = 2**21,
                        preprocessor = None,
                        tokenizer = tokenizer)
clf = SGDClassifier(loss = 'log', random_state = 1, max_iter = 1)
# = classifier (LR by setting loss to 'log') using Stochastic Gradient Descent (SGD)
# SGD is using one document at a time
doc_stream = stream_docs(path='movie_data.csv')

In [25]:
#Implement the out-of-core model
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size = 1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:54


In [27]:
#eveluate performance of the model using 5000 documents
X_test, y_test = get_minibatch(doc_stream, size = 5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


In [28]:
#Update model with those 5000 documents
clf = clf.partial_fit(X_test, y_test)

#### Note
A more modern approach to the bag-of-words model is the word2vec algorithm that learns relationship between words to solve problems like king-man + woman-queen

## Topic modeling with Latent Dirichlet Allocation

Topic modeling is the task of assigning topics to unlabelled text documents; it is a clustering task and a subcategory of unsupervised learning. A concrete example is the categroziation of large text corpus of newspaper articles into categories (sports, finance etc.).

LDA tries to find groups of words that appear frequently together across different documents. it composes the bag-of-words model into two matrices; 1) a document to topic matrix and 2) a word to topic matrix. (LDA decomposes the bag of words so that the two matrices multiplied reproduce the bag-of-words matrix with the lowest error possible). The number of topics is a hyperparameter that can be defined beforehand.

In [None]:
#Latend Dirichlet Allocation with scikit-learn
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding = 'utf-8')

#Create CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words = 'english',
                       max_df = .1, #10%, so that words occuring to too frequently are excluded
                       max_features = 5000) #most frequently occuring 5000 words
X = count.fit_transform(df['review'].values)

#LDA with 10 different topics from the document
from sklearn.decomposition import LatentDirichletAllocation
lda = LatendDirichletAllocation(n_topics = 10,
                               random_state = 123,
                               learning_method = 'batch') #Batch: All training examples in one, can also be set to 'online'
X_topics = lda.fit_transform(X)

#print shape of lda components
print(lda.comonents_.shape)

#Print the 5 most important words for each of the 10 topics.
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d: " % (topic_idx + 1))
    print(" ".join([feature_names[i]
                   for i in topic.argsort()[:-n_top_words - 1:1]]))
    
#plot three horror movies reviews
horror = X_topics[:,5].argsort()[::-1]
for iter_idx, movie in enumerate(horror[:3]):
    print('\nHorror movie#%d:' (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')