# Logistic Regression for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [17]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


Let us shuffle the class labels.

In [18]:
import numpy as np
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)

<br>
<br>

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [19]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

[nltk_data] Downloading package stopwords to /home/jose/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let's give it at try:

In [20]:
tokenizer('This :) is a <a> test! :-)</br>')

['test', ':)', ':)']

## Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [21]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [22]:
next(stream_docs(path='shuffled_movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [23]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [24]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

# Excercise 1: define new features according this https://web.stanford.edu/~jurafsky/slp3/7.pdf

## Ejercicio 1 new features

In [25]:
# Las nuevas caracteristicas de agregan en el tokenizer
def tokenizer2(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return tokenized # trabajamos solamente con raices de las palabras ,eg run ,running = run

vect2 = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer2)

Using the [SGDClassifier]() from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent. 

In [26]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, max_iter=1)
clf2 = SGDClassifier(loss='log', random_state=1, max_iter=1)
doc_stream = stream_docs(path='shuffled_movie_data.csv')
# Excercise 2: implement a MaxEnt classifier, using regularization, according this https://web.stanford.edu/~jurafsky/slp3/7.pdf

## Ejercicio 2 MaxEnt

In [27]:
# Ejercicio 2 : implementacion MaxEnt
class MaximunEntropy():
    def __init__(self, learning_rate = 0.5, max_iter = 1000):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.theta = []
        self.no_examples = 0
        self.no_features = 0
        self.X = None
        self.Y = None
    # add bias a la columna 0 de la matriz X
    def add_bias_col_X(self, X):
        bias_col = np.ones((X.shape[0], 1))
        return np.concatenate([bias_col, X], axis=1)
    # usamos la funcion sigmoidea
    def hypothesis(self, X):
        return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta)))
 
    def cost_function(self):
        predicted_Y_values = self.hypothesis(self.X)
        cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values)))
        return cost
        
    def gradient(self):
        predicted_Y_values = self.hypothesis(self.X)
        grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X)
        return grad
    
    def gradient_descent(self):
        for iter in range(1,self.max_iter):
            cost = self.cost_function()
            delta = self.gradient()
            self.theta = self.theta - self.learning_rate * delta
            print("iteracion %s : costo %s " % (iter, cost))
    def train(self,X,Y):
        self.X = self.add_bias_col_X(X)
        self.Y = Y
        self.no_examples, self.no_features = np.shape(X)
        self.theta = np.ones(self.no_features + 1)
        self.gradient_descent()
    def classify(self,X):
        X = self.add_bias_col_X(X)
        predicted_Y = self.hypothesis(X)
        predicted_Y_binary = np.round(predicted_Y)
        return predicted_Y_binary

## usando MaxEnt

In [12]:
# Usando maxEnt
# El entrenamiento se hizo por minibatch
import scipy
from scipy.sparse import lil_matrix

num_epochs = 1
batch_size = 50
classes = np.array([0, 1])

cme = MaximunEntropy(learning_rate = 0.7,max_iter=100)

for epoch in range(num_epochs):
    doc_stream = stream_docs(path='shuffled_movie_data.csv')
    for ii in range(int(45000/batch_size)):
        X_train, y_train = get_minibatch(doc_stream,size=batch_size)
        X_train = vect.transform(X_train)
        y_train = np.asarray(y_train).ravel()
        X_train_f = np.zeros(shape=(batch_size,2**21))
        
        cx = scipy.sparse.coo_matrix(X_train)
        for i,j,v in zip(cx.row, cx.col, cx.data):
            X_train_f[i][j] = v
        
        cme.train(X_train_f,y_train)
        del X_train_f
        print("batch ",ii)
# si se interrumpe el entrenamiento sale error

iteracion 1 : costo 0.799848333838 
iteracion 2 : costo 0.789439191647 
iteracion 3 : costo 0.780979758463 
iteracion 4 : costo 0.773922606949 
iteracion 5 : costo 0.767870467274 
iteracion 6 : costo 0.7625375958 
iteracion 7 : costo 0.757719773531 
iteracion 8 : costo 0.753271712847 
iteracion 9 : costo 0.749090390405 
iteracion 10 : costo 0.745102923916 
iteracion 11 : costo 0.741257839921 
iteracion 12 : costo 0.737518827112 
iteracion 13 : costo 0.733860289445 
iteracion 14 : costo 0.730264191517 
iteracion 15 : costo 0.726717826345 
iteracion 16 : costo 0.723212238751 
iteracion 17 : costo 0.719741113359 
iteracion 18 : costo 0.716299991107 
iteracion 19 : costo 0.712885717664 
iteracion 20 : costo 0.70949605537 
iteracion 21 : costo 0.706129410364 
iteracion 22 : costo 0.70278464078 
iteracion 23 : costo 0.699460921995 
iteracion 24 : costo 0.696157651971 
iteracion 25 : costo 0.692874384793 
iteracion 26 : costo 0.689610784013 
iteracion 27 : costo 0.6863665899 
iteracion 28 : c

iteracion 26 : costo 0.659642387329 
iteracion 27 : costo 0.656228358667 
iteracion 28 : costo 0.652840625208 
iteracion 29 : costo 0.649478911605 
iteracion 30 : costo 0.646142961229 
iteracion 31 : costo 0.642832530891 
iteracion 32 : costo 0.639547387172 
iteracion 33 : costo 0.636287303866 
iteracion 34 : costo 0.633052060208 
iteracion 35 : costo 0.629841439636 
iteracion 36 : costo 0.626655228942 
iteracion 37 : costo 0.623493217671 
iteracion 38 : costo 0.620355197711 
iteracion 39 : costo 0.617240963011 
iteracion 40 : costo 0.614150309379 
iteracion 41 : costo 0.61108303435 
iteracion 42 : costo 0.608038937091 
iteracion 43 : costo 0.605017818335 
iteracion 44 : costo 0.602019480343 
iteracion 45 : costo 0.599043726867 
iteracion 46 : costo 0.596090363136 
iteracion 47 : costo 0.593159195838 
iteracion 48 : costo 0.590250033113 
iteracion 49 : costo 0.587362684549 
iteracion 50 : costo 0.584496961172 
iteracion 51 : costo 0.581652675448 
iteracion 52 : costo 0.578829641281 
it

iteracion 51 : costo 0.648323892396 
iteracion 52 : costo 0.645559527487 
iteracion 53 : costo 0.642810995996 
iteracion 54 : costo 0.640078213297 
iteracion 55 : costo 0.637361094958 
iteracion 56 : costo 0.634659556742 
iteracion 57 : costo 0.631973514609 
iteracion 58 : costo 0.629302884723 
iteracion 59 : costo 0.626647583449 
iteracion 60 : costo 0.62400752736 
iteracion 61 : costo 0.621382633237 
iteracion 62 : costo 0.618772818072 
iteracion 63 : costo 0.616177999075 
iteracion 64 : costo 0.613598093671 
iteracion 65 : costo 0.611033019504 
iteracion 66 : costo 0.608482694442 
iteracion 67 : costo 0.605947036579 
iteracion 68 : costo 0.603425964236 
iteracion 69 : costo 0.600919395966 
iteracion 70 : costo 0.598427250552 
iteracion 71 : costo 0.595949447016 
iteracion 72 : costo 0.593485904616 
iteracion 73 : costo 0.591036542853 
iteracion 74 : costo 0.588601281468 
iteracion 75 : costo 0.58618004045 
iteracion 76 : costo 0.583772740036 
iteracion 77 : costo 0.581379300711 
ite

iteracion 75 : costo 0.532089253835 
iteracion 76 : costo 0.529851431464 
iteracion 77 : costo 0.527627355297 
iteracion 78 : costo 0.525416937635 
iteracion 79 : costo 0.523220091111 
iteracion 80 : costo 0.521036728701 
iteracion 81 : costo 0.51886676372 
iteracion 82 : costo 0.516710109834 
iteracion 83 : costo 0.514566681056 
iteracion 84 : costo 0.512436391756 
iteracion 85 : costo 0.510319156663 
iteracion 86 : costo 0.508214890867 
iteracion 87 : costo 0.506123509822 
iteracion 88 : costo 0.504044929353 
iteracion 89 : costo 0.501979065655 
iteracion 90 : costo 0.499925835299 
iteracion 91 : costo 0.497885155233 
iteracion 92 : costo 0.495856942787 
iteracion 93 : costo 0.493841115671 
iteracion 94 : costo 0.491837591986 
iteracion 95 : costo 0.489846290217 
iteracion 96 : costo 0.487867129241 
iteracion 97 : costo 0.48590002833 
iteracion 98 : costo 0.483944907149 
iteracion 99 : costo 0.482001685761 
batch  6
iteracion 1 : costo 0.88006372092 
iteracion 2 : costo 0.86967206662

iteracion 1 : costo 0.798004468963 
iteracion 2 : costo 0.793988984987 
iteracion 3 : costo 0.790086726355 
iteracion 4 : costo 0.786270054426 
iteracion 5 : costo 0.782519479985 
iteracion 6 : costo 0.778821271602 
iteracion 7 : costo 0.775165757942 
iteracion 8 : costo 0.771546126089 
iteracion 9 : costo 0.767957572879 
iteracion 10 : costo 0.764396706631 
iteracion 11 : costo 0.760861126032 
iteracion 12 : costo 0.757349124091 
iteracion 13 : costo 0.753859480232 
iteracion 14 : costo 0.750391314425 
iteracion 15 : costo 0.746943984916 
iteracion 16 : costo 0.743517016586 
iteracion 17 : costo 0.740110050786 
iteracion 18 : costo 0.736722810236 
iteracion 19 : costo 0.733355074481 
iteracion 20 : costo 0.730006662727 
iteracion 21 : costo 0.726677421854 
iteracion 22 : costo 0.723367218043 
iteracion 23 : costo 0.720075930932 
iteracion 24 : costo 0.716803449535 
iteracion 25 : costo 0.713549669403 
iteracion 26 : costo 0.710314490642 
iteracion 27 : costo 0.707097816528 
iteracion 

KeyboardInterrupt: 

In [28]:
#import pyprind
#pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train_txt, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.transform(X_train_txt)
    clf.partial_fit(X_train, y_train, classes=classes)
    X_train = vect2.transform(X_train_txt)
    clf2.partial_fit(X_train, y_train, classes=classes)
    #pbar.update()

Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify "new" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [29]:
X_test_txt, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test_txt)
print('Accuracy: %.3f' % clf.score(X_test, y_test))
X_test = vect2.transform(X_test_txt)
print('Accuracy with new features: %.3f' % clf2.score(X_test, y_test))

Accuracy: 0.867
Accuracy with new features: 0.871


I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [None]:
clf = clf.partial_fit(X_test, y_test)

<br>
<br>

# Model Persistence

In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on "new data."

So, to reuse this model, we could use the [`pickle`](https://docs.python.org/3.5/library/pickle.html) module to "serialize a Python object structure". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently.

To install:
conda install -c anaconda joblib

In [16]:
import joblib
import os
if not os.path.exists('./pkl_objects'):
    os.mkdir('./pkl_objects')
    
joblib.dump(vect, './vectorizer.pkl')
joblib.dump(clf, './clf.pkl')

['./clf.pkl']

Using the code above, we "pickled" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a known issue with `pickling` objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on <module '__main__'>` if we'd unpickle it later. Thus, to pickle the `tokenizer` function, we can write it to a file and import it to get the `namespace` "right".

In [17]:
%%writefile tokenizer.py
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Writing tokenizer.py


In [18]:
from tokenizer import tokenizer
joblib.dump(tokenizer, './tokenizer.pkl')

['./tokenizer.pkl']

Now, let us restart this IPython notebook and check if the we can load our serialized objects:

In [19]:
import joblib
tokenizer = joblib.load('./tokenizer.pkl')
vect = joblib.load('./vectorizer.pkl')
clf = joblib.load('./clf.pkl')

After loading the `tokenizer`, `HashingVectorizer`, and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application -- a topic for another IPython notebook.

In [20]:
example = ['I did not like this movie']
X = vect.transform(example)
clf.predict(X)

array([0])

In [21]:
example = ['I loved this movie']
X = vect.transform(example)
clf.predict(X)

array([1])