# Logistic Regression for Sentiment Analysis

# # The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively. For simplicity, I assembled the reviews in a single CSV file.

In [2]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')

# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')

# show 5 last data
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


In [3]:
# import numpy librarie
import numpy as np
## uncomment these lines if you have dowloaded the original file:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)
df

Unnamed: 0,review,sentiment
11841,"Election is a Chinese mob movie, or triads in ...",1
19602,I was just watching a Forensic Files marathon ...,0
45519,Police Story is a stunning series of set piece...,1
25747,"Dear Readers,<br /><br />The final battle betw...",1
42642,I have seen The Perfect Son about three times....,1
31902,A brilliant portrait of a traitor (Victor McLa...,1
30346,If ever a potential movie must've sounded like...,1
12363,I'd always wanted David Duchovney to go into t...,1
32490,Perhaps if only to laugh at the way my favorit...,0
26128,"Even though the story is light, the movie flow...",1


# # Preprocessing Text Data

In [326]:
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re   # regular expressions
from nltk.corpus import stopwords

#C:\Python27\Scripts> pip install nltk
# In command line Python > nltk.download()

stop = stopwords.words('english')    # type: list

porter = PorterStemmer()             # process of transformating a word into its root form
lemmatizer = WordNetLemmatizer()     # ADD

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text) # delete HTML tag or others <..>
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower()) # find emoticons
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '') # type(text): string
    text = [w for w in text.split() if w not in stop] # splits and delete word in stop {'this', 'a', ..}
    text = set(text)             # ADD, type(text): test
    #text.add('exciting')
    
    #tokenized = [porter.stem(w) for w in text]
    tokenized = [lemmatizer.lemmatize(w) for w in text]   # ADD
    return text

Let's give it at try:

In [327]:
tokenizer('This :) is a <a> test! :-)</br>')

{':)', 'test'}

In [328]:
tokenizer('Hey my bro! how are you <aa> and (bragfdfs) this is new line for you :-)</br>')

{':)', 'bragfdfs', 'bro', 'hey', 'line', 'new'}

# # Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [329]:
# Read the 'path'
def stream_docs(path):
    with open(path, 'r') as csv:  # open a close this path
        next(csv) # skip header
        for line in csv:
            text = line[:-3]      # review
            label = int(line[-2]) # sentiment
            yield text, label

To conform that the stream_docs function fetches the documents as intended, let us execute the following code snippet before we implement the get_minibatch function

In [330]:
next(stream_docs(path='shuffled_movie_data.csv'))

('"Election is a Chinese mob movie, or triads in this case. Every two years an election is held to decide on a new leader, and at first it seems a toss up between Big D (Tony Leung Ka Fai, or as I know him, ""The Other Tony Leung"") and Lok (Simon Yam, who was Judge in Full Contact!). Though once Lok wins, Big D refuses to accept the choice and goes to whatever lengths he can to secure recognition as the new leader. Unlike any other Asian film I watch featuring gangsters, this one is not an action movie. It has its bloody moments, when necessary, as in Goodfellas, but it\'s basically just a really effective drama. There are a lot of characters, which is really hard to keep track of, but I think that plays into the craziness of it all a bit. A 100-year-old baton, which is the symbol of power I mentioned before, changes hands several times before things settle down. And though it may appear that the film ends at the 65 or 70-minute mark, there are still a couple big surprises waiting. Si

After we confirmed that our stream_docs functions works, we will now implement a get_minibatch function to fetch a specified number (size) of documents:

In [331]:
def get_minibatch(doc_stream, size):
    docs, y = [], []           # lists
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

#print( get_minibatch(stream_docs(path='shuffled_movie_data.csv'), 1) )

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).


In [332]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [ 'This is the first document','This is the second second document']
hv = HashingVectorizer(n_features=15)
hv.transform(corpus)

#hv
#A = vectorizer.fit_transform(corpus)
#print(A)
#A

<2x15 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [337]:
from sklearn.feature_extraction.text import HashingVectorizer

# Excercise 1: define new features according this https://web.stanford.edu/~jurafsky/slp3/7.pdf

# ---------------------------------------------------------------------
X, y = get_minibatch(stream_docs(path='shuffled_movie_data.csv'), 50000)
print(X[0])
first_token = tokenizer(X[0])
second_token = tokenizer(X[1])
print(first_token)
tokens = [first_token, second_token]
#tokens


# convert a collection of text documents to a matrix of token occurrences
vect = HashingVectorizer(lowercase = True, 
                         stop_words = stop, analyzer = 'word', decode_error='ignore',  
                         n_features=2**21, preprocessor=None, 
                         tokenizer = tokenizer)



"Election is a Chinese mob movie, or triads in this case. Every two years an election is held to decide on a new leader, and at first it seems a toss up between Big D (Tony Leung Ka Fai, or as I know him, ""The Other Tony Leung"") and Lok (Simon Yam, who was Judge in Full Contact!). Though once Lok wins, Big D refuses to accept the choice and goes to whatever lengths he can to secure recognition as the new leader. Unlike any other Asian film I watch featuring gangsters, this one is not an action movie. It has its bloody moments, when necessary, as in Goodfellas, but it's basically just a really effective drama. There are a lot of characters, which is really hard to keep track of, but I think that plays into the craziness of it all a bit. A 100-year-old baton, which is the symbol of power I mentioned before, changes hands several times before things settle down. And though it may appear that the film ends at the 65 or 70-minute mark, there are still a couple big surprises waiting. Simon

Using the SGDClassifier from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent.

In [338]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, max_iter=1)
doc_stream = stream_docs(path='shuffled_movie_data.csv')

# Excercise 2: implement a MaxEnt classifier, using regularization

In [339]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:37


Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify "new" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [340]:
X_test, y_test = get_minibatch(doc_stream, size=5000)

X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.869


Improve a little (before 0.866, now 0.869)

I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [15]:
clf = clf.partial_fit(X_test, y_test)