Data Overview 

For this analysis we’ll be using a dataset of 50,000 movie reviews 
taken from IMDb. The data was compiled by Andrew Maas and can be 
found here: IMDb Reviews (http://ai.stanford.edu/~amaas/data/sentiment/).

The data is split evenly with 25k reviews intended for training and 
25k for testing your classifier. Moreover, each set has 12.5k 
positive and 12.5k negative reviews.

IMDb lets users rate movies on a scale from 1 to 10. To label 
these reviews the curator of the data labeled anything with 
≤ 4 stars as negative and anything with ≥ 7 stars as positive. 
Reviews with 5 or 6 stars were left out.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
import os
import re

In [2]:
# Step 1: Read in the reviews
reviews_train = []
for line in open('../../data/aclImdb/movie_data/full_train.txt', 'r'):
    reviews_train.append(line.strip())
    
reviews_test = []
for line in open('../../data/aclImdb/movie_data/full_test.txt', 'r'):
    reviews_test.append(line.strip())

In [3]:
reviews_train[5]  # reviews contain unwanted characters

"This isn't the comedic Robin Williams, nor is it the quirky/insane Robin Williams of recent thriller fame. This is a hybrid of the classic drama without over-dramatization, mixed with Robin's new love of the thriller. But this isn't a thriller, per se. This is more a mystery/suspense vehicle through which Williams attempts to locate a sick boy and his keeper.<br /><br />Also starring Sandra Oh and Rory Culkin, this Suspense Drama plays pretty much like a news report, until William's character gets close to achieving his goal.<br /><br />I must say that I was highly entertained, though this movie fails to teach, guide, inspect, or amuse. It felt more like I was watching a guy (Williams), as he was actually performing the actions, from a third person perspective. In other words, it felt real, and I was able to subscribe to the premise of the story.<br /><br />All in all, it's worth a watch, though it's definitely not Friday/Saturday night fare.<br /><br />It rates a 7.7/10 from...<br />

In [4]:
# Step 2: Clean and Preprocess--Tidy the reviews by getting rid of/replacing unwanted characters with space
import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(SPACE, line) for line in reviews]
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [5]:
reviews_train_clean[5]  # reviews are clean now

'this isnt the comedic robin williams nor is it the quirky insane robin williams of recent thriller fame this is a hybrid of the classic drama without over dramatization mixed with robins new love of the thriller but this isnt a thriller per se this is more a mystery suspense vehicle through which williams attempts to locate a sick boy and his keeper also starring sandra oh and rory culkin this suspense drama plays pretty much like a news report until williams character gets close to achieving his goal i must say that i was highly entertained though this movie fails to teach guide inspect or amuse it felt more like i was watching a guy williams as he was actually performing the actions from a third person perspective in other words it felt real and i was able to subscribe to the premise of the story all in all its worth a watch though its definitely not friday saturday night fare it rates a   from the fiend '

In [6]:
# Step 3: Vectorization--convert each review to a numeric representation
"""
The simplest form of this is to create one very large matrix 
with one column for every unique word in your corpus 
(where the corpus is all 50k reviews in our case). 
Then we transform each review into one row containing 0s and 1s, 
where 1 means that the word in the corpus corresponding to 
that column appears in that review. That being said, 
each row of the matrix will be very sparse (mostly zeros). 
This process is also known as one hot encoding.
"""
from sklearn.feature_extraction.text import CountVectorizer

cv = TfidfVectorizer(binary=True)  # gives better prediction results
#cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean)
X_test = cv.transform(reviews_test_clean)

In [7]:
# Step 4: Build Classifier
"""
Now that we’ve transformed our dataset into a format suitable for 
modeling we can start building a classifier. 
Logistic Regression is a good baseline model for us to use 
for several reasons: 
(1) They’re easy to interpret, 
(2) linear models tend to perform well on sparse datasets like this one, and 
(3) they learn very fast compared to other algorithms.

To keep things simple I’m only going to worry about 
the hyperparameter C, which adjusts the regularization.

Note: The targets/labels we use will be the same for training 
and testing because both datasets are structured the same, 
where the first 12.5k are positive and the last 12.5k are negative.
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i < 12500 else 0 for i in range(25000)]

# First, use a subset of the train set to get an optimal hyperparameter
X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))
    
#     Accuracy for C=0.01: 0.87472
#     Accuracy for C=0.05: 0.88368
#     Accuracy for C=0.25: 0.88016
#     Accuracy for C=0.5: 0.87808
#     Accuracy for C=1: 0.87648

#  The value of C that gives us the highest accuracy is 0.05



Accuracy for C=0.01: 0.83232
Accuracy for C=0.05: 0.85088
Accuracy for C=0.25: 0.87072
Accuracy for C=0.5: 0.87952
Accuracy for C=1: 0.88544


In [8]:
# Train final model
"""
Now that we’ve found the optimal value for C, we should train a model 
using the entire training set and evaluate our accuracy on the 25k 
test reviews.
"""
final_model = LogisticRegression(C=0.05)
final_model.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test)))
# Final Accuracy: 0.88128

Final Accuracy: 0.856


In [9]:
# Make some predictions
from sklearn.feature_extraction.text import CountVectorizer

def get_features(review):
    return cv.transform([review])

# Predict sentiment for a glowing review
review1 = "LOVED IT! This movie was amazing. Top 10 this year."
review1_features = get_features(review1)
print("Review:", review1)
print("Probability of positive review:", final_model.predict_proba(review1_features)[0,1])

# Predict sentiment for a poor review
review2 = "Total junk! I'll never watch a film by that director again, no matter how good the reviews."
review2_features = get_features(review2)
print("Review:", review2)
print("Probability of positive review:", lr.predict_proba(review2_features)[0,1])

Review: LOVED IT! This movie was amazing. Top 10 this year.
Probability of positive review: 0.7651982913550238
Review: Total junk! I'll never watch a film by that director again, no matter how good the reviews.
Probability of positive review: 0.21523347721452063


In [10]:
# Sanity check
"""
Let’s look at the 5 most discriminating words for both positive and 
negative reviews. We’ll do this by looking at the largest and 
smallest coefficients, respectively.
"""
feature_to_coef = {word: coef 
                   for word, coef in 
                   zip(cv.get_feature_names(), final_model.coef_[0])}

# Sort dict by values: key=lambda kv: kv[1]
for best_positive in sorted(
    feature_to_coef.items(), key=lambda x: x[1],
    reverse=True)[:5]:
    print (best_positive)
    
#     ('excellent', 0.9288812418118644)
#     ('perfect', 0.7934641227980576)
#     ('great', 0.675040909917553)
#     ('amazing', 0.6160398142631545)
#     ('superb', 0.6063967799425831)
    
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)
    
#     ('worst', -1.367978497228895)
#     ('waste', -1.1684451288279047)
#     ('awful', -1.0277001734353677)
#     ('poorly', -0.8748317895742782)
#     ('boring', -0.8587249740682945)

('great', 1.8840509214410224)
('excellent', 1.3884072707342143)
('best', 1.2961021967900876)
('love', 1.1584900640211526)
('wonderful', 1.1518504420439997)
('bad', -2.3389314858564085)
('worst', -2.2122343185011553)
('waste', -1.5623682742489695)
('awful', -1.4770778282885826)
('no', -1.2354706007960563)
