# Predicting the Sentiment of Movie Reviews

The goal for this analysis is to predict if a review rates the movie positively or negatively. Inside this dataset there are 25,000 labeled movies reviews for training, 50,000 unlabeled reviews for training, and 25,000 reviews for testing. More information about the data can be found at: https://www.kaggle.com/c/word2vec-nlp-tutorial.

This data comes from the 2015 Kaggle competition, "Bag of Words Meets Bags of Popcorn." Despite the competition being finished, I thought it could still serve as a useful tool for my first Natural Lanugage Processing (NLP) project. Within this analysis you will find two methods for predicting the sentiment of movie reviews. I wanted to experiment with a couple of strategies to gain an understanding of different options and compare their results. The two methods that I will use are: 
- Bag of Centroids with word2vec
- TfidfVectorizer

In [36]:
import pandas as pd
import numpy as np
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import nltk.data
import logging  
import multiprocessing
import time
import tflearn
import tensorflow as tf
from tensorflow.contrib import learn
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.linear_model import SGDClassifier as sgd
from sklearn.linear_model import LogisticRegression as lr
from sklearn.ensemble import VotingClassifier as vc
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer 
from collections import defaultdict

## Load and Explore the Data

In [2]:
# Load the Data
train = pd.read_csv("labeledTrainData.tsv", 
                    header=0, 
                    delimiter="\t", 
                    quoting=3 )

unlabeled_train = pd.read_csv("unlabeledTrainData.tsv", 
                              header=0, 
                              delimiter="\t", 
                              quoting=3 )

test = pd.read_csv("testData.tsv", 
                   header=0, 
                   delimiter="\t", 
                   quoting=3 )

In [3]:
# Compare the lengths of the datasets
print(train.shape)
print(unlabeled_train.shape)
print(test.shape)

(25000, 3)
(50000, 2)
(25000, 2)


Let's take a look at each of the dataframes.

In [5]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
train.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [7]:
unlabeled_train.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [8]:
unlabeled_train.review[0]

'"Watching Time Chasers, it obvious that it was made by a bunch of friends. Maybe they were sitting around one day in film school and said, \\"Hey, let\'s pool our money together and make a really bad movie!\\" Or something like that. What ever they said, they still ended up making a really bad movie--dull story, bad script, lame acting, poor cinematography, bottom of the barrel stock music, etc. All corners were cut, except the one that would have prevented this film\'s release. Life\'s like that."'

In [9]:
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [10]:
test.review[0]

'"Naturally in a film who\'s main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce\'s short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."'

Everything looks good with the data. No surprises so far.

## Model #1: Bag of Centroids

In [11]:
def review_to_wordlist(review, remove_stopwords = True, stem_words = True):
    # Clean the text, with the option to remove stopwords and stem words.

    # Remove HTML
    review_text = BeautifulSoup(review, "lxml").get_text()
    
    # Convert words to lower case and split them
    review_text = review_text.lower()

    # Optionally remove stop words (true by default)
    if remove_stopwords:
        words = review_text.split()
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
        review_text = " ".join(words)

    # Clean the text
    review_text = re.sub(r"[^A-Za-z0-9!?\'\`]", " ", review_text)
    review_text = re.sub(r"it's", " it is", review_text)
    review_text = re.sub(r"that's", " that is", review_text)
    review_text = re.sub(r"\'s", " 's", review_text)
    review_text = re.sub(r"\'ve", " have", review_text)
    review_text = re.sub(r"won't", " will not", review_text)
    review_text = re.sub(r"don't", " do not", review_text)
    review_text = re.sub(r"can't", " can not", review_text)
    review_text = re.sub(r"cannot", " can not", review_text)
    review_text = re.sub(r"n\'t", " n\'t", review_text)
    review_text = re.sub(r"\'re", " are", review_text)
    review_text = re.sub(r"\'d", " would", review_text)
    review_text = re.sub(r"\'ll", " will", review_text)
    review_text = re.sub(r"!", " ! ", review_text)
    review_text = re.sub(r"\?", " ? ", review_text)
    review_text = re.sub(r"\s{2,}", " ", review_text)
    
    # Shorten words to their stems
    if stem_words:
        words = review_text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in words]
        review_text = " ".join(stemmed_words)
    
    # Return a list of words, with each word as its own string
    return review_text.split()

In [12]:
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def review_to_sentences(review, tokenizer):
    # Split a review into parsed sentences
    # Returns a list of sentences, where each sentence is a list of words
    
    # Use the NLTK tokenizer to split the review into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append(review_to_wordlist(raw_sentence))
    
    # Return the list of sentences
    # Each sentence is a list of words, so this returns a list of lists
    return sentences

In [13]:
sentences = [] 

print ("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print ("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set


  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Parsing sentences from unlabeled set


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


In [14]:
# Check how many sentences we have in total 
print (len(sentences))
print()
print (sentences[0])
print()
print (sentences[1])

795538

['with', 'stuff', 'go', 'moment', 'mj', 'i', 'have', 'start', 'listen', 'music', 'watch', 'odd', 'documentari', 'there', 'watch', 'wiz', 'watch', 'moonwalk', 'again']

['mayb', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'realli', 'cool', 'eighti', 'mayb', 'make', 'mind', 'whether', 'guilti', 'innoc']


In [18]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Set values for various parameters
num_features = 300      # Word vector dimensionality                      
min_word_count = 5      # Minimum word count                        
num_workers = 1         # Number of threads to run in parallel
context = 20            # Context window size                                                                                    
downsampling = 1e-4     # Downsample setting for frequent words

from gensim.models import word2vec

# Initialize and train the model
print ("Training model...")
model = word2vec.Word2Vec(sentences, 
                          workers = num_workers,
                          size = num_features,
                          min_count = min_word_count,
                          window = context, 
                          sample = downsampling)

# Call init_sims because we won't train the model any further 
# This will make the model much more memory-efficient.
model.init_sims(replace=True)

# save the model for potential, future use.
model_name = "{}features_{}minwords_{}context".format(num_features,min_word_count,context)
model.save(model_name)

2017-03-13 17:09:08,042 : INFO : collecting all words and their counts
2017-03-13 17:09:08,043 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-03-13 17:09:08,078 : INFO : PROGRESS: at sentence #10000, processed 131544 words, keeping 12784 word types
2017-03-13 17:09:08,132 : INFO : PROGRESS: at sentence #20000, processed 262182 words, keeping 17571 word types
2017-03-13 17:09:08,174 : INFO : PROGRESS: at sentence #30000, processed 388992 words, keeping 20988 word types
2017-03-13 17:09:08,211 : INFO : PROGRESS: at sentence #40000, processed 520101 words, keeping 23883 word types
2017-03-13 17:09:08,246 : INFO : PROGRESS: at sentence #50000, processed 647817 words, keeping 26225 word types


Training model...


2017-03-13 17:09:08,283 : INFO : PROGRESS: at sentence #60000, processed 775677 words, keeping 28215 word types
2017-03-13 17:09:08,331 : INFO : PROGRESS: at sentence #70000, processed 904602 words, keeping 30054 word types
2017-03-13 17:09:08,367 : INFO : PROGRESS: at sentence #80000, processed 1031248 words, keeping 31698 word types
2017-03-13 17:09:08,416 : INFO : PROGRESS: at sentence #90000, processed 1162109 words, keeping 33415 word types
2017-03-13 17:09:08,452 : INFO : PROGRESS: at sentence #100000, processed 1290091 words, keeping 34868 word types
2017-03-13 17:09:08,488 : INFO : PROGRESS: at sentence #110000, processed 1417840 words, keeping 36191 word types
2017-03-13 17:09:08,527 : INFO : PROGRESS: at sentence #120000, processed 1546128 words, keeping 37632 word types
2017-03-13 17:09:08,569 : INFO : PROGRESS: at sentence #130000, processed 1677040 words, keeping 38899 word types
2017-03-13 17:09:08,606 : INFO : PROGRESS: at sentence #140000, processed 1800051 words, keepi

In [19]:
# Load the model, if necessary
# model = Word2Vec.load("300features_5minwords_20context") 

In [20]:
# Take a look at the performance of the model
print(model.most_similar("man"))

[('woman', 0.6236348748207092), ('crippl', 0.6064074039459229), ('harden', 0.5961687564849854), ('loner', 0.5913522839546204), ('businessman', 0.5873215198516846), ('unwit', 0.5845123529434204), ('rancher', 0.5801037549972534), ('alonzo', 0.5760775804519653), ('deed', 0.5737305879592896), ('greedi', 0.5680551528930664)]


In [21]:
model.most_similar("great")

[('excel', 0.728122353553772),
 ('fantast', 0.7253319621086121),
 ('terrif', 0.6595006585121155),
 ('outstand', 0.6500605344772339),
 ('superb', 0.6353731751441956),
 ('brilliant', 0.6136918663978577),
 ('fine', 0.5936250686645508),
 ('good', 0.5925694704055786),
 ('awesom', 0.5890749096870422),
 ('amaz', 0.584725022315979)]

In [22]:
model.most_similar("terribl")

[('horribl', 0.8962301015853882),
 ('aw', 0.8750690221786499),
 ('atroci', 0.8494848012924194),
 ('horrend', 0.8438228964805603),
 ('horrid', 0.8207361698150635),
 ('lousi', 0.7899401783943176),
 ('abysm', 0.7819921970367432),
 ('bad', 0.7641528844833374),
 ('laughabl', 0.7585679888725281),
 ('amateurish', 0.738487958908081)]

In [23]:
model.most_similar("movi")

[('this', 0.6199691891670227),
 ('film', 0.607010006904602),
 ('flick', 0.5626986026763916),
 ('honest', 0.5343186855316162),
 ('realli', 0.5284943580627441),
 ('probabl', 0.5283719897270203),
 ('blockbust', 0.514778733253479),
 ('those', 0.5122922658920288),
 ('watcher', 0.5061703324317932),
 ('actual', 0.5050208568572998)]

In [24]:
model.most_similar("best")

[('finest', 0.7128208875656128),
 ('greatest', 0.6819601655006409),
 ('favourit', 0.6107134819030762),
 ('funniest', 0.6005286574363708),
 ('arguabl', 0.5901353359222412),
 ('weakest', 0.5900111794471741),
 ('underr', 0.5632859468460083),
 ('favorit', 0.5546506643295288),
 ('undoubt', 0.5520817041397095),
 ('worst', 0.5461603403091431)]

The model looks good so far. Each of these words has appropriate similar words.

In [25]:
model.syn0.shape



(32269, 300)

In [26]:
# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
# average of 5 words per cluster
word_vectors = model.wv.syn0 
num_clusters = int(word_vectors.shape[0] / 5)

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans(n_clusters = num_clusters,
                           n_init = 5,
                           verbose = 2)
idx = kmeans_clustering.fit_predict(word_vectors)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 6549.41
start iteration
done sorting
end inner loop
Iteration 1, inertia 6407.38
start iteration
done sorting
end inner loop
Iteration 2, inertia 6368.37
start iteration
done sorting
end inner loop
Iteration 3, inertia 6352.34
start iteration
done sorting
end inner loop
Iteration 4, inertia 6345.52
start iteration
done sorting
end inner loop
Iteration 5, inertia 6341.61
start iteration
done sorting
end inner loop
Iteration 6, inertia 6339.4
start iteration
done sorting
end inner loop
Iteration 7, inertia 6337.72
start iteration
done sorting
end inner loop
Iteration 8, inertia 6336.79
start iteration
done sorting
end inner loop
Iteration 9, inertia 6336.27
start iteration
done sorting
end inner loop
Iteration 10, inertia 6336.03
start iteration
done sorting
end inner loop
Iteration 11, inertia 6335.88
start iteration
done sorting
end inner loop
Iteration 12, inertia 6335.67
start iteration
done sort

In [27]:
# Create a Word / Index dictionary, mapping each vocabulary word to a cluster number                                                                                            
word_centroid_map = dict(zip(model.wv.index2word, idx))

In [28]:
# Clean the training and testing reviews.
clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append(review_to_wordlist(review))
    
print("Training reviews are clean")  

clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append(review_to_wordlist(review))
    
print("Testing reviews are clean") 

Training reviews are clean
Unlabeled training reviews are clean
Testing reviews are clean


In [29]:
clean_train_reviews[0]

['with',
 'stuff',
 'go',
 'moment',
 'mj',
 'i',
 'have',
 'start',
 'listen',
 'music',
 'watch',
 'odd',
 'documentari',
 'there',
 'watch',
 'wiz',
 'watch',
 'moonwalk',
 'again',
 'mayb',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'realli',
 'cool',
 'eighti',
 'mayb',
 'make',
 'mind',
 'whether',
 'guilti',
 'innoc',
 'moonwalk',
 'part',
 'biographi',
 'part',
 'featur',
 'film',
 'rememb',
 'go',
 'see',
 'cinema',
 'origin',
 'releas',
 'subtl',
 'messag',
 'mj',
 "'s",
 'feel',
 'toward',
 'press',
 'also',
 'obvious',
 'messag',
 'drug',
 'bad',
 "m'kay",
 'visual',
 'impress',
 'cours',
 'michael',
 'jackson',
 'unless',
 'remot',
 'like',
 'mj',
 'anyway',
 'go',
 'hate',
 'find',
 'bore',
 'may',
 'call',
 'mj',
 'egotist',
 'consent',
 'make',
 'movi',
 'mj',
 'fan',
 'would',
 'say',
 'made',
 'fan',
 'true',
 'realli',
 'nice',
 'him',
 'the',
 'actual',
 'featur',
 'film',
 'bit',
 'final',
 'start',
 '20',
 'minut',
 'exclud',
 'smooth',
 'crimin

In [31]:
clean_test_reviews[0]

['natur',
 'film',
 'who',
 "'s",
 'main',
 'theme',
 'mortal',
 'nostalgia',
 'loss',
 'innoc',
 'perhap',
 'surpris',
 'rate',
 'high',
 'older',
 'viewer',
 'younger',
 'one',
 'howev',
 'craftsmanship',
 'complet',
 'film',
 'anyon',
 'enjoy',
 'pace',
 'steadi',
 'constant',
 'charact',
 'full',
 'engag',
 'relationship',
 'interact',
 'natur',
 'show',
 'need',
 'flood',
 'tear',
 'show',
 'emot',
 'scream',
 'show',
 'fear',
 'shout',
 'show',
 'disput',
 'violenc',
 'show',
 'anger',
 'natur',
 'joyc',
 "'s",
 'short',
 'stori',
 'lend',
 'film',
 'readi',
 'made',
 'structur',
 'perfect',
 'polish',
 'diamond',
 'small',
 'chang',
 'huston',
 'make',
 'inclus',
 'poem',
 'fit',
 'neat',
 'truli',
 'masterpiec',
 'tact',
 'subtleti',
 'overwhelm',
 'beauti']

In [32]:
def create_bag_of_centroids(wordlist, word_centroid_map):
    
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max(word_centroid_map.values()) + 1
    
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros(num_centroids, dtype="float32")
    
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
    
    return bag_of_centroids

In [34]:
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros((train["review"].size, num_clusters), dtype="float32")

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids(review, word_centroid_map)
    counter += 1

print("Training reviews are complete.")    
    
# Repeat for test reviews 
test_centroids = np.zeros((test["review"].size, num_clusters), dtype="float32")

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids(review, word_centroid_map )
    counter += 1
    
print("Testing reviews are complete.")  

Training reviews are complete.
Testing reviews are complete.


Let's use GridSearchCV to find the optimal parameters for each classifier.

In [39]:
def use_GridSearch(model, model_paramters, x_values):
    '''Find the optimal parameters for a model'''
    grid = GridSearchCV(model, model_paramters, scoring = 'roc_auc')
    grid.fit(x_values, train.sentiment)

    print("Best grid score = ", grid.best_score_)
    print("Best Parameters = ", grid.best_params_)

In [111]:
# RandomForect Classifier
rfc_parameters = {'n_estimators':[100,200,300],
                  'max_depth':[3,5,7,None],
                  'min_samples_leaf': [1,2,3]}

rfc_model = rfc()

use_GridSearch(rfc_model, rfc_parameters, train_centroids)

Best grid score =  0.92957452917
Best Parameters =  {'n_estimators': 300, 'min_samples_leaf': 1, 'max_depth': None}


In [48]:
# Logistic Regression
lr_parameters = {'C':[0.005,0.01,0.05],
                 'max_iter':[4,5,6],
                 'fit_intercept': [True]}

lr_model = lr()

use_GridSearch(lr_model, lr_parameters, train_centroids)

Best grid score =  0.942853192921
Best Parameters =  {'max_iter': 5, 'C': 0.01, 'fit_intercept': True}


In [49]:
# Stochastic Gradient Descent Classifier 
sgd_parameters = {'loss': ['log'],
                  'penalty': ['l1','l2','none']}

sgd_model = sgd()

use_GridSearch(sgd_model, sgd_parameters, train_centroids)

Best grid score =  0.92408550165
Best Parameters =  {'penalty': 'l1', 'loss': 'log'}


Let's double check the quality of the classifiers with cross validation, then train them.

In [51]:
def use_model(model, x_values):
    '''
    Test the quality of a model using cross validation
    Train the model with the x_values
    '''
    scores = cross_val_score(model, x_values, train.sentiment, cv = 5, scoring = 'roc_auc')
    model.fit(x_values, train.sentiment)
    mean_score = round(np.mean(scores) * 100,2) 

    print(scores)
    print()
    print("Mean score = {}".format(mean_score))

In [105]:
rfc_model = rfc(n_estimators = 300,
                max_depth = None,
                min_samples_leaf = 1)

use_model(rfc_model, train_centroids)

[ 0.92891048  0.9312636   0.93165904  0.93291288  0.92610464]

Mean score = 93.02


In [106]:
lr_model = lr(C = 0.01,
              max_iter = 5,
              fit_intercept = True)

use_model(lr_model, train_centroids)

[ 0.94450368  0.94777888  0.94073952  0.94730352  0.9427424 ]

Mean score = 94.46


In [107]:
sgd_model = sgd(loss = 'log',
                penalty = 'l1')

use_model(sgd_model, train_centroids)

[ 0.92591808  0.92870432  0.91723472  0.92715936  0.92216176]

Mean score = 92.42


In [108]:
rfc_result = rfc_model.predict(test_centroids)
lr_result = lr_model.predict(test_centroids)
sgd_result = sgd_model.predict(test_centroids)

avg_result = (lr_result + rfc_result + sgd_result) / 3

avg_result_final = []
for result in avg_result:
    if result > 0.5:
        avg_result_final.append(1)
    else:
        avg_result_final.append(0)
        
avg_output = pd.DataFrame(data={"id":test["id"], "sentiment":avg_result_final})
avg_output.to_csv("avg_centroids_submission.csv", index=False, quoting=3)

In [57]:
# Take a look at the submission file
avg_output[0:10]

Unnamed: 0,id,sentiment
0,"""12311_10""",1
1,"""8348_2""",0
2,"""5828_4""",1
3,"""7186_2""",0
4,"""12128_7""",1
5,"""2913_8""",1
6,"""4396_1""",0
7,"""395_2""",1
8,"""10616_1""",0
9,"""9074_9""",0


When I submit the results to the Kaggle competition its score (area under the ROC curve) is 0.880, which ranks 266/578, top 46%.

## Model 2: TfidfVectorizer

In [96]:
# Count the number of different words in the reviews
word_counts = defaultdict(int)

for comment in clean_train_reviews:
    word_counts[" ".join(comment)] += 1

for comment in clean_test_reviews:
    word_counts[" ".join(comment)] += 1
print(len(word_counts))

49577


In [97]:
# Set the parameters for vectorizing the words in the reviews.
vectorizer = TfidfVectorizer(max_features = len(word_counts), 
                             ngram_range = (1, 3), 
                             sublinear_tf = True)

In [60]:
# Join the words of the reviews.
# The list of lists becomes just a list of strings (strings = reviews).
clean_train_reviews_join = []
for review in clean_train_reviews:
    clean_train_reviews_join.append(" ".join(review))

clean_test_reviews_join = []
for review in clean_test_reviews:
    clean_test_reviews_join.append(" ".join(review))

In [99]:
# Train the vectorizer on the vocabulary and convert reviews into matrices.
x_train_vec = vectorizer.fit_transform(clean_train_reviews_join)
print("x_train_vec is complete.")
x_test_vec = vectorizer.transform(clean_test_reviews_join)
print("x_test_vec is complete.")

x_train_vec is complete.
x_test_vec is complete.


Use GridSearchcv to find the best parameters, just like with Method 1.

In [78]:
rfc_parameters_vec = {'n_estimators':[100,200,300],
                      'max_depth':[3,5,7,None],
                      'min_samples_leaf': [1,2,3,4]}

rfc_model_vec = rfc()

use_GridSearch(rfc_model_vec, rfc_parameters_vec, x_train_vec)

Best grid score =  0.933537007858
Best Parameters =  {'n_estimators': 200, 'min_samples_leaf': 3, 'max_depth': None}


In [71]:
lr_parameters_vec = {'C':[5,6,7],
                 'max_iter':[1,2,3],
                 'fit_intercept': [True,False]}

lr_model_vec = lr()

use_GridSearch(lr_model_vec, lr_parameters_vec, x_train_vec)

Best grid score =  0.962700581661
Best Parameters =  {'max_iter': 2, 'C': 6, 'fit_intercept': False}


In [68]:
sgd_parameters_vec = {'loss': ['log'],
                  'penalty': ['l1','l2','none']}

sgd_model_vec = sgd()

use_GridSearch(sgd_model_vec, sgd_parameters_vec, x_train_vec)

Best grid score =  0.958539153016
Best Parameters =  {'penalty': 'none', 'loss': 'log'}


Double check the quality of the classifiers with cross validation, then train them.

In [100]:
rfc_model_vec = rfc(n_estimators = 200,
                max_depth = None,
                min_samples_leaf = 3)

use_model(rfc_model_vec, x_train_vec)

[ 0.93125968  0.93375808  0.93373184  0.93458096  0.93305328]

Mean score = 93.33


In [101]:
lr_model_vec = lr(C = 6,
              max_iter = 2,
              fit_intercept = False)

use_model(lr_model_vec, x_train_vec)

[ 0.96271008  0.96529376  0.95973088  0.96456848  0.96171536]

Mean score = 96.28


In [102]:
sgd_model_vec = sgd(loss = 'log',
                penalty = 'none')

use_model(sgd_model_vec, x_train_vec)

[ 0.95948544  0.96322432  0.9573232   0.96245712  0.95898608]

Mean score = 96.03


In [103]:
lr_result_vec = lr_model_vec.predict(x_test_vec)
rfc_result_vec = rfc_model_vec.predict(x_test_vec)
sgd_result_vec = sgd_model_vec.predict(x_test_vec)

avg_result_vec = (lr_result_vec + rfc_result_vec + sgd_result_vec) / 3

avg_result_final_vec = []
for result in avg_result_vec:
    if result > 0.5:
        avg_result_final_vec.append(1)
    else:
        avg_result_final_vec.append(0)
        
avg_output_vec = pd.DataFrame(data={"id":test["id"], "sentiment":avg_result_final_vec})
avg_output_vec.to_csv("avg_vec_submission1.csv", index=False, quoting=3)

In [104]:
avg_output_vec[0:10]

Unnamed: 0,id,sentiment
0,"""12311_10""",1
1,"""8348_2""",0
2,"""5828_4""",1
3,"""7186_2""",0
4,"""12128_7""",1
5,"""2913_8""",1
6,"""4396_1""",0
7,"""395_2""",1
8,"""10616_1""",0
9,"""9074_9""",0


This method scores slightly higher, 0.895, which ranks 251/578, top 44%.

Just for fun, let's see what happens when we combine all six predictions.

In [109]:
avg_result_combine = (lr_result + rfc_result + sgd_result +
                      lr_result_vec + rfc_result_vec + sgd_result_vec) / 6

avg_result_final_combine = []
for result in avg_result_combine:
    if result > 0.5:
        avg_result_final_combine.append(1)
    else:
        avg_result_final_combine.append(0)
        
avg_output_combine = pd.DataFrame(data={"id":test["id"], "sentiment":avg_result_final_combine})
avg_output_combine.to_csv("avg_combine_submission.csv", index=False, quoting=3)

In [110]:
avg_output_combine[0:10]

Unnamed: 0,id,sentiment
0,"""12311_10""",1
1,"""8348_2""",0
2,"""5828_4""",1
3,"""7186_2""",0
4,"""12128_7""",1
5,"""2913_8""",1
6,"""4396_1""",0
7,"""395_2""",1
8,"""10616_1""",0
9,"""9074_9""",0


This 'combined' submission scored inbetween method 1 and 2, 0.892. I expected that this ensemble strategy would have scored better than the two previous methods, but unfortunately, it did not.

## Summary

The best performing tutorial from Google scores 0.845, which is the "Word2Vec - Bag of Centroids" example. I am pleased that I have improved upon this example, and built another method that scores even higher. Although I did not score at the top of the leaderboard, I am still pleased with my results and have learned a great deal. One thing that I will focus on with further projects of this nature is reducing the amount of overfitting. As you probably noticed, my models perform much better on the training data than the testing data. If I find some useful strategies online for producing models that generalize better, I'll try to return to this code to improve its results.