# Predicting the Sentiment of Movie Reviews

The goal for this analysis is to predict if a review rates the movie positively or negatively. Inside this dataset there are 25,000 labeled movies reviews for training, 50,000 unlabeled reviews for training, and 25,000 reviews for testing. More information about the data can be found at: https://www.kaggle.com/c/word2vec-nlp-tutorial.

This data comes from the 2015 Kaggle competition, "Bag of Words Meets Bags of Popcorn." Despite the competition being finished, I thought it could still serve as a useful tool for my first Natural Lanugage Processing (NLP) project. Within this analysis you will find three methods for predicting the sentiment of movie reviews. I wanted to experiment with a few strategies to gain an understanding of different strategies and compare their results. The three methods that I will use are: 
- Bag of Centroids with Random Forest
- Bag of Words with Tensorflow
- Long Short Term Memory (LSTM) with Tensorflow


In [1]:
import pandas as pd
import numpy as np
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import nltk.data
import logging  
from gensim.models import Word2Vec
import multiprocessing
import time
import tflearn
import tensorflow as tf
from tensorflow.contrib import learn
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import GridSearchCV
from bs4 import BeautifulSoup



## Load and Explore the Data

In [2]:
# Load the Data
train = pd.read_csv("labeledTrainData.tsv", 
                    header=0, 
                    delimiter="\t", 
                    quoting=3 )

unlabeled_train = pd.read_csv("unlabeledTrainData.tsv", 
                              header=0, 
                              delimiter="\t", 
                              quoting=3 )

test = pd.read_csv("testData.tsv", 
                   header=0, 
                   delimiter="\t", 
                   quoting=3 )

In [3]:
# Compare the lengths of the datasets
print(train.shape)
print(unlabeled_train.shape)
print(test.shape)

(25000, 3)
(50000, 2)
(25000, 2)


In [4]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [5]:
# Take a look at a review
train.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [6]:
unlabeled_train.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [7]:
unlabeled_train.review[0]

'"Watching Time Chasers, it obvious that it was made by a bunch of friends. Maybe they were sitting around one day in film school and said, \\"Hey, let\'s pool our money together and make a really bad movie!\\" Or something like that. What ever they said, they still ended up making a really bad movie--dull story, bad script, lame acting, poor cinematography, bottom of the barrel stock music, etc. All corners were cut, except the one that would have prevented this film\'s release. Life\'s like that."'

In [8]:
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [9]:
test.review[0]

'"Naturally in a film who\'s main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce\'s short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."'

Everything looks good with the data. Naturally the reviews are of different lengths, but everything is as expected.

## Model #1: Bag of Centroids

In [10]:
def review_to_wordlist(review, remove_stopwords=False):
    # Clean the text, with the option to remove stopwords.

    # Remove HTML
    review_text = BeautifulSoup(review).get_text()

    # Clean the text
    review_text = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", review_text)
    review_text = re.sub(r"\'s", " \'s", review_text)
    review_text = re.sub(r"\'ve", " \'ve", review_text)
    review_text = re.sub(r"n\'t", " n\'t", review_text)
    review_text = re.sub(r"\'re", " \'re", review_text)
    review_text = re.sub(r"\'d", " \'d", review_text)
    review_text = re.sub(r"\'ll", " \'ll", review_text)
    review_text = re.sub(r",", " , ", review_text)
    review_text = re.sub(r"!", " ! ", review_text)
    review_text = re.sub(r"\(", " \( ", review_text)
    review_text = re.sub(r"\)", " \) ", review_text)
    review_text = re.sub(r"\?", " \? ", review_text)
    review_text = re.sub(r"\s{2,}", " ", review_text)

    # Convert words to lower case and split them
    words = review_text.lower().split()

    # Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    # Shorten words to their stems (i.e. remove suffixes and other word endings)
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in words]
    
    # Return a list of words
    return(stemmed_words)

In [11]:
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def review_to_sentences(review, tokenizer, remove_stopwords=False):
    # Split a review into parsed sentences
    # Returns a list of sentences, where each sentence is a list of words
    
    # Use the NLTK tokenizer to split the review into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
    
    # Return the list of sentences
    # Each sentence is a list of words, so this returns a list of lists
    return sentences

In [12]:
sentences = [] 

print ("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print ("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Parsing sentences from unlabeled set


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


In [13]:
# Check how many sentences we have in total 
print (len(sentences))
print()
print (sentences[0])
print()
print (sentences[1])

795538

['with', 'all', 'this', 'stuff', 'go', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'start', 'listen', 'to', 'his', 'music', ',', 'watch', 'the', 'odd', 'documentari', 'here', 'and', 'there', ',', 'watch', 'the', 'wiz', 'and', 'watch', 'moonwalk', 'again']

['mayb', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'realli', 'cool', 'in', 'the', 'eighti', 'just', 'to', 'mayb', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilti', 'or', 'innoc']


In [14]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Set values for various parameters
num_features = 250      # Word vector dimensionality                      
min_word_count = 20     # Minimum word count                        
num_workers = 1         # Number of threads to run in parallel
context = 20            # Context window size                                                                                    
downsampling = 1e-3     # Downsample setting for frequent words

# Initialize and train the model
from gensim.models import word2vec
print ("Training model...")
model = word2vec.Word2Vec(sentences, 
                          workers = num_workers,
                          size = num_features,
                          min_count = min_word_count,
                          window = context, 
                          sample = downsampling)

# Call init_sims because we won't train the model any further 
# This will make the model much more memory-efficient.
model.init_sims(replace=True)

# save the model for potential, future use.
model_name = "250features_20minwords_20context"
model.save(model_name)

2017-02-24 14:46:50,312 : INFO : collecting all words and their counts
2017-02-24 14:46:50,313 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-02-24 14:46:50,361 : INFO : PROGRESS: at sentence #10000, processed 241944 words, keeping 12778 word types
2017-02-24 14:46:50,406 : INFO : PROGRESS: at sentence #20000, processed 483715 words, keeping 17558 word types
2017-02-24 14:46:50,450 : INFO : PROGRESS: at sentence #30000, processed 718901 words, keeping 20970 word types


Training model...


2017-02-24 14:46:50,501 : INFO : PROGRESS: at sentence #40000, processed 961251 words, keeping 23861 word types
2017-02-24 14:46:50,555 : INFO : PROGRESS: at sentence #50000, processed 1196241 words, keeping 26195 word types
2017-02-24 14:46:50,601 : INFO : PROGRESS: at sentence #60000, processed 1433081 words, keeping 28180 word types
2017-02-24 14:46:50,645 : INFO : PROGRESS: at sentence #70000, processed 1671693 words, keeping 30007 word types
2017-02-24 14:46:50,689 : INFO : PROGRESS: at sentence #80000, processed 1906209 words, keeping 31644 word types
2017-02-24 14:46:50,737 : INFO : PROGRESS: at sentence #90000, processed 2146479 words, keeping 33357 word types
2017-02-24 14:46:50,783 : INFO : PROGRESS: at sentence #100000, processed 2384019 words, keeping 34803 word types
2017-02-24 14:46:50,827 : INFO : PROGRESS: at sentence #110000, processed 2619451 words, keeping 36116 word types
2017-02-24 14:46:50,873 : INFO : PROGRESS: at sentence #120000, processed 2856973 words, keepin

In [15]:
# Load the model, if necessary
# model = Word2Vec.load("250features_20minwords_20context") 

In [16]:
# Take a look at the performance of the model
print(model.most_similar("man"))

[('woman', 0.613450825214386), ('businessman', 0.5231447219848633), ('loner', 0.49102887511253357), ('ladi', 0.4824357330799103), ('lad', 0.4822234809398651), ('millionair', 0.4815925359725952), ('boxer', 0.48143839836120605), ('men', 0.465503454208374), ('crippl', 0.4626684784889221), ('policeman', 0.442837119102478)]


In [17]:
model.most_similar("great")

[('fantast', 0.7024080753326416),
 ('terrif', 0.7019121050834656),
 ('superb', 0.6130414009094238),
 ('fine', 0.5928903222084045),
 ('excel', 0.5899552702903748),
 ('brilliant', 0.5725492835044861),
 ('good', 0.5600504875183105),
 ('solid', 0.5397793054580688),
 ('tremend', 0.5350014567375183),
 ('fabul', 0.5297815203666687)]

In [18]:
model.most_similar("terribl")

[('horribl', 0.8503983616828918),
 ('horrend', 0.7642723321914673),
 ('atroci', 0.7585293650627136),
 ('aw', 0.7438712120056152),
 ('horrid', 0.7418596744537354),
 ('lousi', 0.6711216568946838),
 ('abysm', 0.6696984171867371),
 ('shoddi', 0.6531139016151428),
 ('bad', 0.6492224931716919),
 ('laughabl', 0.633672297000885)]

In [19]:
model.most_similar("movi")

[('film', 0.8010729551315308),
 ('flick', 0.642029881477356),
 ('it', 0.5277636647224426),
 ('sequel', 0.45611339807510376),
 ('thing', 0.45259204506874084),
 ('storylin', 0.4260961711406708),
 ('stinker', 0.4210663139820099),
 ('documentari', 0.40751001238822937),
 ('cinema', 0.4012513756752014),
 ('pictur', 0.39962583780288696)]

In [20]:
model.most_similar("best")

[('finest', 0.753976583480835),
 ('funniest', 0.645499587059021),
 ('worst', 0.6364975571632385),
 ('weakest', 0.6331263184547424),
 ('greatest', 0.6280043125152588),
 ('coolest', 0.5717089176177979),
 ('scariest', 0.5357791185379028),
 ('saddest', 0.5317215323448181),
 ('poorest', 0.5277900695800781),
 ('strongest', 0.52679842710495)]

The model looks good so far. Each of these words has appropriate similar words.

In [21]:
model.syn0.shape



(17218, 250)

In [22]:
# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
# average of 5 words per cluster
word_vectors = model.wv.syn0
num_clusters = int(word_vectors.shape[0] / 5)

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans(n_clusters = num_clusters,
                           verbose = 2)
idx = kmeans_clustering.fit_predict(word_vectors)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 7277.73
start iteration
done sorting
end inner loop
Iteration 1, inertia 7226.96
start iteration
done sorting
end inner loop
Iteration 2, inertia 7216.98
start iteration
done sorting
end inner loop
Iteration 3, inertia 7213.83
start iteration
done sorting
end inner loop
Iteration 4, inertia 7213.01
start iteration
done sorting
end inner loop
Iteration 5, inertia 7212.54
start iteration
done sorting
end inner loop
Iteration 6, inertia 7212.33
start iteration
done sorting
end inner loop
Iteration 7, inertia 7212.14
start iteration
done sorting
end inner loop
Iteration 8, inertia 7212.14
center shift 0.000000e+00 within tolerance 3.604397e-07
Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 7284.97
start iteration
done sorting
end inner loop
Iteration 1, inertia 7237.15
start iteration
done sorting
end inner loop
Iteration 2, inertia 7225.71
start iteration
done

In [23]:
# Create a Word / Index dictionary, mapping each vocabulary word to a cluster number                                                                                            
word_centroid_map = dict(zip(model.wv.index2word, idx))

In [24]:
def create_bag_of_centroids(wordlist, word_centroid_map):
    
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max(word_centroid_map.values()) + 1
    
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros(num_centroids, dtype="float32")
    
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
    
    return bag_of_centroids

In [25]:
# Clean the training and testing reviews, remove stopwords.
clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords=True))
    
print("Training reviews are clean")  

clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append(review_to_wordlist(review, remove_stopwords=True))
    
print("Testing reviews are clean") 



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Training reviews are clean
Testing reviews are clean


In [26]:
clean_train_reviews[0]

['stuff',
 'go',
 'moment',
 'mj',
 've',
 'start',
 'listen',
 'music',
 ',',
 'watch',
 'odd',
 'documentari',
 ',',
 'watch',
 'wiz',
 'watch',
 'moonwalk',
 'mayb',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'realli',
 'cool',
 'eighti',
 'mayb',
 'make',
 'mind',
 'whether',
 'guilti',
 'innoc',
 'moonwalk',
 'part',
 'biographi',
 ',',
 'part',
 'featur',
 'film',
 'rememb',
 'go',
 'see',
 'cinema',
 'origin',
 'releas',
 'subtl',
 'messag',
 'mj',
 "'s",
 'feel',
 'toward',
 'press',
 'also',
 'obvious',
 'messag',
 'drug',
 'bad',
 "m'kay",
 'visual',
 'impress',
 'cours',
 'michael',
 'jackson',
 'unless',
 'remot',
 'like',
 'mj',
 'anyway',
 'go',
 'hate',
 'find',
 'bore',
 'may',
 'call',
 'mj',
 'egotist',
 'consent',
 'make',
 'movi',
 'mj',
 'fan',
 'would',
 'say',
 'made',
 'fan',
 'true',
 'realli',
 'nice',
 'actual',
 'featur',
 'film',
 'bit',
 'final',
 'start',
 '20',
 'minut',
 'exclud',
 'smooth',
 'crimin',
 'sequenc',
 'joe',
 'pesci',
 '

In [27]:
clean_test_reviews[0]

['natur',
 'film',
 "'s",
 'main',
 'theme',
 'mortal',
 ',',
 'nostalgia',
 ',',
 'loss',
 'innoc',
 'perhap',
 'surpris',
 'rate',
 'high',
 'older',
 'viewer',
 'younger',
 'one',
 'howev',
 'craftsmanship',
 'complet',
 'film',
 'anyon',
 'enjoy',
 'pace',
 'steadi',
 'constant',
 ',',
 'charact',
 'full',
 'engag',
 ',',
 'relationship',
 'interact',
 'natur',
 'show',
 'need',
 'flood',
 'tear',
 'show',
 'emot',
 ',',
 'scream',
 'show',
 'fear',
 ',',
 'shout',
 'show',
 'disput',
 'violenc',
 'show',
 'anger',
 'natur',
 'joyc',
 "'s",
 'short',
 'stori',
 'lend',
 'film',
 'readi',
 'made',
 'structur',
 'perfect',
 'polish',
 'diamond',
 ',',
 'small',
 'chang',
 'huston',
 'make',
 'inclus',
 'poem',
 'fit',
 'neat',
 'truli',
 'masterpiec',
 'tact',
 ',',
 'subtleti',
 'overwhelm',
 'beauti']

In [28]:
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros((train["review"].size, num_clusters), dtype="float32")

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids(review, word_centroid_map)
    counter += 1

print("Training reviews are complete.")    
    
# Repeat for test reviews 
test_centroids = np.zeros((test["review"].size, num_clusters), dtype="float32")

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids(review, word_centroid_map )
    counter += 1
    
print("Testing reviews are complete.")  

Training reviews are complete.
Testing reviews are complete.


In [29]:
# Split the data for testing
x_train, x_test, y_train, y_test = train_test_split(train_centroids,
                                                    train.sentiment,
                                                    test_size = 0.2,
                                                    random_state = 2)

In [32]:
# Use GridSearchCV to find the optimal parameters
parameters = {'n_estimators':[100, 200, 300],
              'max_depth':[1,3,5,7, None],
              'min_samples_leaf': [1,3,5],
              'verbose': [True]}

# Use Random Forest to make the predictions
forest = RandomForestClassifier()
grid = GridSearchCV(forest, parameters)
grid.fit(x_train, y_train)

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  1.2min finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.9s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  1.1min finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.9s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   58.2s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  1.6min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'verbose': [True], 'n_estimators': [300], 'max_depth': [None], 'min_samples_leaf': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [33]:
print("Best training score = ", grid.best_score_)

grid_predictions = grid.best_estimator_.predict(x_test)
grid_score = metrics.accuracy_score(y_test, grid_predictions) 

print("Accuracy: {0:f}".format(grid_score))

print("Best Parameters = ", grid.best_params_)

Best training score =  0.85565
Accuracy: 0.862400
Best Parameters =  {'verbose': True, 'n_estimators': 300, 'max_depth': None, 'min_samples_leaf': 1}


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.8s finished


We have a pretty good first result here. The original code supplied by Google in the tutorial scored about 84%. It's nice that we have made some improvements to it, and scored higher.

In [34]:
forest = RandomForestClassifier(n_estimators = 300,
                                max_depth = None,
                                min_samples_leaf = 1, 
                                verbose = 2)

# Apply the Random Forest Model to the full training data.
forest = forest.fit(train_centroids,train["sentiment"])
result = forest.predict(test_centroids)

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv("BagOfCentroids.csv", index=False, quoting=3)

building tree 1 of 300


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


building tree 2 of 300
building tree 3 of 300
building tree 4 of 300
building tree 5 of 300
building tree 6 of 300
building tree 7 of 300
building tree 8 of 300
building tree 9 of 300
building tree 10 of 300
building tree 11 of 300
building tree 12 of 300
building tree 13 of 300
building tree 14 of 300
building tree 15 of 300
building tree 16 of 300
building tree 17 of 300
building tree 18 of 300
building tree 19 of 300
building tree 20 of 300
building tree 21 of 300
building tree 22 of 300
building tree 23 of 300
building tree 24 of 300
building tree 25 of 300
building tree 26 of 300
building tree 27 of 300
building tree 28 of 300
building tree 29 of 300
building tree 30 of 300
building tree 31 of 300
building tree 32 of 300
building tree 33 of 300
building tree 34 of 300
building tree 35 of 300
building tree 36 of 300
building tree 37 of 300
building tree 38 of 300
building tree 39 of 300
building tree 40 of 300
building tree 41 of 300
building tree 42 of 300
building tree 43 of 300


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  2.1min finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    3.6s finished


When I submit the results to the Kaggle competition its accuracy is 85.7%, which ranks it near the middle of the pack.

## Model #2: Bag of Words

In [35]:
# Find the length of each training and testing review
review_lengths = []
for review in clean_train_reviews:
    review_lengths.append(len(review))
    
review_lengths_test = []
for review in clean_test_reviews:
    review_lengths_test.append(len(review))

In [36]:
# Change the lists to dataframes so that describe() can be used
review_lengths = pd.DataFrame(review_lengths)
review_lengths_test = pd.DataFrame(review_lengths_test)

In [37]:
# Print out a summary of the review lengths
print("Summary Training Reviews:")
print(review_lengths.describe())
print()
print("Summary Testing Reviews:")
print(review_lengths_test.describe())

Summary Training Reviews:
                  0
count  25000.000000
mean     141.511440
std      107.887134
min        4.000000
25%       74.000000
50%      105.000000
75%      173.000000
max     1571.000000

Summary Testing Reviews:
                  0
count  25000.000000
mean     138.179320
std      105.029206
min        6.000000
25%       74.000000
50%      103.000000
75%      168.000000
max     1464.000000


In [38]:
# Find the maximum number of words for a percentile
percentile = 90
print(np.percentile(review_lengths, percentile))
print(np.percentile(review_lengths_test, percentile))

281.0
272.0


In [39]:
# Join the list of words to make more natural sentences
clean_train_reviews_sentences = []
space = " "
for review in clean_train_reviews:
    sentence = space.join(review)
    clean_train_reviews_sentences.append(sentence)
    
clean_test_reviews_sentences = []
space = " "
for review in clean_test_reviews:
    sentence = space.join(review)
    clean_test_reviews_sentences.append(sentence)

In [40]:
# Take a look at a review to ensure everything is alright
clean_train_reviews_sentences[0]

"stuff go moment mj ve start listen music , watch odd documentari , watch wiz watch moonwalk mayb want get certain insight guy thought realli cool eighti mayb make mind whether guilti innoc moonwalk part biographi , part featur film rememb go see cinema origin releas subtl messag mj 's feel toward press also obvious messag drug bad m'kay visual impress cours michael jackson unless remot like mj anyway go hate find bore may call mj egotist consent make movi mj fan would say made fan true realli nice actual featur film bit final start 20 minut exclud smooth crimin sequenc joe pesci convinc psychopath power drug lord want mj dead bad beyond mj overheard plan \\? nah , joe pesci 's charact rant want peopl know suppli drug etc dunno , mayb hate mj 's music lot cool thing like mj turn car robot whole speed demon sequenc also , director must patienc saint came film kiddi bad sequenc usual director hate work one kid let alon whole bunch perform complex danc scene bottom line , movi peopl like 

In [41]:
# Split the data in training and testing
x_train, x_test, y_train, y_test = train_test_split(clean_train_reviews_sentences, 
                                                    train.sentiment, 
                                                    test_size = 0.2, 
                                                    random_state = 2)

In [42]:
# Process the reviews to limit the number of words and length

# max_document_length = maximum number of words in a review
# min_frequency = minimum number of times a word must be present to be used in the vocabulary
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length = 281,
                                                          min_frequency = 5)
x_train_transformed = np.array(list(vocab_processor.fit_transform(x_train)))
x_test_transformed = np.array(list(vocab_processor.transform(x_test)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 16625


In [43]:
# Check to make sure everything looks okay
x_train_transformed[0]

array([  316,   845, 10973,   263,   597,   233,   694,    79,  2270,
         350,  1708,   494,  1822,    79,     5,    79,  6216,   854,
           0,   674,     5,  1369,     0,    34,    14,  1979,   221,
           6,   502,     3,   310,  1336,     0,   605, 16190,    79,
       10973,   263, 10973,     1,    55,    10,   502,     3,     6,
         494,  1822,    79,  2270,   350,  2580,   929,   793,   177,
        8210,     3, 16190,     1,  1331,  1290,    77,   356,     5,
         124,    83,    95,   210,     3,  9115, 10973,    10,   591,
         189,    82,  3264,   120,   177,    90,    32,    35,   558,
           5,   124,   214,   241,     6,    73,     2,    79,    14,
         175,  7112,    49,   248,  9367,    70,  2782,     0,  1843,
        7131,  1931,    34,   624,  6359,   711,    63,  8158,     3,
         721,   189, 10973,     1,   223,    12,    86,   263,   597,
        1263,   578,    63,   114, 16190, 10973,     0,     0,     0,
           0,     0,

In [44]:
print(x_train[0])
print()
print(len(x_train[0].split()))

particular joe mcdoak short subject obvious inspir star warner brother spectacular thank lucki star , one star wartim moral booster period one eddi cantor play would comedian 'd like break film except resembl cantor georg o'hanlon star mcdoak short mcdoak 's tri get break film like thank lucki star warner brother contract player free moment stroll film o'hanlon 's sent central cast small one line role world war film , lookalik mcdoak get messag poor guy nervous big moment , start think way deliv one line mayb sound like real movi star would help 86 take later exasper director ralph sanford patient clyde cook play british cockney soldier find nich film busi poor mcdoak 's worth see funni short subject nomin oscar find happen o'hanlon mcdoak

126


In [45]:
EMBEDDING_SIZE = 15

def bag_of_words_model(features, target):    
    # One-hot encode the target feature - positive and negative
    target = tf.one_hot(target, 2, 1, 0)  
    
    # If you alter the original n_words, you will need to input the value manually.
    features = tf.contrib.layers.bow_encoder(features, 
                                             vocab_size = 18307, #n_words, 
                                             embed_dim = EMBEDDING_SIZE)  
    
    logits = tf.contrib.layers.fully_connected(features, 
                                               2, 
                                               activation_fn=None)  
    
    loss = tf.contrib.losses.softmax_cross_entropy(logits, target)
    
    train_op = tf.contrib.layers.optimize_loss(loss, 
                                               tf.contrib.framework.get_global_step(),
                                               optimizer='Adam', 
                                               learning_rate=0.005)  
    
    return ({'class': tf.argmax(logits, 1), 
             'prob': tf.nn.softmax(logits)},      
            loss, train_op)

In [46]:
# Set classifier as bag_of_words_model
classifier_bow = learn.Estimator(model_fn = bag_of_words_model) 
# Train model
classifier_bow.fit(x_train_transformed, y_train, steps=1000) 





INFO:tensorflow:Using default config.


2017-02-24 15:23:33,252 : INFO : Using default config.


INFO:tensorflow:Using config: {'save_checkpoints_steps': None, 'save_checkpoints_secs': 600, '_is_chief': True, '_master': '', 'tf_random_seed': None, 'save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1f8d6fd30>, '_task_id': 0, '_environment': 'local', 'keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_evaluation_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_ps_replicas': 0, 'keep_checkpoint_max': 5}


2017-02-24 15:23:33,254 : INFO : Using config: {'save_checkpoints_steps': None, 'save_checkpoints_secs': 600, '_is_chief': True, '_master': '', 'tf_random_seed': None, 'save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1f8d6fd30>, '_task_id': 0, '_environment': 'local', 'keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_evaluation_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_ps_replicas': 0, 'keep_checkpoint_max': 5}


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


INFO:tensorflow:Create CheckpointSaverHook.


2017-02-24 15:23:37,729 : INFO : Create CheckpointSaverHook.


INFO:tensorflow:loss = 0.693168, step = 1


2017-02-24 15:23:41,052 : INFO : loss = 0.693168, step = 1


INFO:tensorflow:Saving checkpoints for 1 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt.


2017-02-24 15:23:41,053 : INFO : Saving checkpoints for 1 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt.


























INFO:tensorflow:loss = 0.254101, step = 101


2017-02-24 15:25:13,504 : INFO : loss = 0.254101, step = 101


INFO:tensorflow:global_step/sec: 1.08164


2017-02-24 15:25:13,506 : INFO : global_step/sec: 1.08164


INFO:tensorflow:loss = 0.116709, step = 201


2017-02-24 15:26:39,230 : INFO : loss = 0.116709, step = 201


INFO:tensorflow:global_step/sec: 1.16651


2017-02-24 15:26:39,232 : INFO : global_step/sec: 1.16651


INFO:tensorflow:loss = 0.0620678, step = 301


2017-02-24 15:28:05,628 : INFO : loss = 0.0620678, step = 301


INFO:tensorflow:global_step/sec: 1.15744


2017-02-24 15:28:05,629 : INFO : global_step/sec: 1.15744


INFO:tensorflow:loss = 0.0354963, step = 401


2017-02-24 15:29:31,003 : INFO : loss = 0.0354963, step = 401


INFO:tensorflow:global_step/sec: 1.17131


2017-02-24 15:29:31,004 : INFO : global_step/sec: 1.17131


INFO:tensorflow:loss = 0.0217942, step = 501


2017-02-24 15:30:56,350 : INFO : loss = 0.0217942, step = 501


INFO:tensorflow:global_step/sec: 1.17169


2017-02-24 15:30:56,351 : INFO : global_step/sec: 1.17169


INFO:tensorflow:loss = 0.0143387, step = 601


2017-02-24 15:32:21,959 : INFO : loss = 0.0143387, step = 601


INFO:tensorflow:global_step/sec: 1.16809


2017-02-24 15:32:21,961 : INFO : global_step/sec: 1.16809


INFO:tensorflow:Saving checkpoints for 694 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt.


2017-02-24 15:33:41,346 : INFO : Saving checkpoints for 694 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt.


























INFO:tensorflow:loss = 0.0100055, step = 701


2017-02-24 15:33:47,924 : INFO : loss = 0.0100055, step = 701


INFO:tensorflow:global_step/sec: 1.16327


2017-02-24 15:33:47,925 : INFO : global_step/sec: 1.16327


INFO:tensorflow:loss = 0.00731796, step = 801


2017-02-24 15:35:25,411 : INFO : loss = 0.00731796, step = 801


INFO:tensorflow:global_step/sec: 1.02577


2017-02-24 15:35:25,413 : INFO : global_step/sec: 1.02577


INFO:tensorflow:loss = 0.00555508, step = 901


2017-02-24 15:37:18,420 : INFO : loss = 0.00555508, step = 901


INFO:tensorflow:global_step/sec: 0.884885


2017-02-24 15:37:18,422 : INFO : global_step/sec: 0.884885


INFO:tensorflow:Saving checkpoints for 1000 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt.


2017-02-24 15:39:07,190 : INFO : Saving checkpoints for 1000 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt.


























INFO:tensorflow:Loss for final step: 0.00435374.


2017-02-24 15:39:07,852 : INFO : Loss for final step: 0.00435374.


Estimator(params=None)

In [47]:
bow_predictions = [p['class'] for p in classifier_bow.predict(x_test_transformed, as_iterable=True)] 
score = metrics.accuracy_score(y_test, bow_predictions) 
print("Accuracy: {0:f}".format(score))

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


INFO:tensorflow:Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt-1000-?????-of-00001.


2017-02-24 15:39:09,088 : INFO : Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpncljsidi/model.ckpt-1000-?????-of-00001.


Accuracy: 0.850600


The prediction of 85.06% is pretty similar to the Bag of Centroids model. Let's see how it compares when we use all of the data.

In [48]:
# Process the full data set to make final predictions
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length = 281,
                                                          min_frequency = 5)
x_train_all = np.array(list(vocab_processor.fit_transform(clean_train_reviews_sentences)))
x_test_all = np.array(list(vocab_processor.transform(clean_test_reviews_sentences)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 18307


In [49]:
# Need to use learn.Estimator again to 'reset' the model. 
# Otherwise you would be 'double training.'
classifier_bow = learn.Estimator(model_fn = bag_of_words_model) 
classifier_bow.fit(x_train_all, train.sentiment, steps=1000) 

result_bow = [p['class'] for p in classifier_bow.predict(x_test_all, as_iterable=True)] 

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result_bow})
output.to_csv("BagOfWords.csv", index=False, quoting=3)





INFO:tensorflow:Using default config.


2017-02-24 15:39:26,462 : INFO : Using default config.


INFO:tensorflow:Using config: {'save_checkpoints_steps': None, 'save_checkpoints_secs': 600, '_is_chief': True, '_master': '', 'tf_random_seed': None, 'save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x21e306588>, '_task_id': 0, '_environment': 'local', 'keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_evaluation_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_ps_replicas': 0, 'keep_checkpoint_max': 5}


2017-02-24 15:39:26,464 : INFO : Using config: {'save_checkpoints_steps': None, 'save_checkpoints_secs': 600, '_is_chief': True, '_master': '', 'tf_random_seed': None, 'save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x21e306588>, '_task_id': 0, '_environment': 'local', 'keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_evaluation_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_ps_replicas': 0, 'keep_checkpoint_max': 5}


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


INFO:tensorflow:Create CheckpointSaverHook.


2017-02-24 15:39:28,001 : INFO : Create CheckpointSaverHook.


INFO:tensorflow:loss = 0.693106, step = 1


2017-02-24 15:39:36,256 : INFO : loss = 0.693106, step = 1


INFO:tensorflow:Saving checkpoints for 1 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


2017-02-24 15:39:36,258 : INFO : Saving checkpoints for 1 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


























INFO:tensorflow:loss = 0.261839, step = 101


2017-02-24 15:42:07,473 : INFO : loss = 0.261839, step = 101


INFO:tensorflow:global_step/sec: 0.661304


2017-02-24 15:42:07,474 : INFO : global_step/sec: 0.661304


INFO:tensorflow:loss = 0.127082, step = 201


2017-02-24 15:44:38,785 : INFO : loss = 0.127082, step = 201


INFO:tensorflow:global_step/sec: 0.660866


2017-02-24 15:44:38,792 : INFO : global_step/sec: 0.660866


INFO:tensorflow:loss = 0.0718937, step = 301


2017-02-24 15:46:59,225 : INFO : loss = 0.0718937, step = 301


INFO:tensorflow:global_step/sec: 0.712068


2017-02-24 15:46:59,227 : INFO : global_step/sec: 0.712068


INFO:tensorflow:loss = 0.043267, step = 401


2017-02-24 15:49:31,387 : INFO : loss = 0.043267, step = 401


INFO:tensorflow:global_step/sec: 0.657176


2017-02-24 15:49:31,395 : INFO : global_step/sec: 0.657176


INFO:tensorflow:Saving checkpoints for 405 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


2017-02-24 15:49:37,388 : INFO : Saving checkpoints for 405 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


























INFO:tensorflow:loss = 0.0273597, step = 501


2017-02-24 15:52:14,244 : INFO : loss = 0.0273597, step = 501


INFO:tensorflow:global_step/sec: 0.614051


2017-02-24 15:52:14,247 : INFO : global_step/sec: 0.614051


INFO:tensorflow:loss = 0.018212, step = 601


2017-02-24 15:54:55,048 : INFO : loss = 0.018212, step = 601


INFO:tensorflow:global_step/sec: 0.621875


2017-02-24 15:54:55,051 : INFO : global_step/sec: 0.621875


INFO:tensorflow:loss = 0.0127388, step = 701


2017-02-24 15:57:23,882 : INFO : loss = 0.0127388, step = 701


INFO:tensorflow:global_step/sec: 0.671893


2017-02-24 15:57:23,884 : INFO : global_step/sec: 0.671893


INFO:tensorflow:Saving checkpoints for 790 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


2017-02-24 15:59:37,657 : INFO : Saving checkpoints for 790 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


























INFO:tensorflow:loss = 0.00930042, step = 801


2017-02-24 16:00:01,916 : INFO : loss = 0.00930042, step = 801


INFO:tensorflow:global_step/sec: 0.632776


2017-02-24 16:00:01,918 : INFO : global_step/sec: 0.632776


INFO:tensorflow:loss = 0.00703519, step = 901


2017-02-24 16:02:53,548 : INFO : loss = 0.00703519, step = 901


INFO:tensorflow:global_step/sec: 0.582623


2017-02-24 16:02:53,557 : INFO : global_step/sec: 0.582623


INFO:tensorflow:Saving checkpoints for 1000 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


2017-02-24 16:05:28,220 : INFO : Saving checkpoints for 1000 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt.


























INFO:tensorflow:Loss for final step: 0.00549149.


2017-02-24 16:05:29,604 : INFO : Loss for final step: 0.00549149.


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


INFO:tensorflow:Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt-1000-?????-of-00001.


2017-02-24 16:05:30,840 : INFO : Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpheckb5hr/model.ckpt-1000-?????-of-00001.


The accuracy drops when we use the Kaggle predictions to 83.7%. 

## Model #3: LSTM

In [60]:
EMBEDDING_SIZE = 25
LTSM_SIZE = 25
number_of_layers = 3

def rnn_model(features, target):  
    """RNN model to predict from sequence of words to a class."""  
    
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE]
    word_vectors = tf.contrib.layers.embed_sequence(features, 
                                                    vocab_size = 9600, #n_words, 
                                                    embed_dim = EMBEDDING_SIZE)   
    
    # Split into list of embeddings per word, while removing doc length dim.
    word_list = tf.unstack(word_vectors, axis=1)
    
    # Create a Long Short Term Memory cell with hidden size of LISTM_SIZE.
    cell = tf.nn.rnn_cell.BasicLSTMCell(LTSM_SIZE, state_is_tuple=False)
    
    # Create an unrolled Recurrent Neural Networks to length of
    # max_document_length and passes word_list as inputs for each unit.
    _, encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)   

    target = tf.one_hot(target, 2, 1, 0)
    logits = tf.contrib.layers.fully_connected(encoding, 2, activation_fn=None)  
    loss = tf.contrib.losses.softmax_cross_entropy(logits, target)   
    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(loss, 
                                               tf.contrib.framework.get_global_step(),      
                                               optimizer='Adam', 
                                               learning_rate=0.005, 
                                               clip_gradients=1.0)   
    return ({'class': tf.argmax(logits, 1), 
             'prob': tf.nn.softmax(logits)},      
             loss, train_op)

In [51]:
# Need to process the reviews again, because my laptop will crash if the network is too large.
# If you are using a GPU or have more than 8GB of RAM, you should be able to use more data for training.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length = 150,
                                                          min_frequency = 20)
x_train_transformed = np.array(list(vocab_processor.fit_transform(x_train)))
x_test_transformed = np.array(list(vocab_processor.transform(x_test)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 8482


In [57]:
classifier_rnn = learn.Estimator(model_fn = rnn_model) 
classifier_rnn.fit(x_train_transformed, y_train, steps = 500) 

predictions_rnn = [p['class'] for p in classifier_rnn.predict(x_test_transformed, as_iterable=True)] 
score = metrics.accuracy_score(y_test, predictions_rnn) 
print("Accuracy: {0:f}".format(score))

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))






INFO:tensorflow:Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpvggc98tl/model.ckpt-500-?????-of-00001.


2017-02-24 18:37:31,277 : INFO : Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpvggc98tl/model.ckpt-500-?????-of-00001.


Accuracy: 0.832200


The testing accuracy for the LSTM model is the lowest, at 83.22%. I expect this is due to the smaller amount of data that is being used to train this neural network.

In [58]:
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length = 150,
                                                          min_frequency = 20)
x_train_all_rnn = np.array(list(vocab_processor.fit_transform(clean_train_reviews_sentences)))
x_test_all_rnn = np.array(list(vocab_processor.transform(clean_test_reviews_sentences)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 9600


In [61]:
classifier_rnn = learn.Estimator(model_fn = rnn_model) 
classifier_rnn.fit(x_train_all_rnn, train.sentiment, steps=500) 

result_rnn = [p['class'] for p in classifier_rnn.predict(x_test_all_rnn, as_iterable=True)] 

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result_rnn})
output.to_csv("rnn_predictions.csv", index=False, quoting=3)





INFO:tensorflow:Using default config.


2017-02-24 18:39:50,737 : INFO : Using default config.


INFO:tensorflow:Using config: {'save_checkpoints_steps': None, 'save_checkpoints_secs': 600, '_is_chief': True, '_master': '', 'tf_random_seed': None, 'save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x187de77b8>, '_task_id': 0, '_environment': 'local', 'keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_evaluation_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_ps_replicas': 0, 'keep_checkpoint_max': 5}


2017-02-24 18:39:50,739 : INFO : Using config: {'save_checkpoints_steps': None, 'save_checkpoints_secs': 600, '_is_chief': True, '_master': '', 'tf_random_seed': None, 'save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x187de77b8>, '_task_id': 0, '_environment': 'local', 'keep_checkpoint_every_n_hours': 10000, '_task_type': None, '_evaluation_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_ps_replicas': 0, 'keep_checkpoint_max': 5}


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))






INFO:tensorflow:Create CheckpointSaverHook.


2017-02-24 18:40:53,643 : INFO : Create CheckpointSaverHook.


INFO:tensorflow:loss = 0.693123, step = 1


2017-02-24 18:42:23,591 : INFO : loss = 0.693123, step = 1


INFO:tensorflow:Saving checkpoints for 1 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 18:42:23,601 : INFO : Saving checkpoints for 1 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 43 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 18:52:35,953 : INFO : Saving checkpoints for 43 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 81 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 19:02:52,174 : INFO : Saving checkpoints for 81 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:loss = 0.30135, step = 101


2017-02-24 19:08:11,739 : INFO : loss = 0.30135, step = 101


INFO:tensorflow:global_step/sec: 0.0645934


2017-02-24 19:08:11,769 : INFO : global_step/sec: 0.0645934


INFO:tensorflow:Saving checkpoints for 122 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 19:13:05,282 : INFO : Saving checkpoints for 122 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 162 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 19:23:17,112 : INFO : Saving checkpoints for 162 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:loss = 0.00452801, step = 201


2017-02-24 19:32:46,876 : INFO : loss = 0.00452801, step = 201


INFO:tensorflow:global_step/sec: 0.0677903


2017-02-24 19:32:46,883 : INFO : global_step/sec: 0.0677903


INFO:tensorflow:Saving checkpoints for 204 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 19:33:28,787 : INFO : Saving checkpoints for 204 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 246 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 19:43:41,309 : INFO : Saving checkpoints for 246 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 288 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 19:53:53,378 : INFO : Saving checkpoints for 288 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:loss = 0.000141494, step = 301


2017-02-24 19:57:23,034 : INFO : loss = 0.000141494, step = 301


INFO:tensorflow:global_step/sec: 0.0677434


2017-02-24 19:57:23,041 : INFO : global_step/sec: 0.0677434


INFO:tensorflow:Saving checkpoints for 327 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:04:01,950 : INFO : Saving checkpoints for 327 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 356 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:14:16,240 : INFO : Saving checkpoints for 356 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 386 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:24:22,537 : INFO : Saving checkpoints for 386 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:loss = 3.63579e-05, step = 401


2017-02-24 20:29:20,676 : INFO : loss = 3.63579e-05, step = 401


INFO:tensorflow:global_step/sec: 0.0521474


2017-02-24 20:29:20,684 : INFO : global_step/sec: 0.0521474


INFO:tensorflow:Saving checkpoints for 417 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:34:37,717 : INFO : Saving checkpoints for 417 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 454 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:44:45,695 : INFO : Saving checkpoints for 454 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 497 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:54:49,834 : INFO : Saving checkpoints for 497 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Saving checkpoints for 500 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


2017-02-24 20:55:51,034 : INFO : Saving checkpoints for 500 into /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt.


























INFO:tensorflow:Loss for final step: 2.13439e-05.


2017-02-24 20:56:10,961 : INFO : Loss for final step: 2.13439e-05.


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))


Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))






INFO:tensorflow:Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt-500-?????-of-00001.


2017-02-24 20:56:42,182 : INFO : Loading model from checkpoint: /var/folders/h3/j2h_850j5klb8yns26kmxqfw0000gp/T/tmpn02idq1e/model.ckpt-500-?????-of-00001.


Based on the testing data, it was not too surprising to see that the LSTM model also scored the worst on the submission data, 81.3%. It would be interesting to see how much the score would improve if we were able to use more data in the model.