# Predicting Movie Scores from IMDB Reviews

### Background & Overview
In this project I use data from IMDB reviews to predict the sentiment score (either 0 for negative, 1 for positive) for a movie. The data set contains around 25,000 data points and many separate reviews for different and varied movies. Our analysis uses a bag of words model (meaning word order doesn't matter): Count Vectorizer, an NLP toolkit from SciKit, that transforms each review into a vector representation. The resulting matrix contains a number of rows corresponding to the number of data points and a column size as large as the number of words in the entire corpus.



### Coding Outline
1) Import Data

2) Clean (Stem/Lemmatize) the data

3) Run classification algorithms and compare their performance on training versus test data

4) Tune Hyperparameters to achieve highest performance

### Credit
This project is inspired by a homework project I did for my DataX class. There are places in the code where I use cleaning techniques and methods written by the GSIs for the class. I have commented above each method that was written or at least, heavily inspired, by my DataX TA's work.


In [1]:
#Pipe test data into dataframe and remove warnings for the future
from __future__ import print_function, division, absolute_import
import matplotlib.pyplot as plt
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd       
train = pd.read_csv("Movie Review Data/labeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)

train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In this next part, we clean the review for the test data.

In [2]:
# import packages

import bs4 as bs
import nltk

nltk.download('all')
from nltk.tokenize import sent_tokenize # tokenizes sentences
import re

from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

#making the stopwords a set improves computational performance
eng_stopwords = set(stopwords.words("english"))

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package cess_cat is

[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /Users/rvenguswamy/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /Users/rvenguswamy/nltk_d

In [3]:
# This cleaning method was written primarily by my DataX TA. I have modified it to fit my project.

from nltk.corpus import stopwords
from nltk.util import ngrams

ps = PorterStemmer()
wnl = WordNetLemmatizer()


def review_cleaner(reviews,lemmatize=True,stem=False):
    '''
    Clean and preprocess a review.

    1. Remove HTML tags
    2. Use regex to remove all special characters (only keep letters)
    3. Make strings to lower case and tokenize / word split reviews
    4. Remove English stopwords
    5. Rejoin to one string
    '''
    ps = PorterStemmer()
    wnl = WordNetLemmatizer()
        #1. Remove HTML tags
    
    cleaned_reviews=[]
    for i,review in enumerate(train['review']):
    # print cleaning progress
        if( (i+1)%10000 == 0 ):
            print("Done with %d reviews" %(i+1))
        review = bs.BeautifulSoup(review).text

        #2. Use regex to find emoticons
        emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P|O)', review)

        #3. Remove punctuation
        review = re.sub("[^a-zA-Z]", " ",review)

        #4. Tokenize into words (all lower case)
        review = review.lower().split()

        #5. Remove stopwords
        eng_stopwords = set(stopwords.words("english"))
            
        clean_review=[]
        for word in review:
            if word not in eng_stopwords:
                if lemmatize is True:
                    word=wnl.lemmatize(word)
                elif stem is True:
                    if word == 'oed':
                        continue
                    word=ps.stem(word)
                clean_review.append(word)

        #6. Join the review to one sentence
        
        review_processed = ' '.join(clean_review+emoticons)
        cleaned_reviews.append(review_processed)
    

    return(cleaned_reviews)

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics # for confusion matrix, accuracy score etc
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from keras.models import Sequential
from keras import layers


np.random.seed(0)

#The following method, originally written by a DataX TA, is modified heavily by me.
def train_predict_sentiment_keras(cleaned_reviews, y=train["sentiment"],ngram=1,max_features=1000):
    '''This function will:
    1. split data into train and test set.
    2. get n-gram counts from cleaned reviews 
    3. train a random forest model using train n-gram counts and y (labels)
    4. test the model on your test split
    5. print accuracy of sentiment prediction on test and training data
'''

    print("Creating the bag of words model!\n")
    # Max features allows for us to limit the 'column' count of the resulting matrix.
    vectorizer = CountVectorizer(ngram_range=(1, ngram), analyzer = "word", max_features = max_features) 
    
    X_train, X_test, y_train, y_test = train_test_split(\
    cleaned_reviews, y, random_state=0, test_size=.2)

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    train_bag = vectorizer.fit_transform(X_train).toarray()
    test_bag = vectorizer.transform(X_test).toarray()

    print("Training the model!\n")
    #Small neural network model on keras
    model = Sequential()
    input_dim = train_bag.shape[1]
    model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    history = model.fit(train_bag, y_train, epochs=20, verbose=False, validation_data=(test_bag, y_test),batch_size=10)
    
    '''
    train_predictions = m.predict(train_bag)
    test_predictions = m.predict(test_bag)
    
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    print(" The training accuracy is: ", train_acc, "\n", "The validation accuracy is: ", valid_acc)
    print()
    print('CONFUSION MATRIX:')
    print('         Predicted')
    print('          neg pos')
    print(' Actual')
    c=confusion_matrix(y_test, test_predictions)
    print('     neg  ',c[0])
    print('     pos  ',c[1])
    '''
    loss, accuracy = model.evaluate(train_bag, y_train, verbose=False)
    print("Training Accuracy: {:.4f}".format(accuracy))
    loss, accuracy = model.evaluate(test_bag, y_test, verbose=False)
    print("Testing Accuracy:  {:.4f}".format(accuracy))
    return history


Using TensorFlow backend.


In [5]:
np.random.seed(0)

def train_predict_sentiment(model, cleaned_reviews, y=train["sentiment"],ngram=1,max_features=1000):
    '''This function will:
    1. split data into train and test set.
    2. get n-gram counts from cleaned reviews 
    3. train a random forest model using train n-gram counts and y (labels)
    4. test the model on your test split
    5. print accuracy of sentiment prediction on test and training data
    6. print confusion matrix on test data results

    To change n-gram type, set value of ngram argument
    To change the number of features you want the countvectorizer to generate, set the value of max_features argument'''

    print("Creating the bag of words model!\n")
    # CountVectorizer" is scikit-learn's bag of words tool, here we show more keywords 
    vectorizer = CountVectorizer(ngram_range=(1, ngram),analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 max_features = max_features) 
    
    X_train, X_test, y_train, y_test = train_test_split(\
    cleaned_reviews, y, random_state=0, test_size=.2)

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    # The input should be a list of strings. .toarraty() converts to a numpy array
    
    train_bag = vectorizer.fit_transform(X_train).toarray()
    test_bag = vectorizer.transform(X_test).toarray()
#     print('TOP 20 FEATURES ARE: ',(vectorizer.get_feature_names()[:20]))


    print("Training the classifier!\n")

    model_use = model

    # Fit the forest to the training set, using the bag of words as 
    # features and the sentiment labels as the target variable
    model = model_use.fit(train_bag, y_train)


    train_predictions = model.predict(train_bag)
    test_predictions = model.predict(test_bag)
    
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    print(" The training accuracy is: ", train_acc, "\n", "The validation accuracy is: ", valid_acc)
    print()
    print('CONFUSION MATRIX:')
    print('         Predicted')
    print('          neg pos')
    print(' Actual')
    c=confusion_matrix(y_test, test_predictions)
    print('     neg  ',c[0])
    print('     pos  ',c[1])

In [6]:
original_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

Done with 10000 reviews
Done with 20000 reviews


In [7]:
h = train_predict_sentiment_keras(cleaned_reviews=original_clean_reviews, y=train["sentiment"],ngram=1,max_features=1000)

Creating the bag of words model!

Training the model!

Training Accuracy: 0.9883
Testing Accuracy:  0.8300


In [None]:
original_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

In [28]:
#Training the model on a Logistic Classifier
train_predict_sentiment(LogisticRegression(tol = 1e-7, solver = 'lbfgs'), cleaned_reviews=original_clean_reviews, y=train["sentiment"],ngram=2,max_features=2000)

Creating the bag of words model!

Training the classifier!

 The training accuracy is:  0.9078 
 The validation accuracy is:  0.8646

CONFUSION MATRIX:
         Predicted
          neg pos
 Actual
     neg   [2191  357]
     pos   [ 320 2132]


In [29]:
#Training the model on a Logistic Classifier
train_predict_sentiment(RandomForestClassifier(), cleaned_reviews=original_clean_reviews, y=train["sentiment"],ngram=2,max_features=1500)

Creating the bag of words model!

Training the classifier!

 The training accuracy is:  0.9929 
 The validation accuracy is:  0.7722

CONFUSION MATRIX:
         Predicted
          neg pos
 Actual
     neg   [2089  459]
     pos   [ 680 1772]
