# IMDB Sentiment Analysis

The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

**Import the required libraries**

In [5]:
import numpy as np
import pandas as pd
import os
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

**Load Data**

In [6]:
reviews_train = []
for line in open('./movie_data 2/full_train.txt', 'r'):
    
    reviews_train.append(line.strip())
    
reviews_test = []
for line in open('./movie_data 2/full_test.txt', 'r'):
    
    reviews_test.append(line.strip())
    
target = [1 if i < 12500 else 0 for i in range(25000)]

The raw text is pretty messy for these reviews so before we can do any analytics we need to clean things up


**Use Regular expressions to remove the non text characters, and the html tags**

In [7]:
import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    
    reviews = [REPLACE_NO_SPACE.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(SPACE, line) for line in reviews]
    
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

# Train a Baseline Model

Train a Logistic Regression model after transforming the data with CountVectorized



Accuracy for C=0.01: 0.86848
Accuracy for C=0.05: 0.8776
Accuracy for C=0.25: 0.87856
Accuracy for C=0.5: 0.87536
Accuracy for C=1: 0.87296
Final Accuracy: 0.88168


# Remove Stop Words

Stop words are the very common words like ‘if’, ‘but’, ‘we’, ‘he’, ‘she’, and ‘they’. We can usually remove these words without changing the semantics of a text and doing so often (but not always) improves the performance of a model. Removing these stop words becomes a lot more useful when we start using longer word sequences as model features (see n-grams below).

Before we apply the CountVectorized, lets remove the stopwords, included in nltk.corpus

Then apply the CountVectorizer, and train the Logistic regression model and obtain the accuracy.



Accuracy for C=0.01: 0.87136
Accuracy for C=0.05: 0.88128
Accuracy for C=0.25: 0.87776
Accuracy for C=0.5: 0.87616
Accuracy for C=1: 0.87184


**Note:** In practice, an easier way to remove stop words is to just use the stop_words argument with any of scikit-learn’s ‘Vectorizer’ classes. If you want to use NLTK’s full list of stop words you can do stop_words='english’. In practice I’ve found that using NLTK’s list actually decreases my performance because its too expansive, so I usually supply my own list of words. For example, stop_words=['in','of','at','a','the'] .

A common next step in text preprocessing is to normalize the words in your corpus by trying to convert all of the different forms of a given word into one. Two methods that exist for this are Stemming and Lemmatization.

# Stemming

Stemming is considered to be the more crude/brute-force approach to normalization (although this doesn’t necessarily mean that it will perform worse). There’s several algorithms, but in general they all use basic rules to chop off the ends of words.

NLTK has several stemming algorithm implementations. We’ll use the Porter stemmer here but you can explore all of the options with examples here: NLTK Stemmers

Apply a PoterStemmer, vectorize, and train the model again



Accuracy for C=0.01: 0.85728
Accuracy for C=0.05: 0.8704
Accuracy for C=0.25: 0.87392
Accuracy for C=0.5: 0.87344
Accuracy for C=1: 0.87072
Final Accuracy: 0.87748


# Lemmatization

Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.



Accuracy for C=0.01: 0.87792
Accuracy for C=0.05: 0.88448
Accuracy for C=0.25: 0.88016
Accuracy for C=0.5: 0.8792
Accuracy for C=1: 0.876
Final Accuracy: 0.87444


# n-grams

We can potentially add more predictive power to our model by adding two or three word sequences (bigrams or trigrams) as well. For example, if a review had the three word sequence “didn’t love movie” we would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word ‘love’ by itself is going to be highly correlated with a positive review.

The scikit-learn library makes this really easy to play around with. Just use the ngram_range argument with any of the ‘Vectorizer’ classes.



Accuracy for C=0.01: 0.88464
Accuracy for C=0.05: 0.89184
Accuracy for C=0.25: 0.8936
Accuracy for C=0.5: 0.89312
Accuracy for C=1: 0.89344
Final Accuracy: 0.898


# Word Counts

Instead of simply noting whether a word appears in the review or not, we can include the number of times a given word appears. This can give our sentiment classifier a lot more predictive power. For example, if a movie reviewer says ‘amazing’ or ‘terrible’ multiple times in a review it is considerably more probable that the review is positive or negative, respectively.



Accuracy for C=0.01: 0.87552
Accuracy for C=0.05: 0.88496
Accuracy for C=0.25: 0.88304
Accuracy for C=0.5: 0.88032
Accuracy for C=1: 0.87872
Final Accuracy: 0.88184


# TF-IDF

Another common way to represent each document in a corpus is to use the tf-idf statistic (term frequency-inverse document frequency) for each word, which is a weighting factor that we can use in place of binary or word count representations.

There are several ways to do tf-idf transformation but in a nutshell, tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in — where words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.

**Note:** Now that we’ve gone over n-grams, when I refer to ‘words’ I really mean any n-gram (sequence of words) if the model is using an n greater than one.



Accuracy for C=0.01: 0.79872
Accuracy for C=0.05: 0.8288
Accuracy for C=0.25: 0.86768
Accuracy for C=0.5: 0.87728
Accuracy for C=1: 0.88432
Final Accuracy: 0.882


# Support Vector Machines (SVM)

Recall that linear classifiers tend to work well on very sparse datasets (like the one we have). Another algorithm that can produce great results with a quick training time are Support Vector Machines with a linear kernel.

Build a model with an n-gram range from 1 to 2:

Accuracy for C=0.01: 0.89408
Accuracy for C=0.05: 0.892
Accuracy for C=0.25: 0.89024




Accuracy for C=0.5: 0.8896
Accuracy for C=1: 0.8896
Final Accuracy: 0.8974


# Final Model

Removing a small set of stop words along with an n-gram range from 1 to 3 and a linear support vector classifier shows the best results.

Accuracy for C=0.001: 0.88752
Accuracy for C=0.005: 0.89584
Accuracy for C=0.01: 0.89568
Accuracy for C=0.05: 0.89456
Accuracy for C=0.1: 0.89472
Final Accuracy: 0.90064


# Top Postitive and Negative Features

Obtain the most important features of the model.

NameError: name 'ngram_vectorizer' is not defined

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
Accuracy: 81.22%




Final Accuracy: 0.88672
