# Text Classification for Movie Reviews

## Goal

With this project, we aimed to use the machine learning skills in this class and apply it to a popular field in machine learning: text classification. We found a great dataset that put together 50,000 movie reviews for binary classification based on whether the movie review was positive or negative. This is useful for websites that serve as aggregates for movie reviews, such as Rotten tomatoes, as users can submit their reviews. 

## Step 1: Feature Exploration

The dataset divided the reviews into 25,000 positive reviews and 25,000 negative reviews split evenly into training and test sets. Furthermore, the creators of the dataset turned each review into a tokenized bag of words that matches with a vocab set. The vocab set list has every words that comes up in the review set and maps it to a number. However, there are still a few things we needed to do to get the data ready for classification.

## Step 2: Text Clean-Up and Transformation

While the bag of words is a great start, the word count doesn't give the whole story. We want by representing not only the word count in the review but how relative it is in document frequency (A word that appears in every document is not as useful as a word that only appears in positive movies for example. We will achieve this by using a tf-idf (term frequency to inverse document frequency) score which we will then vectorize to use for classification.

In [62]:
import math
import numpy as np

# Finding the frequency of each term

vocab = open("aclImdb/imdb.vocab", errors='ignore').read().splitlines()
trainReviews = open("aclImdb/train/labeledBow.feat").read().splitlines()
testReviews = open("aclImdb/test/labeledBow.feat").read().splitlines()
docFreqTrain = [0] * len(vocab)

for line in trainReviews:
    bag = line.split()
    bag.pop(0)
    for word in bag:
        pair = word.split(":")
        docFreqTrain[int(pair[0])] += 1 

# Calculating tf-idf score for standardizing

def TF_score(pair):
    left = int(pair[0])
    right = int(pair[1])
    return right * math.log(25000/int(docFreqTrain[left]))

def getFile(reviews, score):
    updatedReviews = ""
    for line in reviews:
        bag = line.split()
        updateLine = "0 " if (int(bag.pop(0)) <= 4) else "1 "
        for word in bag:
            pair = word.split(":")
            updateLine += pair[0]
            updateLine += ":"
            updateLine += str(score(pair))
            updateLine += " "
        updateLine += "\n"
        updatedReviews += updateLine
    return updatedReviews

updatedReviewsTrain = getFile(trainReviews, TF_score_Train)
updatedReviewsTest = getFile(testReviews, TF_score_Train)

Next we will vectorize and turn the data into an array. For now, we will treat each word as a feature, but this will change when we optimize for hyper paramaters and classification choice

In [65]:
TrainData = updatedReviewsTrain.splitlines()
TestData = updatedReviewsTest.splitlines()

TrainFeatures = np.zeros((len(TrainData), len(vocab)))
TrainLabels = np.zeros(len(TrainData))
TestFeatures = np.zeros((len(TestData), len(vocab)))
TestLabels = np.zeros(len(TestData))

for i in range(0, len(TrainData)):
    bag = TrainData[i].split()
    TrainLabels[i] = int(bag.pop(0))
    for word in bag:
        pair = word.split(":")
        TrainFeatures[i][int(pair[0])] = float(pair[1])

for i in range(0, len(TestData)):
    bag = TestData[i].split()
    TestLabels[i] = int(bag.pop(0))
    for word in bag:
        pair = word.split(":")
        TestFeatures[i][int(pair[0])] = float(pair[1])
        
print(TrainFeatures.shape, TrainLabels.shape)
print(TestFeatures.shape, TestLabels.shape)

(25000, 89527) (25000,)
(25000, 89527) (25000,)


Finally, just to see how the data perform without any optimization, we'll run it through a basic svm classifier and see how the accuracy is

In [64]:
from sklearn.svm import LinearSVC

svc = LinearSVC()
svc.fit(TrainFeatures, TrainLabels)

print("Accuracy of Train Set: ", svc.score(TrainFeatures, TrainLabels), "\n")
print("Accuracy of Test Set: ", svc.score(TestFeatures, TestLabels))



Accuracy of Train Set:  1.0 

Accuracy of Test Set:  0.84776


The classifier has decent accuracy on the test, but as we can see when compared to the acurracy of the Train Set, we may have some overfitting that is not well generalized for test data. We also need to do a more in depth process for classification to figure out which classifier is the best overall. This will be covered in the next section

## Step 3: Further Refinement

# TOKENIZER

This class takes in the File Name of a movie review and converts it into a Bag of Words using the imdb.vocab file.

RETURNS a string with 0 at the front and a count of each word present afterwards, delimited by whitespace. No newline character at the end.

In [1]:
import string
import re

def Tokenizer(file_name):
    vocab = open("aclImdb/imdb.vocab").read().splitlines()
    
    review = open(file_name).read()
    allow = string.ascii_letters + string.whitespace + "'-"
    review = re.sub('[^%s]' % allow, '', review).lower().split()
    
    word_count = {}
    
    for word in review:
        count = review.count(word)
        word_count[word] = count
        
    converted_word_count = {}
    
    for word, count in word_count.items():
        try:
            index = vocab.index(word)
            converted_word_count[index] = count
        except ValueError:    
            continue
            
    indices = list(converted_word_count.keys())
    indices.sort()
    
    sorted_bow = {}
    
    for index in indices:
        sorted_bow[index] = converted_word_count[index]
        
    return_string = "0 "
    
    for index, count in sorted_bow.items():
        return_string += str(index)
        return_string += ":"
        return_string += str(count)
        return_string += " "
    
    return return_string

In [2]:
Tokenizer("0_9.txt")

'0 0:9 1:1 2:4 3:4 4:6 5:4 6:2 7:2 8:4 10:4 12:2 26:1 28:1 29:2 32:1 41:1 45:1 47:1 50:1 54:2 57:1 59:1 63:2 64:1 66:1 68:2 70:1 72:1 78:1 100:1 106:1 116:1 122:1 125:1 136:1 140:1 142:1 150:1 167:1 183:1 201:1 207:1 208:1 213:1 217:1 230:1 255:1 321:4 343:1 357:1 370:1 390:2 468:1 514:1 571:1 619:1 671:1 766:1 877:1 1057:1 1179:1 1192:1 1402:2 1416:1 1477:2 1940:1 1941:1 2096:1 2243:1 2285:1 2379:1 2934:1 2938:1 3520:1 3647:1 4938:1 5138:3 5715:1 5726:1 5731:1 5812:1 8319:1 8567:1 10480:1 14239:1 20604:1 22409:4 24551:1 47304:1 '