# Sentiment Classification - Movie Reviews

*by Apoorv Pandey*

Movie reviews are a crucial aspect of sensitizing an audience towards a movie. They provide people with multiple perspectives and allow people to decide whether to watch a movie. Movie reviews are far more personalized than statistical ratings or scores. They embody the opinion of the reviewer, and are highly subjective in nature. A crucial characteristic of the reviews is their *sentiment*, or **overall opinion** towards the subject matter.

This notebook implements a classifier approach to evaluating the sentiment of a movie review. The training data set is used to compute the weighted sentiment of all the words. The weighted sentiments of words are the fractions when the words were used within a review with a positive or a negative sentiment. They are computed by performing a weighted sum for each and every occurence of the words across the training data.

The weighted sum is required because the same word can be used to express both a positive or a negative sentiment, depending on the context. A weighted sum, which is the weighted sentiment, can be used to glean the likelihood of the word's existence contributing to the direction of sentiment of the movoe review.

The overall sentiment of reviews in the testing data are formulated as a sum of the weighted sentiments of the words present in the review.

## Functions

The functions created in this notebook are:
* `simplify(doc)`: Simplifies the input, and reduces simple text without special characters or whitespaces
* `get_words(sentence)`: Returns a list of words for the input sentence

In [1]:
import nltk
import re
from nltk.stem.porter import *

In [2]:
def simplify(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A).lower().strip() # Remove special characters, whitespaces and make lower case
    tokens = nltk.WordPunctTokenizer().tokenize(doc) # Tokenize
    filtered_tokens = [token for token in tokens if token not in nltk.corpus.stopwords.words('english')] # Remove stopwords
    doc = ' '.join(filtered_tokens) # Re-create document from filtered tokens
    return doc

In [3]:
def get_words(sentence):
    stemmer = PorterStemmer()
    words = [stemmer.stem(x) for x in simplify(sentence).split()]
    return words

## Training

In [4]:
train = open('training_set.csv', 'r') # Training data
train.readline() # Read and remove header row
word_sentiment = {} # Dictionary stores sentiment weight of all words

In [5]:
for data in train:
    sentiment, line = data.split(',')
    words = get_words(line)
    for word in words:
        try: # Increment weight of the word by 1 if positive, decrement weight by 1 if negative
            word_sentiment[word][0] = word_sentiment[word][0] + 1 if int(sentiment) == 1 else word_sentiment[word][0] - 1
        except: # If word doesn't exist, create new entry in dictionary
            word_sentiment[word] = [1, 0] if int(sentiment) == 1 else [-1, 0]
        finally: # Increment number of occurences of the word; used later to compute the weighted sum instead of just sum
            word_sentiment[word][1] += 1

In [6]:
word_weighted_sentiment = {word: word_sentiment[word][0] / word_sentiment[word][1] for word in word_sentiment.keys()}
# Weighted sum of the sentiment of each word (divide weight of the word by number of occurences of the word)

In [7]:
train.close()

## Testing

In [8]:
test = open('test_set.csv', 'r') # Testing data
test.readline() # Read and remove header row
output = open('prediction_file.csv', 'a') # Output file (testing data with predictions)

In [9]:
for data in test:
    sentiment = 0
    words = get_words(data)
    for word in words:
        try:
            sentiment += word_weighted_sentiment[word] # Compute sum of weighted sentiments of the words in the review
        except:
            sentiment += 0 # MISSING WORDS (words in testing data but not training data) ARE GIVEN SENTIMENT OF 0
    if sentiment >= 0:
        output.write('1,' + data + '\n')
    else:
        output.write('0,' + data + '\n')

In [10]:
test.close()
output.close()

The results of the predictions may be observed in `prediction_file.csv`.

## References

[Training and testing datasets](https://inclass.kaggle.com/c/cs6998/data)