# Sentiment Analysis on movie comments
The data we're using are movie comments from [IMDB](http://www.imdb.com/). It has been preprocessed and stored in `reviews.txt` and `labels.txt`. For the same line in these two files, `reviews.txt` contains the review and `labels.txt` contains corresponding label which is either `positive` or `negative`. The goal is to train a model against this data to tell whether a comment is positive or not.

## Part1: Preparing data
Task for this part:  
1. Convert a review to a vector (it's much easier to work with the neural network)
2. Reduce noise in data

### Load data
The data has been preprocessed that it only contains lower case characters. Labels are converted into upper case here to make it more like a constant.

In [1]:
with open('reviews.txt', 'r') as f:
    reviews = list(map(lambda x : x[:-1], f.readlines()))

with open('labels.txt', 'r') as f:
    labels = list(map(lambda x : x[:-1].upper(), f.readlines()))

### Count word frequency
The count of a word in all reviews can provide some useful information:
1. A word appears more often in positive reviews is found in a review, it's more likely to a positive review (same for negative)
2. A word appears equally often in positive and negative reviews contribute very little to the prediction

The first step is to count word frequency and then the count will be used to create a vocabulary to encode the review data

In [17]:
from collections import Counter
import numpy as np

positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

for review, label in zip(reviews, labels):
    for word in review.split(' '):
        total_counts[word] += 1
        if label == 'POSITIVE':
            positive_counts[word] += 1
        if label == 'NEGATIVE':
            negative_counts[word] += 1

Calculate the ratios of positive and negative uses of words, notice the +1 in the denominator – that ensures we don't divide by zero for words that are only seen in positive reviews.

In [18]:
pos_neg_ratios = Counter()
for word in total_counts.keys():
    pos_neg_ratios[word] = positive_counts[word] / float(negative_counts[word]+1)

Ratio values are not easy to compare for two reasons:
1. Neutral value is 1 instead of 0, comparing absolute values are easier when around zero
2. Ratios of positive and negative words don't have same magnitude  

To solve these two problems, convert ratios to logs in such a way: 

In [19]:
for word, ratio in pos_neg_ratios.items():
    if ratio >= 1:
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log(1/(ratio+0.01))

### Noise Reduction
There are two kinds of noise we want to reduce:
1. Words appearing equally often in positive and negative reviews, they contribute very little to the prediction
2. Words appearing seldomly, they could be self made and not representational

We use two parameters `min_count` and `polarity_cutoff` to filter out those words and add the rest into vocabulary

In [22]:
min_count = 50
polarity_cutoff = 0.1

vocab = set()
for word, count in total_counts.items():
    if count > min_count and abs(pos_neg_ratios[word]) > polarity_cutoff:
        vocab.add(word)

Create the word-to-index dictionary

In [23]:
word2idx = dict() 
for i, word in enumerate(vocab):
    word2idx[word] = i

Convert a review into a vector

In [26]:
def review_to_vector(review):
    vector = np.zeros(len(vocab))
    for word in text.split(' '):
        idx = word2idx.get(word, None)
        if idx:
            vector[idx] = 1
    return vector