# Sentiment Analysis on movie comments
The data we're using are movie comments from [IMDB](http://www.imdb.com/). It has been preprocessed and stored in `reviews.txt` and `labels.txt`. For the same line in these two files, `reviews.txt` contains the review and `labels.txt` contains corresponding label which is either `positive` or `negative`. The goal is to train a model against this data to tell whether a comment is positive or not.

## Part1: Preparing data
Task for this part:  
1. Convert a review to a vector (it's much easier to work with the neural network)
2. Reduce noise in data

### Load data
The data has been preprocessed that it only contains lower case characters. Labels are converted into upper case here to make it more like a constant.

In [1]:
with open('reviews.txt', 'r') as f:
    reviews = list(map(lambda x : x[:-1], f.readlines()))

with open('labels.txt', 'r') as f:
    labels = list(map(lambda x : x[:-1].upper(), f.readlines()))

### Count word frequency
The count of a word in all reviews can provide some useful information:
1. A word appears more often in positive reviews is found in a review, it's more likely to a positive review (same for negative)
2. A word appears equally often in positive and negative reviews contribute very little to the prediction

The first step is to count word frequency and then the count will be used to create a vocabulary to encode the review data

In [2]:
from collections import Counter
import numpy as np

positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

for review, label in zip(reviews, labels):
    for word in review.split(' '):
        total_counts[word] += 1
        if label == 'POSITIVE':
            positive_counts[word] += 1
        if label == 'NEGATIVE':
            negative_counts[word] += 1

Calculate the ratios of positive and negative uses of words, notice the +1 in the denominator – that ensures we don't divide by zero for words that are only seen in positive reviews.

In [3]:
pos_neg_ratios = Counter()
for word in total_counts.keys():
    pos_neg_ratios[word] = positive_counts[word] / float(negative_counts[word]+1)

Ratio values are not easy to compare for two reasons:
1. Neutral value is 1 instead of 0, comparing absolute values are easier when around zero
2. Ratios of positive and negative words don't have same magnitude  

To solve these two problems, convert ratios to logs in such a way: 

In [4]:
for word, ratio in pos_neg_ratios.items():
    if ratio >= 1:
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log(1/(ratio+0.01))

### Noise Reduction
There are two kinds of noise we want to reduce:
1. Words appearing equally often in positive and negative reviews, they contribute very little to the prediction
2. Words appearing seldomly, they could be self made and not representational

We use two parameters `min_count` and `polarity_cutoff` to filter out those words and add the rest into vocabulary

In [5]:
min_count = 50
polarity_cutoff = 0.1

vocab = set()
for word, count in total_counts.items():
    if count > min_count and abs(pos_neg_ratios[word]) > polarity_cutoff:
        vocab.add(word)

Create the word-to-index dictionary

In [6]:
word2idx = dict() 
for i, word in enumerate(vocab):
    word2idx[word] = i

Convert a review into a vector

In [7]:
def review_to_vector(review):
    vector = np.zeros(len(vocab))
    for word in review.split(' '):
        idx = word2idx.get(word, None)
        if idx:
            vector[idx] = 1
    return vector

Now, run through the entire review data set and convert each review to a word vector.

In [8]:
review_vectors = np.array([review_to_vector(r) for r in reviews])

### Train, Validation and Test sets
Here we're using the function to_categorical from TFLearn to reshape the target data so that we'll have two output units and can classify with a softmax activation function. We actually won't be creating the validation set here, TFLearn will do that for us later.

In [12]:
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

Y = (np.array(labels) == 'POSITIVE').astype(np.int_)
records = len(labels)

shuffle = np.arange(records)
np.random.shuffle(shuffle)
test_fraction = 0.9

train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]
trainX, trainY = review_vectors[train_split,:], to_categorical(Y[train_split], 2)
testX, testY = review_vectors[test_split,:], to_categorical(Y[test_split], 2)

In [13]:
trainY

array([[ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.]])

# Part2: Build the Network

Here I use [TFlearn](http://tflearn.org/) to build the network layer by layer.

In [26]:
def build_model():
    # This resets all parameters and variables, leave this here
    tf.reset_default_graph()
    
    net = tflearn.input_data([None, len(vocab)])
    net = tflearn.fully_connected(net,150, activation='ReLU')
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='sgd', learning_rate=0.01, loss='categorical_crossentropy')
    
    model = tflearn.DNN(net)
    return model

### Train the network

In [24]:
model = build_model()

# Training
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=35)

Training Step: 14309  | total loss: [1m[32m0.14391[0m[0m | time: 3.732s
| SGD | epoch: 090 | loss: 0.14391 - acc: 0.9521 -- iter: 20224/20250
Training Step: 14310  | total loss: [1m[32m0.14102[0m[0m | time: 4.759s
| SGD | epoch: 090 | loss: 0.14102 - acc: 0.9538 | val_loss: 0.23216 - val_acc: 0.9049 -- iter: 20250/20250
--


### Testing 

In [25]:
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.8864


### Try my own comments

In [37]:
# Helper function that uses trained model to predict sentiment
def test_review(review):
    positive_prob = model.predict([review_to_vector(review.lower())])[0][1]
    print('Review: {}'.format(review))
    print('P(positive) = {:.3f} :'.format(positive_prob), 
          'Positive' if positive_prob > 0.5 else 'Negative')

In [38]:
review = "GANTZ:O is by far the best CG movie of 2016."
test_review(review)

review = "It's amazing anyone could be talented enough to make something this spectacularly awful"
test_review(review)

Review: GANTZ:O is by far the best CG movie of 2016.
P(positive) = 0.585 : Positive
Review: It's amazing anyone could be talented enough to make something this spectacularly awful
P(positive) = 0.233 : Negative
