# Sentiment analysis with TFLearn

In this notebook, I'll build a network for sentiment analysis on the movie review data.   
I'll be using [TFLearn](http://tflearn.org/), a high-level library built on top of TensorFlow.

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

## Preparing the data

In [3]:
reviews = pd.read_csv('/Users/Jocelyn/Desktop/Data_Science/Intro_to_Deep_Learning/deep-learning-master/intro-to-tflearn/reviews.txt', header=None)
labels = pd.read_csv('/Users/Jocelyn/Desktop/Data_Science/Intro_to_Deep_Learning/deep-learning-master/intro-to-tflearn/labels.txt', header=None)

### Counting word frequency

In [25]:
from collections import Counter

total_counts = Counter()

for sentence in reviews[0]:
    s = sentence.lower().split(' ')
    total_counts.update(s)

print("Total words in data set: ", len(total_counts))

Total words in data set:  74074


Keep the first 10000 most frequent words as most of the words in the vocabulary are rarely used so they will have little effect on our predictions.

In [26]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:10000]
print(vocab[:60])

['', 'the', '.', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you', 'on', 't', 'not', 'he', 'are', 'his', 'have', 'be', 'one', 'all', 'at', 'they', 'by', 'an', 'who', 'so', 'from', 'like', 'there', 'her', 'or', 'just', 'about', 'out', 'if', 'has', 'what', 'some', 'good', 'can', 'more', 'she', 'when', 'very', 'up', 'time', 'no']


In [27]:
# The last word in vocabulary
print(vocab[-1], ': ', total_counts[vocab[-1]])

fulfilled :  30


In [28]:
# Create a dictionary called word2idx that maps each word in the vocabulary to an index
word2idx = dict()
for i in range(len(vocab)):
    word2idx[vocab[i]] = i 

In [43]:
# Create a function that converts a some text to a word vector
def text_to_vector(text):
    vec = np.zeros(len(word2idx))
    
    words = text.lower().split(' ')
    
    for w in words:
        if w in vocab:
            i = word2idx[w]
            vec[i] += 1
    
    return vec

In [44]:
text_to_vector('The tea is for a party to celebrate '
               'the movie so she has no time for a cake')[:65]

array([ 0.,  2.,  0.,  0.,  2.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  2.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.])

Now, run through our entire review data set and convert each review to a word vector.

In [45]:
word_vectors = np.zeros((len(reviews), len(vocab)), dtype=np.int_)
for ii, (_, text) in enumerate(reviews.iterrows()):
    word_vectors[ii] = text_to_vector(text[0])

In [46]:
# Printing out the first 5 word vectors
word_vectors[:5, :23]

array([[ 18,   9,  27,   1,   4,   4,   6,   4,   0,   2,   2,   5,   0,
          4,   1,   0,   2,   0,   0,   0,   0,   0,   0],
       [  5,   4,   8,   1,   7,   3,   1,   2,   0,   4,   0,   0,   0,
          1,   2,   0,   0,   1,   3,   0,   0,   0,   1],
       [ 78,  24,  12,   4,  17,   5,  20,   2,   8,   8,   2,   1,   1,
          2,   8,   0,   5,   5,   4,   0,   2,   1,   4],
       [167,  53,  23,   0,  22,  23,  13,  14,   8,  10,   8,  12,   9,
          4,  11,   2,  11,   5,  11,   0,   5,   3,   0],
       [ 19,  10,  11,   4,   6,   2,   2,   5,   0,   1,   2,   3,   1,
          0,   0,   0,   3,   1,   0,   1,   0,   0,   0]])

### Train, Validation, Test sets

In [69]:
Y = (labels=='positive').astype(np.int_)
records = len(labels)

shuffle = np.arange(records)
np.random.shuffle(shuffle)
train_fraction = 0.9

train_split, test_split = shuffle[:int(records*train_fraction)], shuffle[int(records*train_fraction):]
trainX, trainY = word_vectors[train_split,:], to_categorical(Y[0][train_split], 2)
testX, testY = word_vectors[test_split,:], to_categorical(Y[0][test_split], 2)

In [70]:
trainY

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.]])

## Building the network

In [81]:
# Network building
def build_model():
    # This resets all parameters and variables, leave this here
    tf.reset_default_graph()
    
    net = tflearn.input_data([None, 10000])

    net = tflearn.fully_connected(net, 200, activation='ReLU')
    net = tflearn.fully_connected(net, 25, activation='ReLU')
    
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
    
    model = tflearn.DNN(net)
    return model

## Intializing the model

In [82]:
model = build_model()

## Training the network

In [83]:
# Training
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=100)

Training Step: 15899  | total loss: [1m[32m0.23645[0m[0m | time: 7.796s
| SGD | epoch: 100 | loss: 0.23645 - acc: 0.8897 -- iter: 20224/20250
Training Step: 15900  | total loss: [1m[32m0.22994[0m[0m | time: 8.856s
| SGD | epoch: 100 | loss: 0.22994 - acc: 0.8953 | val_loss: 0.38010 - val_acc: 0.8582 -- iter: 20250/20250
--


## Testing

In [84]:
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.8644


## Try out using some sentences

In [85]:
# Helper function that uses your model to predict sentiment
def test_sentence(sentence):
    positive_prob = model.predict([text_to_vector(sentence.lower())])[0][1]
    print('Sentence: {}'.format(sentence))
    print('P(positive) = {:.3f} :'.format(positive_prob), 
          'Positive' if positive_prob > 0.5 else 'Negative')

In [86]:
sentence = "Moonlight is by far the best movie of 2016."
test_sentence(sentence)

sentence = "It's amazing anyone could be talented enough to make something this spectacularly awful"
test_sentence(sentence)

Sentence: Moonlight is by far the best movie of 2016.
P(positive) = 0.952 : Positive
Sentence: It's amazing anyone could be talented enough to make something this spectacularly awful
P(positive) = 0.002 : Negative
