# Sentiment Analysis on Amazon Reviews

## Nick Koutroumpinis, ML|mind Software Development

In this Project we will perform a Sentiment Analysis on some reviews that we took from Amazon. Our goal is to predict whether a review is positive or negative. We'll use a Word2Vec Neural Network model to do this.

Let's start by preprocessing the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# dependecies go first of course.

In [2]:
data = pd.read_csv('Reviews.csv')
# let's take a look on our data
data[0:5]


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [3]:
len(data)

568454

We can see our data is too much to train a Reccurent Neural Network on a Desktop computer so we'll have to use a lot less data than the whole dataset.

In [4]:
data = data[0:50000]

As we can see the features we are gonna use are the "Text" and the "Score". First things first, we'll have to change the scores that are in the range of [3,5] to 0(Positive) and the scores in the range of [0,2] to 1(Negative) so we can perform a Classification Sentiment Analysis. We actually encoded our labels...

In [5]:
def score_proc(data):
    
    labels = []
    for i, score in data['Score'].iteritems():
        if score == 3 or score == 4 or score == 5:
            labels.append(0)
        else:
            labels.append(1)
    return labels

labels = score_proc(data)
labels = np.array(labels) # transform into a numpy array for later use
print(labels[0:5])
data['Score'][0:5]


[0 1 0 1 0]


0    5
1    1
2    4
3    2
4    5
Name: Score, dtype: int64

So as we can see we created the list we wanted, let's continue with the text preprocessing.

In [10]:
from string import punctuation  # remove any punctuation

text = data['Text'].values.tolist() # passing our values into a list to process

def text_proc(data):

    all_text = ''.join([c for c in text if c not in punctuation])
    reviews = all_text.split('\n')

    all_text = ' '.join(text)
    words = all_text.split()
    
    return all_text, words

all_text, words = text_proc(text)
print(all_text[:300])
print(words[:30])

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most. Product arrived labeled as Jumbo Sal
['I', 'have', 'bought', 'several', 'of', 'the', 'Vitality', 'canned', 'dog', 'food', 'products', 'and', 'have', 'found', 'them', 'all', 'to', 'be', 'of', 'good', 'quality.', 'The', 'product', 'looks', 'more', 'like', 'a', 'stew', 'than', 'a']


### Why we did this?

The reason we did this is because we have to fix an embedding lookup table in order to feed this data into our Neural Network. That means we have to transform each word into a number.

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

In [11]:
from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)

vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

reviews_ints = []
for each in reviews:
    reviews_ints.append([vocab_to_int[word] for word in each.split()])

In [12]:
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Minimum review length: {}".format(min(review_lens)))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Minimum review length: 6
Maximum review length: 1751


As it seems like we have no zero-length reviews. This could create a step problem with our RNN (too many steps to do).

In [13]:
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
len(non_zero_idx)

50000

Let's make it more general

In [14]:
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
labels = np.array([labels[ii] for ii in non_zero_idx])

Now, we create an array features that contains the data we'll pass to the network. The data should come from review_ints, since we want to feed integers to the network. Each row should be 200 elements long. For reviews shorter than 200 words, left pad with 0s. That is, if the review is ['best', 'cheese', 'ever'], [117, 18, 128] as integers, the row will look like [0, 0, 0, ..., 0, 117, 18, 128]. For reviews longer than 200, use on the first 200 words as the feature vector.
This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.



In [15]:
seq_len = 200
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

In [16]:
features[:10,:150]

array([[  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   7,   4, 930],
       ..., 
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]])

## Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.

In [17]:
split_frac = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(40000, 200) 
Validation set: 	(5000, 200) 
Test set: 		(5000, 200)


## Build the graph

Here, we'll build the graph. First up, defining the hyperparameters.

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `batch_size`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `learning_rate`: Learning rate

In [18]:
lstm_size = 256
lstm_layers = 1
batch_size = 500
learning_rate = 0.001

For the network itself, we'll be passing in our 200 element long review vectors. Each batch will be batch_size vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

In [19]:
import tensorflow as tf

n_words = len(vocab_to_int)

# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

### Embedding

Now we'll add an embedding layer. We need to do this because there are 74000 words in our vocabulary. It is massively inefficient to one-hot encode our classes here. Instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table.

In [20]:
# Size of the embedding vectors (number of units in the embedding layer)
embed_size = 300 

with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

### LSTM cell


Next, we'll create our LSTM cells to use in the recurrent network ([TensorFlow documentation](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn)). Here we are just defining what the cells look like. This isn't actually building the graph, just defining the type of cells we want in our graph.


In [21]:
with graph.as_default():
    # basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Adding dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)
    

### RNN forward pass


Now we need to actually run the data through the RNN nodes. We can use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to do this. We'd pass in the RNN cell we created (our multiple layered LSTM `cell` for instance), and the inputs to the network.

```
outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)
```

Above I created an initial state, `initial_state`, to pass to the RNN. This is the cell state that is passed between the hidden layers in successive time steps. `tf.nn.dynamic_rnn` takes care of most of the work for us. We pass in our cell and the input to the cell, then it does the unrolling and everything else for us. It returns outputs for each time step and the final_state of the hidden layer.


In [22]:
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                             initial_state=initial_state)

### Output

We only care about the final output, we'll be using that as our sentiment prediction. So we need to grab the last output with `outputs[:, -1]`, the calculate the cost from that and `labels_`.

In [23]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

### Validation accuracy

Here we can add a few nodes to calculate the accuracy which we'll use in the validation pass.

In [24]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Batching

This is a simple function for returning batches from our data. First it removes data such that we only have full batches. Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[batch_size]`.

In [25]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training

Below is the typical training code. 

In [26]:
epochs = 4

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/4 Iteration: 5 Train loss: 0.138
Epoch: 0/4 Iteration: 10 Train loss: 0.115
Epoch: 0/4 Iteration: 15 Train loss: 0.126
Epoch: 0/4 Iteration: 20 Train loss: 0.135
Epoch: 0/4 Iteration: 25 Train loss: 0.177
Val acc: 0.859
Epoch: 0/4 Iteration: 30 Train loss: 0.112
Epoch: 0/4 Iteration: 35 Train loss: 0.116
Epoch: 0/4 Iteration: 40 Train loss: 0.181
Epoch: 0/4 Iteration: 45 Train loss: 0.219
Epoch: 0/4 Iteration: 50 Train loss: 0.126
Val acc: 0.860
Epoch: 0/4 Iteration: 55 Train loss: 0.097
Epoch: 0/4 Iteration: 60 Train loss: 0.112
Epoch: 0/4 Iteration: 65 Train loss: 0.145
Epoch: 0/4 Iteration: 70 Train loss: 0.140
Epoch: 0/4 Iteration: 75 Train loss: 0.101
Val acc: 0.855
Epoch: 0/4 Iteration: 80 Train loss: 0.066
Epoch: 1/4 Iteration: 85 Train loss: 0.109
Epoch: 1/4 Iteration: 90 Train loss: 0.098
Epoch: 1/4 Iteration: 95 Train loss: 0.093
Epoch: 1/4 Iteration: 100 Train loss: 0.099
Val acc: 0.870
Epoch: 1/4 Iteration: 105 Train loss: 0.103
Epoch: 1/4 Iteration: 110 Train loss

## Conclusion

We got 90%+ accuracy on the validation set. Not too bad for just 4 epochs. On a gpu based enviroment where we could run more epochs, we could get 95%+ accuracy. But 90% is not bad for predicting sentiment in a huge dataset like this.