<h1 style="text-align:center">Deep Learning   </h1>

<h1 style="text-align:center"> Sentiment Analysis with Recurrent Neural Networks</h1>

The aim of this session is to practice with VanillaRNN and Gated Recurrent Units (GRU). Each group should fill and run appropriate notebook cells. 

# Section 1: Sentiment Analysis with a Vanilla RNN

You will work on a corpus of 3,000 user comments taken from IMDb (1,000), Amazon (1,000) and Yelp (1,000). These comments are split into two categories: positive comments (denoted by "1") and negative comments (denoted by "0"). For each website, 500 comments are positive and 500 comments are negative. This corpus has been created for the paper <i>From Group to Individual Labels using Deep Features</i> by Kotzias <i>et al</i> (KDD '15 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 597-606, Sydney, NSW, Australia — August 10 - 13, 2015, ACM New York, NY, USA ©2015  ISBN: 978-1-4503-3664-2 doi>10.1145/2783258.2783380).

In this lab, we split this dataset into a training set of 2,520 comments (420 positive comments and 420 negative comments from each website), a validation set of 240 comments (40 positive comments and 40 negative comments from each website) and a test set of 240 comments (40 positive comments and 40 negative comments from each website).

Your goal will be to classify automatically these sentences by training a Vanilla RNN and then a GRU. Please note that we use the word2vec method to convert words into vectors (Embedding of 300 dimensions in this lab): these vectors are designed so that they reflect the semantic and the syntactic functions of words. You can read more about word2vec in the paper <i>Distributed representations of words and phrases and their compositionality</i> by Mikolov <i>et al.</i> (NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Pages 3111-3119, Lake Tahoe, Nevada — December 05 - 10, 2013).

First of all, please run the following cell.

In [1]:
# Imports
import tensorflow as tf
import numpy as np
import utils

# Parameters
epsilon = 1e-10
max_l = 32 # Max length of sentences

train, val, test, word2vec = utils.load_data()
data = utils.Dataset(train, val, test, word2vec)

  from ._conv import register_converters as _register_converters


#### look into data

In [2]:
print(train[1])
print("the dimension of word2vec:",word2vec['is'].shape)

(['great', 'for', 'the', 'jawbone'], 1.0)
the dimension of word2vec: (300,)


In the following cell, we define a VanillaRNN class. Please read its code carefully before running the cell because you will need to implement a similar class for the GRU.

If our sentence is represented by the sequence $(x_1, ..., x_L)$, the hidden states $h_t$ of the Vanilla RNN are defined as

<div align="center">$h_0 = 0$</div>
<div align="center">$h_{t+1} = f(W_h h_t + W_x x_{t+1} + b)$</div>

where $W_h$, $W_x$ and $b$ are trainable parameters and $f$ is an activation function.
<img src="images/RNN.png">

In [3]:
class VanillaRNN:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "vanilla_rnn") + "/"
        self._candidate_kernel = tf.get_variable(self._name + "candidate/weights",
                                                 shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias = tf.get_variable(self._name + "candidate/bias", shape=[self._hidden_states])
        tf.summary.histogram("weights", self._candidate_kernel)
        tf.summary.histogram("biases", self._candidate_bias)

    def state_size(self):
        return self._hidden_states
    
    # generate zero initial hidden state
    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    # run over one time step
    def __call__(self, inputs, state):
        candidate = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel)
        candidate = tf.nn.bias_add(candidate, self._candidate_bias)
        new_h = self._activation(candidate)
        tf.summary.histogram("activations", candidate)
        tf.summary.histogram("hiddenstate", new_h)
        return new_h

Then we define our model. Please read the code of the process_sequence() function to understand the utility of the MaskData placeholder. If $h_L$ is the last hidden state of the Vanilla RNN, then we define our final prediction $p$ as

<div align="center">$p = \sigma (W_{pred} h_L + b_{pred})$</div>

where $W_{pred}$ and $b_{pred}$ are trainable parameters and $\sigma$ denotes the sigmoid function.

In [6]:
tf.reset_default_graph()
# Parameters
learning_rate = 0.001
training_epochs = 30
batch_size = 128
hidden_states = 50

# graph
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/vanilla.ckpt"
# tf Graph Input:  sentiment analysis data

# Sentences are padded with zero vectors
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks: necessary as we have different sentence lengths
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# positive (1) or negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

with tf.name_scope('RNN'):
    # we define our VanillaRNN cell
    vanilla = VanillaRNN(300, hidden_states)
    # we retrieve its last output
    vanilla_output = utils.process_sequence(vanilla, x, m)

with tf.name_scope('Predict'):
    W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
    b = tf.Variable(tf.zeros([1]), name='Bias')
    # we make the final prediction
    pred = tf.nn.sigmoid(tf.matmul(vanilla_output, W) + b)
    tf.summary.histogram("weights", W)
    tf.summary.histogram("biases", b)



Eventually, we train our model using a cross-entropy loss and the Adam optimizer. At each epoch we check the validation accuracy, and save the model if that accuracy increased. At the end, we load the best model on validation, and print its accuracy on the test set.

We test our model using a $\tanh$ activation function.

In [7]:
with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()
tf.summary.scalar("loss", cost)
tf.summary.scalar("accuracy", acc)
merged_summary_op = tf.summary.merge_all()



with tf.Session() as sess:
    sess.run(init)
    summary_writer = tf.summary.FileWriter('logfiles/RNN', graph=tf.get_default_graph())
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c, summary = sess.run([optimizer, cost, merged_summary_op],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            summary_writer.add_summary(summary, epoch * total_batch + i)
            
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

Training started
Accuracy on validation: 0.5
        Model saved in file: models/vanilla.ckpt
Epoch:  01   =====> Loss= 0.694689268
Accuracy on validation: 0.6666667
        Model saved in file: models/vanilla.ckpt
Epoch:  02   =====> Loss= 0.690510916
Accuracy on validation: 0.7416667
        Model saved in file: models/vanilla.ckpt
Epoch:  03   =====> Loss= 0.662441825
Accuracy on validation: 0.69166666
Epoch:  04   =====> Loss= 0.582807086
Accuracy on validation: 0.77916664
        Model saved in file: models/vanilla.ckpt
Epoch:  05   =====> Loss= 0.528961472
Accuracy on validation: 0.79583335
        Model saved in file: models/vanilla.ckpt
Epoch:  06   =====> Loss= 0.471243090
Accuracy on validation: 0.8
        Model saved in file: models/vanilla.ckpt
Epoch:  07   =====> Loss= 0.430939939
Accuracy on validation: 0.8208333
        Model saved in file: models/vanilla.ckpt
Epoch:  08   =====> Loss= 0.417647840
Accuracy on validation: 0.8208333
Epoch:  09   =====> Loss= 0.395544462
A

Did you understand everything? If so, you can move towards Section 2.

# Section 2: GRU

<img src="images/GRU.png">

<b>Question 1</b> - Recall the formulas defining the hidden states of a GRU.<br>
<div class='alert alert-success'>
<ul>
<li><b>Remember gate:</b><br>
$ r^t = sigmoid(W_r \cdot [h^{t-1}, x^t] + b_r) $</li><br>
<li><b>Input:</b><br>
$ h^{'t} = tanh(W_i \cdot [r^t \otimes h^{t-1}, x^t] + b_i) $</li><br>
<li><strong>Update gate:</strong><br>
$ z^t = sigmoid(W_z \cdot [h^{t-1}, x^t] + b_z) $</li>
<br>
$ h^{t} = z^t \otimes h^{'t} + (1-z^t) \otimes h^{t-1} $
</ul>
<img src="https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/11/GRU.png?x31195" width="400">
</div>

<b>Question 2</b> - Define a GRU similar to the Vanilla RNN that we defined in Section 1.

In [8]:
class GRU:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "gru") + "/"
        
        ############ CODE NEEDED ############
        # Define trainable parameters here  #
        #####################################
        # R
        self._candidate_kernel_R = tf.get_variable(self._name + "candidate/weightsR",
                                                 shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_R = tf.get_variable(self._name + "candidate/biasR", shape=[self._hidden_states])
        
        # I
        self._candidate_kernel_I = tf.get_variable(self._name + "candidate/weightsI",
                                                 shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_I = tf.get_variable(self._name + "candidate/biasI", shape=[self._hidden_states])
        
        # Z
        self._candidate_kernel_Z = tf.get_variable(self._name + "candidate/weightsZ",
                                                 shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_Z = tf.get_variable(self._name + "candidate/biasZ", shape=[self._hidden_states])
        
        tf.summary.histogram("R_W", self._candidate_kernel_R)
        tf.summary.histogram("R_b", self._candidate_bias_R)
        tf.summary.histogram("I_W",self._candidate_kernel_I)
        tf.summary.histogram("I_b", self._candidate_bias_I)
        tf.summary.histogram("Z_W",self._candidate_kernel_Z)
        tf.summary.histogram("Z_b", self._candidate_bias_Z)        



    def state_size(self):
        return self._hidden_states

    def output_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    
    def __call__(self, inputs, state):
        
        # Remember
        candidate_R = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_R)
        candidate_R = tf.nn.bias_add(candidate_R, self._candidate_bias_R)
        r = tf.sigmoid(candidate_R)
        
        # Input
        candidate_I = tf.matmul(tf.concat([inputs, tf.multiply(r, state)], 1), self._candidate_kernel_I)
        candidate_I = tf.nn.bias_add(candidate_I, self._candidate_bias_I)
        i = tf.tanh(candidate_I)
        
        # Update
        candidate_Z = tf.matmul(tf.concat([inputs, tf.multiply(r, state)], 1), self._candidate_kernel_Z)
        candidate_Z = tf.nn.bias_add(candidate_Z, self._candidate_bias_Z)
        z = tf.sigmoid(candidate_Z)
        
        new_h = tf.multiply(z, i) + tf.multiply(1-z, state)
        return new_h

<b>Question 3</b> - Train that GRU with a $tanh$ activation function and print its accuracy on the test set.

In [10]:
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/gru.ckpt"
board_path = "logfiles/GRU"

# tf Graph Input:  sentiment analysis data
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# Positive (1) or Negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

with tf.name_scope('RUN_GRU'):
    gru = GRU(300, hidden_states)
    gru_output = utils.process_sequence(gru, x, m)
with tf.name_scope('Projection'):
    W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
    b = tf.Variable(tf.zeros([1]), name='Bias')
    pred = tf.nn.sigmoid(tf.matmul(gru_output, W) + b)
    tf.summary.histogram("P_W",W)
    tf.summary.histogram("P_b", b)

with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))
    
tf.summary.scalar("loss", cost)
tf.summary.scalar("accuracy", acc)
merged_summary_op = tf.summary.merge_all()
    
    
# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    summary_writer = tf.summary.FileWriter(board_path, graph=tf.get_default_graph())
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c, summary = sess.run([optimizer, cost, merged_summary_op],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            summary_writer.add_summary(summary, epoch * total_batch + i)
            
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

Training started
Accuracy on validation: 0.5833333
        Model saved in file: models/gru.ckpt
Epoch:  01   =====> Loss= 0.691213316
Accuracy on validation: 0.75
        Model saved in file: models/gru.ckpt
Epoch:  02   =====> Loss= 0.663495892
Accuracy on validation: 0.78333336
        Model saved in file: models/gru.ckpt
Epoch:  03   =====> Loss= 0.545318318
Accuracy on validation: 0.81666666
        Model saved in file: models/gru.ckpt
Epoch:  04   =====> Loss= 0.446474080
Accuracy on validation: 0.825
        Model saved in file: models/gru.ckpt
Epoch:  05   =====> Loss= 0.399288366
Accuracy on validation: 0.825
Epoch:  06   =====> Loss= 0.378474634
Accuracy on validation: 0.825
Epoch:  07   =====> Loss= 0.356144936
Accuracy on validation: 0.84166664
        Model saved in file: models/gru.ckpt
Epoch:  08   =====> Loss= 0.339461015
Accuracy on validation: 0.84583336
        Model saved in file: models/gru.ckpt
Epoch:  09   =====> Loss= 0.323130208
Accuracy on validation: 0.8541667

<b>Question 4</b> - Comment on your findings:
<div class='alert alert-success'>
<b>GRU</b> (together with LSTM) faces the problem of the <b>vanishing</b> (or exploding) gradient. Vanilla RNN indeed, especially for the long chain of layers which is composed of (thinking of it in a "unrolled" fashion), has the big issue of having a very small gradient during the backpropagation phase: this issue prevents the weights of the network to change their value.<br>
The idea of LSTM is that the network can keep <b>memory</b> (or in the case of the GRU, the <b>hidden state</b>) for a long term in order to avoid the gradient to vanish or explode.<br>
<img src="https://qph.fs.quoracdn.net/main-qimg-d4725497028c5a60241e1524c32f60de">
<br>
While the LSTM uses memory units, the GRU exposes always the entire hidden state (<a href="https://arxiv.org/pdf/1502.02367.pdf">source</a>). Of course, even if accuracy is generally in line with the LSTM, the GRU is more efficient from the computational point of view, because of the lack of complex memory structures.<br>
<br>
In our case the performances are better!<br>
<ul>
    <li>Vanilla RNN: <b>84.6%</b></li>
    <li>GRU RNN: <b>86.7%</b></li>
</ul>
</div>