<h1 style="text-align:center">Deep Learning   </h1>
<h1 style="text-align:center"> Lab Session 3 - 1.5 Hours </h1>
<h1 style="text-align:center"> Sentiment Analysis with Recurrent Neural Networks</h1>

<b> Group name:</b>Chloé Brochet & Alix Tarentelli (DeepLearn12)
 
 
The aim of this session is to practice with VanillaRNN and Gated Recurrent Units (GRU). Each group should fill and run appropriate notebook cells. 


Generate your final report in HTML and upload it (along with any necessary images files using a zip archive) on the submission website http://bigfoot-m1.eurecom.fr/teachingsub/login (using your deeplearnXX/password). Do not forget to run all your cells before generating your final report and do not forget to include the names of all participants in the group. The lab session should be completed and submitted by June 15th 2018 (23:59:59 CET).

# Section 1: Sentiment Analysis with a Vanilla RNN

In this part, you have no code to write. However you should spend some time stydying the code provided, to fully understand how the Vanilla RNN is implemented: you will implement a GRU in a similar way in Section 2.

You will work on a corpus of 3,000 user comments taken from IMDb (1,000), Amazon (1,000) and Yelp (1,000). These comments are split into two categories: positive comments (denoted by "1") and negative comments (denoted by "0"). For each website, 500 comments are positive and 500 comments are negative. This corpus has been created for the paper <i>From Group to Individual Labels using Deep Features</i> by Kotzias <i>et al</i> (KDD '15 Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 597-606, Sydney, NSW, Australia — August 10 - 13, 2015, ACM New York, NY, USA ©2015  ISBN: 978-1-4503-3664-2 doi>10.1145/2783258.2783380).

In this lab, we split this dataset into a training set of 2,520 comments (420 positive comments and 420 negative comments from each website), a validation set of 240 comments (40 positive comments and 40 negative comments from each website) and a test set of 240 comments (40 positive comments and 40 negative comments from each website).

Your goal will be to classify automatically these sentences by training a Vanilla RNN and then a GRU. Please note that we use the word2vec method to convert words into vectors (Embedding of 300 dimensions in this lab): these vectors are designed so that they reflect the semantic and the syntactic functions of words. You can read more about word2vec in the paper <i>Distributed representations of words and phrases and their compositionality</i> by Mikolov <i>et al.</i> (NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Pages 3111-3119, Lake Tahoe, Nevada — December 05 - 10, 2013).

First of all, please run the following cell.

In [4]:
# Imports
import tensorflow as tf
import numpy as np
import utils

# Parameters
epsilon = 1e-10
max_l = 32 # Max length of sentences

train, val, test, word2vec = utils.load_data()
data = utils.Dataset(train, val, test, word2vec)

In the following cell, we define a VanillaRNN class. Please read its code carefully before running the cell because you will need to implement a similar class for the GRU.

If our sentence is represented by the sequence $(x_1, ..., x_L)$, the hidden states $h_t$ of the Vanilla RNN are defined as

<div align="center">$h_0 = 0$</div>
<div align="center">$h_{t+1} = f(W_h h_t + W_x x_{t+1} + b)$</div>

where $W_h$, $W_x$ and $b$ are trainable parameters and $f$ is an activation function.

In [5]:
class VanillaRNN:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "vanilla_rnn") + "/"
        self._candidate_kernel = tf.get_variable(self._name + "candidate/weights",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias = tf.get_variable(self._name + "candidate/bias", shape=[self._hidden_states])
        

    def state_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    def __call__(self, inputs, state):

        candidate = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel)
        candidate = tf.nn.bias_add(candidate, self._candidate_bias)
        new_h = self._activation(candidate)
        return new_h

<b>Parameters</b>
* Learning rate: 0.001
* Training epochs: 30
* Batch size: 128
* Hidden states: 50

In [6]:
# Parameters
learning_rate = 0.001
training_epochs = 30
batch_size = 128
hidden_states = 50

Then we define our model. Please read the code of the process_sequence() function to understand the utility of the MaskData placeholder. If $h_L$ is the last hidden state of the Vanilla RNN, then we define our final prediction $p$ as

<div align="center">$p = \sigma (W_{pred} h_L + b_{pred})$</div>

where $W_{pred}$ and $b_{pred}$ are trainable parameters and $\sigma$ denotes the sigmoid function.

In [7]:
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/vanilla.ckpt"
# tf Graph Input:  sentiment analysis data
# Sentences are padded with zero vectors
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')

# masks: necessary as we have different sentence lengths
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# positive (1) or negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

# we define our VanillaRNN cell
vanilla = VanillaRNN(300, hidden_states)

# we retrieve its last output
vanilla_output = utils.process_sequence(vanilla, x, m)

W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
# we make the final prediction
pred = tf.nn.sigmoid(tf.matmul(vanilla_output, W) + b)

Eventually, we train our model using a cross-entropy loss and the Adam optimizer. At each epoch we check the validation accuracy, and save the model if that accuracy increased. At the end, we load the best model on validation, and print its accuracy on the test set.

We test our model using a $\tanh$ activation function.

In [18]:
from time import time

with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

t0 = time()
# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c = sess.run([optimizer, cost],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

t1 = time()
print("Time : ", t1-t0)

Training started
Accuracy on validation: 0.59583336
        Model saved in file: models/gru.ckpt
Epoch:  01   =====> Loss= 0.691898437
Accuracy on validation: 0.6625
        Model saved in file: models/gru.ckpt
Epoch:  02   =====> Loss= 0.670233720
Accuracy on validation: 0.81666666
        Model saved in file: models/gru.ckpt
Epoch:  03   =====> Loss= 0.590490702
Accuracy on validation: 0.8125
Epoch:  04   =====> Loss= 0.483606856
Accuracy on validation: 0.8208333
        Model saved in file: models/gru.ckpt
Epoch:  05   =====> Loss= 0.431560894
Accuracy on validation: 0.8208333
Epoch:  06   =====> Loss= 0.381465506
Accuracy on validation: 0.825
        Model saved in file: models/gru.ckpt
Epoch:  07   =====> Loss= 0.364984484
Accuracy on validation: 0.84166664
        Model saved in file: models/gru.ckpt
Epoch:  08   =====> Loss= 0.355273253
Accuracy on validation: 0.85
        Model saved in file: models/gru.ckpt
Epoch:  09   =====> Loss= 0.327987916
Accuracy on validation: 0.841666

Did you understand everything? If so, you can move towards Section 2.

# Section 2: Your turn!

<b>Question 1</b> - Recall the formulas defining the hidden states of a GRU.

- **Remember**  $r^t\ =\ \sigma(W_{r}\cdot [h^{t-1},\ x^t] +b_{r})$
- **Input**   $h^{'t}\ = \ \tanh(W_{i}\cdot [r^t\otimes h^{t-1},\ x^t]\ + \ b_{i})$
- **Update Gate** $ z^t\ = \ \sigma(W_{z}\cdot[h^{t-1}, \ x^t]\ +\ b_{z})$     <br>

Then,   $h^{t+1}\ = \ z^t \otimes h^{'t} \ +(1\ - \ z^{t})\otimes h^t$



<b>Question 2</b> - Define a GRU similar to the Vanilla RNN that we defined in Section 1.

In [14]:
class GRU:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "gru") + "/"
        ## WEIGHTS
        # weights of the remember gate
        self._candidate_Wr = tf.get_variable(self._name + "candidate/weightsWr",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
         # weights of the input gate
        self._candidate_Wi = tf.get_variable(self._name + "candidate/weightsWi",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
         # weights of the update gate
        self._candidate_Wz = tf.get_variable(self._name + "candidate/weightsWz",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        
         ## BIASES
        # biases of the remember gate
        self._candidate_bias_r = tf.get_variable(self._name + "candidate/bias_r", shape=[self._hidden_states])
        # biases of the input gate
        self._candidate_bias_i = tf.get_variable(self._name + "candidate/bias_i", shape=[self._hidden_states])
         # biases of the update gate
        self._candidate_bias_z = tf.get_variable(self._name + "candidate/bias_z", shape=[self._hidden_states])
        
        #####################################

    def state_size(self):
        return self._hidden_states

    def output_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    def __call__(self, inputs, state):
        
        #REMEMBER
        candidate_r = tf.matmul(tf.concat([inputs, state], 1), self._candidate_Wr)
        candidate_r = tf.nn.bias_add(candidate_r, self._candidate_bias_r)
        new_r = tf.sigmoid(candidate_r)
        
        #INPUT
        candidate_hh = tf.matmul(tf.concat([inputs, candidate_r * state], 1), self._candidate_Wi)
        candidate_hh= tf.nn.bias_add(candidate_hh, self._candidate_bias_i)
        new_hh = tf.tanh(candidate_hh)
        
        #UPDATE
        candidate_z = tf.matmul(tf.concat([inputs, state], 1), self._candidate_Wz)
        candidate_z= tf.nn.bias_add(candidate_z, self._candidate_bias_z)
        new_z = tf.sigmoid(candidate_z)
        
        # h at t+1
        new_h1 = new_z * new_hh
        new_h2 = 1-new_z
        new_h2 = new_h2 *state
        new_h = new_h1 + new_h2
        
        
    
        return new_h


<b>Question 3</b> - Train that GRU with a $tanh$ activation function and print its accuracy on the test set.

In [15]:
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/gru.ckpt"
# tf Graph Input:  sentiment analysis data
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# Positive (1) or Negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

gru = GRU(300, hidden_states)

gru_output = utils.process_sequence(gru, x, m)

W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
pred = tf.nn.sigmoid(tf.matmul(gru_output, W) + b)

with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))



In [17]:
from time import time 

t0 = time()

# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c = sess.run([optimizer, cost],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("  Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))
    
t1 = time()
print("Time : ", t1-t0)

Training started
Accuracy on validation: 0.55833334
  Model saved in file: models/gru.ckpt
Epoch:  01   =====> Loss= 0.691573548
Accuracy on validation: 0.675
  Model saved in file: models/gru.ckpt
Epoch:  02   =====> Loss= 0.670791287
Accuracy on validation: 0.8041667
  Model saved in file: models/gru.ckpt
Epoch:  03   =====> Loss= 0.589503840
Accuracy on validation: 0.775
Epoch:  04   =====> Loss= 0.525226904
Accuracy on validation: 0.81666666
  Model saved in file: models/gru.ckpt
Epoch:  05   =====> Loss= 0.455104801
Accuracy on validation: 0.825
  Model saved in file: models/gru.ckpt
Epoch:  06   =====> Loss= 0.406399835
Accuracy on validation: 0.8375
  Model saved in file: models/gru.ckpt
Epoch:  07   =====> Loss= 0.375947331
Accuracy on validation: 0.84583336
  Model saved in file: models/gru.ckpt
Epoch:  08   =====> Loss= 0.360052289
Accuracy on validation: 0.8375
Epoch:  09   =====> Loss= 0.352264996
Accuracy on validation: 0.8541667
  Model saved in file: models/gru.ckpt
Epoc

<b>Question 4</b> - comment on your findings:

A Vanilla Network is a one to one RNN. The hidden state is used to decide whether the information in the previous output (h at t-1) will be discarded or not through a sigmoig function. <br>
On the contrary a GRU RNN is designed to remember things from far in the past and has 3 gates:  
- The remember gate
- The input gate which decides which values chosen by the remember gate will be updated 
- The update gate used to decide on the output data

The Loss went from 0.23 to 0.096 which is a good improvement. <br>
Furthermore, the duration of the training decreased by 16% using GRU. <br>
Yet the accuracy increase is only of 0,04. It is still easy to implement and has its advantages.<br> 
A GRU RNN is a simpler version of a LSTM as it works directly on the hidden states and do not use a memory cell.