<h1 style="text-align:center">Deep Learning   </h1>
<h1 style="text-align:center"> Lab Session 3 - 1.5 Hours </h1>
<h1 style="text-align:center"> Sentiment Analysis with Recurrent Neural Networks</h1>

The aim of this session is to practice with VanillaRNN and Gated Recurrent Units (GRU). Each group should fill and run appropriate notebook cells. 

Follow instructions step by step until the end and submit your complete notebook as an archive (tar -cf groupXnotebook.tar DL_lab3/).

Do not forget to run all your cells before generating your final report and do not forget to include the names of all participants in the group. The lab session should be completed by June 12th 2019 (23:59:59 CET).

# Section 1: Sentiment Analysis with a Vanilla RNN

In this part, you will have no code to write. However you should spend some minutes on it, to understand well how the Vanilla RNN is implemented: you will implement a GRU in a similar way in Section 2.

You will work on a corpus of 3,000 user comments taken from IMDb (1,000), Amazon (1,000) and Yelp (1,000). These comments are split into two categories: positive comments (denoted by "1") and negative comments (denoted by "0"). For each website, 500 comments are positive and 500 comments are negative. This corpus has been created for the paper <i>From Group to Individual Labels using Deep Features</i> by Kotzias <i>et al</i>.

In this lab, we split this dataset into a training set of 2,520 comments (420 positive comments and 420 negative comments from each website), a validation set of 240 comments (40 positive comments and 40 negative comments from each website) and a test set of 240 comments (40 positive comments and 40 negative comments from each website).

Your goal will be to classify automatically these sentences by training a Vanilla RNN and then a GRU. Please note that we use the word2vec method to convert words into vectors: these vectors are designed so that they reflect the semantic and the syntactic functions of words. You can read more about word2vec in the paper <i>Distributed representations of words and phrases and their compositionality</i> by Mikolov <i>et al.</i>

First of all, please run the following cell.

In [1]:
# Imports
import tensorflow as tf
import numpy as np
import utils

# Parameters
epsilon = 1e-10
max_l = 32 # Max length of sentences

train, val, test, word2vec = utils.load_data()
data = utils.Dataset(train, val, test, word2vec)

In the following cell, we define a VanillaRNN class. Please read its code carefully before running the cell because you will need to implement a similar class for the GRU.

If our sentence is represented by the sequence $(x_1, ..., x_L)$, the hidden states $h_t$ of the Vanilla RNN are defined as

<div align="center">$h_0 = 0$</div>
<div align="center">$h_{t+1} = f(W_h h_t + W_x x_{t+1} + b)$</div>

where $W_h$, $W_x$ and $b$ are trainable parameters and $f$ is an activation fucntion.

In [2]:
class VanillaRNN:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "vanilla_rnn") + "/"
        self._candidate_kernel = tf.get_variable(self._name + "candidate/weights",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias = tf.get_variable(self._name + "candidate/bias", shape=[self._hidden_states])

    def state_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    def __call__(self, inputs, state): #cell(input, state) in process_sequence function of utils calls this

        candidate = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel)
        candidate = tf.nn.bias_add(candidate, self._candidate_bias)
        new_h = self._activation(candidate)
        return new_h

<b>Parameters</b>
* Learning rate: 0.001
* Training epochs: 30
* Batch size: 128
* Hidden states: 50

In [3]:
# Parameters
learning_rate = 0.001
training_epochs = 30
batch_size = 128
hidden_states = 50

Then we define our model. Please read the code of the process_sequence() function to understand the utility of the MaskData placeholder. If $h_L$ is the last hidden state of the Vanilla RNN, then we define our final prediction $p$ as

<div align="center">$p = \sigma (W_{pred} h_L + b_{pred})$</div>

where $W_{pred}$ and $b_{pred}$ are trainable parameters and $\sigma$ denotes the sigmoid function.

In [4]:
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/vanilla.ckpt"
# tf Graph Input:  sentiment analysis data
# Sentences are padded with zero vectors
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks: necessary as we have different sentence lengths
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# positive (1) or negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

# we define our VanillaRNN cell
vanilla = VanillaRNN(300, hidden_states)

# we retrieve its last output
vanilla_output = utils.process_sequence(vanilla, x, m)

W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
# we make the final prediction
pred = tf.nn.sigmoid(tf.matmul(vanilla_output, W) + b)

<b>Question 0</b> - Why do we need a MaskData placeholder?

Your answer: 

***The maximum length is 32. In order to get the proper output, we "put" the sentence of 10 words at the 10 last of the 32 elements. Therefore, in this case, the first masks are equal to 0 (we don't do updates), and then, when the sentence begins, the masks are equal to 1 (we do updates of the hidden state). Without the masks, the null/non existent words would influence the output (as it is recurrent), it would be useless noise for the prediction.  ***

Eventually, we train our model using a cross-entropy loss and the Adam optimizer. At each epoch we check the validation accuracy, and save the model if that accuracy increased. At the end, we load the best model on validation, and print its accuracy on the test set.

We test our model using a $\tanh$ activation function.

In [5]:
with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            #print(batch_ms[0])
            #print(batch_xs[0])
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c = sess.run([optimizer, cost],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

Training started
Accuracy on validation: 0.5
        Model saved in file: models/vanilla.ckpt
Epoch:  01   =====> Loss= 0.694689268
Accuracy on validation: 0.6666667
        Model saved in file: models/vanilla.ckpt
Epoch:  02   =====> Loss= 0.690510916
Accuracy on validation: 0.7416667
        Model saved in file: models/vanilla.ckpt
Epoch:  03   =====> Loss= 0.662441825
Accuracy on validation: 0.69166666
Epoch:  04   =====> Loss= 0.582807086
Accuracy on validation: 0.77916664
        Model saved in file: models/vanilla.ckpt
Epoch:  05   =====> Loss= 0.528961472
Accuracy on validation: 0.79583335
        Model saved in file: models/vanilla.ckpt
Epoch:  06   =====> Loss= 0.471243090
Accuracy on validation: 0.8
        Model saved in file: models/vanilla.ckpt
Epoch:  07   =====> Loss= 0.430939939
Accuracy on validation: 0.8208333
        Model saved in file: models/vanilla.ckpt
Epoch:  08   =====> Loss= 0.417647840
Accuracy on validation: 0.8208333
Epoch:  09   =====> Loss= 0.395544462
A

Did you understand everything? If so, you can move towards Section 2.

# Section 2: Your turn!

<b>Question 1</b> - Recall the formulas defining the hidden states of a GRU.

Your answer: 

> RECAP VANILLA RNN:
If our sentence is represented by the sequence $(x_1, ..., x_L)$, the hidden states $h_t$ of the Vanilla RNN are defined as <div align="center">$h_0 = 0$</div><div align="center">$h_{t+1} = f(W_h h_t + W_x x_{t+1} + b)$</div>


* GRU :
<div align="center">$h_0 = 0$</div>
<div align="center">$\tilde{h}_{t+1} = f(W_{hh}\Gamma_r\times h_{t} + W_{hx}x_{t+1}+b_h)$</div>
<div align="center">$\Gamma_u = \sigma( W_{uh}h_{t} + W_{ux}x_{t+1}+b_u)$</div>
<div align="center">$\Gamma_r = \sigma( W_{rh}h_{t} + W_{rx}x_{t+1}+b_r)$</div>
<div align="center">${h}_{t+1} = \Gamma_u \tilde{h}_{t+1} + (1-\Gamma_u)h_t  $</div>

--->With $\Gamma_u $ the update gate, $\Gamma_r $ the relevance gate, $\tilde{h}_t$ the candidate 






<b>Question 2</b> - Define a GRU similar to the Vanilla RNN that we defined in Section 1.

In [6]:
class GRU:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "gru") + "/"
        ############ CODE NEEDED ############
        # Define trainable parameters here  #
        #####################################
        
        #______________CODE of VANILLA (for comparison)_____________________________
        #self._candidate_kernel = tf.get_variable(self._name + "candidate/weights",
                                                   #shape=[input_size + self._hidden_states, self._hidden_states])
        #self._candidate_bias = tf.get_variable(self._name + "candidate/bias", shape=[self._hidden_states])
        #___________________________________________________________________________
        
        #_________h_tilde update parameters_______________
        self._candidate_kernel_h = tf.get_variable(self._name + "candidate/weights/h",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_h = tf.get_variable(self._name + "candidate/bias/h", shape=[self._hidden_states])
        
        #_________Update Gate parameters_______________
        self._candidate_kernel_u = tf.get_variable(self._name + "candidate/weights/u",
                                                   shape=[input_size + self._hidden_states, 1])
        self._candidate_bias_u = tf.get_variable(self._name + "candidate/bias/u", shape=[1])
        
         #_________Relevance Gate parameters_______________
        self._candidate_kernel_r = tf.get_variable(self._name + "candidate/weights/r",
                                                   shape=[input_size + self._hidden_states, 1])
        self._candidate_bias_r = tf.get_variable(self._name + "candidate/bias/r", shape=[1])


    def state_size(self):
        return self._hidden_states

    def output_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    def __call__(self, inputs, state):
        ############ CODE NEEDED ############
        #  Write GRU operations according   #
        #   to your answer at question 1    #
        #####################################
        
        #______________CODE of VANILLA______________________________________________
        #candidate = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel)
        #candidate = tf.nn.bias_add(candidate, self._candidate_bias)
        #new_h = self._activation(candidate)
        #___________________________________________________________________________
        
        #______Computing h_tilde
        h_tilde_1 = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_h) #step 1
        h_tilde_2 = tf.nn.bias_add(h_tilde_1, self._candidate_bias_h) #step 2
        h_tilde_3 = self._activation(h_tilde_2) #step 3
        
        #Computing the gates
        gate_u_1 = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_u) #step 1
        gate_u_2 = tf.nn.bias_add(gate_u_1, self._candidate_bias_u) #step 2
        gate_u_3 = tf.sigmoid(gate_u_2) #step 3   TO CHANGE FOR SIGMOID
        
        gate_r_1 = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_r) #step 1
        gate_r_2 = tf.nn.bias_add(gate_r_1, self._candidate_bias_r) #step 2
        gate_r_3 = tf.sigmoid(gate_r_2) #step 3   TO CHANGE FOR SIGMOID
        
        
        #Computing the output
        a=tf.multiply(gate_u_3, h_tilde_3)
        b=tf.multiply(gate_u_3, state)
        new_h_1 = tf.subtract(a,b)
        new_h = tf.add(new_h_1,state)  
        
        return new_h


<b>Question 3</b> - Train that GRU with a $tanh$ activation function and print its accuracy on the test set.

In [7]:
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/gru.ckpt"
# tf Graph Input:  sentiment analysis data
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# Positive (1) or Negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

gru = GRU(300, hidden_states)

gru_output = utils.process_sequence(gru, x, m)

W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
pred = tf.nn.sigmoid(tf.matmul(gru_output, W) + b)

with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c = sess.run([optimizer, cost],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

Training started
Accuracy on validation: 0.575
        Model saved in file: models/gru.ckpt
Epoch:  01   =====> Loss= 0.691597361
Accuracy on validation: 0.75
        Model saved in file: models/gru.ckpt
Epoch:  02   =====> Loss= 0.656620019
Accuracy on validation: 0.75416666
        Model saved in file: models/gru.ckpt
Epoch:  03   =====> Loss= 0.562774732
Accuracy on validation: 0.8041667
        Model saved in file: models/gru.ckpt
Epoch:  04   =====> Loss= 0.487438327
Accuracy on validation: 0.8125
        Model saved in file: models/gru.ckpt
Epoch:  05   =====> Loss= 0.426487927
Accuracy on validation: 0.79583335
Epoch:  06   =====> Loss= 0.395798418
Accuracy on validation: 0.8208333
        Model saved in file: models/gru.ckpt
Epoch:  07   =====> Loss= 0.372140412
Accuracy on validation: 0.82916665
        Model saved in file: models/gru.ckpt
Epoch:  08   =====> Loss= 0.355980878
Accuracy on validation: 0.82916665
Epoch:  09   =====> Loss= 0.343101723
Accuracy on validation: 0.85

<b>Question 4</b> - What are the advantages of Gated Recurrent Units over Vanilla RNNs?

Your answer: 

The gate system (update/ relevance gates) has a huge avantage over classic rnn. They help remembering long term depedencies (for exemple : if a word at the beginning of the sentence was plural or singular, to conjugate the next output correctly, thanks to the gate that will be close to 0 or to 1).
On the opposite, classic rnn are not very efficient in comparison, when it comes to remenbering inputs from a long time ago.