# Deep Learning 2019
## Assignment 4 - Recurrent Neural Networks
Please complete the questions below by modifying this notebook and send this file via e-mail to

__[pir-assignments@l3s.de](mailto:pir-assignments@l3s.de?subject=[DL-2019]%20Assignment%20X%20[Name]%20[Mat.%20No.]&)__

using the subject __[DL-2019] Assignment X [Name] [Mat. No.]__. The deadline for this assignment is __May 21st, 2019, 9AM__.

Programming assignments have to be completed using Python 3. __Please do not use Python 2.__

__Always explain your answers__ (do not just write 'yes' or 'no').

Please add your name and matriculation number below:

__Name:__
<br>
__Mat. No.:__

----

### 1. Word Embeddings
Consider a Word2Vec model with the vocabulary $\{A, B, C, D, E\}$ and the weight matrices
\begin{equation*}
    W =
        \begin{pmatrix}
            1 & -1  & 0\\
            0 & 1   & 2\\
            2 & 2   & 2\\
            2 & 1   & 0\\
            2 & -1  & 0
        \end{pmatrix}
    \quad \text{and} \quad
    W' =
        \begin{pmatrix}
            2 & 3   & 4   & 2 & 0\\
            1 & 3   & -1  & 2 & 3\\
            1 & -2  & 1   & 0 & 1
        \end{pmatrix}.
\end{equation*}
Assume that the model uses the __CBOW__ architecture and that the one-hot indices correspond to the order of the words in the vocabulary above (the one-hot vector $(1, 0, 0, 0, 0)$ encodes the word $A$, the vector $(0, 1, 0, 0, 0)$ encodes $B$ and so on).
1. What is one way the word $A$ can be embedded using these matrices?
2. Suppose that at one point in time during training $W$ and $W'$ have the above values and the window is $B\ C\ D$. What would be the corresponding network input and output?
3. What loss function is used in Word2Vec? What would be the inputs of the loss function in the given example?
4. Compute the loss for the given example.

#### Solution
1. The simplest way of embedding a single word is by using its hidden representation in the network. Because we are using one-hot representations, this corresponds to a single row in $W$. In the case of $A$ we get $(1, -1, 0)$.
2. In the CBOW architecture the network predicts the word in the center (target) given all other words in the window (context). In our case the target word is $C$, and thus the expected output is $(0, 0, 1, 0, 0)$. The input is the average of the embeddings of the context words $B$ and $D$, i.e. $(0, 0.5, 0, 0.5, 0)$. We can now calculate the actual network output using the input and the weights as $ŷ = x^T \cdot W \cdot W' = (4, 4, 4, 4, 4)$.
3. Since this problem can be seen as a prediction task (with many classes), we use cross-entropy loss. The inputs of the loss function are the model output after applying the softmax function and the expected output, i.e. $softmax(ŷ) = (0.2, 0.2, 0.2, 0.2, 0.2)$ and $(0, 0, 1, 0, 0)$.
4. We can use the function integrated in Keras to calculate the cross-entropy. Note that the parameter `from_logits=True` means that the softmax will be applied before the loss is computed:

In [1]:
import numpy as np
import tensorflow as tf

tf.InteractiveSession()

y = np.asarray([[0, 0, 1, 0, 0]])
y_pred = np.asarray([[4., 4., 4., 4., 4.]])

print(tf.keras.backend.categorical_crossentropy(y, y_pred, from_logits=True).eval())

[1.60943791]


### 2. Backpropagation through Time 
What happens to the gradient in vanilla RNNs if you backpropagate through a long sequence? What are some of the possible solutions proposed in literature to solve such problems?

#### Solution

The gradients coming from the deeper layers in RNNs have to go through continuous matrix multiplications because of the the chain rule, and as they approach the earlier layers, if they have small values $(<1)$, they shrink exponentially until they vanish and make it impossible for the model to learn. This is the vanishing gradient problem. On the other hand, if they have large values $(>1)$ they get larger and eventually blow up. This is the exploding gradient problem.

The following are some of the possible solutions to mitigate such problems:
1. Gradient Clipping: When gradients explode, they can become NaNs because of the numerical overflow or we might see irregular oscillations in training cost when we plot the learning curve. A solution to fix this is to apply gradient clipping, which places a predefined threshold on the gradients to prevent it from getting too large. This doesn’t change the direction of the gradients, only their lengths.
2. In LSTM and GRU, each unit consists of "gates" which are applied to the previous state, input vector, output state and/or candidate output vectors. This allows the unit to control the influence that previous states have on the current state, thereby allowing it to control the magnitude of gradient propagation.
3. Regularize the parameters to avoid vanishing gradients.

### 3. Word2Vec
The [Delicious Bookmarks](https://grouplens.org/datasets/hetrec-2011/) dataset contains URLs and corresponding tags from users.
1. For every bookmark in the dataset, concatenate all tags and treat the resulting lists as sentences. Train a Word2Vec model with a window size of $5$ on these sentences to obtain $100$-dimensional word embeddings. Use the [gensim library](https://radimrehurek.com/gensim/models/word2vec.html) for this task.

Now suppose we represent a bookmark $b$ that has a number of tags $T$ as the average of all word vectors, i.e.
\begin{equation}
    v_b = \frac{\sum_{t \in T} E(t)}{|T|}
\end{equation}
where $E(t)$ is the embedding of $t$.

2. Implement a simple search engine, where each bookmark is represented by a vector as described above. Queries are represented the same way. The relevance of a bookmark w.r.t. a query should be the cosine similarity between the two vectors. Print the top-$10$ results (the bookmark URLs) for the query `firefox addons extensions`.

#### Solution

In [1]:
from collections import defaultdict
from gensim.models import Word2Vec
import pandas
import numpy

We use pandas to read the files. We end up with two dictionaries `tags` and `urls` that map IDs to tags and URLs. `bookmarks_tags` contains tuples of bookmark IDs, tag IDs and tag weights. We ignore the weights and concatenate the tags for each bookmark ID to obtain the sentences for Word2Vec.

In [3]:
tags = pandas.read_csv('tags.dat', encoding='latin-1', delimiter='\t').set_index('id').to_dict()['value']
urls = pandas.read_csv('bookmarks.dat', encoding='latin-1', delimiter='\t').set_index('id').to_dict()['url']
bookmarks_tags = pandas.read_csv('bookmark_tags.dat', encoding='latin-1', delimiter='\t').values

sentences = defaultdict(list)
for b_id, t_id, _ in bookmarks_tags:
    if t_id < len(tags):
        sentences[b_id].append(tags[t_id])

We train a Word2Vec model on these sentences and save the sentences:

In [4]:
EMBEDDING_SIZE = 100
model = Word2Vec(sentences.values(), size=EMBEDDING_SIZE, window=5)
word_vectors = model.wv
del model

We represent each bookmark as the average of the embedding vectors of its tags.

In [5]:
docs = defaultdict(lambda: numpy.zeros([EMBEDDING_SIZE]))
for b_id, tags in sentences.items():
    num_tags = 0
    for tag in tags:
        if tag in word_vectors:
            num_tags += 1
            docs[b_id] += word_vectors[tag]
    if num_tags > 0:
        docs[b_id] /= num_tags

We represent the query the same way as the bookmarks:

In [6]:
query = 'firefox addons extensions'
query_vec = numpy.zeros([EMBEDDING_SIZE])
num = 0
for word in query.split():
    if word in word_vectors:
        num += 1
        query_vec += word_vectors[word]
query_vec /= num

Finally we compute the similarities for all bookmarks and print the top-$10$ results.

In [7]:
def cosine_sim(u, v):
    return numpy.dot(u, v) / numpy.linalg.norm(u) / numpy.linalg.norm(v)

scores = []
for b_id, doc in docs.items():
    scores.append((b_id, cosine_sim(doc, query_vec)))
scores.sort(key=lambda x: x[1], reverse=True)

print('top-10 results:')
for i in range(10):
    b_id, score = scores[i]
    print('{}\tscore: {}\t{}\t{}'.format(i + 1, score, b_id, urls.get(b_id)))

top-10 results:
1	score: 0.998458575326554	55827	https://addons.mozilla.org/en-US/firefox/addon/11233
2	score: 0.9910288703149589	45170	https://chrome.google.com/extensions/detail/djnnmjiciadfjbpoclahceniaoeiabbb?hl=es
3	score: 0.9907319104315977	53013	https://addons.mozilla.org/en-US/firefox/addon/98440
4	score: 0.9888803425230343	105817	https://addons.mozilla.org/en-US/firefox/addon/3179
5	score: 0.988613161444	19834	https://addons.mozilla.org/es-ES/firefox/addon/14971/
6	score: 0.9878646184918128	61522	http://xsticky.com/
7	score: 0.9869709338469045	59013	http://www.seoquake.com/
8	score: 0.9867048189249178	19616	http://orera.g.hatena.ne.jp/edvakf/20101021/1287659747
9	score: 0.9833431323905616	38240	http://www.clubic.com/navigateur-internet/mozilla-firefox/article-374036-1-extension-naviguer-internet-firefox-chrome-safari.html
10	score: 0.9807386104866544	19480	http://vimperator.g.hatena.ne.jp/teramako/20100921/1285079275


### 4. Sentiment Classification with LSTMs
In this task we will implement a many-to-one LSTM to do sentiment classification. We will use the IMDB dataset, which is included in Keras. It contains movie reviews associated with sentiments (positive/negative). The task is to classify each review into one of these two classes.

The training of this model requires a GPU. If you do not have one, we recommend using [Google CoLab](https://colab.research.google.com).

If you get an error loading the dataset, try [downgrading to numpy 1.16.2](https://github.com/tensorflow/tensorflow/issues/28102):

`$ pip uninstall numpy`

`$ pip install numpy==1.16.2`

We can use the `num_words` parameter to specify how many of the most frequent words should be included. The rest of the words will be replaced by a placeholder (`<UNK>`).

In [15]:
!pip install numpy==1.16.2

import tensorflow as tf

num_words = 1000
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=num_words, index_from=3)
word_to_id = tf.keras.datasets.imdb.get_word_index()



The dataset has been preprocessed, i.e. stop words and punctuation marks were removed. Each word is assigned an index in the dictionary `word_to_id`. The sequences itself are made of indices instead of words. Since we'd like to translate the sequences back to words, we need to reverse the dictionary ([source](https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset)).

In [0]:
word_to_id = {k: v + 3 for k, v in word_to_id.items()}
word_to_id['<PAD>'] = 0
word_to_id['<START>'] = 1
word_to_id['<UNK>'] = 2
id_to_word = {value: key for key, value in word_to_id.items()}

With this reverse index we can translate sequences of indices back to words:

In [17]:
def get_text(seq):
    return list(map(id_to_word.get, seq))

print(x_train[0])
print(get_text(x_train[0]))

[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
['<START>', 'this', 'film', 'was', 'just', 'brilliant', 'casting', '<UNK>', '<UNK>', 'story', 'direction', '<UNK>', 'really', '<U

1. Define a many-to-one LSTM or GRU model that takes as input a sentence and classifies it as positive or negative. Experiment with different architectures (e.g. number of hidden units, dropout, `tf.keras.layers.Bidirectional` wrapper etc.). _Hint_: If you want to train on a GPU, use `tf.keras.layers.CuDNNLSTM` or `tf.keras.layers.CuDNNGRU`.
2. Train your model. Use the best practices (validation, early stopping). Since the sequences are made out of numbers, you need to convert each number to a one-hot vector using `tf.keras.utils.to_categorical`. In order to use minibatching, all sequences in a batch must have the same length. You can use `tf.keras.preprocessing.sequence.pad_sequences` to pad the sequences with the `<PAD>` token (see above).
3. Evaluate your trained model using `sklearn.metrics.classification_report` and `sklearn.metrics.confusion_matrix`. Compare two models, one with a vocabulary of $500$ words and one with $1000$ words.

#### Solution
We use a simple model with an LSTM and dropout. Our output is just a single number indicating the probability that the sentence was positive (class $1$).

In [0]:
model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(None, num_words + 3)),
    tf.keras.layers.Bidirectional(tf.keras.layers.CuDNNLSTM(128, return_sequences=False)),
    tf.keras.layers.Dropout(rate=0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam')

We create a batch generator that yields batches from the data, converted to one-hot vectors and padded.

In [0]:
import numpy as np

def convert_to_onehot(x, num_classes):
    result = []
    for x_ in x:
        result.append(tf.keras.utils.to_categorical(x_, num_classes=num_classes))
    return result

def batch_generator(x_train, y_train, batch_size, num_classes):
    assert len(x_train) == len(y_train)
    while True:
        i = 0
        while i < len(x_train) - batch_size:
            b_x = x_train[i:i + batch_size]
            b_y = y_train[i:i + batch_size]
            b_x_oh = convert_to_onehot(b_x, num_classes)
            b_x_pad = tf.keras.preprocessing.sequence.pad_sequences(b_x_oh, value=word_to_id['<PAD>'])
            yield b_x_pad, b_y
            i += batch_size

In order to use validation, we need to split the training set:

In [0]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1)

Now we can start training:

In [21]:
cb_early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)

batch_size = 32
train_gen = batch_generator(x_train, y_train, batch_size, num_words + 3)
validation_gen = batch_generator(x_val, y_val, batch_size, num_words + 3)
steps = int(len(x_train) / batch_size)
val_steps = int(len(x_val) / batch_size)
model.fit_generator(train_gen, epochs=100, steps_per_epoch=steps, callbacks=[cb_early_stop],
                    validation_data=validation_gen, validation_steps=val_steps)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100


<tensorflow.python.keras.callbacks.History at 0x7f9d483313c8>

Finally we predict the labels of the test set and evaluate the model:

In [22]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = []
for x in x_test:
    y, = model.predict([convert_to_onehot([x], num_words + 3)])
    y_pred.append(y)

y_pred = np.around(y_pred)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.64      0.72     12500
           1       0.71      0.86      0.77     12500

   micro avg       0.75      0.75      0.75     25000
   macro avg       0.76      0.75      0.75     25000
weighted avg       0.76      0.75      0.75     25000

[[ 8052  4448]
 [ 1793 10707]]


We skip the comparison with the other model here since it can be easily trained by changing the `num_words` variable.