# Deep Learning 2019
## Assignment 4 - Recurrent Neural Networks
Please complete the questions below by modifying this notebook and send this file via e-mail to

__[pir-assignments@l3s.de](mailto:pir-assignments@l3s.de?subject=[DL-2019]%20Assignment%20X%20[Name]%20[Mat.%20No.]&)__

using the subject __[DL-2019] Assignment X [Name] [Mat. No.]__. The deadline for this assignment is __May 21st, 2019, 9AM__.

Programming assignments have to be completed using Python 3. __Please do not use Python 2.__

__Always explain your answers__ (do not just write 'yes' or 'no').

Please add your name and matriculation number below:

__Name:__
<br>
__Mat. No.:__

----

### 1. Word Embeddings
Consider a Word2Vec model with the vocabulary $\{A, B, C, D, E\}$ and the weight matrices
\begin{equation*}
    W =
        \begin{pmatrix}
            1 & -1  & 0\\
            0 & 1   & 2\\
            2 & 2   & 2\\
            2 & 1   & 0\\
            2 & -1  & 0
        \end{pmatrix}
    \quad \text{and} \quad
    W' =
        \begin{pmatrix}
            2 & 3   & 4   & 2 & 0\\
            1 & 3   & -1  & 2 & 3\\
            1 & -2  & 1   & 0 & 1
        \end{pmatrix}.
\end{equation*}
Assume that the model uses the __CBOW__ architecture and that the one-hot indices correspond to the order of the words in the vocabulary above (the one-hot vector $(1, 0, 0, 0, 0)$ encodes the word $A$, the vector $(0, 1, 0, 0, 0)$ encodes $B$ and so on).
1. What is one way the word $A$ can be embedded using these matrices?
2. Suppose that at one point in time during training $W$ and $W'$ have the above values and the window is $B\ C\ D$. What would be the corresponding network input and output?
3. What loss function is used in Word2Vec? What would be the inputs of the loss function in the given example?
4. Compute the loss for the given example.

### 2. Backpropagation through Time 
What happens to the gradient in vanilla RNNs if you backpropagate through a long sequence? What are some of the possible solutions proposed in literature to solve such problems?

### 3. Word2Vec
The [Delicious Bookmarks](https://grouplens.org/datasets/hetrec-2011/) dataset contains URLs and corresponding tags from users.
1. For every bookmark in the dataset, concatenate all tags and treat the resulting lists as sentences. Train a Word2Vec model with a window size of $5$ on these sentences to obtain $100$-dimensional word embeddings. Use the [gensim library](https://radimrehurek.com/gensim/models/word2vec.html) for this task.

Now suppose we represent a bookmark $b$ that has a number of tags $T$ as the average of all word vectors, i.e.
\begin{equation}
    v_b = \frac{\sum_{t \in T} E(t)}{|T|}
\end{equation}
where $E(t)$ is the embedding of $t$.

2. Implement a simple search engine, where each bookmark is represented by a vector as described above. Queries are represented the same way. The relevance of a bookmark w.r.t. a query should be the cosine similarity between the two vectors. Print the top-$10$ results (the bookmark URLs) for the query `firefox addons extensions`.

### 4. Sentiment Classification with LSTMs
In this task we will implement a many-to-one LSTM to do sentiment classification. We will use the IMDB dataset, which is included in Keras. It contains movie reviews associated with sentiments (positive/negative). The task is to classify each review into one of these two classes.

The training of this model requires a GPU. If you do not have one, we recommend using [Google CoLab](https://colab.research.google.com).

If you get an error loading the dataset, try [downgrading to numpy 1.16.2](https://github.com/tensorflow/tensorflow/issues/28102):

`$ pip uninstall numpy`

`$ pip install numpy==1.16.2`

We can use the `num_words` parameter to specify how many of the most frequent words should be included. The rest of the words will be replaced by a placeholder (`<UNK>`).

In [15]:
!pip install numpy==1.16.2

import tensorflow as tf

num_words = 1000
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=num_words, index_from=3)
word_to_id = tf.keras.datasets.imdb.get_word_index()



The dataset has been preprocessed, i.e. stop words and punctuation marks were removed. Each word is assigned an index in the dictionary `word_to_id`. The sequences itself are made of indices instead of words. Since we'd like to translate the sequences back to words, we need to reverse the dictionary ([source](https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset)).

In [0]:
word_to_id = {k: v + 3 for k, v in word_to_id.items()}
word_to_id['<PAD>'] = 0
word_to_id['<START>'] = 1
word_to_id['<UNK>'] = 2
id_to_word = {value: key for key, value in word_to_id.items()}

With this reverse index we can translate sequences of indices back to words:

In [17]:
def get_text(seq):
    return list(map(id_to_word.get, seq))

print(x_train[0])
print(get_text(x_train[0]))

[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
['<START>', 'this', 'film', 'was', 'just', 'brilliant', 'casting', '<UNK>', '<UNK>', 'story', 'direction', '<UNK>', 'really', '<U

1. Define a many-to-one LSTM or GRU model that takes as input a sentence and classifies it as positive or negative. Experiment with different architectures (e.g. number of hidden units, dropout, `tf.keras.layers.Bidirectional` wrapper etc.). _Hint_: If you want to train on a GPU, use `tf.keras.layers.CuDNNLSTM` or `tf.keras.layers.CuDNNGRU`.
2. Train your model. Use the best practices (validation, early stopping). Since the sequences are made out of numbers, you need to convert each number to a one-hot vector using `tf.keras.utils.to_categorical`. In order to use minibatching, all sequences in a batch must have the same length. You can use `tf.keras.preprocessing.sequence.pad_sequences` to pad the sequences with the `<PAD>` token (see above).
3. Evaluate your trained model using `sklearn.metrics.classification_report` and `sklearn.metrics.confusion_matrix`. Compare two models, one with a vocabulary of $500$ words and one with $1000$ words.