# Deep Learning 2019
## Assignment 7 - Attention Mechanism
Please complete the questions below by modifying this notebook and send this file via e-mail to

__[pir-assignments@l3s.de](mailto:pir-assignments@l3s.de?subject=[DL-2019]%20Assignment%20X%20[Name]%20[Mat.%20No.]&)__

using the subject __[DL-2019] Assignment X [Name] [Mat. No.]__. The deadline for this assignment is __June 18th, 2019, 9AM__. Before your submission please replace fields __[Name]__ and __[Mat. No.]__ with your own name and registration number respectively (please keep the brackets), and replace the __X__ in the filename with the number of the current assignment.

Programming assignments have to be completed using Python 3. __Please do NOT use Python 2.__

__Always explain your answers__ (do not just write 'yes' or 'no').

Please add your name and matriculation number below:

__Name:__
<br>
__Mat. No.:__

----

### 1. Quick quiz
1. Why do we care about attention?
2. what are differences between hard- vs soft-attention; please also provide pros and cons both approaches to attention.
3. What are differences between standard attention and Transformer?
4. Is Transformer using soft- or hard-attention?

#### solution
1. Before the attention mechanism was born, the seq2seq model with fixed-length context vector is broadly used. A critical and apparent disadvantage of this fixed-length design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. Rather than building a single context vector out of the encoder’s last hidden state, the attention creates shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

2. In hard attention, each part in a sentence or patch in an image is either used to obtain the context vector or is discarded. In this case, $\alpha_{ti}$ represents the probability of the part/patch being used. <br/>__Pro__: less calculation at the inference time. <br/>__Con__: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train.<br/><br/>While in soft attention, the context vector $z_i$ can be computed as
$$
z_i=\sum_ix_i\alpha_i,
$$
where $x_i$ is the representation of input and $\alpha_i$ stands for its coefficient.<br/>__Pro__: the model is smooth and differentiable.<br/>__Con__: expensive when the source input is large.

3. Transformer is essentially a self-attention mechanism. Based on partial correspondence between input and output, the standard attention mechanism calculate the weight (soft attention) or the possibility of taking the input element. While the Transformer let each input element focus on any other elements in the input.

4. Transformer uses soft-attention, because it uses different weights of different tokens to represent the extend of the attention. Hard-attention is stochastic, therefore it requires a large amount of corrisponding data to calculate probabilities, which is not the case for self-attention (each sentence has only one opponent).

### 2. Theoretical questions
In the lecture it is briefly mentioned Transformer for language translation and a transformer for images. Based on those we can extent Transformer into following situations:

1. We are working with images. Let us assume we have detected objects in the form of bounding boxes with [x,y] coordinates (top-left, bottom-right) and corresponding features (vector of size 2048 for each object). Could you please design Transformer working with such an input representation? Please also be precise regarding particular design choices, e.g., which weights are shared, and dimensionality of all the vectors. Please also take into consideration that images may have different number of detected objects whereas the vanilla Transformer assumes a fixed number of these objects.

2. Let us work with videos now. Could you also design a variant of Transformer working with these sequences of images? You can choose your favorite input representation: image tensor for each frame, detected objects in each frame, or both. Please focus on representing temporal sequences, comment your design choices, and be precise regarding the design choices.

#### solution
1. solution for reference: <br/>When assuming there is only one object in each bounding box. The displacement of the upper-left and bottom-right corners can be represented by positional embeddings, where the length of the embedding should be four times of the size of embeddings in textual problems. In that way the reletive and absolut positions can be represented as linear combination as the Transformers in textual scene. The dimensions of $\mathbf{q}$, $\mathbf{k}$ and $\mathbf{v}$ vectors can be kept in the same dimension as the original Transformers (e.g. 2048 each vector or n*2048 for all n attention heads). Since the positional embedding is only a part of the representation of the detected objects, it doesn't harm when multiple objects occur in the same bounding box. Different objects can be represented by their feature vectors. Since the vanilla Transformer also deal with sequences in different length, it would be no problem if the count of detected objects in each pictures are different.
2. The video is different from single frame images by adding up a time dimension. In order to take this temporal information into the consideration we can adopt, for example, two strategies:
    1. add an additional temporal dimension to Transformer's input tensor and allow key-value pairs selection across different points of time to calculate attention mechanism.
    2. concate every frame in the video clip to a 'long' tensor along their time dimension, so that its shape becomes $(w * n_{frames}, h, d_{channel}, d_{feature})$, where $w$ and $h$ are width and height of each frame (assume that all frames are equisized). The rest of constructions are the same as in the first problem.
The downside of both design is their high computational complexity. Beyond the complexity caused by vanilla Transformer, we need to multiply another $n_{frames}*n_{frames}$ to it.

### 3. Coding exercise
Recall that we have implemented a very simple LSTM based sentimental classification model in the 4th assignment. Now it is the time to replace the LSTM module with the Transform one. Again we will use the IMDB dataset and finish binary classification task:

In [1]:
!pip install numpy==1.16.2

import tensorflow as tf

num_words = 1000
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=num_words, index_from=3)
word_to_id = tf.keras.datasets.imdb.get_word_index()

word_to_id = {k: v + 3 for k, v in word_to_id.items()}
word_to_id['<PAD>'] = 0
word_to_id['<START>'] = 1
word_to_id['<UNK>'] = 2
id_to_word = {value: key for key, value in word_to_id.items()}

def get_text(seq):
    return list(map(id_to_word.get, seq))

print(x_train[0])
print(get_text(x_train[0]))

Collecting numpy==1.16.2
  Using cached https://files.pythonhosted.org/packages/c4/33/8ec8dcdb4ede5d453047bbdbd01916dbaccdb63e98bba60989718f5f0876/numpy-1.16.2-cp27-cp27mu-manylinux1_x86_64.whl
Installing collected packages: numpy
Successfully installed numpy-1.16.4


  from ._conv import register_converters as _register_converters


[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
['<START>', 'this', 'film', 'was', 'just', 'brilliant', 'casting', '<UNK>', '<UNK>', 'story', 'direction', '<UNK>', 'really', '<U

1. Reuse your code for the 4th assignment and replace the LSTM/GRU unit with Transformer structure. Compare the different requirements of two different models. In order to integrate Transformer to your sentimental classification model, what information is needed additionally additionally as input? Hint 1: you can find the Transformer implementation in [the tensorflow's github page under the `tensor2tensor`repository](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py). Hint 2: This is the colab [link](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb) to a colab python notebook that helps you better understanding the Transformer model

2. Visualize the weight matrix of selecting tokens (the self-attention matrix) for a input you like using `tensor2tensor.visualization.attention.show`.

#### solution

In [1]:
import os

import keras

import tensorflow as tf
from tensorflow.python.keras.backend import set_session

from sklearn.model_selection import train_test_split

num_words = 1000

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=num_words, index_from=3)
# In this solution we illustrate the solution only, so for each of training/validation/test dataset we only
# take their first 4 instances.
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1)
x_train, y_train = x_train[:4], y_train[:4]
x_test, y_test = x_test[:4], y_test[:4]
x_val, y_val = x_val[:4], y_val[:4]

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
word_to_id = tf.keras.datasets.imdb.get_word_index()

word_to_id = {k: v + 3 for k, v in word_to_id.items()}
word_to_id['<PAD>'] = 0
word_to_id['<START>'] = 1
word_to_id['<UNK>'] = 2
id_to_word = {value: key for key, value in word_to_id.items()}

In [3]:
from tensor2tensor.utils import t2t_model
from tensor2tensor.layers import transformer_layers, common_layers
from tensor2tensor.models.transformer import features_to_nonpadding
from tensor2tensor.models.transformer import transformer_tiny

transformer_encoder = transformer_layers.transformer_encoder
transformer_prepare_encoder = transformer_layers.transformer_prepare_encoder

# This TransformerEncoder4Vis class is mainly a copy of tensor2tensor.models.transformer.TransformerEncoder
# But for the post-visualization we add an additional member, i.e. attention_weights to save all attentions.
# Also we delete the flatten4d3d method call because we only deal with sequantial inputs in form of text.
class TransformerEncoder4Vis(t2t_model.T2TModel):
    def __init__(self, *args, **kwargs):
        super(TransformerEncoder4Vis, self).__init__(*args, **kwargs)
        self.attention_weights = {}
        
    def body(self, features):
        hparams = self._hparams
        inputs = features["inputs"]
        target_space = features["target_space_id"]

        (encoder_input, encoder_self_attention_bias, _) = (
            transformer_prepare_encoder(inputs, target_space, hparams))

        encoder_input = tf.nn.dropout(encoder_input,
                                      rate=hparams.layer_prepostprocess_dropout)
        encoder_output = transformer_encoder(
            encoder_input,
            encoder_self_attention_bias,
            hparams,
            save_weights_to=self.attention_weights,
            nonpadding=features_to_nonpadding(features, "inputs"))
        return encoder_output


# This is a custom layer inheriting the keras.layers.Layer. As introduced in the official document, the 
# __init__() method is called to initialize the layer before knowing the shape of the input and the 
# build() method is called after knowing it. Each time when the layer is called in the forward propagation, 
# the call method is called.
class TinyTransformerEncoder(keras.layers.Layer):
    def __init__(self, output_dim, **kwargs):
        self.hparams = transformer_tiny()
        self.output_dim = output_dim
        super(TinyTransformerEncoder, self).__init__(**kwargs)

    def build(self, input_shape, **kwargs):
        self.encoder = TransformerEncoder4Vis(self.hparams)
        super(TinyTransformerEncoder, self).build(input_shape)

    def call(self, x):
        # Our transformer layer takes a dict as its inputs, where the value of key 
        # 'inputs' is the input tensor
        output = self.encoder({'inputs': x, 'targets': 0})
        return output[0]

    def compute_output_shape(self, input_shape):
        return input_shape


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Entry Point [tensor2tensor.envs.tic_tac_toe_env:TicTacToeEnv] registered with id [T2TEnv-TicTacToeEnv-v0]


In [4]:
# Please remember don't use both tf.keras and keras in the sametime. Especially when your version of Tensorflow
# is 1.x, for the both have different implementation in the lower level and have compatibility issues.
model = keras.models.Sequential([
        keras.layers.Embedding(num_words + 3, 128, input_shape=(None, )),
        TinyTransformerEncoder(output_dim=128, name='tiny_transformer'),
        keras.layers.GlobalAveragePooling1D(),
        keras.layers.Dense(1, activation='sigmoid')
    ])

model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Building model body
Instructions for updating:
Shapes are always computed; don't use the compute_shapes as it has no effect.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [5]:
def batch_generator(x_train, y_train, batch_size):
    assert len(x_train) == len(y_train)
    while True:
        i = 0
        while i < len(x_train) - batch_size:
            b_x = x_train[i:i + batch_size]
            b_y = y_train[i:i + batch_size]
            b_x_pad = tf.keras.preprocessing.sequence.pad_sequences(b_x, value=word_to_id['<PAD>'], maxlen=1003)
            yield b_x_pad, b_y
            i += batch_size

In [6]:
batch_size = 1

train_gen = batch_generator(x_train[:2], y_train[:2], batch_size)
val_gen = batch_generator(x_val[:2], y_val[:2], batch_size)
es = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)

steps = int(len(x_train[:2]) / batch_size)
val_steps = int(len(x_val[:2]) / batch_size)

model.fit_generator(train_gen,
                    steps_per_epoch=steps,
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    callbacks=[ckpt],
                    epochs=10)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f8072ad26a0>

In [7]:
model.save_weights('ckpt/model.ckpt')

2.
##### Attention!
---
From here on we need to restart the notebook to enable the Eager Execution, so that we can visullize the attention weights. The Eager Execution need to be activated at the beginning of a programm. For more details about the Eager Execution please refer to [here](https://www.youtube.com/watch?v=T8AW0fKP0Hs), [here](https://www.tensorflow.org/guide/eager) and [here](https://www.youtube.com/watch?v=T8AW0fKP0Hs).

In [1]:
import os

import keras

import tensorflow as tf

from tensorflow.python.keras.backend import set_session

from sklearn.model_selection import train_test_split

num_words = 1000
viz_text_len = 35

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=num_words, index_from=3)

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1)
x_train, y_train = x_train[:4], y_train[:4]
x_test, y_test = x_test[:4], y_test[:4]
x_val, y_val = x_val[:4], y_val[:4]

tfe = tf.contrib.eager
tfe.enable_eager_execution()

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.



For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



In [2]:
word_to_id = tf.keras.datasets.imdb.get_word_index()

word_to_id = {k: v + 3 for k, v in word_to_id.items()}
word_to_id['<PAD>'] = 0
word_to_id['<START>'] = 1
word_to_id['<UNK>'] = 2
id_to_word = {value: key for key, value in word_to_id.items()}

In [3]:
import numpy as np

from tensor2tensor.utils import t2t_model
from tensor2tensor.layers import transformer_layers, common_layers
from tensor2tensor.models.transformer import features_to_nonpadding, transformer_tiny

transformer_encoder = transformer_layers.transformer_encoder
transformer_prepare_encoder = transformer_layers.transformer_prepare_encoder


class TransformerEncoder4Vis(t2t_model.T2TModel):
    def __init__(self, *args, **kwargs):
        super(TransformerEncoder4Vis, self).__init__(*args, **kwargs)
        self.attention_weights = {}
        
    def body(self, features):
        hparams = self._hparams
        inputs = features["inputs"]
        target_space = features["target_space_id"]

        (encoder_input, encoder_self_attention_bias, _) = (
            transformer_prepare_encoder(inputs, target_space, hparams))

        encoder_input = tf.nn.dropout(encoder_input,
                                      rate=hparams.layer_prepostprocess_dropout)
        encoder_output = transformer_encoder(
            encoder_input,
            encoder_self_attention_bias,
            hparams,
            save_weights_to=self.attention_weights,
            nonpadding=features_to_nonpadding(features, "inputs"))
        return encoder_output


# Notice that this class inherits the tf.keras.layers.Layer instead of keras.layers.Layer, to by-pass
# compatibility issues caused by mixing tf.keras and keras in a same programm.
class TinyTransformerEncoder(tf.keras.layers.Layer):
    def __init__(self, output_dim, **kwargs):
        self.hparams = transformer_tiny()
        self.output_dim = output_dim
        super(TinyTransformerEncoder, self).__init__(**kwargs)

    def build(self, input_shape, **kwargs):
        self.encoder = TransformerEncoder4Vis(self.hparams)
        super(TinyTransformerEncoder, self).build(input_shape)

    def call(self, x):
        output = self.encoder({'inputs': x, 'targets': 0})
        return output[0]

    def compute_output_shape(self, input_shape):
        return input_shape

INFO:tensorflow:Entry Point [tensor2tensor.envs.tic_tac_toe_env:TicTacToeEnv] registered with id [T2TEnv-TicTacToeEnv-v0]


In [4]:
model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(num_words + 3, 128, input_shape=(1003, )), # In eager execution there is no place_holder,
                                                                             # Therefore the inputlayer can no longer inference
                                                                             # shape of the input, we need to thence specify
                                                                             # the input shape explicitly. The magic number
                                                                             # 1003 here is nothing more but the maximum length
                                                                             # of the training text
        TinyTransformerEncoder(output_dim=128, name='tiny_transformer'),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Building model body
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [5]:
def resize(np_mat):
    # Sum across heads
    np_mat = np_mat[:, :viz_text_len, :viz_text_len]
    row_sums = np.sum(np_mat, axis=0)
    # Normalize
    layer_mat = np_mat / row_sums[np.newaxis, :]
    lsh = layer_mat.shape
    # Add extra dim for viz code to work.
    layer_mat = np.reshape(layer_mat, (1, lsh[0], lsh[1], lsh[2]))
    return layer_mat


def get_att_mats(encoder):
    enc_atts = []
    for i in range(encoder._hparams.num_hidden_layers):
        enc_att = encoder.attention_weights[
          "transformer_encoder4_vis/body/encoder/layer_%i/self_attention/multihead_attention/dot_product_attention" % i][0]
        enc_atts.append(resize(enc_att))
    return enc_atts


def call_html():
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [6]:
from tensor2tensor.visualization import attention
from tensor2tensor.data_generators import text_encoder

inp = x_test[0]
inp += [0 for i in range(1003 - len(inp))]
inp_text = [id_to_word[i] for i in inp]
inp = np.array(inp).reshape((1, 1003))

with tfe.restore_variables_on_create(tf.train.latest_checkpoint('./')):
    encoder = model.get_layer('tiny_transformer').encoder
    Modes = tf.estimator.ModeKeys
    encoder.set_mode(Modes.EVAL)    
    model(inp)
    enc_atts = get_att_mats(encoder)
call_html()
# The attention.show function in tensor2tensor.visualization takes 5 parameters, they are input sequence,
# output sequence, input-input self-attention, input-output attention and output-output self-attention.
# because we used the encoder only, we pass the input-input self-attention for all three attention params.
# And for demonstration only we take the first viz_text_len==35 input tokens to show their self-attention.
attention.show(inp_text[:viz_text_len], [], enc_atts, enc_atts, enc_atts)

INFO:tensorflow:Setting T2TModel mode to 'eval'
INFO:tensorflow:Setting hparams.dropout to 0.0
INFO:tensorflow:Setting hparams.label_smoothing to 0.0
INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0
INFO:tensorflow:Setting hparams.symbol_dropout to 0.0
INFO:tensorflow:Setting hparams.attention_dropout to 0.0
INFO:tensorflow:Setting hparams.relu_dropout to 0.0


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>