In [7]:
import numpy as np
import operator
from torch import optim
import torch.nn.functional as F
import torch.nn as nn
import torch
import sys

sys.path.append(".")
import utils

This is an introduction to basic sequence-to-sequence learning using a Long short term memory (LSTM) module.
Given a string of characters representing a math problem "3141+42" we would like to generate a string of characters representing the correct solution: "3183". Our network will learn how to do basic mathematical operations.
The important part is that we will not first use our human intelligence to break the string up into integers and a mathematical operator. We want the computer to figure all that out by itself.
Each math problem is an input sequence: a list of {0,...,9} integers and math operation symbols
The result of the operation ("$3141+42$" $\rightarrow$ "$3183$"</span>) is the sequence to decode.

**math_operators** is the set of $5$ operations we are going to use to build are input sequences.<br/>
The math_expressions_generation function uses them to generate a large set of examples

In [8]:

def math_expressions_generation(n_samples=1000, n_digits=3, invert=True):
    X, Y = [], []
    math_operators = {
        "+": operator.add,
        "-": operator.sub,
        "*": operator.mul,
        "/": operator.truediv,
        "%": operator.mod,
    }
    for i in range(n_samples):
        a, b = np.random.randint(1, 10 ** n_digits, size=2)
        op = np.random.choice(list(math_operators.keys()))
        res = math_operators[op](a, b)
        x = "".join([str(elem) for elem in (a, op, b)])
        if invert is True:
            x = x[::-1]
        y = "{:.5f}".format(res) if isinstance(res, float) else str(res)
        X.append(x)
        Y.append(y)
    return X, Y

In [9]:
quick_for_debugg = True
n_samples = 100 if quick_for_debugg else int(1e5)

X, y = math_expressions_generation(n_samples=n_samples, n_digits=3, invert=True)
for X_i, y_i in list(zip(X, y))[:20]:
    print(X_i[::-1], "=", y_i)

377*236 = 88972
178/347 = 0.51297
480-903 = -423
446/594 = 0.75084
600/761 = 0.78844
380*438 = 166440
818+86 = 904
369+836 = 1205
379%966 = 379
197/141 = 1.39716
548+45 = 593
581+847 = 1428
886/218 = 4.06422
753%134 = 83
152%922 = 152
955*557 = 531935
952/439 = 2.16856
151/334 = 0.45210
274/91 = 3.01099
559-990 = -431


# I - Encoder and decoder models

- encoder and decoder are both GRU models
- encoder and decoder both take an input sequence and output $1$ hidden vector for each step in input sequence
- the decoder also outputs $1$ softmax per step in input sequence, that corresponds to the next predicted token

In the next cells the example is:
- sequence to encode: 94+8
- sequence to decode: $102\text{<EOS>}$

**NB: In this TP all tensors have a $\text{batch_size}$ axis in addition to the traditional $\text{nb_timesteps, vector_dim}$ axes.**
**The batch size axis is there because pytorch GRU (and most other pytorch layers) can process tensors organized in batch, meaning that contain several sequences.**
**In the returned tensor, the results for each sequence are given along a batch axis.**

**encoder and decoder inputs**
- for the encoder, the input sequence is the operation: $94+8$
<img src="../images/encoder_input.png" style="width: 600px;" />
- for the decoder, if using teacher forcing, the input sequence is the off-set of the sequence to decode: $\text{<GO>}102$
<img src="../images/decoder_input_all.png" style="width: 600px;" />
- for the decoder, if **not** using teacher forcing, the input sequence is $1$ timestep long and is either the $\text{<GO>}$ token or the previous predicted token:
<img src="../images/decoder_input_one.png" style="width: 600px;" />
for the decoder those $3$ scenarios are one: the input sequence is of shape $(\text{nb_timesteps, batch_size, input_dim})$, the decoder goes through all timesteps for each sequence, produces $1$ hidden vector and $1$ prediction per timestep

**no attention vs attention**
the attention mechanism is handled (and implemented) at the decoder level
**no attention**
<img src="../images/decoder_no_attention_all.png" style="width: 900px;" />
At each timestep, the hidden vector is used to predict the next token
**attention**
<img src="../images/decoder_attention_all.png" style="width: 900px;" />
The attention mechanism here is of type that is performed over the decoder hidden vectors after they are produced.
- For each timestep of the decoder input, similarity between the decoder hidden vector and all the encoder hidden vectors is computed. It allows to determine which token in encoder input to focus on. Here similarity is just a dot product $hdec^T \cdot henc$ between the vectors.
- For each timestep of the decoder input, pass this "attention weights" vector to a softmax so the weights sum to $1$.
- For each timestep of the decoder input, compute a weighted sum of the encoder hidden vectors. This is the context vector. The fact that it is more or less heavily weighted towards certain encoder hidden vector relates to the tokens the algorithm focuses on.
- Use the context vector to predict the next token by performing a matrix product to set at the right dimension and apply a softmax.

In [21]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, device):
        super(EncoderRNN, self).__init__()
        self.device = device
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size, hidden_size).to(self.device)

    """
    Implement the encoder forward pass.
    Compute henc_ts, a tensor that represent all the encoder hidden vectors
    for all timesteps for all sequences
    henc_ts is of shape (nb_timesteps, batch_size, hidden_size)
    Compute henc_final, the final encoder hidden vector for all sequences. 
    henc_final is of shape (1, batch_size, hidden_size)
    note:
        - encoder_input is of shape (nb_timesteps, batch_size, input_size)
    hints:
        - Use the gru attribute
    """

    def forward(self, encoder_input, henc_init=None):
        if henc_init is None:
            henc_init = torch.zeros(
                1, encoder_input.size()[1], self.hidden_size, device=self.device
            ).to(self.device)
        # TODO:
        henc_ts, henc_final = self.gru(encoder_input, henc_init)
        
        
        return henc_ts, henc_final

In [22]:
# rnn = nn.GRU(10, 20, 2)
# input = torch.randn(5, 3, 10)
# h0 = torch.randn(2, 3, 20)
# output, hn = rnn(input, h0)

In [23]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, device, attention=False):
        super(DecoderRNN, self).__init__()
        self.device = device
        self.hidden_size = hidden_size
        self.gru = nn.GRU(output_size, hidden_size).to(self.device)
        self.linear = nn.Linear(hidden_size, output_size).to(self.device)
        self.attention = attention

    """
    Implement the decoder forward pass.
    Compute hdec_ts, a tensor that represent all the decoder hidden vectors
    for all timesteps for all sequences
    hdec_ts is of shape (nb_timesteps, batch_size, hidden_size)
    Compute h_ts, a tensor that represent all the encoder hidden vectors
    for all timesteps for all sequences
    Compute hdec_final, the final decoder hidden vector for all sequences.
    hdec_final is of shape (1, batch_size, hidden_size)
        Hint: Use the gru attribute 
    Compute output, the tensor that represent all the softmax for all timesteps 
    for all sequences
    output is of shape (nb_timesteps, batch_size, hidden_size)
        without attention
        with attention
            compute first context_vectors, a tensor that represent a weighted sum
            of encoder hidden vectors at all timesteps for all sequences. 
            context_vectors is of shape (nb_timesteps, batch_size, hidden_size)
                Hint: it is possible to compute it in fully "vectorial" way with 
                pytorch function but do not hesitate to use loops to iterate over
                timesteps etc. if it seems easier
    
    note:
        - for the softmax, use the function torch.nn.functional.log_softmax
        - follow the above diagrams
    """

    def forward(self, decoder_input, hdec_init, henc_ts=None):

        # TODO:
        hdec_ts, hdec_final = self.gru(decoder_input, hdec_init)
        
        if self.attention:
            # TODO: (done)
            batch_size = hdec_ts.shape[1]
            nb_timesteps = hdec_ts.shape[0]
            context_vectors = []
            for b in range(batch_size):
                context = []
                for t in range(nb_timesteps):
                    # Compute similarity
                    sim = hdec_ts[t, b, :].matmul(henc_ts[:, b, :].transpose(0, 1))
                    # Apply softmax
                    sm_sim = F.log_softmax(sim)
                    # Compute context vector
                    context.append(torch.sum(sm_sim * henc_ts[:, b, :].transpose(0, 1), 1).unsqueeze(0))
                # Concatenate to get context tensor
                context_vectors.append(torch.cat(context).unsqueeze(0))
            # Concatenate to get context tensor for all batches
            context_vectors = torch.cat(context_vectors, 0).permute(1, 0, 2)
            
            # Compute output
            output = F.log_softmax(self.linear(context_vectors))
            
            else:
            # TODO:
            output = F.log_softmax(self.linear(hdec_ts))
            
        return output, hdec_final


# II - Sequence to sequence model

**GO** is the character ("=") that marks the beginning of decoding for the decoder GRU<br/>
**EOS** is the character ("\n") that marks the end of sequence to decode for the decoder GRU

**global Seq2seq architecture (teacher forcing scenario)**
<img src="../images/seq2seq_teacher.png" style="width: 1000px;" />
the teacher forcing mechanism is handled (and implemented) at the seq2seq forward pass level.
teacher forcing or no teacher forcing depends on the kind of input passed to the decoder.
**teacher forcing**
<img src="../images/teacher_forcing.png" style="width: 600px;" />
- the decoder input is the sequence of expected decoded tokens at all timesteps.
- the decoder input is passed in one go to the decoder. The decoder goes through all timesteps and decodes the whole sequence in one go.
- the decoder input is of shape $(\text{nb_timesteps, batch_size, input_dim})$.
**no teacher forcing**
<img src="../images/no_teacher_forcing.png" style="width: 1000px;" />
- the decoder input is $1$ timestep long and either the $\text{GO}$ token or the previous decoded token
- the decoder inputs are passed iteratively in many stages to the decoder. For each stage, the decoder is given as state the previous returned hidden vector and take as input the previous decoded token. It produces a new hidden vector and decoded token that are returned for the next stage.
- the decoder input for each stage is of shape $(\text{1, batch_size, input_dim})$.

In [12]:

class Seq2seq(nn.Module):
    def __init__(self, X, y, hidden_size=256, learning_rate=0.01, attention=False):
        super(Seq2seq, self).__init__()
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.X = X
        self.y = y
        self.GO = "="
        self.EOS = "\n"
        self.dataset_size = None
        self.encoder_char_index = None
        self.encoder_index_char = None
        self.decoder_char_index = None
        self.decoder_index_char = None
        self.encoder_vocabulary_size = None
        self.decoder_vocabulary_size = None
        self.max_encoder_sequence_length = None
        self.max_decoder_sequence_length = None
        self.encoder_input_tr = None
        self.encoder_input_val = None
        self.decoder_input_tr = None
        self.decoder_input_val = None
        self.target_tr = None
        self.target_val = None
        self._set_data_properties_attributes()
        self._construct_data_set()
        self.encoder = EncoderRNN(
            input_size=self.encoder_vocabulary_size,
            hidden_size=hidden_size,
            device=self.device,
        )
        self.decoder = DecoderRNN(
            hidden_size=hidden_size,
            output_size=self.decoder_vocabulary_size,
            attention=attention,
            device=self.device,
        )
        self.parameters = list(self.encoder.parameters()) + list(
            self.decoder.parameters()
        )
        self.optimizer = optim.Adam(self.parameters, lr=learning_rate)
        self.criterion = nn.NLLLoss(reduction="mean")
        # training attributes
        self.total_loss = None
        self.total_loss_nb_samples = None

    def _set_data_properties_attributes(self):
        self.y = list(map(lambda token: self.GO + token + self.EOS, self.y))
        self.dataset_size = len(self.X)
        encoder_characters = sorted(list(set("".join(self.X))))
        decoder_characters = sorted(list(set("".join(self.y))))
        decoder_characters.remove(self.EOS)
        # set EOS at 0 index so argmax on zero vector falls at EOS
        decoder_characters = [self.EOS] + decoder_characters
        self.encoder_char_index = dict((c, i) for i, c in enumerate(encoder_characters))
        self.encoder_index_char = dict((i, c) for i, c in enumerate(encoder_characters))
        self.decoder_char_index = dict((c, i) for i, c in enumerate(decoder_characters))
        self.decoder_index_char = dict((i, c) for i, c in enumerate(decoder_characters))
        self.encoder_vocabulary_size = len(self.encoder_char_index)
        self.decoder_vocabulary_size = len(self.decoder_char_index)
        self.max_encoder_sequence_length = max([len(sequence) for sequence in self.X])
        self.max_decoder_sequence_length = max([len(sequence) for sequence in self.y])
        print("Number of samples:", self.dataset_size)
        print("Number of unique encoder tokens:", self.encoder_vocabulary_size)
        print("Number of unique decoder tokens:", self.decoder_vocabulary_size)
        print("Max sequence length for encoding:", self.max_encoder_sequence_length)
        print("Max sequence length for decoding:", self.max_decoder_sequence_length)

    def _construct_data_set(self):
        encoder_input = torch.zeros(
            (
                self.max_encoder_sequence_length,
                self.dataset_size,
                self.encoder_vocabulary_size,
            ),
            dtype=torch.float32,
        )
        decoder_input = torch.zeros(
            (
                self.max_decoder_sequence_length,
                self.dataset_size,
                self.decoder_vocabulary_size,
            ),
            dtype=torch.float32,
        )
        target = torch.zeros(
            (
                self.max_decoder_sequence_length,
                self.dataset_size,
                self.decoder_vocabulary_size,
            ),
            dtype=torch.float32,
        )

        for i, (X_i, y_i) in enumerate(zip(self.X, self.y)):
            for t, char in enumerate(X_i):
                encoder_input[t, i, self.encoder_char_index[char]] = 1.0
            for t, char in enumerate(y_i):
                decoder_input[t, i, self.decoder_char_index[char]] = 1.0
                if t > 0:
                    target[t - 1, i, self.decoder_char_index[char]] = 1.0

        p_val = 0.25
        size_val = int(p_val * self.dataset_size)
        idxs = np.arange(self.dataset_size)
        np.random.shuffle(idxs)
        idxs_tr = idxs[:-size_val]
        idxs_val = idxs[-size_val:]
        (
            self.encoder_input_tr,
            self.encoder_input_val,
            self.decoder_input_tr,
            self.decoder_input_val,
            self.target_tr,
            self.target_val,
        ) = (
            encoder_input[:, idxs_tr, :],
            encoder_input[:, idxs_val, :],
            decoder_input[:, idxs_tr, :],
            decoder_input[:, idxs_val, :],
            target[:, idxs_tr, :],
            target[:, idxs_val, :],
        )
        self.encoder_input_tr = self.encoder_input_tr.to(self.device)
        self.encoder_input_val = self.encoder_input_val.to(self.device)
        self.decoder_input_tr = self.decoder_input_tr.to(self.device)
        self.decoder_input_val = self.decoder_input_val.to(self.device)
        self.target_tr = self.target_tr.to(self.device)
        self.target_val = self.target_val.to(self.device)

    """
    Implement the Seq2seq forward pass.
    Compute henc_ts, the tensor that represent all the encoder hidden vectors
    for all timesteps for all sequences
    henc_ts is of shape (nb_encoder_timesteps, batch_size, hidden_size)
    Compute henc_final, the final encoder hidden vector for all sequences. 
    henc_final is of shape (1, batch_size, hidden_size)
    Compute pred_softmax_all_ts, the tensor that represents all the softmax
    vectors at all timesteps for all sequences.
    pred_softmax_all_ts is of shape (nb_decoder_timesteps, batch_size, output_dim)
        teacher forcing case
            Hint: refer to diagrams notes
        no teacher forcing case
            Before the loop, initialize decoder_input, the tensor that represents
            the first token passed to the decoder for all sequences. 
            The token is <GO>, the decoder_input is of shape (1, batch_size, output_dim).
            It has to be in one-hot encoding representation.
            In the loop, compute pred_softmax. The tensor represents the softmax 
            produced at this timestep, for all sequences. 
            It is of shape (1, batch_size, output_dim)
            In the loop, compute hdec_final. The tensor represents the hidden vector 
            produced at this timestep, for all sequences. 
            It is of shape (1, batch_size, hidden_dim)
            In the loop, set hdec_init to the right value. 
            hdec_init is a tensor that represents the state in which the decoder 
            will start at next stage.
            hdec_init is of shape (1, batch_size, hidden_dim)
    note:
        - in code nb_decoder_timesteps is self.max_decoder_sequence_length
        - in code output_dim is self.decoder_vocabulary_size
    """

    def forward(
        self, encoder_input, decoder_input=None, teacher_enforce=True, inference=False
    ):

        batch_size = encoder_input.size()[1]
        if inference:
            assert (
                batch_size == 1
            ), "during inference batch size must be 1: 1 sequence processed"
            if teacher_enforce:
                print("Warning teacher_enforce will be set to False for inference")
                teacher_enforce = False

        # TODO:
        henc_ts = None
        henc_final = None

        if teacher_enforce:
            assert decoder_input is not None
            # TODO:
            pred_softmax_all_ts = torch.zeros(
                self.max_decoder_sequence_length,
                batch_size,
                self.decoder_vocabulary_size,
                requires_grad=True,
            )
            hdec_final = None

        elif not teacher_enforce:
            pred_softmax_all_ts = []
            # TODO:
            decoder_input = torch.zeros(1, batch_size, self.decoder_vocabulary_size)

            decoder_input = decoder_input.to(self.device)
            hdec_init = henc_final
            # iterate over all decoder stages
            for _ in range(self.max_decoder_sequence_length):
                # TODO:
                pred_softmax = torch.zeros(
                    1, batch_size, self.decoder_vocabulary_size, requires_grad=True
                )
                hdec_final = None
                pred_softmax_all_ts.append(pred_softmax)
                # convert softmax predictions to idx
                preds_idx = pred_softmax.argmax(dim=2)
                # convert idx predictions to one-hot encoding
                decoder_input = torch.zeros(1, batch_size, self.decoder_vocabulary_size)
                decoder_input = decoder_input.to(self.device)
                decoder_input[0, np.arange(batch_size), preds_idx] = 1

                # TODO:
                hdec_init = None
                if inference:
                    pred = preds_idx.squeeze().item()
                    if pred == self.decoder_char_index[self.EOS]:
                        break
            pred_softmax_all_ts = torch.cat(pred_softmax_all_ts)

        return pred_softmax_all_ts

    def _train_on_batch(
        self, encoder_input, target, teacher_forcing, decoder_input=None
    ):
        self.optimizer.zero_grad()
        prediction = self.forward(
            encoder_input, decoder_input=decoder_input, teacher_enforce=teacher_forcing
        )
        target_idx = target.argmax(2)
        loss_on_batch = self.criterion(
            prediction.reshape(-1, prediction.size()[2]), target_idx.reshape(-1)
        )
        loss_on_batch.backward()
        self.optimizer.step()

        return loss_on_batch

    def train(self, nb_epoch=10, batch_size=64, teacher_enforce=True):
        arr = np.arange(self.encoder_input_tr.size()[1])
        np.random.shuffle(arr)
        nb_batch = int(self.encoder_input_tr.size()[1] / batch_size)
        verbose_every = 5 if nb_batch >= 5 else 1

        for epoch in range(nb_epoch):
            self._reset_monitor_train_epoch()
            if epoch > 0:
                print()
            for batch_idx in range(nb_batch):
                idxs = arr[batch_idx * batch_size : (batch_idx + 1) * batch_size]
                encoder_input_batch_tr = self.encoder_input_tr[:, idxs, :]
                target_batch_tr = self.target_tr[:, idxs, :]
                decoder_input_batch_tr = self.decoder_input_tr[:, idxs, :]

                batch_loss_tr = self._train_on_batch(
                    encoder_input_batch_tr,
                    target_batch_tr,
                    teacher_forcing=teacher_enforce,
                    decoder_input=decoder_input_batch_tr,
                )
                self._monitor_train_epoch(
                    batch_loss=batch_loss_tr,
                    batch_size=encoder_input_batch_tr.size()[1],
                )

                if (batch_idx + 1) % verbose_every == 0:
                    self._display_training(
                        epoch, nb_epoch, batch_idx, nb_batch, epoch_ended=False
                    )

            self._monitor_validation(teacher_enforce=teacher_enforce)
            self._display_training(
                epoch, nb_epoch, batch_idx, nb_batch, epoch_ended=True
            )

    def _monitor_train_epoch(self, batch_loss, batch_size):
        self.total_loss += batch_loss * batch_size
        self.total_loss_nb_samples += batch_size

    def _reset_monitor_train_epoch(self):
        self.total_loss = 0
        self.total_loss_nb_samples = 0

    def _monitor_validation(self, teacher_enforce):

        prediction_val = self(
            self.encoder_input_val,
            decoder_input=self.decoder_input_val,
            teacher_enforce=teacher_enforce,
        )
        target_val_idx = self.target_val.argmax(2)
        self.last_loss_val = self.criterion(
            prediction_val.reshape(-1, prediction_val.size()[2]),
            target_val_idx.reshape(-1),
        )

    def _display_training(
        self, epoch, nb_epoch, idx_batch, nb_batch, epoch_ended=False
    ):
        msg = "Epoch {}/{} {} {}".format(
            epoch + 1,
            nb_epoch,
            utils.arrow(idx_batch + 1, nb_batch),
            " mean loss: %.5f" % (self.total_loss.item() / self.total_loss_nb_samples),
        )
        if epoch_ended:
            msg += " val loss: %.5f" % self.last_loss_val
        print(msg, end="\r")

    def _tensor_to_words(self, output, decoded=True):
        dict_index_char = (
            self.decoder_index_char if decoded else self.encoder_index_char
        )
        pred_idx = output.argmax(dim=2)
        decoded_words = []
        for seq in range(pred_idx.size()[1]):
            idxs_chars = pred_idx[:, seq]
            decoded_word = "".join(dict_index_char[idx.item()] for idx in idxs_chars)
            if not decoded:
                # correct errors due to zero vectors at the end
                accepted_end_chars = set(list("0123456789"))
                for i in range(len(decoded_word) - 1, -1, -1):
                    if decoded_word[i] in accepted_end_chars:
                        decoded_word = decoded_word[: i + 1]
                        break
            decoded_words.append(decoded_word)
        return decoded_words

    def evaluate(self, nb=30):
        nb = min(nb, self.encoder_input_val.size()[1])
        for i in range(nb):
            output = self(
                self.encoder_input_val[:, i : i + 1, :],
                inference=True,
                teacher_enforce=False,
            )
            decoded_word = self._tensor_to_words(output, decoded=True)[0]
            operation = self._tensor_to_words(
                self.encoder_input_val[:, i : i + 1, :], decoded=False
            )[0][::-1]
            expected_decoded_word = self._tensor_to_words(
                self.target_val[:, i : i + 1, :], decoded=True
            )[0]
            decoded_word = decoded_word.replace("\n", "")
            operation = operation.replace("\n", "")
            expected_decoded_word = expected_decoded_word.replace("\n", "")
            print(
                "Input sentence: {} Decoded sentence: {} Expected decoded sentence: {}".format(
                    operation, decoded_word, expected_decoded_word
                )
            )
            print()


### no attention - teacher forcing

In [13]:
seq2seq = Seq2seq(X, y, hidden_size=128, attention=False)

Number of samples: 100
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 10


In [14]:
seq2seq.train(nb_epoch=10, batch_size=64, teacher_enforce=True)



In [15]:
seq2seq.evaluate()




























### no attention - no teacher forcing

In [10]:
seq2seq = Seq2seq(X, y, hidden_size=128, attention=False)

In [11]:
seq2seq.train(nb_epoch=3, batch_size=64, teacher_enforce=False)

In [12]:
seq2seq.evaluate()

### attention - teacher forcing

In [13]:
seq2seq_attn = Seq2seq(X, y, hidden_size=128, attention=True)

In [14]:
seq2seq_attn.train(nb_epoch=3, batch_size=64, teacher_enforce=True)

In [15]:
seq2seq_attn.evaluate()

### attention - no teacher forcing

In [16]:
seq2seq_attn = Seq2seq(X, y, hidden_size=128, attention=True)

In [17]:
seq2seq_attn.train(nb_epoch=3, batch_size=64, teacher_enforce=False)

In [18]:
seq2seq_attn.evaluate()

###¬†Questions:
- 1) Explain the interest in using teacher forcing during training. What is specific about this process?



- 2) Describe step by step how the encoder-decoder couple works in this case (~ 5-10 lines)

###¬†Questions:
- 1) Describe how the attention mechanism works in the seq2seq setting (~ 5-10 lines)

Attention mechanism works by being able to focus on a specific subsequence in a long sequence to predict the right token at some timestep. That means not having to rely solely on final  ‚Ñéùëíùëõùëêùëá  to predict the whole decoded sequence, but rather recombining and weighing all the  ‚Ñéùëíùëõùëêùë°  at each decoding step to focus those related to the prediction. At each decoding step, a scalar product is performed between  ‚Ñéùëëùëíùëêùë°  and all the  ‚Ñéùëíùëõùëêùë° . This gives a similarity measure between  ‚Ñéùëëùëíùëêùë°  and each  ‚Ñéùëíùëõùëêùë° . A softmax is applied to this vector to rescale the similarity coefficients and make them sum to  1 . This way we can use them to compute a mean  ‚Ñéùëíùëõùëê  vector to be used for prediction that allows the network to focus on some input tokens by making some coefficient relatively much greater than the others. Mean  ‚Ñéùëíùëõùëê  vector is then computed and followed by  ùë°ùëéùëõ‚Ñé  operation to reduce vector input space of next operation. Final step is a softmax fully connected layer over the  ùë°ùëéùëõ‚Ñé  vector for prediction of the next decoded token. Applying attention mechanism involves iterating over this for each decoding timestep.

The attention mechanism recombine all the weights at each step in the decoding. Thanks to that, the model is able to have better prediction on long expressions. It will not only focus on the final $h^{enc}$. 
To implement the attention mechanism, we do a scalar product between $h^{dec}_t$ and ALL the $h^{enc}_t$. Then we apply a softmax function to this result. Then we can compute a mean vector we can use to predict the next token. 
The inconvenient of such a mechanism is that we have to go through all theses steps at each timestep. 

- 2) Compare the perfomances of your model at inference time with and without attention mechanism. Do you see noticeable differences? Why?

Here, we are not hoping for a huge difference in performances between the model with and without the attention mechanism. Almost all the "sentences" here are not huge and moreover, almost all tokens in each expression is involved in the computation of the result. Thus it is not relevant to use the attention mechanism here, as it would have been in a case of translation for example.