## Notebook for running Seq2Seq model in Google Colab

Let's import all the code for the task of Seq2Seq neural machine translation and run it on Colab.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -r '/content/drive/MyDrive/Colab Notebooks/NLP1C/Seq2Seq/' .

In [13]:
import os

### Training the model

Let's create a necessary vocab file and start training the model.

In [8]:
os.chdir('Seq2Seq')

In [11]:
!sh run.sh vocab

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
read in source sentences: ./en_es_data/train.es
read in target sentences: ./en_es_data/train.en
initialize source vocabulary ..
number of word types: 93286, number of word types w/ frequency >= 2: 52658
initialize target vocabulary ..
number of word types: 67535, number of word types w/ frequency >= 2: 39786
generated vocabulary, source 50004 words, target 39788 words
vocabulary saved to vocab.json


In [None]:
!sh run.sh train

### Testing the model

Now let's test the model.

In [27]:
!sh run.sh test

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
load test source sentences from [./en_es_data/test.es]
load test target sentences from [./en_es_data/test.en]
load model from model.bin
Decoding:   0% 0/8064 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run.py", line 344, in <module>
    main()
  File "run.py", line 338, in main
    decode(args)
  File "run.py", line 283, in decode
    max_decoding_time_step=int(args['--max-decoding-time-step']))
  File "run.py", line 311, in beam_search
    example_hyps = model.beam_search(src_sent, beam_size=beam_size, max_decoding_time_step=max_decoding_time_step)
  File "/content/Seq2Seq/nmt_model.py", line 454, in beam_search
    new_hyp_sent = hypotheses[prev_hyp_id] + [hyp_word]
TypeError: list indices must be integers or slices, not float


### Answering the written questions

**(g)** The `generate_sent_masks()` function in `nmt_model.py` produces a tensor called `enc_masks`. It has shape (batch size, max source sentence length) and contains 1s in positions corresponding to 'pad' tokens in the input, and 0s for non-pad tokens. Look at how the masks are used during the attention computation in the `step()` function.
<br>
First explain (in around three sentences) what effect the masks have on the entire attention computation. Then explain (in one or two sentences) why it is necessary to use the masks in this way. 
<br><br>
**Solution.**
<br>
 By looking at the `step()` function we see that all the attention scores, which have 1s in their mask (if it is a 'pad' token) are set to `-float(inf)`. As a result, these tokens get a very small weight in probability distridution after applying `softmax`. Therefore, it is necessary to use masks in this way, because without applying this technique the 'pad' tokens might transform the real attention distribution.
<br><br>
**(i)** Please report the model's corpus BLEU Score. It should be larger than 21.
<br><br>
**Solution.**
<br>
*Early stopped model:* <br>
training: epoch 15, iter 96000, avg. loss 24.51, avg. ppl 3.25, <br>
validation: iter 96000, dev. ppl 7.318153.
<br>
Model's corpus BLEU score: 
<br><br>
**(j)** In class, we learned about dot product attention, multiplicative attention, and additive attention. Please explain one advantage and one disadvantage of dot product attention compared to multiplicative attention. Then explain one advantage and one disadvantage of additive attention compared to multiplicative attention. As a reminder, dot product attention is $\mathbf{e}_{t, i} = \mathbf{s}^{T} \mathbf{h}_{i}$, multiplicative attention is $\mathbf{e}_{t, i} = \mathbf{s}_{t}^{T} \mathbf{W} \mathbf{h}_{i}$, and additive attention is $\mathbf{e}_{t, i} = \mathbf{v}^{T} \tanh{(\mathbf{W}_{1} \mathbf{h}_{i} + \mathbf{W}_{2} \mathbf{s}_{t})}$.
<br><br>
**Solution.**
<br>
*Comparing dot product and multiplicative attention:* dot product attention is faster and more memory efficient, however it is not as flexible as multiplicative attention.
<br>
*Comparing multiplicative and additive attention:* multiplicative attention is faster and more memory efficient, however additive attention performs better for larger dimensions and more flexible because both source and target hidden state vectors have their own learnable matrices $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$.


