# KAIST AI605 Assignment 2: Token Classification with RNNs and Attention
Author: Minjoon Seo (minjoon@kaist.ac.kr)

TA in charge: Taehyung Kwon (taehyung.kwon@kaist.ac.kr)

**Due date**:  April 19 (Mon) 11:00pm, 2021  


Your name: Radhika Dua

Your student ID: 20204824

Your collaborators: 

## Assignment Objectives
- Verify theoretically and empirically how Transformer's attention mechanism works for sequence modeling task.
- Implement Transformer's encoder attention layer from scratch using PyTorch.
- Design an Attention-based token classification model using PyTorch.
- Apply the token classification model to a popular machine reading comprehension task, Stanford Question Answering Dataset (SQuAD).
- (Bonus) Analyze pros and cons between using RNN + attention versus purely attention.

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Also make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are two bonus questions with 30 points altogether. Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch
import os
import numpy as np
CUDA_LAUNCH_BLOCKING=1
# os.environ["CUDA_VISIBLE_DEVICES"]="5"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Transformer's Attention Layer

We will first start with going over a few concepts that you learned in your high school statistics class. The variance of a random variable $X$, $\text{Var}(X)$ is defined as $\text{E}[(X-\mu)^2]$ where $\mu$ is the mean of $X$. Furthermore, given two independent random variables $X$ and $Y$ and a constant $a$,
$$ \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y),$$
$$ \text{Var}(aX) = a^2\text{Var}(X),$$
$$ \text{Var}(XY) = \text{E}(X^2)\text{E}(Y^2) - [\text{E}(X)]^2[\text{E}(Y)]^2.$$

**Problem 1.1** *(10 points)* Suppose we are given two sets of $n$ random variables, $X_1 \dots X_n$ and $Y_1 \dots Y_n$, where all of these $2n$ variables are mutually independent and have a mean of $0$ and a variance of $1$. Prove that
$$\text{Var}\left(\sum_i^n X_i Y_i\right) = n.$$

### $\color{blue}{\text{Solution 1.1}}$

<font color='blue'>
Given two independent random variables $X$ (mean = $\mu_x$ and standard_deviation = $\sigma_x$) and $Y$ (mean = $\mu_y$ and standard_deviation = $\sigma_y$),  $\operatorname{Var}(\mathrm{XY})$ is given by:
 $$\operatorname{Var}(\mathrm{XY}) =\mathrm{E}\left[\mathrm{X}^{2}\right] \mathrm{E}\left[\mathrm{Y}^{2}\right]-(\mathrm{E}[\mathrm{X}])^{2}(\mathrm{E}[\mathrm{Y}])^{2}$$
 <font color='blue'>   
We have $ E(X)=\mu_x$ and $E(X-\mu_x)^{2}=\sigma^{2}$, we can find $E(X^{2})$ using the equation below:
\begin{aligned}E(X^{2})&=E(X-\mu+\mu)^{2} \\&=E(X-\mu)^{2}-2 E[(X-\mu) \mu]+E\left(\mu^{2}\right) \\&=\sigma^{2}-2 \mu E(X-\mu)+\mu^{2} \\&=\sigma^{2}+\mu^{2}\end{aligned}
<font color='blue'>
Therefore,
\begin{aligned} \operatorname{Var}(\mathrm{XY}) &=\mathrm{E}\left[\mathrm{X}^{2}\right] \mathrm{E}\left[\mathrm{Y}^{2}\right]-(\mathrm{E}[\mathrm{X}])^{2}(\mathrm{E}[\mathrm{Y}])^{2} \\ &=\left(\sigma_{\mathrm{x}}^{2}+\mu_{\mathrm{x}}^{2}\right)\left(\sigma_{\mathrm{y}}^{2}+\mu_{\mathrm{y}}^{2}\right)-\mu_{\mathrm{x}}^{2} \mu_{\mathrm{y}}^{2} \\ &=\sigma_{\mathrm{x}}^{2} \sigma_{\mathrm{y}}^{2}+\mu_{\mathrm{x}}^{2} \mu_{\mathrm{y}}^{2}+\sigma_{\mathrm{x}}^{2} \mu_{\mathrm{y}}^{2}+\sigma_{\mathrm{y}}^{2} \mu_{\mathrm{x}}^{2}-\mu_{\mathrm{x}}^{2} \mu_{\mathrm{y}}^{2} \\ &=\sigma_{\mathrm{x}}^{2} \sigma_{\mathrm{y}}^{2}+\sigma_{\mathrm{x}}^{2} \mu_{\mathrm{y}}^{2}+\sigma_{\mathrm{y}}^{2} \mu_{\mathrm{x}}^{2} \end{aligned}
<font color='blue'>    
So, we can rewrite $\operatorname{Var}\left(\sum_{i=1}^{n} x_{i} y_{i}\right)$ as: 
\begin{aligned} \operatorname{Var}\left(\sum_{i=1}^{n} x_{i} y_{i}\right) &=\sum_{i=1}^{n} \operatorname{var}\left(x_{i} y_{i}\right) \\ &=\sum_{i=1}^{n}\left(\sigma_{x_{i}}^{2} \sigma_{y_{i}}^{2}+\sigma_{x_{i}}^{2} \mu_{y_{i}}^{2}+\sigma_{i}^{2} u_{k}^{2}\right) \\ &=\sum_{i=1}^{n}\left(1\right) \\ &=n \end{aligned} 
<font color='blue'>
Hence proved that $\operatorname{Var}\left(\sum_{i=1}^{n} x_{i} y_{i}\right)=n$

In Lecture 08 and 09, we discussed how the attention is computed in Transformer via the following equation,
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.$$
**Problem 1.2** *(10 points)*  Suppose $Q$ and $K$ are matrices of independent variables each of which has a mean of $0$ and a variance of $1$. Using what you learned from Problem 1.1., show that
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.$$

### $\color{blue}{\text{Solution 1.2}}$
<font color='blue'>

<font color='blue'> **LHS** <font>
    <font color='blue'>
\begin{align}
\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) &= \frac{1}{d_k}\text{Var}\left(QK^\top \right)
\\ &= \frac{1}{d_k} (E[(QK^\top)^2] - E[QK^\top]^2)
\\&= \frac{1}{d_k} (E\left[ QK^\top KQ^\top \right] - \left(E[Q] E[K]^\top \right)^2)
\\&= \frac{1}{d_k} (E\left[Tr\left(Q Q^\top K K^\top \right)\right] - \left(\mu_Q^\top \mu_K\right)^2)
\\&= \frac{1}{d_k} (Tr\left( E[Q Q^\top] E[K K^\top] \right) - \left(\mu_Q^\top \mu_K\right)^2)
\\&= \frac{1}{d_k}( Tr\left(
        \left( \mu_Q \mu_Q^\top + \sigma_K \right)
        \left( \mu_K \mu_K^\top + \sigma_Q \right))
     \right) - \left(\mu_Q^\top \mu_K\right)^2)
\\&= \frac{1}{d_k} (Tr\left( \mu_Q \mu_Q^\top \mu_K \mu_K^\top \right)
   + Tr\left( \mu_Q \sigma_K \mu_Q^\top \right)
   + Tr\left( \mu_K \sigma_Q \mu_K^\top \right)
   + Tr\left( \sigma_Q \sigma_K \right)
   - \left(\mu_Q^\top \mu_K\right)^2)
\\&= \frac{1}{d_k} (\left(\mu_Q^\top \mu_K\right)^2
   + \left( \mu_K \sigma_Q \mu_k^\top \right)
   + \left( \mu_Q \sigma_K \mu_Q^\top \right)
   + Tr\left( \sigma_Q \sigma_K \right)
   - \left(\mu_Q^\top \mu_K\right)^2)
\\&= \frac{1}{d_k} (\mu_K \sigma_Q \mu_X^\top + \mu_Q \sigma_Q \mu_K^\top + Tr(\sigma_Q \sigma_K))
\\&= \frac{1}{d_k} (d_k)
\\&= 1
\end{align}
<font color='blue'>
Hence proved that $\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1$


**Problem 1.3** *(10 points)* What would happen if the assumption that the variance of $Q$ and $K$ is $1$ does not hold? Consider each case of it being higher and lower than $1$ and conjecture what it implies, respectively.

### $\color{blue}{\text{Solution 1.3}}$
<font color='blue'><br> **Case 1:** If the variance of  𝑄  and  𝐾 is greater than 1, than $\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right)$ will also be greater than 1. In this case, the softmax may saturate at initilization making it dificult to learn.  
**Case 2:** If the variance of  𝑄  and  𝐾 is less than 1, than $\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right)$ will also be less than 1. In this case, the output may be too flat to optimize effectively. 


## 2. Preprocessing SQuAD

We will use `datasets` package offered by Hugging Face, which allows us to easily download various language datasets, including Stanford Question Answering Dataset (SQuAD).

First, install the package:

In [None]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 5.5MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/e9/91/2ef649137816850fa4f4c97c6f2eabb1a79bf0aa2c8ed198e387e373455e/fsspec-2021.4.0-py3-none-any.whl (108kB)
[K     |████████████████████████████████| 112kB 7.6MB/s 
Collecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 7.7MB/s 
Installing collected packages: fsspec, huggingface-hub, xx

Then, download SQuAD and print the first example:

In [None]:
from datasets import load_dataset
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1877.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=955.0, style=ProgressStyle(description_…


Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.75 MiB, post-processed: Unknown size, total: 119.27 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=8116577.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1054280.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a. Subsequent calls will reuse this data.
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5

Here, `answer_start` corresponds to the character-level start position of the answer, and `text` is the answer text itself. You will note that `answer_start` and `text` fields are given as lists but they only contain one item each. In fact, you can safely assume that this is the case for the training data. During evaluation, however, you will utilize several possible answers so that your evaluation can be compared against all of them. So your code need to handle multiple-answers case as well.

As we discussed in Lecture 05, we want to formulate this task as a token classification problem. That is, we want to find which token of the context corresponds to the start position of the answer, and which corresponds to the end.

**Problem 2.1** *(10 points)* Write `preprocess()` function that takes a SQuAD example as the input and outputs space-tokenized context and question, as well as the start and end token position of the answer if it has the answer field. That is, a pseudo code would look like:
```python
def preprocess(example):
  out = {'context': ['each', 'token'], 
         'question': ['each', 'token']}
  if 'answers' not in example:
    return out
  out['answers'] = [{'start': 3, 'end': 5}]
  return out
```
Verify that this code works by comparing between the original answer text and the concatenation of the answer tokens from start to end in training data. Report the percentage of the questions that have exact match.

### $\color{blue}{\text{Solution 2.1}}$

In [None]:
########## space tokenization #############
def tokenization(text): 
    temp_tokens = text.split(" ")
    return temp_tokens

########## function to find start and end of a sublist in a list #############
def find_sub_list(sl,l):
    results=[]
    sll=len(sl)
    for ind in (i for i,e in enumerate(l) if e==sl[0]):
        if l[ind:ind+sll]==sl:
            results.append(ind)
            results.append(ind+sll-1)
    return results

########### preprocess function #############
def preprocess(tokenization, example):
  out = {}
  context = example['context']
  question = example['question']
  answer = example['answers']['text'][0]
  out['context'] = tokenization(context)
  out['question'] = tokenization(question)
  if 'answers' not in example:
    return out
  answer_ids = tokenization(answer)
  startend = find_sub_list(answer_ids, out['context']) 
  if(startend):
    out['answers'] = [{'start': startend[0], 'end':  startend[1]}]
  else:
    out['answers'] = []
  return out

In [None]:
######### An example to show the original and preprocessed data ############
print("############## EXAMPLE ###############")
i = 0
print(squad_dataset['train'][i])
output = preprocess(tokenization, squad_dataset['train'][i])
orig = squad_dataset['train'][i]['answers']['text'][0]
if (output['answers']):
  concat = ' '.join(output['context'][output['answers'][0]['start']:output['answers'][0]['end']+1])
else:
  concat = ''
if(orig == concat): 
  print("EXACT MATCH!!!!!")
print("Original answer text:", orig)
print("Concatenation of the answer tokens from start to end:", concat)

########## Finding percentage of the questions that have exact match ###########
exact = 0
total = len(squad_dataset['train'])
for i in range(total):
  output = preprocess(tokenization, squad_dataset['train'][i])
  orig = squad_dataset['train'][i]['answers']['text'][0]
  if (output['answers']):
    concat = ' '.join(output['context'][output['answers'][0]['start']:output['answers'][0]['end']+1])
  else:
    concat = ''
  if(orig == concat): 
    exact += 1

print("\n")
print("##### Finding percentage of the questions that have exact match #####")
print("No of questions that have exact match:", exact)
print("Total no of samples:", total)
print("Percentage of the questions that have exact match:", (exact/total)*100)

############## EXAMPLE ###############
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
EXACT MATCH!!!!

<font color='blue'> In the above cell, I presented and example to  verify that this code works by comparing between the original answer text and the concatenation of the answer tokens from start to end in training data. In the above example, the original answer text and the concatenation of the answer tokens from start to end is exactly same. 
    <font color='blue'>
The percentage of the questions that have exact match is **55.108 %**

We want to maximize the percentage of the exact match. You might see a low percentage however, due to bad tokenization. For instance, such space-based tokenization will fail to separate between "world" and "!" in "hello world!". 

**Problem 2.2** *(10 points)* Write an advanced tokenization model that always separates non-alphabet characters as independent tokens. For instance, "hello1 world!!" will be tokenized into "hello", "1", "world", "!", and "!". Using this new tokenizer, re-run the `preprocess` function and report the exact match percentage. How does the ratio change?

### $\color{blue}{\text{Solution 2.2}}$

In [None]:
import nltk
import regex as re
import string
nltk.download('punkt')

########## advanced tokenization #############
def adv_tokenization(text):
    tokens = []
    regex = re.compile('([^a-zA-Z])')
    for s in text.lower().split():
        _splitted = re.split(regex, s)
        tokens += _splitted
    return tokens

########## function to find start and end of a sublist in a list #############
def find_sub_list(sl,l):
    results=[]
    sll=len(sl)
    for ind in (i for i,e in enumerate(l) if e==sl[0]):
        if l[ind:ind+sll]==sl:
            results.append(ind)
            results.append(ind+sll-1)
    return results

########### preprocess function #############
def preprocess(tokenization, example):
  out = {}
  context = example['context']
  question = example['question']
  answer_list = example['answers']['text']
  out['context'] = tokenization(context)
  out['question'] = tokenization(question)
  out['answers'] = example['answers']
  out['answer'] = []
  out['id'] = example['id']
  if 'answers' not in example:
    return out
    
  for i in range(len(answer_list)):

    answer = answer_list[i]
    answer_ids = tokenization(answer)
    startend = find_sub_list(answer_ids, out['context']) 
    if(startend):
      out['answer'].append([{'start': startend[0], 'end':  startend[1]}])
      
  return out

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
######### An example to show the original and preprocessed data ############
print("############## EXAMPLE ###############")
i = 0
print(squad_dataset['train'][i])
output = preprocess(adv_tokenization, squad_dataset['train'][i])
orig = squad_dataset['train'][i]['answers']['text'][0].lower()
if (output['answer']):
  concat = ' '.join(output['context'][output['answer'][0][0]['start']:output['answer'][0][0]['end']+1])
else:
  concat = ''
if(orig == concat): 
  print("EXACT MATCH!!!!!")
print("Original answer text:", orig)
print("Concatenation of the answer tokens from start to end:", concat)

########## Finding percentage of the questions that have exact match ###########
exact = 0
total = len(squad_dataset['train'])
for i in range(total):
  output = preprocess(adv_tokenization, squad_dataset['train'][i])
  orig = squad_dataset['train'][i]['answers']['text'][0].lower()
  if (output['answer']):
    concat = ' '.join(output['context'][output['answer'][0][0]['start']:output['answer'][0][0]['end']+1])
  else:
    concat = ''
  if(orig == concat): 
    exact += 1
print("\n")
print("##### Finding percentage of the questions that have exact match #####")
print("No of questions that have exact match:", exact)
print("Total no of samples:", total)
print("Percentage of the questions that have exact match:", (exact/total)*100)

############## EXAMPLE ###############
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
EXACT MATCH!!!!

<font color='blue'> In the above cell, I presented an example to  verify that this code works by comparing between the original answer text and the concatenation of the answer tokens from start to end in training data. In the above example, the original answer text and the concatenation of the answer tokens from start to end is exactly same. 
    <font color='blue'>
The percentage of the questions that have exact match is **64.2826 %**.  The exact match percentage on using advanced tokenizer increases by **9.17%**. The exact matching percentage is **1.17** times more on using advanced tokenizer than on using space tokenizer.

In [None]:
import nltk
nltk.download('punkt')

########## advanced tokenization #############
def adv_tokenization(text): 
    tokens = [token.replace("``", '"').replace("''", '"').lower() for token in nltk.tokenize.word_tokenize(text)]
    return tokens

########## function to find start and end of a sublist in a list #############
def find_sub_list(sl,l):
    results=[]
    sll=len(sl)
    for ind in (i for i,e in enumerate(l) if e==sl[0]):
        if l[ind:ind+sll]==sl:
            results.append(ind)
            results.append(ind+sll-1)
    return results

########### preprocess function #############
def preprocess(tokenization, example):
  out = {}
  context = example['context']
  question = example['question']
  answer_list = example['answers']['text']
  out['context'] = tokenization(context)
  out['question'] = tokenization(question)
  out['answers'] = example['answers']
  out['answer'] = []
  out['id'] = example['id']
  if 'answers' not in example:
    return out
    
  for i in range(len(answer_list)):

    answer = answer_list[i]
    answer_ids = tokenization(answer)
    startend = find_sub_list(answer_ids, out['context']) 
    if(startend):
      out['answer'].append([{'start': startend[0], 'end':  startend[1]}])
      
  return out

[nltk_data] Downloading package punkt to /home/radhika/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
######### An example to show the original and preprocessed data ############
print("############## EXAMPLE ###############")
i = 0
print(squad_dataset['train'][i])
output = preprocess(adv_tokenization, squad_dataset['train'][i])
orig = squad_dataset['train'][i]['answers']['text'][0].lower()
if (output['answer']):
  concat = ' '.join(output['context'][output['answer'][0][0]['start']:output['answer'][0][0]['end']+1])
else:
  concat = ''
if(orig == concat): 
  print("EXACT MATCH!!!!!")
print("Original answer text:", orig)
print("Concatenation of the answer tokens from start to end:", concat)

########## Finding percentage of the questions that have exact match ###########
exact = 0
total = len(squad_dataset['train'])
for i in range(total):
  output = preprocess(adv_tokenization, squad_dataset['train'][i])
  orig = squad_dataset['train'][i]['answers']['text'][0].lower()
  if (output['answer']):
    concat = ' '.join(output['context'][output['answer'][0][0]['start']:output['answer'][0][0]['end']+1])
  else:
    concat = ''
  if(orig == concat): 
    exact += 1
print("\n")
print("##### Finding percentage of the questions that have exact match #####")
print("No of questions that have exact match:", exact)
print("Total no of samples:", total)
print("Percentage of the questions that have exact match:", (exact/total)*100)

############## EXAMPLE ###############
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
EXACT MATCH!!!!

<font color='blue'> In the above cell, I used more advanced tokenizer and also presented an example to  verify that this code works by comparing between the original answer text and the concatenation of the answer tokens from start to end in training data. In the above example, the original answer text and the concatenation of the answer tokens from start to end is exactly same. 
    <font color='blue'>
The percentage of the questions that have exact match is **86.6197 %**.  The exact match percentage on using advanced tokenizer increases by **31.51%**. The exact matching percentage is **1.57** times more on using advanced tokenizer than on using space tokenizer.

###### $\color{blue}{\text{Constructing vocabulary from train data using advanced tokenizer and adding 'UNK' token}}$

In [None]:
####### Generating tokens #######
total = len(squad_dataset['train'])
tokens_context = []; tokens_question = []; tokens = []

for i in range(total):
  output = preprocess(adv_tokenization, squad_dataset['train'][i])
  tokens_context.extend(output['context'])
  tokens_question.extend(output['question'])
  tokens.extend(output['context']); tokens.extend(output['question'])

######## Finding the frequency of occurence of each token ########
token_counts = {}
tokens_final = []

for token in tokens:
    if token in token_counts:
        token_counts[token] += 1
    else:
        token_counts[token] = 1

####### Constructing vocabulary by including only those tokens that occured atleast 2 times #########
for token in token_counts.keys():
    if token_counts[token] >= 2:
        tokens_final.append(token)

vocab = ['PAD'] + ['UNK'] + ['SEP'] + list(set(tokens_final))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print("Vocabulary size:", len(vocab))
print(word2id["UNK"])

Vocabulary size: 97908
1


In [None]:
print(word2id['PAD'])

0


###### $\color{blue}{\text{Preparing train and test data }}$

In [None]:
def prepare_sequence(seq, word2id):
    idxs = [word2id[w] if w in word2id else 1 for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

def prepare_labels(context, token_index):
    idxs = [1 if i == token_index else 0 for i in range(len(context))]
    return torch.tensor(idxs, dtype=torch.long)

In [None]:
########## preparing data for training and testing ###########
def prepare_data(data):
  total = len(data)
  context_ids = []
  question_ids = []
  ques_cont_ids = []
  cont_ques_ids = []
  labels_start = []
  labels_end = []
  references = []
  context_tokenized = []
  ids = []
  for i in range(total):
    output = preprocess(adv_tokenization, data[i])
    output_context = prepare_sequence(output['context'], word2id)
    output_question = prepare_sequence(output['question'], word2id)
    ques_cont = prepare_sequence(output['question'] + ['SEP']+ output['context'], word2id)
    cont_ques = prepare_sequence(output['context'] + ['SEP']+ output['question'], word2id)
    
    if (output['answer']):
      labels_start_per_sample = output['answer'][0][0]['start']
      labels_end_per_sample = output['answer'][0][0]['end']+1
      if(labels_start_per_sample>0 and labels_end_per_sample<len(output_context) and labels_start_per_sample<labels_end_per_sample):
          context_ids.append(output_context)
          question_ids.append(output_question)
          ques_cont_ids.append(ques_cont)
          cont_ques_ids.append(cont_ques)
          labels_start.append(labels_start_per_sample)
          labels_end.append(labels_end_per_sample)
          references.append({'id': output['id'], 'answers': output['answers']})
          ids.append(output['id'])
          context_tokenized.append(output['context'])
          
  print(len(labels_end))
  return context_ids, question_ids, ques_cont_ids, cont_ques_ids, labels_start, labels_end, references, context_tokenized, ids

train_context_ids, train_question_ids, train_ques_cont_ids, train_cont_ques_ids, train_labels_start, train_labels_end, train_references, train_context_tokenized, train_ids = prepare_data(squad_dataset['train'])
test_context_ids, test_question_ids, test_ques_cont_ids, test_cont_ques_ids, test_labels_start, test_labels_end, test_references, test_context_tokenized, test_ids = prepare_data(squad_dataset['validation'])

83570
10166


## 3. LSTM Baseline for SQuAD

We will bring and reuse our model from Assignment 1. There are two key differences, however. First, we need to classify each token instead of the entire sentence. Second, we have two inputs (context and question) instead of just one. 

Before resolving these differences, you will need to define your evaluation function to correctly evaluate how well your model is doing. Note that the evaluation was very straightforward in Assignment 1's sentiment classification (it is either positive or negative) while it is a bit complicated in SQuAD. We will use the evaluation function provided by `datasets`. You can access to it via the following code.  

In [None]:
from datasets import load_metric
squad_metric = load_metric('squad')

You can also easily learn about how to use the function by simply typing the function:

In [None]:
squad_metric

Metric(name: "squad", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positions for the answer, as a list of ints
   

**Problem 3.1** *(10 points)* Let's resolve the first issue here. Hence, for now, assume that your only input is context and you want to obtain the answer without seeing the question. While this may seem to be a non-sense, actually it can be considered as modeling the prior $\text{Prob}(a|c)$ before observing $q$ (we ultimately want $\text{Prob}(a|q,c)$). Transform your model into a token classification model by imposing $\text{softmax}$ over the tokens instead of predefined classes. You will need to do this twice for each of start and end. Report the accuracy (using the metric above) on `squad_dataset['validation']`. 

### $\color{blue}{\text{Solution 3.1}}$
<font color='blue'> We use the context as the input to the LSTM model and reported the performance of the model.

In [None]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)

<torch._C.Generator at 0x7fe6c00f37b0>

In [None]:
##### Dataset class ##########

from torch.utils.data import Dataset, DataLoader
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class Squad_data(Dataset):

    def __init__(self, context_ids = None, question_ids = None, ques_cont_ids = None, cont_ques_ids = None, labels_start=None, labels_end=None):
        self.context_ids = context_ids
        self.question_ids = question_ids
        self.ques_cont_ids = ques_cont_ids
        self.cont_ques_ids = cont_ques_ids
        self.labels_start = labels_start
        self.labels_end = labels_end
        self.len = len(labels_end)

    def __getitem__(self, index):
        return self.context_ids[index], self.question_ids[index], self.ques_cont_ids[index], self.cont_ques_ids[index], self.labels_start[index], self.labels_end[index]

    def __len__(self):
        return self.len


def pad_collate(batch):
  (context_ids, question_ids, ques_cont_ids, cont_ques_ids, labels_start, labels_end) = zip(*batch)
  context_lens = [len(x) for x in context_ids]
  question_lens = [len(y) for y in question_ids]
  ques_cont_lens = [len(y) for y in ques_cont_ids]
  cont_ques_lens = ques_cont_lens

  context_pad = pad_sequence(context_ids, batch_first=True, padding_value=0)
  question_pad = pad_sequence(question_ids, batch_first=True, padding_value=0)
  ques_cont_pad = pad_sequence(ques_cont_ids, batch_first=True, padding_value=0)
  cont_ques_pad = pad_sequence(cont_ques_ids, batch_first=True, padding_value=0)
  labels_start = torch.Tensor(labels_start)
  labels_end = torch.Tensor(labels_end)

  return context_pad, question_pad, ques_cont_pad, cont_ques_pad, labels_start, labels_end, context_lens, question_lens, ques_cont_lens, cont_ques_lens

train_data = Squad_data(train_context_ids[:80000], train_question_ids[:80000], train_ques_cont_ids[:80000], train_cont_ques_ids[:80000], train_labels_start[:80000], train_labels_end[:80000])
val_data = Squad_data(train_context_ids[80000:], train_question_ids[80000:], train_ques_cont_ids[80000:], train_cont_ques_ids[80000:], train_labels_start[80000:], train_labels_end[80000:])
test_data = Squad_data(test_context_ids, test_question_ids, test_ques_cont_ids, test_cont_ques_ids, test_labels_start, test_labels_end)

train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True, collate_fn=pad_collate)
val_loader = DataLoader(dataset=val_data, batch_size=64, shuffle=False, collate_fn=pad_collate)
test_loader = DataLoader(dataset=test_data, batch_size=64, shuffle=False, collate_fn=pad_collate)

In [None]:
class LSTM(nn.Module):

    def __init__(self, vocab_size = len(vocab), embedding_dim = 256, hidden_dim = 256, tagset_size = 2):
        super(LSTM, self).__init__()

        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.tagset_size = tagset_size
        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True)

        self.hidden2start = nn.Linear(self.hidden_dim, 1)
        self.hidden2end = nn.Linear(self.hidden_dim, 1)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, sentence, lens, mask):
        embeds = self.word_embeddings(sentence)
        embeds_packed = pack_padded_sequence(embeds, lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(embeds_packed)
        lstm_out, output_lengths = pad_packed_sequence(lstm_out, batch_first=True)
        tag_start = self.hidden2start(lstm_out)
        tag_start = tag_start.squeeze()
        mask = mask.type(torch.float32)
        masked_tag_start = mask * tag_start + (1 - mask) * -1e30
        tag_scores_start = F.log_softmax(masked_tag_start, dim=1)
        tag_end = self.hidden2end(lstm_out)
        tag_end = tag_end.squeeze()
        masked_tag_end = mask * tag_end + (1 - mask) * -1e30
        tag_scores_end = F.log_softmax(masked_tag_end, dim=1)

        return tag_scores_start, tag_scores_end

In [None]:
def train(model, train_loader, val_loader, num_epochs = 12, device=None): 
    model.to(device)
    criterion = nn.NLLLoss(reduction='sum')
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    train_loss=[]
    val_loss = []
   
    for epoch in range(num_epochs):
      model.train()
      epoch_loss = 0
      for i, (train_context_ids, train_question_ids, train_ques_cont_ids, train_cont_ques_ids, train_labels_start, train_labels_end, context_lens, question_lens, ques_cont_lens, cont_ques_lens) in enumerate(train_loader):
          input_tensor = train_context_ids.to(device)
          c_mask = torch.zeros_like(input_tensor) != input_tensor
          train_labels_start = torch.tensor(train_labels_start, dtype=torch.long).to(device)
          train_labels_end = torch.tensor(train_labels_end, dtype=torch.long).to(device)
          optimizer.zero_grad()
          tag_scores_start, tag_scores_end = model(input_tensor, context_lens, c_mask)
          loss = criterion(tag_scores_start, train_labels_start) + criterion(tag_scores_end, train_labels_end)
          
          loss.backward()
          optimizer.step()
          epoch_loss += loss.item()
      train_loss.append(round((epoch_loss/len(train_data)), 2))

      with torch.no_grad():
            model.eval()
            val_epoch_loss = 0
            for i, (val_context_ids, val_question_ids, val_ques_cont_ids, val_cont_ques_ids, val_labels_start, val_labels_end, val_context_lens, val_question_lens, val_ques_cont_lens, val_cont_ques_lens) in enumerate(val_loader):
                input_tensor = val_context_ids.to(device)
                c_mask = torch.zeros_like(input_tensor) != input_tensor
                val_labels_start = torch.tensor(val_labels_start, dtype=torch.long).to(device)
                val_labels_end = torch.tensor(val_labels_end, dtype=torch.long).to(device)
                tag_scores_start, tag_scores_end = model(input_tensor, val_context_lens, c_mask)

                loss = criterion(tag_scores_start, val_labels_start) + criterion(tag_scores_end, val_labels_end)
                val_epoch_loss += loss.item()

            val_loss.append(round((val_epoch_loss/len(val_data)),2))
      print("epoch {}: Training Loss- {:.2f}  Val loss - {:.2f}".format(epoch, epoch_loss/len(train_data), val_epoch_loss/len(val_data)))
    return model, train_loss, val_loss

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
def test(model, test_loader):
    model.to(device)
    model.eval()
    criterion = nn.NLLLoss(reduction='sum')
    start_list = []
    end_list = []
    with torch.no_grad():
        test_loss = 0
        for i, (test_context_ids, test_question_ids, test_ques_cont_ids, test_cont_ques_ids, test_labels_start, test_labels_end, test_context_lens, test_question_lens, test_ques_cont_lens, test_cont_ques_lens) in enumerate(test_loader):
            input_tensor = test_context_ids.to(device)
            c_mask = torch.zeros_like(input_tensor) != input_tensor
            test_labels_start = torch.tensor(test_labels_start, dtype=torch.long).to(device)
            test_labels_end = torch.tensor(test_labels_end, dtype=torch.long).to(device)
            tag_scores_start, tag_scores_end = model(input_tensor, test_context_lens, c_mask)
            values, start_idx = torch.max(tag_scores_start, dim=-1)
            values_end, end_idx = torch.max(tag_scores_end, dim=-1)
            start_list.extend(start_idx)
            end_list.extend(end_idx)
            loss = 0
            loss = criterion(tag_scores_start, test_labels_start) + criterion(tag_scores_end, test_labels_end)
            test_loss += loss.item()
        print("Test loss: {}". format(round((test_loss/len(test_data)),2)))
    return start_list, end_list

In [None]:
def evaluate(start_list, end_list, context_tokenized, references, ids):
    predictions = []
    for i in range(len(ids)):
        if(start_list[i]<end_list[i]):
            pred_text = ' '.join(context_tokenized[i][start_list[i].item():end_list[i].item()+1])
        else:
            pred_text = ' '.join(context_tokenized[i][start_list[i].item():start_list[i].item()+1])
        predictions.append({'id': ids[i], 'prediction_text': pred_text})
    results = squad_metric.compute(predictions=predictions, references=references)
    print(results)

In [None]:
num_epochs = 10
model = LSTM()
model, train_loss, val_loss = train(model, train_loader, val_loader, num_epochs, device)

  from ipykernel import kernelapp as app
  app.launch_new_instance()


epoch 0: Training Loss- 7.90  Val loss - 7.78
epoch 1: Training Loss- 7.18  Val loss - 7.73
epoch 2: Training Loss- 6.71  Val loss - 7.89
epoch 3: Training Loss- 6.27  Val loss - 8.04
epoch 4: Training Loss- 5.86  Val loss - 8.38
epoch 5: Training Loss- 5.49  Val loss - 8.73
epoch 6: Training Loss- 5.17  Val loss - 9.03
epoch 7: Training Loss- 4.92  Val loss - 9.62
epoch 8: Training Loss- 4.71  Val loss - 9.87
epoch 9: Training Loss- 4.54  Val loss - 10.24


In [None]:
start_list, end_list = test(model, test_loader) 
evaluate(start_list, end_list, test_context_tokenized, test_references, test_ids)

  if sys.path[0] == '':
  del sys.path[0]


Test loss: 10.83
{'exact_match': 2.63623844186504, 'f1': 8.896667938985173}


<font color='blue'> In the above cells, we train **LSTM model** on train set and evaluate it on the test set. In this experiment, **input is the context** and we want to obtain the answer without seeing the question.
On evaluating the model on test set, we get the following scores:<br> 
    **exact_match:** 2.63623844186504<br> 
    **f1:** 8.896667938985173

**Problem 3.2** *(10 points)*  Now let's resolve the second issue, by simply concatenating the two inputs into one sequence. The simplest way would be to append the the question at the start *OR* the end of the context. If you put it at the start, you will need to shift the start and the end positions of the answer accordingly. If you put it at the end, it will be necesary to use bidirectional LSTM for the context to be aware of what is ahead (though it is recommended to use bidirectional LSTM even if the question is appended at the start). Whichever you choose, carry it out and report the accuracy. How does it differ from 3.1?

### $\color{blue}{\text{Solution 3.2}}$
<font color='blue'> Now we append the question at the start of the context and use it as the input to the model.
I trained two models namely LSTM and BiLSTM and reported their performance.

In [None]:
class LSTM1(nn.Module):

    def __init__(self, vocab_size = len(vocab), embedding_dim = 256, hidden_dim = 256, tagset_size = 2):
        super(LSTM1, self).__init__()

        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.tagset_size = tagset_size
        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True)

        self.hidden2start = nn.Linear(self.hidden_dim, 1)
        self.hidden2end = nn.Linear(self.hidden_dim, 1)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, sentence, ques_lens, context_lens, ques_cont_lens, mask):
        mask = mask.type(torch.float32)
        for i in range(len(ques_lens)):
            mask[i][:ques_lens[i]] = 0
        embeds = self.word_embeddings(sentence)
        embeds_packed = pack_padded_sequence(embeds, ques_cont_lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(embeds_packed)
        lstm_out, output_lengths = pad_packed_sequence(lstm_out, batch_first=True)
        tag_start = self.hidden2start(lstm_out)
        tag_start = tag_start.squeeze()
        masked_tag_start = mask * tag_start + (1 - mask) * -1e30
        tag_scores_start = F.log_softmax(masked_tag_start, dim=1)
        tag_end = self.hidden2end(lstm_out)
        tag_end = tag_end.squeeze()
        masked_tag_end = mask * tag_end + (1 - mask) * -1e30
        tag_scores_end = F.log_softmax(masked_tag_end, dim=1)

        return tag_scores_start, tag_scores_end

In [None]:
class BiLSTM(nn.Module):

    def __init__(self, vocab_size = len(vocab), embedding_dim = 256, hidden_dim = 256, tagset_size = 2):
        super(BiLSTM, self).__init__()

        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.tagset_size = tagset_size
        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True, bidirectional =True)

        self.hidden2start = nn.Linear(self.hidden_dim*2, 1)
        self.hidden2end = nn.Linear(self.hidden_dim*2, 1)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, sentence, ques_lens, context_lens, ques_cont_lens, mask):
        mask = mask.type(torch.float32)
        for i in range(len(ques_lens)):
            mask[i][:ques_lens[i]] = 0
        embeds = self.word_embeddings(sentence)
        embeds_packed = pack_padded_sequence(embeds, ques_cont_lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(embeds_packed)
        lstm_out, output_lengths = pad_packed_sequence(lstm_out, batch_first=True)
        tag_start = self.hidden2start(lstm_out)
        tag_start = tag_start.squeeze()
        masked_tag_start = mask * tag_start + (1 - mask) * -1e30
        tag_scores_start = F.log_softmax(masked_tag_start, dim=1)
        tag_end = self.hidden2end(lstm_out)
        tag_end = tag_end.squeeze()
        masked_tag_end = mask * tag_end + (1 - mask) * -1e30
        tag_scores_end = F.log_softmax(masked_tag_end, dim=1)

        return tag_scores_start, tag_scores_end

In [None]:
def train(model, train_loader, val_loader, num_epochs = 12, device=None): 
    model.to(device)
    criterion = nn.NLLLoss(reduction='sum')
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    train_loss=[]
    val_loss = []
   
    for epoch in range(num_epochs):
      model.train()
      epoch_loss = 0
      for i, (train_context_ids, train_question_ids, train_ques_cont_ids, train_cont_ques_ids, train_labels_start, train_labels_end, context_lens, question_lens, ques_cont_lens, cont_ques_lens) in enumerate(train_loader):
          input_tensor = train_ques_cont_ids.to(device)
          c_mask = torch.zeros_like(input_tensor) != input_tensor
          train_labels_start = torch.tensor(train_labels_start, dtype=torch.long).to(device)
          train_labels_end = torch.tensor(train_labels_end, dtype=torch.long).to(device)
          optimizer.zero_grad()
          tag_scores_start, tag_scores_end = model(input_tensor, question_lens, context_lens, ques_cont_lens, c_mask)
          question_lens = torch.LongTensor(question_lens).to(device)
          loss = criterion(tag_scores_start, train_labels_start+question_lens) + criterion(tag_scores_end, train_labels_end + question_lens)
          loss.backward()
          optimizer.step()
          epoch_loss += loss.item()
      epoch_loss = round(epoch_loss/len(train_data), 2)
      train_loss.append(epoch_loss)

      with torch.no_grad():
            model.eval()
            val_epoch_loss = 0
            for i, (val_context_ids, val_question_ids, val_ques_cont_ids, val_cont_ques_ids, val_labels_start, val_labels_end, val_context_lens, val_question_lens, val_ques_cont_lens, val_cont_ques_lens) in enumerate(val_loader):
                input_tensor = val_ques_cont_ids.to(device)
                c_mask = torch.zeros_like(input_tensor) != input_tensor
                val_labels_start = torch.tensor(val_labels_start, dtype=torch.long).to(device)
                val_labels_end = torch.tensor(val_labels_end, dtype=torch.long).to(device)
                tag_scores_start, tag_scores_end = model(input_tensor, val_question_lens, val_context_lens, val_ques_cont_lens, c_mask)
                val_question_lens = torch.LongTensor(val_question_lens).to(device)
                loss = criterion(tag_scores_start, val_labels_start+val_question_lens) + criterion(tag_scores_end, val_labels_end+val_question_lens)
                val_epoch_loss += loss.item()

            val_epoch_loss = round(val_epoch_loss/len(val_data),2)
            val_loss.append(val_epoch_loss)
      
      print("epoch {}: Training Loss- {:.2f}  Val loss - {:.2f}".format(epoch, epoch_loss, val_epoch_loss))
    return model, train_loss, val_loss

In [None]:
def test(model, test_loader):
    model.to(device)
    model.eval()
    start_list = []
    end_list = []
    criterion = nn.NLLLoss(reduction='sum')
    with torch.no_grad():
        test_loss = 0
        for i, (test_context_ids, test_question_ids, test_ques_cont_ids, test_cont_ques_ids, test_labels_start, test_labels_end, test_context_lens, test_question_lens, test_ques_cont_lens, test_cont_ques_lens) in enumerate(test_loader):
            input_tensor = test_ques_cont_ids.to(device)
            c_mask = torch.zeros_like(input_tensor) != input_tensor
            test_labels_start = torch.tensor(test_labels_start, dtype=torch.long).to(device)
            test_labels_end = torch.tensor(test_labels_end, dtype=torch.long).to(device)
            tag_scores_start, tag_scores_end = model(input_tensor, test_question_lens, test_context_lens, test_ques_cont_lens, c_mask)
            start_idx = torch.argmax(tag_scores_start, dim=1)
            end_idx = torch.argmax(tag_scores_end, dim=1)
            test_question_lens = torch.LongTensor(test_question_lens).to(device)
            loss = criterion(tag_scores_start, test_labels_start+test_question_lens) + criterion(tag_scores_end, test_labels_end+test_question_lens)
            test_loss += loss.item()
            start_list.extend(start_idx-test_question_lens)
            end_list.extend(end_idx-test_question_lens)
        print("Test loss: {}". format(round((test_loss/len(test_data)),2)))
    return start_list, end_list

#### $\color{blue}{\text{Training and testing LSTM model}}$

In [None]:
num_epochs = 10
model1 = LSTM1()
model1, train_loss, val_loss = train(model1, train_loader, val_loader, num_epochs, device)



epoch 0: Training Loss- 8.33  Val loss - 8.13
epoch 1: Training Loss- 7.51  Val loss - 7.92
epoch 2: Training Loss- 6.91  Val loss - 7.78
epoch 3: Training Loss- 6.27  Val loss - 7.86
epoch 4: Training Loss- 5.64  Val loss - 8.11
epoch 5: Training Loss- 4.99  Val loss - 8.44
epoch 6: Training Loss- 4.35  Val loss - 9.04
epoch 7: Training Loss- 3.74  Val loss - 9.67
epoch 8: Training Loss- 3.19  Val loss - 10.38
epoch 9: Training Loss- 2.70  Val loss - 11.08


In [None]:
start_list, end_list = test(model1, test_loader)    
evaluate(start_list, end_list, test_context_tokenized, test_references, test_ids)

  if sys.path[0] == '':
  del sys.path[0]


Test loss: 11.26
{'exact_match': 4.642927405075743, 'f1': 13.600260130770142}


<font color='blue'> In the above cells, we train the **LSTM model** on train set and evaluate it on the test set. In this experiment, **we append the question at the start of the context and use it as the input** to the model. 
On evaluating the model on test set, we get the following scores:<br> 
    **exact_match:** 4.642927405075743<br> 
    **f1:** 13.600260130770142<br>
<font color='blue'> 
The exact_match and f1 score suggest that on appending the question at the start of the context and using it as an input to the LSTM outperforms the model that uses only context as an input. 

#### $\color{blue}{\text{Training and testing BiLSTM model}}$

In [None]:
num_epochs = 10
model1 = BiLSTM()
model1, train_loss, val_loss = train(model1, train_loader, val_loader, num_epochs, device)



epoch 0: Training Loss- 7.75  Val loss - 7.70
epoch 1: Training Loss- 6.83  Val loss - 7.21
epoch 2: Training Loss- 6.02  Val loss - 7.09
epoch 3: Training Loss- 5.32  Val loss - 7.15
epoch 4: Training Loss- 4.64  Val loss - 7.38
epoch 5: Training Loss- 3.98  Val loss - 7.77
epoch 6: Training Loss- 3.34  Val loss - 8.47
epoch 7: Training Loss- 2.75  Val loss - 9.08
epoch 8: Training Loss- 2.22  Val loss - 9.99
epoch 9: Training Loss- 1.77  Val loss - 11.14


In [None]:
start_list, end_list = test(model1, test_loader)    
evaluate(start_list, end_list, test_context_tokenized, test_references, test_ids)

  if sys.path[0] == '':
  del sys.path[0]


Test loss: 11.26
{'exact_match': 8.115286248278576, 'f1': 18.636382168003795}


<font color='blue'> In the above cells, we train the **BiLSTM model** on train set and evaluate it on the test set. In this experiment, **we append the question at the start of the context and use it as the input** to the model. 
On evaluating the model on test set, we get the following scores:<br> 
    **exact_match:** 8.115286248278576<br> 
    **f1:** 18.636382168003795<br>
<font color='blue'> 
**The exact_match and f1 score suggest that on using BiLSTM instead of LSTM further improves the performance of the model.**

## 4. LSTM + Attention for SQuAD

**Problem 4.1** *(20 points)* Here, we will be appending an attention layer on top of LSTM outputs. We will use a single-head attention sublayer from Transformer. That is, you will implement 
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V,$$
where $Q, K, V$ is obtained by the linear transformation of the hidden states of the LSTM outputs $H$, i.e. $Q = HW^Q, K=HW^K, V=HW^V$ ($W^Q, W^K, W^V \in \mathbb{R}^{d \times d}$ are trainable weights). Note that the output of $\text{Attention}$ layer has the same dimension as $H$, so you can directly append your token classification layer on top of it. Report the accuracy and compare it with 3.2.



### $\color{blue}{\text{Solution 4.1}}$
<font color='blue'> Now, we will be appending an attention layer on top of LSTM outputs. We will use a single-head attention sublayer from Transformer.
I did two experiments:
   
1.   <font color='blue'>Added attention on top of LSTM outputs and reported its performance

2.   <font color='blue'>Added attention on top of BiLSTM outputs and reported its performance


#### $\color{blue}{\text{Attention on top of LSTM outputs }}$

In [None]:
import math
class LSTM_attention(nn.Module):

    def __init__(self, vocab_size = len(vocab), embedding_dim = 256, hidden_dim = 256):
        super(LSTM_attention, self).__init__()

        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True)

        self.hq = nn.Linear(self.hidden_dim, self.hidden_dim)
        self.hv = nn.Linear(self.hidden_dim, self.hidden_dim)
        self.hk = nn.Linear(self.hidden_dim, self.hidden_dim)
        
        self.hidden2start = nn.Linear(self.hidden_dim, 1)
        self.hidden2end = nn.Linear(self.hidden_dim, 1)
        self.softmax = nn.Softmax(dim=1)
        self.scale = 1. / math.sqrt(hidden_dim)

    def forward(self, sentence, ques_lens, context_lens, ques_cont_lens, mask):
        mask = mask.type(torch.float32)
        batch_size = sentence.shape[0]
        for i in range(len(ques_lens)):
            mask[i][:ques_lens[i]] = 0
        embeds = self.word_embeddings(sentence)
        embeds_packed = pack_padded_sequence(embeds, ques_cont_lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(embeds_packed)
        lstm_out, output_lengths = pad_packed_sequence(lstm_out, batch_first=True)
        q = self.hq(lstm_out).view(batch_size, -1, 1, self.hidden_dim).transpose(1, 2)
        k = self.hk(lstm_out).view(batch_size, -1, 1, self.hidden_dim).transpose(1, 2)
        v = self.hv(lstm_out).view(batch_size, -1, 1, self.hidden_dim).transpose(1, 2)
        scores = torch.matmul(q, k.transpose(-1, -2)) / np.sqrt(self.hidden_dim)
        attn_mask = mask.unsqueeze(1)
        
        attn_mask = attn_mask.expand(batch_size, mask.shape[1], mask.shape[1])
        attn_mask = torch.as_tensor(attn_mask, dtype=torch.bool)
        attn_mask = attn_mask.unsqueeze(1)
        attn_mask = attn_mask.to(device)
        scores.masked_fill_(attn_mask, -1e9)
        attn = nn.Softmax(dim=-1)(scores)
        attn_lstm_out = torch.matmul(attn, v)
        attn_lstm_out = attn_lstm_out.transpose(1, 2).contiguous().view(batch_size, -1, self.hidden_dim)
    
        tag_start = self.hidden2start(attn_lstm_out)
        tag_start = tag_start.squeeze()
        masked_tag_start = mask * tag_start + (1 - mask) * -1e30
        tag_scores_start = F.log_softmax(masked_tag_start, dim=1)
        tag_end = self.hidden2end(attn_lstm_out)
        tag_end = tag_end.squeeze()
        masked_tag_end = mask * tag_end + (1 - mask) * -1e30
        tag_scores_end = F.log_softmax(masked_tag_end, dim=1)

        return tag_scores_start, tag_scores_end

In [None]:
def train(model, train_loader, val_loader, num_epochs = 12, device=None): 
    model.to(device)
    criterion = nn.NLLLoss(reduction='sum')
    optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
    
    train_loss=[]
    val_loss = []
   
    for epoch in range(num_epochs):
      model.train()
      epoch_loss = 0
      for i, (train_context_ids, train_question_ids, train_ques_cont_ids, train_cont_ques_ids, train_labels_start, train_labels_end, context_lens, question_lens, ques_cont_lens, cont_ques_lens) in enumerate(train_loader):
          input_tensor = train_ques_cont_ids.to(device)
          c_mask = torch.zeros_like(input_tensor) != input_tensor
          train_labels_start = torch.tensor(train_labels_start, dtype=torch.long).to(device)
          train_labels_end = torch.tensor(train_labels_end, dtype=torch.long).to(device)
          optimizer.zero_grad()
          tag_scores_start, tag_scores_end = model(input_tensor, question_lens, context_lens, ques_cont_lens, c_mask)
          question_lens = torch.LongTensor(question_lens).to(device)
          loss = criterion(tag_scores_start, train_labels_start+question_lens) + criterion(tag_scores_end, train_labels_end + question_lens)
          loss.backward()
          optimizer.step()
          epoch_loss += loss.item()
      epoch_loss = round(epoch_loss/len(train_data), 2)
      train_loss.append(epoch_loss)

      with torch.no_grad():
            model.eval()
            val_epoch_loss = 0
            for i, (val_context_ids, val_question_ids, val_ques_cont_ids, val_cont_ques_ids, val_labels_start, val_labels_end, val_context_lens, val_question_lens, val_ques_cont_lens, val_cont_ques_lens) in enumerate(val_loader):
                input_tensor = val_ques_cont_ids.to(device)
                c_mask = torch.zeros_like(input_tensor) != input_tensor
                val_labels_start = torch.tensor(val_labels_start, dtype=torch.long).to(device)
                val_labels_end = torch.tensor(val_labels_end, dtype=torch.long).to(device)
                tag_scores_start, tag_scores_end = model(input_tensor, val_question_lens, val_context_lens, val_ques_cont_lens, c_mask)
                val_question_lens = torch.LongTensor(val_question_lens).to(device)
                loss = criterion(tag_scores_start, val_labels_start+val_question_lens) + criterion(tag_scores_end, val_labels_end+val_question_lens)
                val_epoch_loss += loss.item()

            val_epoch_loss = round(val_epoch_loss/len(val_data),2)
            val_loss.append(val_epoch_loss)
      
      print("epoch {}: Training Loss- {:.2f}  Val loss - {:.2f}".format(epoch, epoch_loss, val_epoch_loss))
    return model, train_loss, val_loss

In [None]:
num_epochs = 10
model2 = LSTM_attention()
model2, train_loss, val_loss = train(model2, train_loader, val_loader, num_epochs, device)



epoch 0: Training Loss- 8.16  Val loss - 7.84
epoch 1: Training Loss- 7.08  Val loss - 7.42
epoch 2: Training Loss- 6.34  Val loss - 7.46
epoch 3: Training Loss- 5.62  Val loss - 7.85
epoch 4: Training Loss- 4.86  Val loss - 8.41
epoch 5: Training Loss- 4.10  Val loss - 9.46
epoch 6: Training Loss- 3.38  Val loss - 10.72
epoch 7: Training Loss- 2.72  Val loss - 12.20
epoch 8: Training Loss- 2.19  Val loss - 14.45
epoch 9: Training Loss- 1.77  Val loss - 15.84


In [None]:
start_list, end_list = test(model2, test_loader)    
evaluate(start_list, end_list, test_context_tokenized, test_references, test_ids)

  if sys.path[0] == '':
  del sys.path[0]


Test loss: 16.16
{'exact_match': 5.439700963997639, 'f1': 14.797131155046463}


<font color='blue'> In the above cells, we train the **LSTM (with attention) model** on train set and evaluate it on the test set. In this experiment, **we added attention on top of LSTM outputs**. 
On evaluating the model on test set, we get the following scores:<br> 
    **exact_match:** 5.439700963997639<br> 
    **f1:** 14.797131155046463<br>
<font color='blue'> 
The exact_match and f1 score suggest that on adding attention on top of LSTM outputs significantly  improves the performance of the model when compared with LSTM(without attention) model (in solution 3.2).

#### $\color{blue}{\text{Attention on top of BiLSTM outputs }}$

In [None]:
class BiLSTM_attention(nn.Module):

    def __init__(self, vocab_size = len(vocab), embedding_dim = 256, hidden_dim = 256, tagset_size = 2):
        super(BiLSTM_attention, self).__init__()

        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.tagset_size = tagset_size
        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True, bidirectional =True)
        
        self.hq = nn.Linear(self.hidden_dim*2, self.hidden_dim*2)
        self.hv = nn.Linear(self.hidden_dim*2, self.hidden_dim*2)
        self.hk = nn.Linear(self.hidden_dim*2, self.hidden_dim*2)
        
        self.hidden2start = nn.Linear(self.hidden_dim*2, 1)
        self.hidden2end = nn.Linear(self.hidden_dim*2, 1)
        self.softmax = nn.Softmax(dim=1)
        self.scale = 1. / math.sqrt(2*hidden_dim)

    def forward(self, sentence, ques_lens, context_lens, ques_cont_lens, mask):
        mask = mask.type(torch.float32)
        batch_size = sentence.shape[0]
        for i in range(len(ques_lens)):
            mask[i][:ques_lens[i]] = 0
        embeds = self.word_embeddings(sentence)
        embeds_packed = pack_padded_sequence(embeds, ques_cont_lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(embeds_packed)
        lstm_out, output_lengths = pad_packed_sequence(lstm_out, batch_first=True)
        q = self.hq(lstm_out).view(batch_size, -1, 1, self.hidden_dim*2).transpose(1, 2)
        k = self.hk(lstm_out).view(batch_size, -1, 1, self.hidden_dim*2).transpose(1, 2)
        v = self.hv(lstm_out).view(batch_size, -1, 1, self.hidden_dim*2).transpose(1, 2)
        
        scores = torch.matmul(q, k.transpose(-1, -2)) / np.sqrt(self.hidden_dim*2)
        attn_mask = mask.unsqueeze(1)
        
        attn_mask = attn_mask.expand(batch_size, mask.shape[1], mask.shape[1])
        attn_mask = torch.as_tensor(attn_mask, dtype=torch.bool)
        attn_mask = attn_mask.unsqueeze(1)
        attn_mask = attn_mask.to(device)
        scores.masked_fill_(attn_mask, -1e9)
        attn = nn.Softmax(dim=-1)(scores)
        attn_lstm_out = torch.matmul(attn, v)
        attn_lstm_out = attn_lstm_out.transpose(1, 2).contiguous().view(batch_size, -1, self.hidden_dim*2)
    
    
        tag_start = self.hidden2start(attn_lstm_out)
        tag_start = tag_start.squeeze()
        masked_tag_start = mask * tag_start + (1 - mask) * -1e30
        tag_scores_start = F.log_softmax(masked_tag_start, dim=1)
        tag_end = self.hidden2end(attn_lstm_out)
        tag_end = tag_end.squeeze()
        masked_tag_end = mask * tag_end + (1 - mask) * -1e30
        tag_scores_end = F.log_softmax(masked_tag_end, dim=1)

        return tag_scores_start, tag_scores_end

In [None]:
num_epochs = 10
model2 = BiLSTM_attention()
model2, train_loss, val_loss = train(model2, train_loader, val_loader, num_epochs, device)



epoch 0: Training Loss- 7.50  Val loss - 7.00
epoch 1: Training Loss- 6.00  Val loss - 6.53
epoch 2: Training Loss- 4.95  Val loss - 6.56
epoch 3: Training Loss- 3.92  Val loss - 7.01
epoch 4: Training Loss- 2.94  Val loss - 7.98
epoch 5: Training Loss- 2.11  Val loss - 9.48
epoch 6: Training Loss- 1.55  Val loss - 10.51
epoch 7: Training Loss- 1.18  Val loss - 12.17
epoch 8: Training Loss- 0.95  Val loss - 12.97
epoch 9: Training Loss- 0.81  Val loss - 13.15


In [None]:
start_list, end_list = test(model2, test_loader) 
evaluate(start_list, end_list, test_context_tokenized, test_references, test_ids)

  if sys.path[0] == '':
  del sys.path[0]


Test loss: 13.44
{'exact_match': 9.669486523706473, 'f1': 22.286058898354636}


<font color='blue'> In the above cells, we train the **BiLSTM (with attention) model** on train set and evaluate it on the test set. In this experiment, **we added attention on top of BiLSTM outputs**. 
On evaluating the model on test set, we get the following scores:<br> 
    **exact_match:** 9.669486523706473<br> 
    **f1:** 22.286058898354636<br>
<font color='blue'> 
The exact_match and f1 score suggest that on adding attention on top of BiLSTM outputs significantly  improves the performance of the model when compared with BiLSTM(without attention) model (in solution 3.2).

**Problem 4.2** *(10 points)* On top of the attention layer, let's add another layer of (bi-directional) LSTM. So this will look like a *sandwich* where the LSTM is bread and the attention is ham. How does it affect the accuracy? Explain why do you think this happens. 

### $\color{blue}{\text{Solution 4.2}}$
<font color='blue'> Now, on top of the attention layer, we added another layer of (bi-directional) LSTM.


In [None]:
class BiLSTM_attention_BiLSTM(nn.Module):

    def __init__(self, vocab_size = len(vocab), embedding_dim = 128, hidden_dim = 128, tagset_size = 2):
        super(BiLSTM_attention_BiLSTM, self).__init__()

        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.tagset_size = tagset_size
        self.word_embeddings = nn.Embedding(self.vocab_size, self.embedding_dim)

        self.lstm = nn.LSTM(self.embedding_dim, self.hidden_dim, batch_first=True, bidirectional =True)
        self.lstm1 = nn.LSTM(self.hidden_dim*2, self.hidden_dim*2, batch_first=True, bidirectional =True)
        self.hq = nn.Linear(self.hidden_dim*2, self.hidden_dim*2)
        self.hv = nn.Linear(self.hidden_dim*2, self.hidden_dim*2)
        self.hk = nn.Linear(self.hidden_dim*2, self.hidden_dim*2)
        
        self.hidden2start = nn.Linear(self.hidden_dim*4, 1)
        self.hidden2end = nn.Linear(self.hidden_dim*4, 1)
        self.softmax = nn.Softmax(dim=1)
        self.scale = 1. / math.sqrt(2*hidden_dim)

    def forward(self, sentence, ques_lens, context_lens, ques_cont_lens, mask):
        mask = mask.type(torch.float32)
        batch_size = sentence.shape[0]
        for i in range(len(ques_lens)):
            mask[i][:ques_lens[i]] = 0
        embeds = self.word_embeddings(sentence)
        embeds_packed = pack_padded_sequence(embeds, ques_cont_lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(embeds_packed)
        lstm_out, output_lengths = pad_packed_sequence(lstm_out, batch_first=True)
        q = self.hq(lstm_out).view(batch_size, -1, 1, self.hidden_dim*2).transpose(1, 2)
        k = self.hk(lstm_out).view(batch_size, -1, 1, self.hidden_dim*2).transpose(1, 2)
        v = self.hv(lstm_out).view(batch_size, -1, 1, self.hidden_dim*2).transpose(1, 2)
        
        scores = torch.matmul(q, k.transpose(-1, -2)) / np.sqrt(self.hidden_dim*2)
        attn_mask = mask.unsqueeze(1)
        
        attn_mask = attn_mask.expand(batch_size, mask.shape[1], mask.shape[1])
        attn_mask = torch.as_tensor(attn_mask, dtype=torch.bool)
        attn_mask = attn_mask.unsqueeze(1)
        attn_mask = attn_mask.to(device)
        scores.masked_fill_(attn_mask, -1e9)
        attn = nn.Softmax(dim=-1)(scores)
        attn_lstm_out = torch.matmul(attn, v)
        attn_lstm_out = attn_lstm_out.transpose(1, 2).contiguous().view(batch_size, -1, self.hidden_dim*2)
    
        attn_lstm_out_packed = pack_padded_sequence(attn_lstm_out, ques_cont_lens, batch_first=True, enforce_sorted=False)
        lstm_out1, _ = self.lstm1(attn_lstm_out_packed)
        lstm_out1, output_lengths1 = pad_packed_sequence(lstm_out1, batch_first=True)
        tag_start = self.hidden2start(lstm_out1)
        tag_start = tag_start.squeeze()
        masked_tag_start = mask * tag_start + (1 - mask) * -1e30
        tag_scores_start = F.log_softmax(masked_tag_start, dim=1)
        tag_end = self.hidden2end(lstm_out1)
        tag_end = tag_end.squeeze()
        masked_tag_end = mask * tag_end + (1 - mask) * -1e30
        tag_scores_end = F.log_softmax(masked_tag_end, dim=1)

        return tag_scores_start, tag_scores_end

In [None]:
num_epochs = 10
model2 = BiLSTM_attention_BiLSTM()
model2, train_loss, val_loss = train(model2, train_loader, val_loader, num_epochs, device)



epoch 0: Training Loss- 7.78  Val loss - 7.29
epoch 1: Training Loss- 6.44  Val loss - 6.71
epoch 2: Training Loss- 5.63  Val loss - 6.54
epoch 3: Training Loss- 4.89  Val loss - 6.57
epoch 4: Training Loss- 4.16  Val loss - 6.78
epoch 5: Training Loss- 3.44  Val loss - 7.34
epoch 6: Training Loss- 2.77  Val loss - 8.16
epoch 7: Training Loss- 2.21  Val loss - 8.95
epoch 8: Training Loss- 1.77  Val loss - 10.23
epoch 9: Training Loss- 1.45  Val loss - 11.50


In [None]:
start_list, end_list = test(model2, test_loader) 
evaluate(start_list, end_list, test_context_tokenized, test_references, test_ids)

  if sys.path[0] == '':
  del sys.path[0]


Test loss: 11.49
{'exact_match': 11.863072988392682, 'f1': 26.28462934302962}


<font color='blue'> In the above cells, we train the **BiLSTM attention BiLSTM model** on train set and evaluate it on the test set. In this experiment, **we added attention on top of BiLSTM outputs followed by another BiLSTM layer**. 
On evaluating the model on test set, we get the following scores:<br> 
    **exact_match:** 11.863072988392682<br> 
    **f1:** 26.28462934302962<br>
<font color='blue'> 
The exact_match and f1 score suggest that on adding attention on top of BiLSTM outputs followed by another BiLSTM layer significantly improves the performance of the mode. This model outperforms all the models trained so far.
On adding attention on top of LSTM, outputs generate attended outputs with more focus on input important to answer the question. On passing the attented input through a bilstm layer generates an attended answer. An attented input helps in generating a more accurate answer. Hence, the performance of the model improves significantly (as compared to 4.1).

## 5. Attention is All You Need (bonus)

**Problem 5.1 (bonus)** *(20 points)*  Implement full Transformer encoder to entirely replace LSTMs. You are allowed to copy and paste code from [*Annotated Transformer*](https://nlp.seas.harvard.edu/2018/04/03/attention.html) (but nowhere else). Report the accuracy and explain what seems to happening with attetion-only model compared to LSTM+Attention model(s). 

**Problem 5.2 (bonus)** *(10 points)* Replace Transformer's sinusoidal position encoding with a fixed-length (of 256) position embedding. That is, you will create a 256-by-$d$ trainable parameter matrix for the position encoding that replaces the variable-length sinusoidal encoding. What is the clear disdvantage of this approach? Report the accuracy and compare it with 5.1. Note that this also has a clear advantage, as we will see in our future lecture on Pretrained Language Model, and more specifically, BERT (Devlin et al., 2018).