# Assignment #2

## Overview 
* In this assignment, we will perform extractive Question Answering (QA) where we find out <b>start and end position of answer span</b> from a context (short paragraph of wikipedia article) for a question. 

* For an example, for the question <b>"What causes precipitation to fall?"</b> and context <b>"In meteorology, precipitation is any product
of the condensation of atmospheric water vapor
that falls under $\color{red}{\text{gravity}}$"</b>, a model should output the start position and end position of <b>gravity</b>, which are both 17 in terms of whitespace tokenization.

* You will implement 1) <b>LSTM</b>, 2) <b>Transformer</b> based QA model that output start and end position of the answer span from the input question and context. More details are on the code blocks below.

* We have already provided the codes for all the pipelines including preprocessing, training loop and evaluation. All you need to do is implement the three models by filling the blank in each model class we sepcified and train the models with SQuAD dataset.

* Report F1 and Exact Match (EM) score for each model.

## Reference
- Dataset from [SQuAD](https://www.aclweb.org/anthology/D16-1264.pdf).
- Pretrained model from [BERT](https://arxiv.org/abs/1908.08962).
- Codes from [transformers](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb).



## Install libraries - transformers, datasets

In [1]:
!pip install datasets transformers



## Import Libraries

In [2]:
from datasets import load_dataset, load_metric
import transformers
from transformers import (AutoTokenizer, 
                          default_data_collator, 
                          AdamW, get_linear_schedule_with_warmup,
                          BertPreTrainedModel,
                          BertModel,
)
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from torch.utils.data.dataloader import DataLoader
from tqdm.auto import tqdm
import collections
import numpy as np

We will use the BERT-mini model (https://arxiv.org/abs/1908.08962) for faster training and less GPU memory in this assignment.

In [3]:
model_checkpoint = "google/bert_uncased_L-4_H-256_A-4"

Hyperparameters (**Do not modify**)

In [4]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed
batch_size = 12
device = "cuda" # We will use gpu device provided by colab. Please set runtime to use the gpu accelerator in colab.

## Loading the dataset

In below cells, dataset is automatically loaded and preprocessed.
**Please don't modify them.**

In [5]:
# Load SQuAD v1.1 dataset using datasets library
datasets = load_dataset("squad")

# Loading Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Reusing dataset squad (/home/donggyu/.cache/huggingface/datasets/squad/plain_text/1.0.0/6b6c4172d0119c74515f44ea0b8262efe4897f2ddb6613e5e915840fdc309c16)


In [6]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [7]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

Loading cached processed dataset at /home/donggyu/.cache/huggingface/datasets/squad/plain_text/1.0.0/6b6c4172d0119c74515f44ea0b8262efe4897f2ddb6613e5e915840fdc309c16/cache-8fba95718b11e8c9.arrow
Loading cached processed dataset at /home/donggyu/.cache/huggingface/datasets/squad/plain_text/1.0.0/6b6c4172d0119c74515f44ea0b8262efe4897f2ddb6613e5e915840fdc309c16/cache-77c29ec9128f3494.arrow


In [8]:
train_dataset = tokenized_datasets["train"]

train_loader = DataLoader(train_dataset, 
                          batch_size=batch_size, 
                          collate_fn=default_data_collator,
                          shuffle=True,)

# Training

In [9]:
def train(model):
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=3e-5)
    # scheduler = get_linear_schedule_with_warmup(optimizer, 0, len(train_loader))
    scheduler = get_linear_schedule_with_warmup(optimizer, 0, len(train_loader) * 2) # Fixed 20210613
    global_step = 0

    for epoch in range(2):
        model.train()
        for batch in tqdm(train_loader):
            batch = {k:v.to(device) for k, v in batch.items()}

            outputs = model(input_ids=batch["input_ids"],
                            attention_mask=batch["attention_mask"],
                            token_type_ids=batch["token_type_ids"],
                            start_positions=batch["start_positions"],
                            end_positions=batch["end_positions"])

            loss = outputs[0]
            # Calculate gradients
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update model parameters
            optimizer.step()
            scheduler.step()
            model.zero_grad()
            global_step += 1
            
            if global_step % 1000 == 0:
                print (f"Loss:{loss.item()}")

# Evaluation

In [10]:
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [11]:
valid_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

Loading cached processed dataset at /home/donggyu/.cache/huggingface/datasets/squad/plain_text/1.0.0/6b6c4172d0119c74515f44ea0b8262efe4897f2ddb6613e5e915840fdc309c16/cache-3803565c4feecc3f.arrow


In [12]:
valid_loader = DataLoader(valid_features.remove_columns("offset_mapping"), 
                          batch_size=batch_size, 
                          collate_fn=default_data_collator,
                          shuffle=False,)

In [13]:
# dev_dataset for extracting real answer
def evaluate(model):
    model.to(device)
    all_start_logits = []
    all_end_logits = []
    model.eval()
    for batch in tqdm(valid_loader):
        batch = {k:v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model(input_ids=batch["input_ids"],
                            attention_mask=batch["attention_mask"],
                            token_type_ids=batch["token_type_ids"],)
        start_logits, end_logits = outputs[0], outputs[1]
        for start_logit, end_logit in zip(start_logits, end_logits):
            all_start_logits.append(start_logit.cpu().numpy())
            all_end_logits.append(end_logit.cpu().numpy())
    return (all_start_logits, all_end_logits)

# Postprocess answers for final score

In [14]:
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        predictions[example["id"]] = best_answer["text"]

    return predictions

In [15]:
def calculate_score(raw_predictions):
    final_predictions = postprocess_qa_predictions(datasets["validation"], valid_features, raw_predictions)
    metric = load_metric("squad")
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
    references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
    results = metric.compute(predictions=formatted_predictions, references=references)
    print(results)

##  Model Specification
- You need to implement LSTM and Transformer based QA models.
- Following BERT, we concatenate an input question and context. For instance,
$$(\text{[CLS ]}, q_1, \ldots, q_n, \text{[SEP]}, c_1, \ldots, c_m, \text{[SEP]} )$$ where $q_t, c_{t^\prime}$ denote a token of question and context respectively and [CLS], [SEP] are special tokens. 
- dimension of word embedding: 30522

## Arguments of forward function
Followings are the arguments of forward function:

- input_ids: Two dimensional  long Tensor indices of concatenated input question and context tokens ($q_t$ or $c_{t^\prime}$). Each token is mapped to the unique number (index).

- token_type_ids: Two dimensional long Tensor consisting of 0 or 1 with the same size of the input_ids. It indicates whether each token belongs to the question or context. 

- attention_mask: Two dimensional long Tensor consisting of 0 or 1 with the same size of the input_ids. Since we are dealing with variable length of sequences (the length of each sequence is different), we add zero-padding token to construct fixed size tensor. 0 indicates the zero-padding token which the model should not attend and 1 indicates the the other tokens.  

- start_positions: One dimensional long Tensor which indicates the start position of each example.

- end_positions: One dimensional long Tensor which indicates the end position of each example.

## Specification of forward function
- In the forward function, you will get iput_ids, token_type_ids, attention_mask, start_positions, end_positions as arguments. start_positions and end_postions are not given for the test time. You just set them None as default value.

- You need to return loss, start_logits, and end_logits when the start_positions and end_positions are given. The start_logits and end_logits denote the unnormalized score (before softmax) for start and end positions.

- If the start_positions and end_positions are not given, you return start_logits and end_logits.

- Check out the forward function of BERTModel.


# **Problem 0. Sample Pipeline using pre-trained BERT model**

Please run the below sample pipeline and report exact match / f1 score from it.

In [16]:
# BERT model (from pre-trained checkpoint)
class BERTModel(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.bert = BertModel(config, add_pooling_layer=False)
        self.qa_outputs = nn.Linear(config.hidden_size, 2)

        self.init_weights()

    def forward(
        self,
        input_ids,
        attention_mask,
        token_type_ids,
        start_positions=None,
        end_positions=None,
    ):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )
        sequence_output = outputs[0]
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        output = (start_logits, end_logits)
        return ((total_loss,) + output) if total_loss is not None else output

In [17]:
model = BERTModel.from_pretrained(model_checkpoint)
train(model)
raw_predictions = evaluate(model)
calculate_score(raw_predictions)

Some weights of the model checkpoint at google/bert_uncased_L-4_H-256_A-4 were not used when initializing BERTModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BERTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BERTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BERTModel were not initialized from t

HBox(children=(FloatProgress(value=0.0, max=7377.0), HTML(value='')))

Loss:2.9935789108276367
Loss:3.166799545288086
Loss:1.9664857387542725
Loss:2.601724624633789
Loss:2.7610039710998535
Loss:1.7573943138122559
Loss:1.1449921131134033



HBox(children=(FloatProgress(value=0.0, max=7377.0), HTML(value='')))

Loss:1.8035937547683716
Loss:1.4644100666046143
Loss:2.5865137577056885
Loss:1.9616796970367432
Loss:1.6259715557098389
Loss:1.6531521081924438
Loss:1.909135341644287



HBox(children=(FloatProgress(value=0.0, max=899.0), HTML(value='')))


Post-processing 10570 example predictions split into 10784 features.


HBox(children=(FloatProgress(value=0.0, max=10570.0), HTML(value='')))


{'exact_match': 60.0, 'f1': 70.67257316462724}


# **Problem 1. LSTM Implementation**

We provide the skeleton code for LSTM QA model.
Fill the init and forward function following descriptions in pdf file.

Then, run the pipeline and report the exact match / f1 score.

In [18]:
# LSTM Model
class LSTMModel(nn.Module):
    def __init__(self, dim):
        super().__init__()

        self.embedding = nn.Embedding(30522, dim)

        ### Implement the model here ###
        self.lstm=torch.nn.LSTM(input_size=dim,hidden_size=dim)
        self.qa_outputs=nn.Linear(dim,2)

    def forward(
        self, 
        input_ids, 
        attention_mask, 
        token_type_ids, 
        start_positions=None,
        end_positions=None,
    ):
        embeds = self.embedding(input_ids)
        
        ## Implement the LSTM forward and qa_outputs (linear) forward here ###
        pack=torch.nn.utils.rnn.pack_padded_sequence(
            input=embeds,
            lengths=torch.sum(attention_mask,dim=1).tolist(),
            batch_first=True,
            enforce_sorted=False
        )
        outputs=self.lstm(pack)
        sequence_output,_=torch.nn.utils.rnn.pad_packed_sequence(outputs[0],batch_first=True)
        logits=self.qa_outputs(sequence_output)
        ####

        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        output = (start_logits, end_logits)
        return ((total_loss,) + output) if total_loss is not None else output

In [19]:
hidden_dim = 768

model = LSTMModel(hidden_dim)
train(model)
raw_predictions = evaluate(model)
calculate_score(raw_predictions)

HBox(children=(FloatProgress(value=0.0, max=7377.0), HTML(value='')))

Loss:4.193328857421875
Loss:4.639159202575684
Loss:4.669092655181885
Loss:3.994865894317627
Loss:4.607333183288574
Loss:4.134253978729248
Loss:4.328581809997559



HBox(children=(FloatProgress(value=0.0, max=7377.0), HTML(value='')))

Loss:4.036929130554199
Loss:3.9858651161193848
Loss:4.279071807861328
Loss:4.3112382888793945
Loss:3.746023654937744
Loss:3.747622013092041
Loss:3.971881866455078



HBox(children=(FloatProgress(value=0.0, max=899.0), HTML(value='')))


Post-processing 10570 example predictions split into 10784 features.


HBox(children=(FloatProgress(value=0.0, max=10570.0), HTML(value='')))


{'exact_match': 5.089877010406812, 'f1': 11.49986205468953}


# **Problem 2. Transformer Implementation**

We provide the skeleton code for Transformer QA model.
Fill the init and forward function following descriptions in pdf file.

Then, run the pipeline and report the exact match / f1 score.

In [20]:
# Transformer Model
class TransformerModel(nn.Module):
    def __init__(self, dim, num_heads=8):
        super().__init__()

        self.num_heads = num_heads
        self.embedding = nn.Embedding(30522, dim)

        ### Implement the model here ###
        self.encoder=torch.nn.TransformerEncoderLayer(d_model=dim,nhead=num_heads)
        self.qa_outputs=nn.Linear(dim,2)

    def forward(
        self, 
        input_ids, 
        attention_mask, 
        token_type_ids, 
        start_positions=None,
        end_positions=None,
    ):
        embeds = self.embedding(input_ids)
        b,t = input_ids.size()
        attention_mask = attention_mask.unsqueeze(1).unsqueeze(1).expand(-1, self.num_heads, t, -1).reshape(b * self.num_heads, t, -1)
        attention_mask = attention_mask == 0
        ## Implement the Transformer forward and qa_outputs (linear) forward here ###
        outputs=self.encoder(
            src=torch.transpose(embeds,0,1),
            src_mask=attention_mask
        )
        sequence_output=torch.transpose(outputs,0,1)
        logits=self.qa_outputs(sequence_output)
        ####

        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        output = (start_logits, end_logits)
        return ((total_loss,) + output) if total_loss is not None else output

In [21]:
hidden_dim = 768
head_size = 12

model = TransformerModel(hidden_dim, head_size)
train(model)
raw_predictions = evaluate(model)
calculate_score(raw_predictions)

HBox(children=(FloatProgress(value=0.0, max=7377.0), HTML(value='')))

Loss:4.756199836730957
Loss:4.391205310821533
Loss:4.321883201599121
Loss:4.261214733123779
Loss:4.342556476593018
Loss:4.480454444885254
Loss:3.9709627628326416



HBox(children=(FloatProgress(value=0.0, max=7377.0), HTML(value='')))

Loss:4.533720016479492
Loss:4.415839672088623
Loss:4.716879367828369
Loss:4.258673667907715
Loss:3.997011423110962
Loss:4.100652694702148
Loss:4.146346092224121



HBox(children=(FloatProgress(value=0.0, max=899.0), HTML(value='')))


Post-processing 10570 example predictions split into 10784 features.


HBox(children=(FloatProgress(value=0.0, max=10570.0), HTML(value='')))


{'exact_match': 5.089877010406812, 'f1': 10.593915244581673}


# **Problem 3. Analysis Questions**

### 3.1. Which model performed better, LSTM or Transformer? Explain why it performed better.
My implementation results for LSTM/Transformer based QA models are as follows:

|Model|exact_match|f1|
|-----|-----------|--|
|LSTM|5.089877010406812|11.49986205468953|
|Transformer|5.089877010406812|10.593915244581673|

Amazingly, LSTM and Transformer got the exactly same *exact_match* scores. (But it is completly coincidential.)   
LSTM got higher f1 score where the number of parameters for each models are in the similar level.

In [22]:
print('# of parameters in LSTM: '+str(sum(p.numel() for p in LSTMModel(hidden_dim).parameters())))
print('# of parameters in Transformer: '+str(sum(p.numel() for p in TransformerModel(hidden_dim, head_size).parameters())))

# of parameters in LSTM: 28167170
# of parameters in Transformer: 28956418


Though, *exact_macth* is a strict all-or-nothing metric so *f1* score is more practical in common sense.
In this respect, LSTM performed better than Transformer.
This might be because my models are small. The strength of Transformer is good scalability by its non-sequential structure.
However, as the original paper of Transformer proposed *N* encoder/decoder layers, it requires many layers to perform well.
Since I used only one encoder layer for Transformer model, LSTM was able to outperform Transformer in this case.

### 3.2. Which model shows faster training, LSTM or Transformer? Explain what makes the training faster.
Transformer trained few minutes faster than LSTM.
The original paper of Transformer (https://arxiv.org/pdf/1706.03762.pdf) is explaining about this;
"In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence
length $n$ is smaller than the representation dimensionality $d$ ... "
In more detail, the complexity per layer for self-attention layer is $O(n^2\cdot d)$, while the one for recurrent layer is $O(n\cdot d^2)$.
In our case, the sequence length $n$ is less or equal to 384, and the representation dimensionality $d$ is 768. Thus, theoritically and empirically, Transformer trains faster than LSTM.

### 3.3. Why the pre-trained model performs much better than LSTM and Transformer? Please write down your thoughts.
The pre-trained model has preliminary knowledge on NLU. This surely benefits the model to understand feature representations. So the model can train and perform much better than my LSTM and Transformer models that are trained from scratch. 

### 3.4. How to improve the performance of the QA model with pre-trained BERT-mini? Please write your ideas if any.
Possible attempts to improve the model
- Larger batch
- Other loss functions instead of cross entropy loss