# Question Answering with Transformer

Performing extractive Question Answering by fine-tuning a pre-trained transformer model to a custom dataset. The project is taken from programming assignment of "Deep Learning Specialization. Sequence Models Course. Week 4".

## Table of Contents

- [<font color='black'>Imports](#0)
- [<font color='black'>Data (preparing and loading)](#1)
  - [<font color='black'>Preprocessing](#1.1)
  - [<font color='black'>Tokenizing and Aligning](#1.2)
- [<font color='black'>Build model](#2)
- [<font color='black'>Train model](#3)
- [<font color='black'>Evaluate](#4)
- [<font color='black'>Inference](#5)

<a name="0"></a>
## 0. Imports

In [1]:
from IPython.display import clear_output

In [2]:
!pip install datasets

clear_output()

In [3]:
!pip install transformers

clear_output()

In [4]:
import shutil


import torch
from datasets import load_from_disk
from google.colab import drive
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForQuestionAnswering
from transformers import Trainer
from transformers import TrainingArguments

In [5]:
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
!sudo apt-get install subversion

clear_output()

In [7]:
# Setup device-agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Device: cuda


<a name="1"></a>
## 1. Data

<a name="1.1"></a>
### 1.1 Preprocessing

In [8]:
!svn checkout https://github.com/amanchadha/coursera-deep-learning-specialization/trunk/C5%20-%20Sequence%20Models/Week%204/Question%20Answering/data

clear_output()

In [9]:
babi_dataset = load_from_disk('data/')
print(babi_dataset['train'][0])

{'story': {'answer': ['', '', 'office'], 'id': ['1', '2', '3'], 'supporting_ids': [[], [], ['1']], 'text': ['The office is north of the kitchen.', 'The garden is south of the kitchen.', 'What is north of the kitchen?'], 'type': [0, 0, 1]}}


In [10]:
babi_dataset['train'][102]

{'story': {'answer': ['', '', 'bedroom'],
  'id': ['1', '2', '3'],
  'supporting_ids': [[], [], ['2']],
  'text': ['The bedroom is west of the office.',
   'The bedroom is east of the hallway.',
   'What is east of the hallway?'],
  'type': [0, 0, 1]}}

In [11]:
type_set = set()
for story in babi_dataset['train']:
    if str(story['story']['type'] )not in type_set:
        type_set.add(str(story['story']['type'] ))

In [12]:
type_set

{'[0, 0, 1]'}

In [13]:
flattened_babi = babi_dataset.flatten()

In [14]:
flattened_babi

DatasetDict({
    train: Dataset({
        features: ['story.answer', 'story.id', 'story.supporting_ids', 'story.text', 'story.type'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['story.answer', 'story.id', 'story.supporting_ids', 'story.text', 'story.type'],
        num_rows: 1000
    })
})

In [15]:
next(iter(flattened_babi['train']))

{'story.answer': ['', '', 'office'],
 'story.id': ['1', '2', '3'],
 'story.supporting_ids': [[], [], ['1']],
 'story.text': ['The office is north of the kitchen.',
  'The garden is south of the kitchen.',
  'What is north of the kitchen?'],
 'story.type': [0, 0, 1]}

In [16]:
def get_question_and_facts(story):
    dic = {}
    dic['question'] = story['story.text'][2]
    dic['sentences'] = ' '.join([story['story.text'][0], story['story.text'][1]])
    dic['answer'] = story['story.answer'][2]
    return dic

In [17]:
processed = flattened_babi.map(get_question_and_facts)

  0%|          | 0/1000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [18]:
processed['train'][2]

{'story.answer': ['', '', 'bedroom'],
 'story.id': ['1', '2', '3'],
 'story.supporting_ids': [[], [], ['2']],
 'story.text': ['The garden is north of the office.',
  'The bedroom is north of the garden.',
  'What is north of the garden?'],
 'story.type': [0, 0, 1],
 'question': 'What is north of the garden?',
 'sentences': 'The garden is north of the office. The bedroom is north of the garden.',
 'answer': 'bedroom'}

In [19]:
def get_start_end_idx(story):
    str_idx = story['sentences'].find(story['answer'])
    end_idx = str_idx + len(story['answer'])
    return {'str_idx':str_idx,
          'end_idx': end_idx}

In [20]:
processed = processed.map(get_start_end_idx)

  0%|          | 0/1000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [21]:
num = 187
print(processed['test'][num])
start_idx = processed['test'][num]['str_idx']
end_idx = processed['test'][num]['end_idx']
print('answer:', processed['test'][num]['sentences'][start_idx:end_idx])

{'story.answer': ['', '', 'garden'], 'story.id': ['1', '2', '3'], 'story.supporting_ids': [[], [], ['2']], 'story.text': ['The hallway is south of the garden.', 'The garden is south of the bedroom.', 'What is south of the bedroom?'], 'story.type': [0, 0, 1], 'question': 'What is south of the bedroom?', 'sentences': 'The hallway is south of the garden. The garden is south of the bedroom.', 'answer': 'garden', 'str_idx': 28, 'end_idx': 34}
answer: garden


<a name="1.2"></a>
### 1.2 Tokenizing and Aligning

In [22]:
!svn checkout https://github.com/amanchadha/coursera-deep-learning-specialization/trunk/C5%20-%20Sequence%20Models/Week%204/Question%20Answering/tokenizer

clear_output()

In [23]:
shutil.copyfile("drive/MyDrive/Pet_Projects/vocab.txt", "tokenizer/vocab.txt")

'tokenizer/vocab.txt'

In [24]:
tokenizer = DistilBertTokenizerFast.from_pretrained('tokenizer/')

In [25]:
tokenizer.add_special_tokens({'pad_token': "<pad>", 'mask_token': "<mask>" })

2

In [26]:
tokenizer

PreTrainedTokenizerFast(name_or_path='tokenizer/', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': '<mask>'})

In [27]:
def tokenize_align(example):
    encoding = tokenizer(example['sentences'], example['question'], truncation=True, padding=True, max_length=tokenizer.model_max_length)
    start_positions = encoding.char_to_token(example['str_idx'])
    end_positions = encoding.char_to_token(example['end_idx']-1)
    if start_positions is None:
        start_positions = tokenizer.model_max_length
    if end_positions is None:
        end_positions = tokenizer.model_max_length
    return {'input_ids': encoding['input_ids'],
          'attention_mask': encoding['attention_mask'],
          'start_positions': start_positions,
          'end_positions': end_positions}

In [28]:
qa_dataset = processed.map(tokenize_align)

  0%|          | 0/1000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [29]:
qa_dataset = qa_dataset.remove_columns(['story.answer', 'story.id', 'story.supporting_ids', 'story.text', 'story.type'])

In [30]:
qa_dataset['train'][200]

{'question': 'What is north of the bathroom?',
 'sentences': 'The garden is north of the bathroom. The hallway is south of the bathroom.',
 'answer': 'garden',
 'str_idx': 4,
 'end_idx': 10,
 'input_ids': [101,
  1996,
  3871,
  2003,
  2167,
  1997,
  1996,
  5723,
  1012,
  1996,
  6797,
  2003,
  2148,
  1997,
  1996,
  5723,
  1012,
  102,
  2054,
  2003,
  2167,
  1997,
  1996,
  5723,
  1029,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'start_positions': 2,
 'end_positions': 2}

In [31]:
print(qa_dataset['train'][200])

{'question': 'What is north of the bathroom?', 'sentences': 'The garden is north of the bathroom. The hallway is south of the bathroom.', 'answer': 'garden', 'str_idx': 4, 'end_idx': 10, 'input_ids': [101, 1996, 3871, 2003, 2167, 1997, 1996, 5723, 1012, 1996, 6797, 2003, 2148, 1997, 1996, 5723, 1012, 102, 2054, 2003, 2167, 1997, 1996, 5723, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'start_positions': 2, 'end_positions': 2}


In [32]:
train_ds = qa_dataset['train']
test_ds = qa_dataset['test']

In [33]:
columns_to_return = ['input_ids','attention_mask', 'start_positions', 'end_positions']
train_ds.set_format(type='pt', columns=columns_to_return)
test_ds.set_format(type='pt', columns=columns_to_return)

<a name="2"></a>
## 2. Build model

In [34]:
!svn checkout https://github.com/amanchadha/coursera-deep-learning-specialization/trunk/C5%20-%20Sequence%20Models/Week%204/Question%20Answering/model

clear_output()

In [35]:
pytorch_model = DistilBertForQuestionAnswering.from_pretrained("drive/MyDrive/Pet_Projects/").to(device)

<a name="3"></a>
## 3. Train model

In [36]:
def compute_metrics(pred):
    start_labels = pred.label_ids[0]
    start_preds = pred.predictions[0].argmax(-1)
    end_labels = pred.label_ids[1]
    end_preds = pred.predictions[1].argmax(-1)
    
    f1_start = f1_score(start_labels, start_preds, average='macro')
    f1_end = f1_score(end_labels, end_preds, average='macro')
    
    return {
        'f1_start': f1_start,
        'f1_end': f1_end,
    }

In [37]:
training_args = TrainingArguments(
    output_dir='results',          # output directory
    overwrite_output_dir=True,
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=20,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=None,            # directory for storing logs
    logging_steps=50
)

trainer = Trainer(
    model=pytorch_model,                 # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_ds,         # training dataset
    eval_dataset=test_ds,
    compute_metrics=compute_metrics             # evaluation dataset
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: sentences, str_idx, answer, question, end_idx. If sentences, str_idx, answer, question, end_idx are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 66364418


Step,Training Loss
50,1.5323
100,0.7932
150,0.456
200,0.4368
250,0.3205
300,0.32
350,0.3297




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=0.5787511965433757, metrics={'train_runtime': 21.4344, 'train_samples_per_second': 139.962, 'train_steps_per_second': 17.495, 'total_flos': 19904183208000.0, 'train_loss': 0.5787511965433757, 'epoch': 3.0})

<a name="4"></a>
## 4. Evaluate

In [38]:
trainer.evaluate(test_ds)

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: sentences, str_idx, answer, question, end_idx. If sentences, str_idx, answer, question, end_idx are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'eval_loss': 0.3174998164176941,
 'eval_f1_start': 0.802786136980122,
 'eval_f1_end': 0.7979120861612974,
 'eval_runtime': 1.3159,
 'eval_samples_per_second': 759.96,
 'eval_steps_per_second': 94.995,
 'epoch': 3.0}

<a name="5"></a>
## 5. Inference

In [39]:
pytorch_model.to(device)

question, text = 'What is east of the hallway?','The kitchen is east of the hallway. The garden is south of the bedroom.'

input_dict = tokenizer(text, question, return_tensors='pt')

input_ids = input_dict['input_ids'].to(device)
attention_mask = input_dict['attention_mask'].to(device)

outputs = pytorch_model(input_ids, attention_mask=attention_mask)

start_logits = outputs[0]
end_logits = outputs[1]

all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
answer = ' '.join(all_tokens[torch.argmax(start_logits, 1)[0] : torch.argmax(end_logits, 1)[0]+1])

print(question, answer.capitalize())

What is east of the hallway? Kitchen
