# <center>Question Answering Machine</center>

In this notebook, we will finetuning BERT model for question answering machine task. We will use facqa dataset from [IndoNLU](https://github.com/indobenchmark/indonlu) and indoBERT model from [indoLEM](https://github.com/indolem).

In [1]:
!pip install sentencepiece==0.1.95
!pip install transformers==4.2.2
!pip install datasets==1.2.0

Collecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 23.5 MB/s eta 0:00:01[K     |▌                               | 20 kB 24.0 MB/s eta 0:00:01[K     |▉                               | 30 kB 11.7 MB/s eta 0:00:01[K     |█                               | 40 kB 9.3 MB/s eta 0:00:01[K     |█▍                              | 51 kB 5.1 MB/s eta 0:00:01[K     |█▋                              | 61 kB 5.5 MB/s eta 0:00:01[K     |██                              | 71 kB 5.9 MB/s eta 0:00:01[K     |██▏                             | 81 kB 6.7 MB/s eta 0:00:01[K     |██▌                             | 92 kB 6.4 MB/s eta 0:00:01[K     |██▊                             | 102 kB 5.3 MB/s eta 0:00:01[K     |███                             | 112 kB 5.3 MB/s eta 0:00:01[K     |███▎                            | 122 kB 5.3 MB/s eta 0:00:01[K     |███▌                     

## Import library

In [3]:
import copy
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import default_data_collator
import numpy as np
from tqdm.auto import tqdm

# if there is a tqdm related error, run this cell one more time

In [4]:
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

## Load data

In [5]:
# download facqa dataset from indoNLU repo

# download train dataset
!wget "https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/facqa_qa-factoid-itb/train_preprocess.csv"

# download validation dataset
!wget "https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/facqa_qa-factoid-itb/valid_preprocess.csv"

--2021-10-08 01:29:31--  https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/facqa_qa-factoid-itb/train_preprocess.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2073762 (2.0M) [text/plain]
Saving to: ‘train_preprocess.csv’


2021-10-08 01:29:32 (32.4 MB/s) - ‘train_preprocess.csv’ saved [2073762/2073762]

--2021-10-08 01:29:32--  https://raw.githubusercontent.com/indobenchmark/indonlu/master/dataset/facqa_qa-factoid-itb/valid_preprocess.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 258917 (253K) [tex

In [6]:
data_files = {"train": 'train_preprocess.csv', "val": 'valid_preprocess.csv'}

dataset = load_dataset('csv', data_files=data_files)

train_dataset = dataset["train"]
valid_dataset = dataset["val"]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1562.0, style=ProgressStyle(description…

Using custom data configuration default



Downloading and preparing dataset csv/default-726b64211aa62887 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-726b64211aa62887/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-726b64211aa62887/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2. Subsequent calls will reuse this data.


In [7]:
# Sample data
valid_dataset[:5]

{'passage': ["['Selain', 'orangutan', ',', 'satwa', 'yang', 'berada', 'TNGL', 'itu', 'di', 'antaranya', 'adalah', 'harimau', 'sumatera', '(', 'Panthera', 'tigris', 'sumatrensis', ')', '.', 'Tim', 'Kompas', 'bulan', 'lalu', 'pernah', 'memergoki', 'harimau', 'ini', 'di', 'tepi', 'jalan', 'raya', 'Tapaktuan-Singkil', ',', 'tepatnya', 'di', 'Desa', 'Sultan', 'Daulat', ',', 'yang', 'masuk', 'koridor', 'ekosistem', 'Leuser', '.']",
  "['Pesawat', 'dengan', 'nomor', 'penerbangan', 'GA', '-', '181', 'yang', 'membawa', '84', 'penumpang', ',', 'termasuk', 'Menteri', 'Sosial', 'Bachtiar', 'Chamsyah', 'dan', 'sejumlah', 'anggota', 'Komisi', 'V', 'DPR', ',', 'ini', 'terpaksa', 'mendarat', 'darurat', 'karena', 'terjadi', 'kerusakan', 'pada', 'mesin', 'di', 'sayap', 'kanan', ',', '40', 'menit', 'setelah', 'lepas', 'landas', 'dari', 'Bandar', 'Udara', 'Polonia', ',', 'Medan', ',', 'Sumatera', 'Utara', '.']",
  "['Pernyataan', 'tersebut', 'disampaikan', 'Larijani', 'dalam', 'acara', 'konferensi', 'pers

## Set config

In [8]:
# we will finetuning using indobert-base-uncased from indoLEM
model_checkpoint = "indolem/indobert-base-uncased" 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
batch_size = 4

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1014.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=234118.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




In [9]:
# set config when tokenizing
encoder_max_len = 384
decoder_max_len = 50
doc_stride = 128
pad_on_right = tokenizer.padding_side == "right"

## Preprocess data

In [10]:
# encode function for encoding data
def encode(example, encoder_max_len=encoder_max_len, decoder_max_len=decoder_max_len):
    
    text = copy.copy(example['passage'])
    question = copy.copy(example['question'])
    answer = copy.copy(example['seq_label'])
    answer_text = [None for i in range(len(text))]
    
    # since the data type in string, we need to convert the data into list
    for i in range(len(text)):
        t = text[i].strip("']['").split("', '")
        a = answer[i].strip("']['").split("', '")
        q = question[i].strip("']['").split(", ")
        q = [b.strip('\'"') for b in q]
        
        if len(t)!=len(a):
            t = text[i].strip("']['").split(", ")
            t_swap = []
            for b in t:
                if b[0] == '"':
                    t_swap.append(b.strip('"'))
                else:
                    t_swap.append(b.strip("\'"))
            t = t_swap
            
        assert len(t)==len(a)
        
        answer[i] = a
        text[i] = t
        question[i] = q
        answer_text[i] = " ".join(list(np.array(t)[np.array(a) != 'O']))
        

    # encode after converting the data
    encoder_inputs = tokenizer(question, text, is_split_into_words=True, truncation="only_second", max_length=encoder_max_len, padding='max_length', 
                               return_overflowing_tokens=True, return_offsets_mapping=True, stride=doc_stride)

    input_ids = encoder_inputs['input_ids']
    input_attention = encoder_inputs['attention_mask']
    offset_mapping = encoder_inputs.pop("offset_mapping") 

    # get the start and end index position of the answer for the question  
    start_answer_token_positions = []
    end_answer_token_positions = []
    
    for i in range(len(text)):
        sequence_ids = encoder_inputs.sequence_ids(i)
        
        token_start_index = 0
        while sequence_ids[token_start_index] != 1:
            token_start_index += 1
            
        token_end_index = len(input_ids[i]) - 1
        while sequence_ids[token_end_index] != 1 :
            token_end_index -= 1
            
        start_token_answer = 0
        while answer[i][start_token_answer] == 'O':
            if offset_mapping[i][token_start_index + start_token_answer +1][0] == 0:
                start_token_answer += 1
            else:
                token_start_index += 1
        
        start_answer_token_positions.append(token_start_index + start_token_answer)
        
        end_token_answer = len(answer[i]) -1
        while answer[i][end_token_answer] == 'O':
            if offset_mapping[i][token_end_index][0] == 0:
                end_token_answer -= 1
                token_end_index -= 1
            else:
                token_end_index -= 1
                
        end_answer_token_positions.append(token_end_index + 1)
    
    outputs = {'input_ids':input_ids, 'attention_mask': input_attention, 
               "start_positions": start_answer_token_positions, "end_positions": end_answer_token_positions}
    
    return outputs

In [11]:
tokenized_datasets = dataset.map(encode, batched=True, remove_columns=dataset["train"].column_names)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [12]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 2495
    })
    val: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 311
    })
})

In [13]:
# check sample of tokenized data
print(tokenized_datasets['train'][:1])

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

## Prepare model

We will finetuning the model using the Trainer class from transformers library

In [14]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=444780374.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at indolem/indobert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at indolem/indobert-base-uncased and a

In [15]:
# set config argument for Trainer object
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end = True,
    metric_for_best_model = 'eval_loss'
)

In [16]:
# for more info check this https://huggingface.co/transformers/main_classes/data_collator.html
data_collator = default_data_collator

In [17]:
# create Trainer object
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

## Finetuning the model

In [18]:
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,1.9702,0.907827,16.0008,19.436
2,0.8368,0.880009,15.9907,19.449
3,0.5178,0.940021,15.9803,19.462


TrainOutput(global_step=1872, training_loss=0.9760072903755383, metrics={'train_runtime': 1217.9146, 'train_samples_per_second': 1.537, 'total_flos': 1896466447157760, 'epoch': 3.0})

In [19]:
# save the model
trainer.save_model('best_qa_model')

In [20]:
# test to make sure that the best model is loaded at the end
trainer.evaluate()

{'epoch': 3.0,
 'eval_loss': 0.880009114742279,
 'eval_runtime': 16.1494,
 'eval_samples_per_second': 19.258}

## Using the saved model

In [21]:
# get device function
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')

In [22]:
# set the device to put the best model into
device = get_default_device()
print(device)

cuda


In [23]:
# load the best model
best_model = AutoModelForQuestionAnswering.from_pretrained('best_qa_model')

In [24]:
# set the model to device, using cuda for faster calculation
model.to(device)

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31923, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [25]:
# convert raw input into feature vector
def prepare_features(example):

    text = example['passage']
    question = example['question']
        
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.

    tokenized_examples = tokenizer(question, text, is_split_into_words=False, truncation="only_second", 
                                   max_length=encoder_max_len, padding='max_length', 
                                   return_overflowing_tokens=True, return_offsets_mapping=True, stride=doc_stride)

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    #sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # Grab the sequence corresponding to that example (to know what is the context and what is the question).
    context_index = 1 if pad_on_right else 0

    return tokenized_examples

In [26]:
# example of input
ex_input={'passage': 'Hal ini banyak dicari tahu masyarakat sejak Presiden Joko Widodo memimpin Upacara Penetapan Komponen Cadangan (Komcad) pada hari ini (7/10/2021) di Pusdiklatpassus, Bandung, Jawa Barat.', 
          'question': 'Dimanakah Upacara Penetapan Komponen Cadangan diadakan ?'}

In [27]:
# process input into feature vector
feature_input = prepare_features(ex_input)

In [28]:
# function for prediction
def predict_answers(features, max_answer_length = 30, n_best_size = 20):
     

    best_answers =[]
        
    attention_mask = torch.tensor(features['attention_mask']).to(torch.int64)
    input_ids = torch.tensor(features['input_ids']).to(torch.int64)
    token_type_ids = torch.tensor(features['token_type_ids']).to(torch.int64)

    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    token_type_ids = token_type_ids.to(device)

    with torch.no_grad():
        output = model(attention_mask = attention_mask, input_ids = input_ids, token_type_ids=token_type_ids)


    start_logits = output.start_logits[0].cpu().numpy()
    end_logits = output.end_logits[0].cpu().numpy()
    offset_mapping = features["offset_mapping"][0]

    context = features['input_ids'][0]

    # Gather the indices the best start/end logits:
    start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
    end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

    valid_answers = []
    for start_index in start_indexes:
        for end_index in end_indexes:
            # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
            # to part of the input_ids that are not in the context.
            if (
                start_index >= len(offset_mapping)
                or end_index >= len(offset_mapping)
                or offset_mapping[start_index] is None
                or offset_mapping[end_index] is None
            ):
                continue
            # Don't consider answers with a length that is either < 0 or > max_answer_length.
            if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                continue
            #if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            #start_char = offset_mapping[start_index][0]
            #end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": tokenizer.decode(context[start_index: end_index]),
                    "start_idx": start_index,
                    "end_idx": end_index
                }
            )

    valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
    
    try:
        best_answers.append(valid_answers[0])
    except:
        print(i)
        print(idx)
        print(valid_answers)
        print(start_indexes)
        print(end_indexes)

    #return best_answers
    return valid_answers

In [29]:
# predict funtion return all possible answer sorted by its score, answer with the biggest score is the top answer
predict_answers(feature_input)

[{'end_idx': 51,
  'score': 0.31542462,
  'start_idx': 47,
  'text': 'bandung, jawa barat'},
 {'end_idx': 51,
  'score': -0.31230026,
  'start_idx': 42,
  'text': 'pusdiklatpassus, bandung, jawa barat'},
 {'end_idx': 51,
  'score': -1.6396658,
  'start_idx': 34,
  'text': '7 / 10 / 2021 ) di pusdiklatpassus, bandung, jawa barat'},
 {'end_idx': 51, 'score': -1.7371016, 'start_idx': 49, 'text': 'jawa barat'},
 {'end_idx': 51,
  'score': -2.2303257,
  'start_idx': 41,
  'text': 'di pusdiklatpassus, bandung, jawa barat'},
 {'end_idx': 51,
  'score': -3.944998,
  'start_idx': 26,
  'text': 'komcad ) pada hari ini ( 7 / 10 / 2021 ) di pusdiklatpassus, bandung, jawa barat'},
 {'end_idx': 51, 'score': -4.2310967, 'start_idx': 50, 'text': 'barat'},
 {'end_idx': 48, 'score': -4.54775, 'start_idx': 47, 'text': 'bandung'},
 {'end_idx': 51,
  'score': -4.8506117,
  'start_idx': 31,
  'text': 'hari ini ( 7 / 10 / 2021 ) di pusdiklatpassus, bandung, jawa barat'},
 {'end_idx': 48,
  'score': -5.175475

In [30]:
# print the best answer
best_answer = predict_answers(feature_input)[0]['text']
print(best_answer)

bandung, jawa barat
