# Question Answering Engine

## 02: Single Model Approach

In my initial experiments, I implemented a single question answering BERT-based model for both span entity and relation prediction. A problem with this approach, as I found out, was that the model struggled to learn both tasks effectively, with relation extraction being a more complex task than entity recognition.

### Libraries

Importing the necessary libraries.

In [1]:
# Pytorch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Numpy, Plotting, Metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import random

from unidecode import unidecode

# Testing and metrics
from sklearn.metrics import roc_curve, roc_auc_score, f1_score, precision_score, recall_score, confusion_matrix, precision_recall_fscore_support

# BERT
from transformers import BertTokenizer, BertModel, get_linear_schedule_with_warmup, logging
logging.set_verbosity_error()

# Metal to run it locally on apple silicon, it falls back to CUDA online, else CPU as final resort
device = 'mps' if (torch.backends.mps.is_available()) else 'cuda' if ( torch.cuda.is_available()) else 'cpu'

### Datasets

Reading with pandas the data from the CSV files to create train, validation, and test datasets. Also creating the relation vocabulary to create a list of possible relations.

In [2]:
# Firstly read the dictionary I created
df = pd.read_csv("dataset/entity_dict.csv", sep = ',')
Entities = df['Entity']
Entity_ids = df['Id']

# Remove accents from the 'Entity' column
df['Entity'] = df['Entity'].apply(lambda x: unidecode(x))

# Then read the relations
df = pd.read_csv("dataset/relation_vocab.csv")

# Create a list with the relation vocabulary
relation_vocab = df['Relation'].to_list()

# Finally read the train dataset
df = pd.read_csv("dataset/train_dataset.csv", sep = ',')
train_Questions = df['question']
train_Entity_ids = df['entity_id']
train_Entity_labels = df['entity_label']
train_Entity_start = df['entity_start']
train_Entity_end = df['entity_end']
train_Relation_ids = df['relation_id']
train_Answer_ids = df['answer_id']

# Validation dataset
df = pd.read_csv("dataset/val_dataset.csv", sep = ',')
val_Questions = df['question']
val_Entity_ids = df['entity_id']
val_Entity_labels = df['entity_label']
val_Entity_start = df['entity_start']
val_Entity_end = df['entity_end']
val_Relation_ids = df['relation_id']
val_Answer_ids = df['answer_id']

# Test dataset
df = pd.read_csv("dataset/test_dataset.csv", sep = ',')
test_Questions = df['question']
test_Entity_ids = df['entity_id']
test_Entity_labels = df['entity_label']
test_Entity_start = df['entity_start']
test_Entity_end = df['entity_end']
test_Relation_ids = df['relation_id']
test_Answer_ids = df['answer_id']

# Free the dataframe's memory resources
del df

Loading the BERT tokenizer from the pre-trained "bert-base-uncased" model.

In [3]:
# Load the BERT tokenizer
BERT_MODEL = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL, do_lower_case=True)

### Functions & Model

Seeding function for reproducibility.

In [4]:
def seedTorch(seed=33):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ['PYTHONHASHSEED'] = str(seed)

For preprocessing since the span start and end correspond to actual words, but here I'll be using the BERT tokenizer I need to locate the corresponding start and end tokens in the BERT tokenized question for each entity. The function returns the tokenized input IDs, attention masks, and the start and end token positions for each question as PyTorch tensors.

In [5]:
# Returns input_ids, attention_masks, start, ends for BERT
def preprocess(questions, max_length, entity_starts, entity_ends):

    ids = []
    masks = []
    token_starts = []
    token_ends = []
    
    for i, question in enumerate(questions):

        # Not replacing ? and 's here since these might be helpful 
        question = unidecode(question.replace("?", "").replace("'s", ""))
        question_t = question.split()

        # Locate the corresponding start in the BERT tokenized question
        start_token = tokenizer.encode_plus(
            text = question_t[entity_starts[i]],
            return_attention_mask=True,
            add_special_tokens=False
        ) 
        start_token = start_token['input_ids'][0]

        # Locate the corresponding end in the BERT tokenized question
        end_token = tokenizer.encode_plus(
            text = question_t[entity_ends[i]],
            return_attention_mask=True,
            add_special_tokens=False
        ) 
        end_token = end_token['input_ids'][-1]

        # Encode the question
        encoding = tokenizer.encode_plus(
            text = question,
            max_length=max_length,
            truncation=True,
            pad_to_max_length=True,
            return_attention_mask=True,
            padding='max_length',
            add_special_tokens=False
        ) 
        tokens = encoding['input_ids']
        
        # Append in the corresponding lists
        ids.append(tokens)
        masks.append(encoding['attention_mask'])
        token_starts.append(tokens.index(start_token))
        token_ends.append(tokens.index(end_token))

    return torch.tensor(ids), torch.tensor(masks), token_starts, token_ends

Custom dataset class that preprocesses the questions using the function I defined above. It keeps the input_ids, attention_masks, entity start and end token positions, and relation IDs for each question as PyTorch tensors.

In [6]:
# A custom Question Dataset class to use for the dataloaders
class QuestionDataset(Dataset):

    def __init__(self, questions, entity_start, entity_end, relation_ids, length):
        ids, masks, entity_start_T, entity_end_T = preprocess(questions, length, entity_start, entity_end)
        self.max_length = length
        self.input_ids = ids
        self.attention_masks = masks 
        self.entity_start = torch.tensor(entity_start_T, dtype=torch.long)
        self.entity_end = torch.tensor(entity_end_T, dtype=torch.long)
        self.relation_ids = torch.tensor([relation_vocab.index(r) for r in relation_ids.to_list() ])
        self.samples = len(relation_ids)

    def __len__(self):
        return self.samples

    def __getitem__(self, idx):
        return {'input_ids': self.input_ids[idx], 
                'attention_mask': self.attention_masks[idx],
                'entity_start': self.entity_start[idx],
                'entity_end': self.entity_end[idx],
                'relation_ids': self.relation_ids[idx]
                }

Then I calculate the maximum length of input tokens among the training, validation, and test sets. It iterates through each question in each set, encodes the question using BERT's tokenizer, and stores the length of the resulting input tokens. The final value represents the maximum number of tokens among all questions in the three sets.

In [7]:
maxlen = 0
listsets = train_Questions, val_Questions, test_Questions
for questions in listsets:
        for question in questions:
                encoding = tokenizer.encode_plus(
                        text = question,
                        return_attention_mask=True,
                        add_special_tokens=False
                ) 
                tokens = encoding['input_ids']
                length_tokens = len(tokens)
                if ( length_tokens> maxlen):
                        maxlen = length_tokens
print(maxlen)

34


Since the maximum length is 34 a maximum sequence length of 36 tokens would be enough. Also I select a batch size of 32 which worked well so far with BERT (with a larger batch size in my local machine it was unable to run). I create the three datasets for the training, validation and testing data and the corresponding dataloaders.

In [7]:
MAX_LENGTH = 36
BATCH_SIZE = 32

# Tokenizing, preprocessing and dataset creation
train_dataset = QuestionDataset(train_Questions, train_Entity_start, train_Entity_end, train_Relation_ids, MAX_LENGTH)
val_dataset = QuestionDataset(val_Questions, val_Entity_start, val_Entity_end, val_Relation_ids, MAX_LENGTH)
test_dataset = QuestionDataset(test_Questions, test_Entity_start, test_Entity_end, test_Relation_ids, MAX_LENGTH)

# Corresponding Dataloaders
train_dataloader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
val_dataloader = DataLoader(dataset=val_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
test_dataloader = DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)

Then I define the BERT_QA class that has three linear layers to predict the start and end positions of the entity (self.start_head and self.end_head, respectively) and the relation label (self.relation_head). 

- In the forward method, the input token IDs and attention mask are passed through the BERT model to get the sequence output. 
- The sequence output is passed through the linear layers to get the start and end logits, which are multiplied by the attention mask to mask out the padding tokens. 
- The pooled output is passed through the linear layer for predicting the relation label to get the relation logits, which are returned along with the masked softmax probabilities for the start logits and end logits.

In [8]:
class BERT_QA(torch.nn.Module):
    def __init__(self, bert_model, vocab_size):
        super().__init__()
        self.bert = BertModel.from_pretrained(bert_model)

        self.start_head = nn.Sequential(
            nn.Dropout(p=0.13),
            nn.Linear(self.bert.config.hidden_size, 1),
            nn.Flatten(),
            nn.Softmax(dim=1)
        )

        self.end_head = nn.Sequential(
            nn.Dropout(p=0.13),
            nn.Linear(self.bert.config.hidden_size, 1),
            nn.Flatten(),
            nn.Softmax(dim=1)
        )

        self.relation_head = nn.Sequential(
            nn.Dropout(0.13),
            nn.Linear(self.bert.config.hidden_size, vocab_size),
            nn.Softmax(dim=1)
        )
        
        self.vocab_size = vocab_size

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        sequence_output = outputs[0]
        start_ent = self.start_head(sequence_output)
        end_ent = self.end_head(sequence_output)
        
        return start_ent* attention_mask, end_ent * attention_mask, self.relation_head(outputs[1])

Below I have the training functions for the span entity & relation prediction model. 
- `train_epoch` trains the model for one epoch and returns the mean loss. 
- `train_model` trains the model for a given number of epochs and returns the trained model.
- `optimize_model` trains the model for a given number of epochs and returns the trained model that has the best F1 score on the validation dataset.
- `evaluation_function` evaluates the model on a given dataset and prints the metrics.

In [14]:
# Function to train for one epoch the span model
def train_epoch(optimizer, scheduler, dataloader, lossfunc, model, device, display=True, clip_value=0.6):

    model = model.train()
    losses = []

    # For each batch
    for batch, data in enumerate(dataloader):

        # In case the GPU is used
        ids = data['input_ids'].to(device)
        mask = data['attention_mask'].to(device)

        # The actual entity and relations
        actual_entity_starts = data['entity_start'].to(device)
        actual_entity_ends = data['entity_end'].to(device)

        # The actual relations
        actual_relations = data['relation_ids'].to(device)
        
        # Predict and calculate loss
        start_logits, end_logits, relation_logits  = model(input_ids=ids, attention_mask=mask)

        # Find the start and end indices with the highest probability
        start_preds = torch.argmax(start_logits, dim=1)

        # Create a mask with the same shape as the matrix
        start_mask = torch.zeros((len(start_preds), MAX_LENGTH)).to(device)

        # Set the ones in the mask based on the indices in the tensor
        for i, idx in enumerate(start_preds):
            start_mask[i, idx:] = 1

        masked_end_logits = end_logits * start_mask

        span_loss1 = lossfunc(start_logits, actual_entity_starts)
        span_loss2 = lossfunc(masked_end_logits, actual_entity_ends)

        rel_loss = lossfunc(relation_logits, actual_relations)

        total_loss = span_loss1+span_loss2+3*rel_loss
        losses.append(total_loss.item())

        # Inform the weights
        total_loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip_value)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        # When display is set print every 64 the loss
        if(display):
            if batch % 64 == 0:
                size = len(dataloader.dataset)
                loss, current = total_loss.item(), batch * len(ids)
                print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

    # Return the total mean loss
    meanloss = 0
    if len(losses)!=0:
        meanloss = sum(losses)/len(losses)
    return meanloss
 
# Function to train a model
def train_model(epochs, optimizer, scheduler, dataloader, entropy_loss, model, device, display=True, clip_value=0.6):

    # For each epoch
    for epoch in range(epochs):
        if (display):
            print(f"\nEpoch {epoch+1}\n_________________________________")
        train_epoch(optimizer, scheduler, dataloader, entropy_loss, model, device, display, clip_value)
        if (display):
            print("_________________________________")
        
    # Returns the model
    return model

# Function to train a model
def optimize_model(epochs, optimizer, scheduler, dataloader, val_dataloader, lossfunc, model, device, display=True, clip_value=0.6):

    best_f1 = 0

    # For each epoch
    for epoch in range(epochs):
        if (display):
            print(f"\nEpoch {epoch+1}\n_________________________________")
        train_epoch(optimizer, scheduler, dataloader, lossfunc, model, device, display, clip_value)
        _, f1_score, _ = evaluation_function(val_dataloader, model, lossfunc, device, True)
        
        # If F1-score is better in the validation set then keep this model
        if (best_f1<f1_score):
            best_f1=f1_score
            # Save the model
            torch.save(model.state_dict(), './best_model.pt')

        if (display):
            print("_________________________________")
        
    # Returns the model
    return model


# Evaluation function
def evaluation_function(dataloader, model, lossfunc, device, display=True):

    # Metrics initialisation
    losses = []
    total_f1_score = 0
    total_accuracy = 0 
    total_precision = 0 
    total_recall = 0 
    total_count = 0 
    total_relations = 0
    correct_relations =0

    # So the model is in eval mode
    model.eval()
    with torch.no_grad():

        # For each batch
        for batch, data in enumerate(dataloader):

            # In case the GPU is used
            ids = data['input_ids'].to(device)
            mask = data['attention_mask'].to(device)

            # The actual positions
            actual_entity_starts = data['entity_start'].to(device)
            actual_entity_ends = data['entity_end'].to(device)
            actual_relations = data['relation_ids'].to(device)

            # Predict
            start_logits, end_logits, relation_logits = model(input_ids=ids, attention_mask=mask)
        
            # Find the start and end indices with the highest probability
            start_preds = torch.argmax(start_logits, dim=1)

            # Create a mask with the same shape as the matrix
            start_mask = torch.zeros((len(start_preds), MAX_LENGTH)).to(device)

            # # Set the ones in the mask based on the indices in the tensor
            for i, idx in enumerate(start_preds):
                start_mask[i, idx:] = 1

            masked_end_logits = end_logits * start_mask
            end_preds = torch.argmax(masked_end_logits, dim=1)

            span_loss1 = lossfunc(start_logits, actual_entity_starts)
            span_loss2 = lossfunc(masked_end_logits, actual_entity_ends)
            rel_loss = lossfunc(relation_logits, actual_relations)

            total_loss = span_loss1+span_loss2 + 3*rel_loss
            losses.append(total_loss.item())

            predicted_spans = [(start_preds[i].item(), end_preds[i].item(), None) for i in range(len(start_preds))]
            true_spans = [(actual_entity_starts[i].item(), actual_entity_ends[i].item(), None) for i in range(len(actual_entity_starts))]

            _ , predictions = torch.max(relation_logits, dim=1)
            correct_relations += (predictions==actual_relations).sum()
            total_relations += len(actual_relations)

            # These to compute the TP, FP, and FN counts for each span
            tp = 0
            fp = 0
            fn = 0

            for j in range(len(start_preds)):
                # create range objects
                true_range = range(int(true_spans[j][0]), int(true_spans[j][1]+1))
                pred_range = range(int(predicted_spans[j][0]), int(predicted_spans[j][1]+1))

                # Compute the overlap between the predicted and true ranges
                overlap = set(true_range).intersection(set(pred_range))

                # Update the TP, FP, and FN counts
                if len(overlap) > 0:
                    tp += 1
                    fp += len(pred_range) - len(overlap)
                    fn += len(true_range) - len(overlap)
                else:
                    fp += len(pred_range)
                    fn += len(true_range)

            # Compute the precision, recall
            precision = float(tp) / float(tp + fp)
            recall = float(tp) / float(tp + fn)
            
            # Add to the total metrics
            total_f1_score += float(2 * precision * recall) / float(precision + recall)
            total_accuracy += float(tp) / float(tp + fp + fn)
            total_precision += precision
            total_recall += recall
            total_count +=1

    accuracy_relation = float(correct_relations)/ float(total_relations) * 100
    accuracy_entity_span = float(total_accuracy) / float(total_count) * 100
    f1_score = float(total_f1_score) / float(total_count) * 100
    dataset_wide_f1 = float(2 * total_precision * total_recall) / float(total_precision + total_recall)
    f1_star = float(dataset_wide_f1)/ float(total_count) * 100
    mean_loss = sum(losses)/len(losses)
    precision_score = float(total_precision) / float(total_count) * 100
    recall_score = float(total_recall) / float(total_count) * 100

    # Printing them if display is not false
    if display:
        print("\nEvaluation Results")
        print("_________________________________")
        print(
            f"\nMean Loss: {mean_loss:.2f} "
            f"\n_________________________________"
            f"\nEntity Span Prediction"
            f"\nPrecision: {precision_score:.2f}%"
            f"\nRecall: {recall_score:.2f}%"
            f"\nSpan Entity Accuracy : {accuracy_entity_span:.2f}%"
            f"\nAverage F1-Score: {f1_score:.2f}%"
            f"\nDataset Wide F1*: {f1_star:.2f}%"
            f"\n_________________________________"
            f"\nRelation Prediction"
            f"\nAccuracy : {accuracy_relation:.2f}%"
            )
        print("_________________________________")

    # Reset the model to train mode
    model.train()

    return accuracy_entity_span, f1_score, f1_star

Then training the BERT model for five epochs. Initializing the learning rate and gradient clip value with the best found during tuning. I use the cross entropy loss function for both tasks. I also experimented with the MSE loss and a hybrid one that adds the distance from the span actual points, but they had worse performance.

In [15]:
EPOCHS = 5
LEARNING_RATE = 4e-5
CLIP_VALUE = 0.7

# Instaniate the model
seedTorch()
model = BERT_QA(bert_model=BERT_MODEL, vocab_size=len(relation_vocab)).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, 0, len(train_dataloader)*EPOCHS)
lossfunc = nn.CrossEntropyLoss()

# Train the model
start_time = time.time()
model = optimize_model(EPOCHS,optimizer, scheduler, train_dataloader, val_dataloader, lossfunc, model, device, True, CLIP_VALUE)
print(f'Training Time: {(time.time() - start_time)/60:.2f} minutes')


Epoch 1
_________________________________
loss: 21.611191  [    0/19463]
loss: 19.657215  [ 2048/19463]
loss: 19.508245  [ 4096/19463]
loss: 19.863255  [ 6144/19463]
loss: 19.632692  [ 8192/19463]
loss: 19.900032  [10240/19463]
loss: 19.598673  [12288/19463]
loss: 19.662283  [14336/19463]
loss: 19.960163  [16384/19463]
loss: 19.975389  [18432/19463]

Evaluation Results
_________________________________

Mean Loss: 19.65 
_________________________________
Entity Span Prediction
Precision: 84.74%
Recall: 87.31%
Span Entity Accuracy : 75.57%
Average F1-Score: 85.62%
Dataset Wide F1*: 86.01%
_________________________________
Relation Prediction
Accuracy : 8.87%
_________________________________
_________________________________

Epoch 2
_________________________________
loss: 19.536558  [    0/19463]
loss: 19.692316  [ 2048/19463]
loss: 19.692572  [ 4096/19463]
loss: 19.887600  [ 6144/19463]
loss: 19.828758  [ 8192/19463]
loss: 19.626688  [10240/19463]
loss: 19.566326  [12288/19463]
loss:

After training, the model achieved good Entity Span Prediction metrics with a Precision of 85.12%, Recall of 92.71%, Accuracy of 80.15%, Average F1-Score of 88.41% and Dataset Wide F1* of 88.75%. However, the model did not perform well in the Relation Prediction task, with an Accuracy of only 8.07%, indicating that it was unable to learn anything relating to this task. 

In [16]:
# Evaluate on the test set
lossfunc = nn.CrossEntropyLoss()
_ = evaluation_function(dataloader=test_dataloader, model=model, lossfunc=lossfunc, device=device,display=True)


Evaluation Results
_________________________________

Mean Loss: 19.65 
_________________________________
Entity Span Prediction
Precision: 85.12%
Recall: 92.71%
Span Entity Accuracy : 80.15%
Average F1-Score: 88.41%
Dataset Wide F1*: 88.75%
_________________________________
Relation Prediction
Accuracy : 8.07%
_________________________________


The same results can be observed with the test set. One possible reason is that the model may have been biased towards learning to predict entity spans during training and may not have received enough information or examples to learn to predict relations accurately. Another reason could be that the relation prediction task is more difficult than the entity span prediction task, which only needs to locate the entity in the text rather than derive context.

I experimented further with different model architectures by enhancing the linear stack with more layers, adding activation functions and more heads and it was still unable to train the relation part of the model. This might be due to the size of the data or that the training data are not diverse enough or representative of the real-world scenarios that the model may encounter during inference, or it might have require a more complex model. In any case in my next experiments I moved to splitting the two tasks to two different models which was immediately more succesful, as you can see in the next notebook.