<a href="https://colab.research.google.com/github/KevinLolochum/BERT-MODELS/blob/main/RoBERTa_Fine_tuned_for_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Fine_tuning RoBERTa Extracitve Question Answering in PyTorch**

- In this exampe I am assuming you have some familiarity with transformers and the Huggingface example of using an already fine-tuned model like BERT or BART for question answering.
- If you don't, you can read from [here](https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering) or most online articles about BERT for quesion answering.
- As discussed in the sentiments analysis example RoBERTa uses virtually the same architecture as BERT.

In [None]:
# Install transformers and huggingface datasets
!pip install transformers
!pip install datasets

Import libraries

In [2]:
import numpy as np
import torch
from torch.optim import Adam
from transformers import RobertaForQuestionAnswering, RobertaTokenizerFast, get_linear_schedule_with_warmup


**1. Instantiate model**

- I will be inheriting from the bert_base_uncased and the BertQuestion answering framework.

In [None]:
# We are using BertTokenizerFast because other python tokens do not have char_to_token functionality we will need later.

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model = RobertaForQuestionAnswering.from_pretrained('roberta-base', return_dict = True)

optimizer = Adam(model.parameters(), lr=1e-5)

**2. Data**

- I will be using **S**tanford **Qu**estion**A**nswering **D**ataset (**SQuAD**)
- SQuAD is a pre_cleaned question answering dataset but I will apply a few changes to get correct answer alignments

- You can explore the dataset [here](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/), download on tfds, huggingface datasets or Kaggle.
* The goal is to find, for each question, a span of text in a paragraph that answers that question.

In [None]:
# Loading squad dataset from 
from datasets import load_dataset

# Load and split dataset, using small datasets for the sake of model training
train_data, valid_data = load_dataset('squad', split='train[:200]'), load_dataset('squad', split='validation[:1%]')


In [5]:
# Checking the features of the answers 
train_data.shape

(200, 5)

- Getting correct answer text alignment and tokenizing the dataset

In [6]:
# Dataset cleaning and tokenization
# BertTokenizerFast because python tokenizer do not have char_to_token functionality

def correct_alignment(context, answer):

    """ Description: This functions corrects the alignment of answers in the squad dataset that are sometimes off by one or 2 values also adds end_postion index.
    
    inputs: list of contexts and answers
    outputs: Updated list that contains answer_end positions """
    
    start_text = answer['text'][0]
    start_idx = answer['answer_start'][0]
    end_idx = start_idx + len(start_text)

    # When alignment is okay
    if context[start_idx:end_idx] == start_text:
      return start_idx, end_idx    
      # When alignment is off by 1 character
    elif context[start_idx-1:end_idx-1] == start_text:
      return start_idx-1, end_idx-1  
      # when alignment is off by 2 characters
    elif context[start_idx-2:end_idx-2] == start_text:
      return start_idx-2, end_idx-2
    else:
      raise ValueError()


# Tokenize our training dataset
def convert_to_features(example_batch):
  """ Description: This functions tokenizes the context and questions then appends encoded start_positions and end_positions from the above function.
    
    inputs: list of contexts, questions and answers
    outputs: Updated list that contains answer_end positions """

    # Tokenize contexts and questions (as pairs of inputs)
  encodings = tokenizer(example_batch['context'], example_batch['question'], truncation=True)

    # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methods.
  start_positions, end_positions = [], []
  for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):
    start_idx, end_idx = correct_alignment(context, answer)
    start_positions.append(encodings.char_to_token(i, start_idx))
    end_positions.append(encodings.char_to_token(i, end_idx-1))
    # update encodings   
  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

  return encodings

In [None]:
# Map the dataset to the convert_function, faster than using for loops.

Training_encoded = train_data.map(convert_to_features, batched=True)
Validation_encoded = valid_data.map(convert_to_features, batched = True)

In [8]:
# Our encoded dataset has some columns we don't need
Training_encoded.features

{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'end_positions': Value(dtype='int64', id=None),
 'id': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'question': Value(dtype='string', id=None),
 'start_positions': Value(dtype='int64', id=None),
 'title': Value(dtype='string', id=None)}

In [9]:
# Format our encided datasets to outputs torch.Tensor to train our pytorch model

columns = ['input_ids', 'attention_mask', 'start_positions', 'end_positions']
Training_encoded.set_format(type='torch', columns=columns)
Validation_encoded.set_format(type='torch', columns=columns)

In [10]:
column_names =['answers', 'context', 'id', 'question', 'title']

Validation_encoded.remove_columns_(column_names=column_names)
Training_encoded.remove_columns_(column_names=column_names)


- Loading the tensor data into dataloader.

In [11]:
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# Instantiate a PyTorch Dataloader around our dataset
# Let's do dynamic batching (pad on the fly with our own collate_fn)
def collate_fn(examples):
    return tokenizer.pad(examples, return_tensors='pt')
dataloader_val = DataLoader(Validation_encoded, collate_fn=collate_fn, batch_size= 20, sampler=SequentialSampler(Validation_encoded))
dataloader = DataLoader(Training_encoded, collate_fn=collate_fn, batch_size =20, sampler= RandomSampler(Training_encoded))


In [12]:
# Setting the seed for generating random numbers
import random

seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

**3. Training and evaluating the model**

**Inputs/parameters**. Here are the [explanations](https://huggingface.co/transformers/glossary.html#attention-mask) of what these paramenters represent.

*  input_ids - Ids of word embeddings
*  attention_masks - Values to point inputs that should be attended to, i.e inputs that are not paddings.
*  input_type_ids - Classification and separation tokens.
*  segment_ids - Whether the segment is a question or an answer.
- start_positions and end_positions - Tokens representing the start and end of an answer.

**outputs**
* Start_logits - probabilities that the start value is an input_id x. (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax)
* End_logits - Probabilities that the end value is an input_id x. (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax)
* Other return values are loss (cross enhropy loss). Hidden states and attention heads when specified.
- Start_Loss is calculated by comparing the correct start_posistions with the start_logits from the QuestionAnswering class. 
- Then  end_Loss is calculated by comparing the correct end_posistions with the end_logits from the QuestionAnswering class.
- The two losses are added then devided by two.

In [20]:
# Validation function for the model

def model_validation(dataloader_val):

    model.eval().to(device)
    
    val_total_loss = 0
    for batch in dataloader_val:   

        batch.to(device)
        model.zero_grad()
        with torch.no_grad():        
            outputs = model(**batch)
            
        loss = outputs.loss
        val_total_loss += loss.item()
    return val_total_loss



In [21]:
# Import scheduler and omptimizer 
from tqdm.notebook import tqdm
from transformers import get_linear_schedule_with_warmup

#Clear cache before running model
torch.cuda.empty_cache()

epochs = 10
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader)*epochs) 

device = 'cuda' if torch.cuda.is_available() else 'cpu'

for epoch in tqdm(range(1, epochs+1)):
    
    model.train().to(device)
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()    

        batch.to(device)  

        outputs = model(**batch)
        
        loss = outputs.loss
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader)            
    tqdm.write(f'Training loss: {round(loss_train_avg, 2)}')
    
    val_loss = model_validation(dataloader_val)
    tqdm.write(f'Validation loss: {round(val_loss, 2)}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=10.0, style=ProgressStyle(description_width…


Epoch 1
Training loss: 4.04
Validation loss: 21.68


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=10.0, style=ProgressStyle(description_width…


Epoch 2
Training loss: 3.63
Validation loss: 20.26


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=10.0, style=ProgressStyle(description_width…


Epoch 3
Training loss: 3.19
Validation loss: 18.7


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=10.0, style=ProgressStyle(description_width…


Epoch 4
Training loss: 2.69
Validation loss: 16.57


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=10.0, style=ProgressStyle(description_width…


Epoch 5
Training loss: 2.2
Validation loss: 15.2


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=10.0, style=ProgressStyle(description_width…


Epoch 6
Training loss: 1.84
Validation loss: 14.56


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=10.0, style=ProgressStyle(description_width…


Epoch 7
Training loss: 1.59
Validation loss: 14.14


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=10.0, style=ProgressStyle(description_width…


Epoch 8
Training loss: 1.55
Validation loss: 13.77


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=10.0, style=ProgressStyle(description_width…


Epoch 9
Training loss: 1.38
Validation loss: 13.41


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=10.0, style=ProgressStyle(description_widt…


Epoch 10
Training loss: 1.34
Validation loss: 13.33



- Increasing the data would massively benefit the performance of this model but my CPU and GPU do not afford this luxury at the moment.
