## A model to further fine tune a smaller 'distilbert-base-uncased' pre-trained model
This model allows you to further fine-tune the pre-trained 'distilbert-base-uncased' and is a standalone notebook. Note, this model has not been fine tuned for any task, therefore you need to seek out data to carry out fine tuning from scratch.
The notebook requires that you have a .csv file available with the folliwing column headers:
- 'context' - This is the text which you are trying to extract the answer from
- 'question' - This is the question being asked
- 'answer' - This is the answer, which must be in the 'context' character for character
- 'answer_start' - This is the start character of the 'answer' within the 'context'

The model expects a .csv as input, and carries out the following:
- prepares the data to enable fine tuning of the 'distilbert-base-uncased'
- tokenised the data
- trains the model using an AdamW optimizer using the pytorch library
- saves the model
- carries out validation, using a separate carved out validation dataset
- Enables prediction using the new fine-tuned model on your own data

In [4]:
import pandas as pd
import transformers
import torch
from tqdm import tqdm
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering, AdamW
from torch.utils.data import DataLoader

In [5]:
# Load pre-trained model
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
# Load tokenizer - Need to use a BERT tokenizer, as other tokenizers not accepted
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

In [6]:
# Load a small dataset for fine tuning

# Needs to have -
# 'context' - Text where the answer will be extracted
# 'question' - The question itself
# 'answer' - The answer, which much be in the context text
# 'answer_start' - This is the start character for the answer

# Replace with your own dataset

datasets = pd.read_csv('/home/malmason/datasets/squad_csv/SQuAD_csv_sm.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'SQuAD_csv_sm.csv'

In [None]:
datasets.head()

In [None]:
# Remove data where answer not in context
array = []
for i in range(len(datasets)):
    if datasets['answer'][i] not in datasets['context'][i]:
        array.append(i)
datasets.drop(datasets.index[array], axis=0, inplace=True)

In [None]:
data_answers = []
temp_data = {}
for answer, answer_start in zip(datasets.answer, datasets.answer_start):
    temp_data['text'] = str(answer)
    temp_data['answer_start'] = answer_start
    dict_copy = temp_data.copy()
    data_answers.append(dict_copy)

In [None]:
data_contexts = datasets.context
data_questions = datasets.question

In [None]:
# Select 80000 training examples, and approx 6000 fine tuning examples
train_answers = data_answers[:80000]
val_answers = data_answers[80000:]
train_contexts = data_contexts[:80000]
val_contexts = data_contexts[80000:]
train_questions = data_questions[:80000]
val_questions = data_questions[80000:]

In [None]:
print(data_contexts[0], data_questions[0], data_answers[0])

In [None]:
def add_end_idx(answers, contexts):
    # loop through each answer-context pair
    for answer, context in zip(answers, contexts):
        # gold_text refers to the answer we are expecting to find in context
        gold_text = answer['text']

        # we already know the start index
        start_idx = answer['answer_start']
        start_idx = int(start_idx)
        
        # and ideally this would be the end index...
        end_idx = start_idx + len(gold_text)

        # ...however, sometimes squad answers are off by a character or two
        if context[start_idx:end_idx] == gold_text:
            # if the answer is not off :)
            answer['answer_end'] = end_idx
        else:
            # this means the answer is off by 1-2 tokens
            for n in [1, 2]:
                if context[start_idx-n:end_idx-n] == gold_text:
                    answer['answer_start'] = start_idx - n
                    answer['answer_end'] = end_idx - n


In [None]:
# and apply the function to our two answer lists
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [None]:
# Verify answer_end is there, as can sometimes be missing if answer not where expected
count = 0
for answer in (train_answers):
    if 'answer_end' not in answer:
        print(answer, count)
    count +=1

In [None]:
# Convert data to lists
train_contexts = list(train_contexts)
train_questions = list(train_questions)
val_contexts = list(val_contexts)
val_questions = list(val_questions)
train_answers = list(train_answers)
val_answers = list(val_answers)

In [None]:
# Tokenise train and val encodings using DistilBertTokenizerFast
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [None]:
def add_token_positions(encodings, answers):
    # initialize lists to contain the token indices of answer start/end
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        # append start/end token position using char_to_token method
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        # end position cannot be found, char_to_token found space, so shift position until found
        shift = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - shift)
            shift += 1
    # update our encodings object with the new token-based start/end positions
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

In [None]:
# apply function to data
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

In [None]:
# Take a look at keys
train_encodings.keys()

In [None]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

In [None]:
# Get train and val encodings
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [None]:
# Set device to GPU if it exists
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer
optim = AdamW(model.parameters(), lr=5e-5)

# initialize data loader with batch size that will fit GPU
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

for epoch in range(10):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # initialize gradients
        optim.zero_grad()
        # get inputs and send to GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # get loss
        loss = outputs[0]
        loss.backward()
        # update weights
        optim.step()
        # show training loss
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

In [None]:
model.eval()

#val_sampler = SequentialSampler(val_dataset)
val_loader = DataLoader(val_dataset, batch_size=16)

acc = []

# initialize loop for progress bar
loop = tqdm(val_loader)
# loop through batches
for batch in loop:
    # we don't need to calculate gradients as we're not training
    with torch.no_grad():
        # pull batched items from loader
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_true = batch['start_positions'].to(device)
        end_true = batch['end_positions'].to(device)
        # make predictions
        outputs = model(input_ids, attention_mask=attention_mask)
        # pull preds out
        start_pred = torch.argmax(outputs['start_logits'], dim=1)
        end_pred = torch.argmax(outputs['end_logits'], dim=1)
        # calculate accuracy for both and append to accuracy list
        acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
        acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
# calculate average accuracy in total
acc = sum(acc)/len(acc)

In [None]:
print(acc)

In [None]:
print("T/F\tstart\tend\n")
for i in range(len(start_true)):
    print(f"true\t{start_true[i]}\t{end_true[i]}\n"
          f"pred\t{start_pred[i]}\t{end_pred[i]}\n")

In [None]:
# save model
model_path = 'models/distilbert-custom'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

## Use model for predictions

In [None]:
from pdfminer.high_level import extract_text
import nltk

In [None]:
model = DistilBertForQuestionAnswering.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

In [None]:
filename = 'DEMO_VitalibisInc_20180316_8-K_EX-10.2_11100168_EX-10.2_Hosting Agreement.pdf'
doc = extract_text(filename)

In [None]:
book = doc.replace("\n" , "")
book = book.replace("\x0c", "")
book = book.replace("  ", " ")

In [None]:
sent_corpus = nltk.sent_tokenize(book)

In [None]:
device = torch.device("cuda")
model.to(device)

In [None]:
def question_answer(question, sent_corpus):
    max_prob = -10.0
    
    # loop through sentences
    for sent in sent_corpus:
        
        # Convert text to string
        text = str(sent)
        
        # Tokenise the question and text
        inputs = tokenizer(question, text, add_special_tokens=True, max_length=512, truncation=True, return_tensors="pt").to(device)
        input_ids = inputs["input_ids"].tolist()[0]
        text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
        
        # Run the tokenised text through the pre-trained auto model for  question answering, and store outputs
        outputs = model(**inputs)

        # Get start and end scores for each sentence from the model output
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

        # Get location of maximum start score
        answer_start = torch.argmax(answer_start_scores)
        answer_end = torch.argmax(answer_end_scores) + 1 
        
        # Get the maximum start and end probabilities
        max_prob_start = torch.max(answer_start_scores)
        max_prob_end = torch.max(answer_end_scores)
        
        # Sum the maximum start and end probabilities
        max_prob_startend = max_prob_start + max_prob_end
        
        # Check of score of prediction for sentence is higher than previously recorded
        if max_prob_startend > max_prob:
            max_prob = max_prob_startend
            
            # Convert answer tokens to string
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
            # Store context where the answer was derived from as text answer
            text_answer = text
            
    print('BERT Answer:\n------------\n', answer, '\n\nSentence:\n---------\n', text_answer)

In [None]:
question_answer('Which two parties is the agreement between?', sent_corpus)