## Carry out Q&A on pdf documents
This code is designed to carry out question answering using pdf files. It can equally be used with text documents, just by loading in the text.

The notebook itself does the following:
- reads in the pdf file identified as 'filename'
- splits the document into sentences, using the nltk library, as the maximum token length is 512
- attempts to answer the question against each sentence, while recording the highest start and end probabilities for each sentence
- presents the answer which has the highest probability start and end token for all sentences

The model uses a pre-trained and fine tuned version of lert large, availabile from the huggingface transformers libraries. The 'bert-large-uncased-whole-word-masking-finetuned-squad' modelis re-trained using masked language modelling, and next sentence prediction. It is further fine tuned using the Stanford SQuAD dataset, which contains near to 100,000 questions and answers.

The model can be further fine tuned using your own dataset through the 2F BERT DEMO BERT_LARGE FT using csv files.ipynb notebook

In [None]:
# Load lobraries
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import nltk
from pdfminer.high_level import extract_text

In [None]:
# Select model we will use
model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'

# Loads the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Loads the fine tuned model for Question Answering
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

In [None]:
# set document to be loaded as filename
filename = '2D DEMO_VitalibisInc_20180316_8-K_EX-10.2_11100168_EX-10.2_Hosting Agreement.pdf'
# Use pdfminer to extract text from pdf
doc = extract_text(filename)

In [None]:
# Remove characters not needed to predict
book = doc.replace("\n" , "")
book = book.replace("\x0c", "")
book = book.replace("  ", " ")

In [None]:
# Only required to download punctuation from NLTK once
nltk.download('punkt')

In [None]:
# tokenise document into sentences
sent_corpus = nltk.sent_tokenize(book)

In [None]:
# Move data to GPU
device = torch.device("cuda")
model.to(device)

In [None]:
def question_answer(question, sent_corpus):
    max_prob = -10.0
    
    # loop through sentences
    for sent in sent_corpus:
        
        # Convert text to string
        text = str(sent)
        
        # Tokenise the question and text
        inputs = tokenizer(question, text, add_special_tokens=True, max_length=512, truncation=True, return_tensors="pt").to(device)
        input_ids = inputs["input_ids"].tolist()[0]
        text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
        
        # Run the tokenised text through the pre-trained auto model for  question answering, and store outputs
        outputs = model(**inputs)

        # Get start and end scores for each sentence from the model output
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

        # Get location of maximum start score
        answer_start = torch.argmax(answer_start_scores)
        answer_end = torch.argmax(answer_end_scores) + 1 
        
        # Get the maximum start and end probabilities
        max_prob_start = torch.max(answer_start_scores)
        max_prob_end = torch.max(answer_end_scores)
        
        # Sum the maximum start and end probabilities
        max_prob_startend = max_prob_start + max_prob_end
        
        # Check of score of prediction for sentence is higher than previously recorded
        if max_prob_startend > max_prob:
            max_prob = max_prob_startend
            
            # Convert answer tokens to string
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
            # Store context where the answer was derived from as text answer
            text_answer = text
            
    print('BERT Answer:\n------------\n', answer, '\n\nSentence:\n---------\n', text_answer)

In [None]:
question_answer('When is the agreement made?', sent_corpus)

In [None]:
question_answer('Which two parties is the agreement between?', sent_corpus)

In [None]:
question_answer('Who is the licensee?', sent_corpus)

In [None]:
question_answer("What is the address of vitalibis inc", sent_corpus)

In [None]:
question_answer("What are the services provided?", sent_corpus)

In [None]:
question_answer("Are there any Additional Services?", sent_corpus)

In [None]:
question_answer("How much notice do the parties have to give?", sent_corpus)

In [None]:
question_answer("How long is the agreement for?", sent_corpus)