## Carry out Q&A on pdf documents
This code is designed to carry out question answering using pdf files. It can equally be used with text documents, just by loading in the text.

The notebook itself does the following:
- reads in the pdf file identified as 'filename'
- splits the document into sentences, using the nltk library, as the maximum token length is 512
- attempts to answer the question against each sentence, while recording the highest start and end probabilities for each sentence
- presents the answer which has the highest probability start and end token for all sentences

The model uses a pre-trained and fine tuned version of lert large, availabile from the huggingface transformers libraries. The 'bert-large-uncased-whole-word-masking-finetuned-squad' modelis re-trained using masked language modelling, and next sentence prediction. It is further fine tuned using the Stanford SQuAD dataset, which contains near to 100,000 questions and answers.

The model can be further fine tuned using your own dataset through the 2F BERT DEMO BERT_LARGE FT using csv files.ipynb notebook

In [1]:
# Load lobraries
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import nltk
from pdfminer.high_level import extract_text

In [2]:
# Select model we will use
model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'

# Loads the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Loads the fine tuned model for Question Answering
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




In [3]:
# set document to be loaded as filename
filename = '2D DEMO_VitalibisInc_20180316_8-K_EX-10.2_11100168_EX-10.2_Hosting Agreement.pdf'
# Use pdfminer to extract text from pdf
doc = extract_text(filename)

In [4]:
# Remove characters not needed to predict
book = doc.replace("\n" , "")
book = book.replace("\x0c", "")
book = book.replace("  ", " ")

In [5]:
# Only required to download punctuation from NLTK once
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/phil/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
# tokenise document into sentences
sent_corpus = nltk.sent_tokenize(book)

In [7]:
# Move data to GPU
device = torch.device("cuda")
model.to(device)

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-12,

In [8]:
def question_answer(question, sent_corpus):
    max_prob = -10.0
    
    # loop through sentences
    for sent in sent_corpus:
        
        # Convert text to string
        text = str(sent)
        
        # Tokenise the question and text
        inputs = tokenizer(question, text, add_special_tokens=True, max_length=512, truncation=True, return_tensors="pt").to(device)
        input_ids = inputs["input_ids"].tolist()[0]
        text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
        
        # Run the tokenised text through the pre-trained auto model for  question answering, and store outputs
        outputs = model(**inputs)

        # Get start and end scores for each sentence from the model output
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

        # Get location of maximum start score
        answer_start = torch.argmax(answer_start_scores)
        answer_end = torch.argmax(answer_end_scores) + 1 
        
        # Get the maximum start and end probabilities
        max_prob_start = torch.max(answer_start_scores)
        max_prob_end = torch.max(answer_end_scores)
        
        # Sum the maximum start and end probabilities
        max_prob_startend = max_prob_start + max_prob_end
        
        # Check of score of prediction for sentence is higher than previously recorded
        if max_prob_startend > max_prob:
            max_prob = max_prob_startend
            
            # Convert answer tokens to string
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
            # Store context where the answer was derived from as text answer
            text_answer = text
            
    print('BERT Answer:\n------------\n', answer, '\n\nSentence:\n---------\n', text_answer)

In [9]:
question_answer('When is the agreement made?', sent_corpus)

BERT Answer:
------------
 march 7th, 2018 

Sentence:
---------
 "RECITALS WHEREAS, Licensee and VOTOCAST have entered into a Services and Hosting Agreement (the "Agreement') dated March 7th, 2018(the "Effective Date").


In [10]:
question_answer('Which two parties is the agreement between?', sent_corpus)

BERT Answer:
------------
 licensee and votocast 

Sentence:
---------
 "RECITALS WHEREAS, Licensee and VOTOCAST have entered into a Services and Hosting Agreement (the "Agreement') dated March 7th, 2018(the "Effective Date").


In [11]:
question_answer('Who is the licensee?', sent_corpus)

BERT Answer:
------------
 vitalibis inc 

Sentence:
---------
 Examples include the following: ·General question such as "how-to"·Issue with little or no impact on business·Documentation issues·Issue is essentially resolved but remains open for Licensee confirmation   12 Source: VITALIBIS, INC., 8-K, 3/16/2018 EXHIBIT DFORM OF SOW STATEMENT OF WORK #XXXX TO SERVICES AND HOSTING AGREEMENT THIS STATEMENT OF WORK # XA2X TO SERVICES AND HOSTING AGREEMENT (this "Statement of Work") is made and entered intoas of <DATE> (the "SOW Effective Date"), by and between VITALIBIS INC. a Nevada C Corporation having its principal place of business at 5348Vegas Drive, Las Vegas NV 89108 (hereinafter, "Licensee"), and VOTOCAST, Inc., a California corporation (dba, newkleus), having its principalplace of business at PO Box 7302 Newport Beach, CA 92658 (hereinafter, "VOTOCAST").


In [12]:
question_answer("What is the address of vitalibis inc", sent_corpus)

BERT Answer:
------------
 5348vegas drive, las vegas nv 89108 

Sentence:
---------
 Examples include the following: ·General question such as "how-to"·Issue with little or no impact on business·Documentation issues·Issue is essentially resolved but remains open for Licensee confirmation   12 Source: VITALIBIS, INC., 8-K, 3/16/2018 EXHIBIT DFORM OF SOW STATEMENT OF WORK #XXXX TO SERVICES AND HOSTING AGREEMENT THIS STATEMENT OF WORK # XA2X TO SERVICES AND HOSTING AGREEMENT (this "Statement of Work") is made and entered intoas of <DATE> (the "SOW Effective Date"), by and between VITALIBIS INC. a Nevada C Corporation having its principal place of business at 5348Vegas Drive, Las Vegas NV 89108 (hereinafter, "Licensee"), and VOTOCAST, Inc., a California corporation (dba, newkleus), having its principalplace of business at PO Box 7302 Newport Beach, CA 92658 (hereinafter, "VOTOCAST").


In [13]:
question_answer("What are the services provided?", sent_corpus)

BERT Answer:
------------
 usage and ongoing support of theservices only 

Sentence:
---------
 3.Scone of Services: - Branded iOS Mobile App- Branded Android Mobile App- Application APIs- Application SDKs- Admin Database Unless specifically noted othenwise, the scope of all services provided by VOTOCAST is limited to the usage and ongoing support of theServices only, and does not include analysis, operation, integration, development, additional training or modification of other custom or OEMpackaged software applications, hardware or systems.


In [14]:
question_answer("Are there any Additional Services?", sent_corpus)

BERT Answer:
------------
 any additional services ( " additional services " ) to be provided byvotocast to licensee shall be described in a statement of work 

Sentence:
---------
 VOTOCAST and Licensee agree that any additional services ("Additional Services") to be provided byVOTOCAST to Licensee shall be described in a statement of work ("SOW').


In [15]:
question_answer("How much notice do the parties have to give?", sent_corpus)

BERT Answer:
------------
 at leastone hundred eighty ( 180 ) days 

Sentence:
---------
 This Agreement shall automatically renew beyond theInitial Term for successive one (I) year terms (each, a "Renewal Term"), unless a Party provides the other with written notice of termination at leastone hundred eighty (180) days prior to the expiration of the Initial Term or the then-current Renewal Term.


In [16]:
question_answer("How long is the agreement for?", sent_corpus)

BERT Answer:
------------
 one ( i ) year 

Sentence:
---------
 This Agreement shall commence as of the Effective Date and shall continue in effect for one (I) year, unless earlier terminated asexpressly provided in Sections 1.3.
