## Carry out Q&A and document summarisation using the T5 pre-trained and fine-tuned transformer
This code is designed to carry out question answering and document summarisation using pdf files. It can equally be used with text documents, just by loading in the text.

The notebook itself does the following:
- reads in the pdf file identified as 'filename', and cleans up the text
- attempts to answer the question against the entire document, and also provides a document summary
- presents the answer

The model uses a pre-trained and fine tuned version of 't5-large', availabile from the huggingface transformers libraries. The 't5-large model is re-trained using masked language modelling, and next sentence prediction. It is further fine for document summarisation, question answering, language translation, sentiment analysis and other capabilities

The model can be further fine tuned using your own dataset through the 6F T5_DEMO_FT using QCA CSV.ipynb notebook. Note, this model is very difficult to fine tune due to its alrady high accuracy, and tuning across a breadth of tasks

In [1]:
# Load libraries to tokenise and de-tokenise, the T5 model itself, odf extractor and torch libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration
from pdfminer.high_level import extract_text #pip install pdfminer.six
import torch
import torch.nn.functional as F

In [2]:
# Set document as filename
filename = '6D DEMO_GRIDIRONBIONUTRIENTS,INC_02_05_2020-EX-10.3-SUPPLY AGREEMENT.PDF'
# Extract text using pdfminer
doc = extract_text(filename)
#Clean to doc, and output as book
book = doc.replace("\n" , "")
book = book.replace("\x0c", "")
book = book.replace("  ", " ")

In [3]:
print(doc[:1000])

EXHIBIT 10.3SUPPLY AGREEMENTThis Agreement (“the Agreement”), is made by and between EWSD 1, LLC, d/b/a/ SHI FARMS (“Shi Farms”), a Delaware limitedliability company and Gridiron BioNutrients, Inc, a Nevada Corporation (“Gridiron”) , each individually “a Party,” and collectively,“the Parties.”WHEREAS Shi Farms grows industrial hemp and wishes to sell hemp biomass (“Product”); andGridiron wishes to purchase Product from Shi Farms; andBoth Parties acknowledge that Shi Farms is the owner of the Product as defined below; and Shi Farms is willing to sell Product toGridiron and Gridiron desires to acquire in accordance with the terms and conditions set forth below.NOW, THEREFORE for good and valuable consideration, the receipt and sufficiency of which are hereby acknowledged, theParties agree as follows:1.Products and PaymentsA.Product. Shi Farms agrees to sell Product and Gridiron agrees to purchase 30,000 lbs. of hemp biomass (“Biomass”)from Shi Farms. Biomass must contain a minimum of six

In [4]:
tokenizer = T5Tokenizer.from_pretrained('t5-large') #pip install sentencepiece
model = T5ForConditionalGeneration.from_pretrained('t5-large', return_dict=True, )

In [5]:
device = torch.device("cuda")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=1024, out_features=4096, bias=False)
              (wo): Linear(in_features=4096, out_features=1024, bias=False)
              (

In [6]:
# Function to answer questions
def qa(question):
    # Define input using the question and the legal document
    another = "question: " + question + " context: " + book
    # Tokenise inputs
    input_ids = tokenizer.encode(another, max_length=5000, truncation=True, return_tensors='pt').to(device)
    # Predict
    outputs = model.generate(input_ids = input_ids)
    # De-tokenise output
    output_str = tokenizer.decode(outputs.reshape(-1), skip_special_tokens=True)
    # Return answer
    return output_str

In [7]:
# Function to summarise the document
def summarise(doc_summ):
    # Tokenise the book, with prepend of summarise:
    input_ids = tokenizer.encode("summarize: " + doc_summ, max_length=5000, truncation=True, return_tensors="pt").to(device)
    # Generate summary
    outputs = model.generate(input_ids)
    # De-tokenise output
    outputs_str = tokenizer.decode(outputs.reshape(-1), skip_special_tokens=True)
    # Return output
    return outputs_str

In [8]:
summarise(book)

'Shi Farms is the owner of the hemp biomass that Gridiron is purchasing. the parties'

In [9]:
qa("When is the contract dated?")

'1/26/2020'

In [10]:
qa("Which two parties is the agreement between?")

'Shi Farms and Gridiron BioNutrients, Inc'

In [11]:
qa("What is the pricing of the product?")

'$5.00 per pound'

In [12]:
qa("How does the buyer pay for the service?")

'remit payment upon execution of this agreement'

In [13]:
qa("What is the point of delivery for the product?")

'a laboratory determined by Gridiron'

In [14]:
qa("Who can terminate the agreement?")

'Either Party'

In [15]:
qa("Where is the governing law?")

'State of Colorado'

In [16]:
qa("Which parties witnessed the signing of the contract?")

'both Parties'

In [17]:
qa("What is the Confidential Information clause?")

'includes all Intellectual Property, processes, pricing, and any information that is marked confidential'

In [18]:
qa("When can either party terminate?")

'any time prior to delivery of the Product'