## Carry out Q&A and document summarisation using the T5 pre-trained and fine-tuned transformer
This code is designed to carry out question answering and document summarisation using pdf files. It can equally be used with text documents, just by loading in the text.

The notebook itself does the following:
- reads in the pdf file identified as 'filename', and cleans up the text
- attempts to answer the question against the entire document, and also provides a document summary
- presents the answer

The model uses a pre-trained and fine tuned version of 't5-large', availabile from the huggingface transformers libraries. The 't5-large model is re-trained using masked language modelling, and next sentence prediction. It is further fine for document summarisation, question answering, language translation, sentiment analysis and other capabilities

The model can be further fine tuned using your own dataset through the 6F T5_DEMO_FT using QCA CSV.ipynb notebook. Note, this model is very difficult to fine tune due to its alrady high accuracy, and tuning across a breadth of tasks

In [None]:
# Load libraries to tokenise and de-tokenise, the T5 model itself, odf extractor and torch libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration
from pdfminer.high_level import extract_text
import torch
import torch.nn.functional as F

In [None]:
# Set document as filename
filename = '6D DEMO_GRIDIRONBIONUTRIENTS,INC_02_05_2020-EX-10.3-SUPPLY AGREEMENT.PDF'
# Extract text using pdfminer
doc = extract_text(filename)
#Clean to doc, and output as book
book = doc.replace("\n" , "")
book = book.replace("\x0c", "")
book = book.replace("  ", " ")

In [None]:
print(doc[:1000])

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-large')
model = T5ForConditionalGeneration.from_pretrained('t5-large', return_dict=True, )

In [None]:
device = torch.device("cuda")
model.to(device)

In [None]:
# Function to answer questions
def qa(question):
    # Define input using the question and the legal document
    another = "question: " + question + " context: " + book
    # Tokenise inputs
    input_ids = tokenizer.encode(another, max_length=5000, truncation=True, return_tensors='pt').to(device)
    # Predict
    outputs = model.generate(input_ids = input_ids)
    # De-tokenise output
    output_str = tokenizer.decode(outputs.reshape(-1), skip_special_tokens=True)
    # Return answer
    return output_str

In [None]:
# Function to summarise the document
def summarise(doc_summ):
    # Tokenise the book, with prepend of summarise:
    input_ids = tokenizer.encode("summarize: " + doc_summ, max_length=5000, truncation=True, return_tensors="pt").to(device)
    # Generate summary
    outputs = model.generate(input_ids)
    # De-tokenise output
    outputs_str = tokenizer.decode(outputs.reshape(-1), skip_special_tokens=True)
    # Return output
    return outputs_str

In [None]:
summarise(book)

In [None]:
qa("When is the contract dated?")

In [None]:
qa("Which two parties is the agreement between?")

In [None]:
qa("What is the pricing of the product?")

In [None]:
qa("How does the buyer pay for the service?")

In [None]:
qa("What is the point of delivery for the product?")

In [None]:
qa("Who can terminate the agreement?")

In [None]:
qa("Where is the governing law?")

In [None]:
qa("Which parties witnessed the signing of the contract?")

In [None]:
qa("What is the Confidential Information clause?")

In [None]:
qa("When can either party terminate?")