**Introduction**

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

Sections and subsections along with their levels.
Paragraphs - combines lines.
Links between sections and paragraphs.
Tables along with the section the tables are found in.
Lists and nested lists.
With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

**Installation**

Install the llmsherpa library.

In [None]:
!pip install llmsherpa

The first step in using the LayoutPDFReader is to provide a url or file path to it and get back a document object.

In [None]:
from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

**Install LlamaIndex**

In the following examples, we will use LlamaIndex for simplicity. Install the library if you haven't already.

In [None]:
!pip install llama-index

**Setup OpenAI**

Make sure your API Key is inserted.

In [None]:
import openai
openai.api_key = #insert your api key here

**Summarize a Section using prompts**

LayoutPDFReader offers powerful ways to pick sections and subsections from a large document and use LLMs to extract insights from a section.

The following code looks for the Fine-tuning section of the document:

In [None]:
from IPython.core.display import display, HTML
selected_section = None
# find a section in the document by title
for section in doc.sections():
    if section.title == '3 Fine-tuning BART':
        selected_section = section
        break
# use include_children=True and recurse=True to fully expand the section.
# include_children only returns at one sublevel of children whereas recurse goes through all the descendants
HTML(section.to_html(include_children=True, recurse=True))

Now, let's create a custom summary of this text using a prompt:

In [None]:
from llama_index.llms import OpenAI
context = selected_section.to_html(include_children=True, recurse=True)
question = "list all the tasks discussed and one line about each task"
resp = OpenAI().complete(f"read this text and answer question: {question}:\n{context}")
print(resp.text)

Tasks discussed in the text:

1. Sequence Classification Tasks: The same input is fed into the encoder and decoder, and the final hidden state of the final decoder token is used for multi-class linear classification.
2. Token Classification Tasks: The complete document is fed into the encoder and decoder, and the top hidden state of the decoder is used as a representation for each word to classify the token.
3. Sequence Generation Tasks: BART can be fine-tuned for tasks like abstractive question answering and summarization, where the encoder input is the input sequence and the decoder generates outputs autoregressively.
4. Machine Translation: BART can be used to improve machine translation decoders by incorporating pre-trained encoders and using the entire BART model as a single pretrained decoder. The new encoder is trained to map foreign words into an input that BART can de-noise to English.


**Analyze a Table using prompts**

With LayoutPDFReader, you can iterate through all the tables in a document and use the power of LLMs to analyze a Table Let's look at the 6th table in this document. If you are using a notebook, you can display the table as follows:

In [None]:
from IPython.core.display import display, HTML
HTML(doc.tables()[5].to_html())

0,1,2,3,4,5,6,7,8,9,10
BERT,84.1/90.9,79.0/81.8,86.6/-,93.2,91.3,92.3,90.0,70.4,88.0,60.6
UniLM,-/-,80.5/83.4,87.0/85.9,94.5,-,92.7,-,70.9,-,61.1
XLNet,89.0/94.5,86.1/88.8,89.8/-,95.6,91.8,93.9,91.8,83.8,89.2,63.6
RoBERTa,88.9/94.6,86.5/89.4,90.2/90.2,96.4,92.2,94.7,92.4,86.6,90.9,68.0
BART,88.8/94.6,86.1/89.2,89.9/90.1,96.6,92.5,94.9,91.2,87.0,90.4,62.8


Now let's ask a question to analyze this table:

In [None]:
from llama_index.llms import OpenAI
context = doc.tables()[5].to_html()
resp = OpenAI().complete(f"read this table and answer question: which model has the best performance on squad 2.0:\n{context}")
print(resp.text)

The model with the best performance on SQuAD 2.0 is RoBERTa, with an EM/F1 score of 86.5/89.4.


That's it! LayoutPDFReader also supports tables with nested headers and header rows.

Here's an example with nested headers (note that the HTML doesn't render properly in ipython but the html structure is correct):

In [None]:
from IPython.core.display import display, HTML
HTML(doc.tables()[6].to_html())

0,1,2,3,4,5,6
Lead-3,40.42,17.62,36.67,16.30,1.60,11.95
"PTGEN (See et al., 2017)",36.44,15.66,33.42,29.70,9.21,23.24
"PTGEN+COV (See et al., 2017)",39.53,17.28,36.38,28.10,8.02,21.72
UniLM,43.33,20.21,40.51,-,-,-
"BERTSUMABS (Liu & Lapata, 2019)",41.72,19.39,38.76,38.76,16.33,31.15
"BERTSUMEXTABS (Liu & Lapata, 2019)",42.13,19.6,39.18,38.81,16.50,31.27
BART,44.16,21.28,40.9,45.14,22.27,37.25


Now let's ask an interesting question:

In [None]:
from llama_index.llms import OpenAI
context = doc.tables()[6].to_html()
question = "tell me about R1 of bart for different datasets"
resp = OpenAI().complete(f"read this table and answer question: {question}:\n{context}")
print(resp.text)

R1 of BART for different datasets:

- For the CNN/DailyMail dataset, the R1 score of BART is 44.16.
- For the XSum dataset, the R1 score of BART is 21.28.



**Vector search and Retrieval Augmented Generation with Smart Chunking**

LayoutPDFReader does smart chunking keeping the integrity of related text together:

All list items are together including the paragraph that precedes the list.
Items in a table are chuncked together
Contextual information from section headers and nested section headers is included
The following code creates a LlamaIndex query engine from LayoutPDFReader document chunks

In [None]:
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Let's run one query:

In [None]:
response = query_engine.query("list all the tasks that work with bart")
print(response)

BART works well for text generation, comprehension tasks, abstractive dialogue, question answering, and summarization tasks.


Let's try another query that needs answer from a table:

In [None]:
response = query_engine.query("what is the bart performance score on squad")
print(response)

The BART performance score on SQuAD is 88.8 for Exact Match (EM) and 94.6 for F1 score.


**Get the Raw JSON**

To get the complete json returned by llmsherpa service and process it differently, simply get the json attribute

In [None]:
doc.json

[{'block_class': 'cls_0',
  'block_idx': 0,
  'level': 0,
  'page_idx': 0,
  'sentences': ['BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension'],
  'tag': 'header'},
 {'block_class': 'cls_1',
  'block_idx': 1,
  'level': 0,
  'page_idx': 0,
  'sentences': ['Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer Facebook AI'],
  'tag': 'para'},
 {'block_class': 'cls_5',
  'block_idx': 2,
  'level': 1,
  'page_idx': 0,
  'sentences': ['{mikelewis,yinhanliu,naman}@fb.com'],
  'tag': 'header'},
 {'block_class': 'cls_1',
  'block_idx': 3,
  'level': 2,
  'page_idx': 0,
  'sentences': ['Abstract'],
  'tag': 'header'},
 {'block_class': 'cls_7',
  'block_idx': 4,
  'level': 3,
  'page_idx': 0,
  'sentences': ['We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.',
   'BART is trained by (1) corrupting text with an arbitrary noisin