##  LLMSherpa Smart Document Parser

### Introduction to `nlm-ingestor`

The `nlm-ingestor` repository provides a powerful and efficient parsing solution tailored for Retrieval-Augmented Generation (RAG) systems. Previously a proprietary tool, also exclusively available to Microsoft Azure users, this parser is now open-source, making it accessible for broader use. 

- Recommended for accuracy and speed and as RAG-friendly, the `nlm-ingestor` 
- An advanced rule-based and model-based techniques to handle various document formats, including PDF, HTML, and DOCX. optimizing processing of text-heavy documents or complex layouts. 


### Installation

To run `nlm-ingestor` locally using Docker, follow these steps:

1. **Pull the Docker Image**

   Download the latest Docker image:

   ```bash
   docker pull ghcr.io/nlmatics/nlm-ingestor:latest
   ```

2. **Run the Docker Container**

   Start the Docker container and map the container's port to a port on your local machine:

   ```bash
   docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest
   ```


3. **Clone the `llmsherpa` Repository**

   Download the `llmsherpa` repository from GitHub:

   ```bash
   git clone https://github.com/nlmatics/llmsherpa.git
   ```

   Note the path to the `llmsherpa` directory after cloning. For example, if you cloned it to your home directory, the path might be `~/llmsherpa`.

4. **Set Up Environment Variables**

   Create a `.env` file in your project directory with the following content, replacing the path with your `llmsherpa` directory path:

   PATH_TO_LLMSHERPA=~/llmsherpa

5. **Verify the Configuration**

   ```bash
   > curl -I http://localhost:5010/   

   This should print:

   HTTP/1.1 200 OK
   Server: Werkzeug/3.0.3 Python/3.11.9
   Date: Thu, 19 Sep 2024 22:47:35 GMT
   Content-Type: text/html; charset=utf-8
   Content-Length: 18
   Connection: close
   ```

To check that the environment variable is set correctly, you can run the following Python code in a Jupyter cell:


(TODO: Automate this step by creating a Docker container to handle the setup process.)

In [None]:
from dotenv import load_dotenv
from typing import List
import os
import sys

load_dotenv()

llmsherpa_path = os.getenv('PATH_TO_LLMSHERPA')

if llmsherpa_path:
    sys.path.insert(0, llmsherpa_path)
    print(f"Added {llmsherpa_path} to sys.path")
else:
    print("Environment variable PATH_TO_LLMSHERPA is not set.")


In [3]:
from llmsherpa.readers import LayoutPDFReader
llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all"

### Parse a Single Document

In [4]:
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

`doc` object contains chunks of the document extracted based on the logical structure of document such as paragraph, table, list etc.

In [5]:
for chunk in doc.chunks()[:5]:
    print(chunk)

<llmsherpa.readers.layout_reader.Paragraph object at 0x7f92abfbd910>
<llmsherpa.readers.layout_reader.Paragraph object at 0x7f92abfbd9a0>
<llmsherpa.readers.layout_reader.Paragraph object at 0x7f92abfbda90>
<llmsherpa.readers.layout_reader.Paragraph object at 0x7f92abfbdb50>
<llmsherpa.readers.layout_reader.Paragraph object at 0x7f92abfbdb80>


We can access to chunks based on the type and display in HTML format

In [6]:
from IPython.core.display import HTML
# Access to 5th table in the document and print in html format
HTML(doc.tables()[5].to_html()) 

0,1,2,3,4,5,6,7,8,9,10
BERT,84.1/90.9,79.0/81.8,86.6/-,93.2,91.3,92.3,90.0,70.4,88.0,60.6
UniLM,-/-,80.5/83.4,87.0/85.9,94.5,-,92.7,-,70.9,-,61.1
XLNet,89.0/94.5,86.1/88.8,89.8/-,95.6,91.8,93.9,91.8,83.8,89.2,63.6
RoBERTa,88.9/94.6,86.5/89.4,90.2/90.2,96.4,92.2,94.7,92.4,86.6,90.9,68.0
BART,88.8/94.6,86.1/89.2,89.9/90.1,96.6,92.5,94.9,91.2,87.0,90.4,62.8


In [7]:
# get the 1st section in JSON format
doc.sections()[1].block_json

{'bbox': [179.52, 165.33, 421.01, 177.29000000000002],
 'block_class': 'cls_5',
 'block_idx': 2,
 'level': 1,
 'page_idx': 0,
 'sentences': ['{mikelewis,yinhanliu,naman}@fb.com'],
 'tag': 'header'}

In [8]:
doc.sections()[0].to_text()

'BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension'

In [None]:
# HTML(doc.to_html())
# HTML(doc.sections()[0].to_html(include_children=True, recurse=True))

### Process the Folder containing Files

In [29]:
class Document:
    """
    A class to represent a document with metadata and content 
    We can use this metadata/hierarchy information in retrieval & search for RAG

    Attributes:
        metadata (dict): A dictionary containing metadata about the document.
        page_content (str): The content of the document.
    """

    def __init__(self, metadata: dict, page_content: str):
        """
        Initialize a Document instance.

        Args:
            metadata (dict): A dictionary containing metadata about the document.
            page_content (str): The content of the document.
        """
        self.metadata = metadata
        self.page_content = page_content

    def format_metadata(self):
        """
        Format the metadata of the document.

        Returns:
            dict: A dictionary containing formatted metadata with predefined attributes.
        """
        # Define attributes based on observed attributes in chunks
        attributes = ['source', 'page_idx', 'block_idx', 'tag', 'type']
        # Extract relevant metadata, defaulting to 'unknown' if not present
        formatted_metadata = {key: self.metadata.get(key, 'unknown') for key in attributes}
        return formatted_metadata
    
    @property
    def source(self):
        return self.metadata.get('source', 'unknown')

    @property
    def page_idx(self):
        return self.metadata.get('page_idx', 'unknown')

    @property
    def block_idx(self):
        return self.metadata.get('block_idx', 'unknown')

    @property
    def tag(self):
        return self.metadata.get('tag', 'unknown')

    @property
    def type(self):
        return self.metadata.get('type', 'unknown')

    def __repr__(self):
        """
        Return a string representation of the Document instance.

        Returns:
            str: A formatted string containing metadata and truncated content.
        """
        # Create a formatted string from metadata and truncate content
        metadata_str = ', '.join(f"{key}={value}" for key, value in self.format_metadata().items())
        return f"Document({metadata_str}, content={self.page_content})"

def process_pdf(file_path: str, pdf_reader: LayoutPDFReader) -> List[Document]:
    """
    Process a PDF file and extract its content into Document instances.

    Args:
        file_path (str): The path to the PDF file.
        pdf_reader (LayoutPDFReader): An instance of LayoutPDFReader to read the PDF.

    Returns:
        List[Document]: A list of Document instances representing the content of the PDF.
    """
    # Read PDF content
    doc = pdf_reader.read_pdf(file_path)

    documents = []
    for chunk in doc.chunks():
        # Extract metadata
        page_idx = chunk.page_idx
        block_idx = chunk.block_idx
        tag = chunk.tag
        chunk_type = type(chunk).__name__  # Get the type of the chunk (e.g., 'Paragraph', 'Table', etc.)
        page_content = chunk.to_text()

        # Convert to Document format
        metadata = {
            'source': os.path.basename(file_path),
            'page_idx': page_idx,
            'block_idx': block_idx,
            'tag': tag,
            'type': chunk_type
        }
        document = Document(metadata=metadata, page_content=page_content)
        documents.append(document)

    return documents

def process_folder(folder_path: str, pdf_reader: LayoutPDFReader):
    """
    Process all PDF files in a folder and extract their content into Document instances.

    Args:
        folder_path (str): The path to the folder containing PDF files.
        pdf_reader (LayoutPDFReader): An instance of LayoutPDFReader to read the PDFs.

    Returns:
        List[Document]: A list of Document instances representing the content of all PDFs in the folder.
    """
    all_documents = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(folder_path, file_name)
            print(f"Processing file: {file_name}")
            # Process each PDF file
            documents = process_pdf(file_path, pdf_reader)
            all_documents.extend(documents)

    return all_documents

In [30]:
# EXAMPLE USE
# llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all"
# pdf_reader = LayoutPDFReader(llmsherpa_api_url)

folder_path = "../data"  # Path to your folder containing PDFs
all_docs = process_folder(folder_path, pdf_reader)

# Output the first document for verification
print(all_docs[0].metadata)
print(all_docs[0].page_content)

Processing file: ShowcaseUserGuide.pdf
Processing file: access_019-access-management-system-user-guide-v4-0.pdf
Processing file: nihms-1769170.pdf
Processing file: Bookshelf_NBK5295.pdf
{'source': 'ShowcaseUserGuide.pdf', 'page_idx': 0, 'block_idx': 3, 'tag': 'para', 'type': 'Paragraph'}
UK Biobank holds an unprecedented amount of data on half a million participants aged 40-69 years (with a roughly even number of men and women) recruited between 2006 and 2010 throughout the UK.
Showcase (available at http://www.ukbiobank.ac.uk) aims to present the data available for health-related research in a comprehensive and concise way, and to provide technical information for researchers considering applying to use the resource.


(Processed 200-300 pages in 4 documents in 7.3 seconds, might be improved with async.)

In [31]:
all_docs[:10]

[Document(source=ShowcaseUserGuide.pdf, page_idx=0, block_idx=3, tag=para, type=Paragraph, content=UK Biobank holds an unprecedented amount of data on half a million participants aged 40-69 years (with a roughly even number of men and women) recruited between 2006 and 2010 throughout the UK.
 Showcase (available at http://www.ukbiobank.ac.uk) aims to present the data available for health-related research in a comprehensive and concise way, and to provide technical information for researchers considering applying to use the resource.),
 Document(source=ShowcaseUserGuide.pdf, page_idx=0, block_idx=4, tag=para, type=Paragraph, content=This user guide is designed to give you an overview of the data and provides some instructions on how to navigate your way through the system.),
 Document(source=ShowcaseUserGuide.pdf, page_idx=0, block_idx=6, tag=list_item, type=ListItem, content=• Have a printout of this user guide handy when you first use Showcase),
 Document(source=ShowcaseUserGuide.pdf,

In [32]:
all_docs[0]

Document(source=ShowcaseUserGuide.pdf, page_idx=0, block_idx=3, tag=para, type=Paragraph, content=UK Biobank holds an unprecedented amount of data on half a million participants aged 40-69 years (with a roughly even number of men and women) recruited between 2006 and 2010 throughout the UK.
Showcase (available at http://www.ukbiobank.ac.uk) aims to present the data available for health-related research in a comprehensive and concise way, and to provide technical information for researchers considering applying to use the resource.)

In [34]:
doc = all_docs[0]
print(doc.source)      # Access source
print(doc.page_idx)    # Access page index
print(doc.block_idx)   # Access block index
print(doc.tag)         # Access tag
print(doc.type)        # Access type
print(doc.page_content)  # Access content

ShowcaseUserGuide.pdf
0
3
para
Paragraph
UK Biobank holds an unprecedented amount of data on half a million participants aged 40-69 years (with a roughly even number of men and women) recruited between 2006 and 2010 throughout the UK.
Showcase (available at http://www.ukbiobank.ac.uk) aims to present the data available for health-related research in a comprehensive and concise way, and to provide technical information for researchers considering applying to use the resource.


### Resources:

- https://github.com/nlmatics/llmsherpa
- https://github.com/nlmatics/nlm-ingestor