This notebook helps test a local or pip installed copy of llmsherpa with the ingestor core code

In [1]:
!pip install llmsherpa



You should consider upgrading via the 'C:\Users\fabia\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


## Add llmsherpa
###  This is used to parse PDFs and allows for smart chunking based on the documents structure (chapters, sections, tables, paragraphs).

In [1]:
import os, sys
%load_ext autoreload
from llmsherpa.readers import LayoutPDFReader
from IPython.core.display import display, HTML
%autoreload 2

  from IPython.core.display import display, HTML


In [2]:
directory_path = "/Users/fabia/Desktop/testapi"
sys.path.insert(0, directory_path)


llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all&useNewIndentParser=true"
pdf_url = "data/test_manual_new.pdf"

do_ocr = True
if do_ocr:
    llmsherpa_api_url = llmsherpa_api_url + "&applyOcr=yes"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

## Complete parsed document:

In [54]:
# HTML(doc.sections()[0].to_html(include_children=True, recurse=True))
# doc.sections()[1].block_json
# doc.sections()[0].to_text()
# doc.sections()[1].bbox
# llmsherpa.readers.Layout
HTML(doc.to_html())
# print(doc.to_html())


0,1,2,3
②,Digital input module,⑦,Motor starter
③,Digital output module,⑧,Server module
④,Dummy module,⑨,Infeed bus cover
⑤,Motor starter,,

0,1,2,3
②,Digital input module,⑦,Motor starter
③,Digital output module,⑧,Server module
④,Dummy module,⑨,Infeed bus cover
⑤,Motor starter,,

0,1,2,3
②,Digital input module,⑦,Motor starter
③,Digital output module,⑧,Server module
④,Dummy module,⑨,Infeed bus cover
⑤,Motor starter,,


## Parsed Chunks

In [75]:
for chunk in doc.chunks():
    print(chunk.to_context_text())
    print("===================================================================")

Installation
7
Installation > 7.1 Basics > Introduction
All modules of the ET 200SP distributed I/O system are open equipment.
This means you may only install the ET 200SP distributed I/O system in housings, cabinets or electrical operating rooms and in a dry indoor environment (degree of protection IP20).
The housings, cabinets and electrical operating rooms must guarantee protection against electric shock and spread of fire.
The requirements regarding mechanical strength must also be met.
The housings, cabinets, and electrical operating rooms must not be accessible without a key or tool.
Personnel with access must have been trained or authorized.
Installation > 7.1 Basics > Installation location
Install the ET 200SP distributed I/O system in a suitable enclosure/control cabinet with sufficient mechanical strength and fire protection.
Take into account the environmental conditions for operating the devices.
Installation > 7.1 Basics > Mounting position
You can mount the ET 200SP distr

## Parsed Tables

In [4]:
for table in doc.tables():
    print(table.to_text())

 | ① | Interface module | ⑥ | Motor starter
 | --- | --- | --- | ---
 | ② | Digital input module | ⑦ | Motor starter
 | ③ | Digital output module | ⑧ | Server module
 | ④ | Dummy module | ⑨ | Infeed bus cover
 | ⑤ | Motor starter |  | 



## Parsed Sections

In [5]:
for section in doc.sections():
    print(section.to_text())

Installation
7.1 Basics
Introduction
Installation location
Mounting position
Mounting rail
NOTE
NOTE
NOTE
Minimum clearances
NOTE Ex module group
General rules for installation
NOTE
Mounting rules for reducing the thermal load
7.2 Installation conditions for motor starters
Mechanical brackets
Designing interference-free motor starters
Mount the dummy module
NOTICE Ensure interference immunity
7.3 Mounting the CPU/interface module
Introduction
Requirement
Required tools
Mounting the CPU/interface module
Dismantling the CPU/interface module
NOTE
7.4 Installing ET 200SP R1
Introduction
Requirement
Tools required
Mounting the ET 200SP R1 system


## Search for specific sections/tables in the document:
### Here Section: 5.2 What are fail-safe automation systems and fail-safe modules?

In [6]:
def get_section_text(doc, section_title):
    """
    Extracts the text from a specific section in a parsed PDF document.

    Parameters:
    - doc (Document): A Document object from the llmsherpa.readers.layout_reader library.
    - section_title (str): The title of the section to extract.

    Returns:
    - str: The HTML representation of the section's content, or a message if the section is not found.
    """

    selected_section = None

    # Find the desired section by title
    for section in doc.sections():
        if section.title == section_title:
            selected_section = section
            break

    # If the section is not found, return a message
    if not selected_section:
        return f"No section titled '{section_title}' found."

    # Return the full content of the section as HTML
    return selected_section.to_html(include_children=True, recurse=True)

In [7]:
section_text = get_section_text(doc, 'Mounting the CPU/interface module')
HTML(section_text)

## Create RAG

### Create ChromaDB
ChromaDB stores the vectorstore (embeddings) persistent on the disc to be re-used

In [8]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("NewCollection")

  from .autonotebook import tqdm as notebook_tqdm


### Setup LLM model and embeddings model

In [9]:
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.llm = Ollama(model="llama3", request_timeout=120.0)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

In [10]:
from llama_index.core import VectorStoreIndex
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import Document

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)


### Insert parsed document chunks into the vectorstore

Print chunk

In [62]:
def get_chunk_parent_titles(chunk):
    # Build a string that consists of recursively getting the chunk parent titles
    parent_titles = ""
    parent = chunk.parent
    while parent:
        try:
            parent_titles = parent.title + " > " + parent_titles
            parent = parent.parent
        except Exception as e:
            break
    # Remove the trailing " >"
    parent_titles = parent_titles.rstrip(" >")
    return parent_titles
    
def print_chunk_metadata(chunk):
    # print("Full json: ", chunk.block_json)
    # print()
    print("Page Nr: ", chunk.page_idx + 1)
    print("Tag: ", chunk.tag)
    print("Parent Section: ", chunk.parent.title)
    print("Parent Section Hierarchy: ", get_chunk_parent_titles(chunk))
    print("Level in the hierarchy: ", chunk.level)
    # print("Text: ", chunk.to_text())
    print()
    print("Context Text: ", chunk.to_context_text())
    print()
    print("="*100)
    


chunk = doc.chunks()[26]
print_chunk_metadata(chunk)




Page Nr:  4
Tag:  para
Level in the hierarchy:  4

This can be achieved, for example, by installing the devices in a control cabinet with the appropriate degree of protection.



### Insert chunk with metadata into the vectorstore index

In [63]:
from llama_index.core.schema import MetadataMode

# If there is no existing vector store create a new one and assign it to the index
index = VectorStoreIndex([], storage_context=storage_context)

for chunk_id, chunk in enumerate(doc.chunks()):
    document = Document(text=chunk.to_context_text(), 
                        id_=chunk_id,
                        metadata={
                              #"bock_idx": chunk.block_idx, # Not sure if needed
                              "page_number": chunk.page_idx + 1, # We add 1 to the page index to match the actual page number
                              "tag": chunk.tag,
                              "parent_section": chunk.parent.title,
                              "parent_section_hierarchy": get_chunk_parent_titles(chunk)
                              #,"hierarchy_level": chunk.level # Not sure if needed 
                              },
                        text_template="Metadata\n{metadata_str}\nContent:\n{content}",)
    print_chunk_metadata(chunk)
                        
    index.insert(document)

    # # Print the context of the chunk for debugging
    # if chunk_id < 10:
    #     print("---------------------The LLM sees this:---------------------",)
    #     print(document.get_content(metadata_mode=MetadataMode.LLM),)
    #     print("---------------------The Embedding model sees this:---------------------",)
    #     print(document.get_content(metadata_mode=MetadataMode.EMBED),)
    #     print("="*100)


Page Nr:  1
Tag:  para
Parent Section:  Installation
Parent Section Hierarchy:  Installation
Level in the hierarchy:  0

Context Text:  Installation
7

Page Nr:  1
Tag:  para
Parent Section:  Introduction
Parent Section Hierarchy:  Installation > 7.1 Basics > Introduction
Level in the hierarchy:  3

Context Text:  Installation > 7.1 Basics > Introduction
All modules of the ET 200SP distributed I/O system are open equipment.
This means you may only install the ET 200SP distributed I/O system in housings, cabinets or electrical operating rooms and in a dry indoor environment (degree of protection IP20).
The housings, cabinets and electrical operating rooms must guarantee protection against electric shock and spread of fire.
The requirements regarding mechanical strength must also be met.
The housings, cabinets, and electrical operating rooms must not be accessible without a key or tool.
Personnel with access must have been trained or authorized.

Page Nr:  1
Tag:  para
Parent Section:  I

Load existing vectorstore if one exists

In [64]:
# load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

### Query llm with context using Ollama Serve

Create prompt

In [65]:
from llama_index.core import PromptTemplate

template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as accurate to the context as possible and keep the answer concise.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Question: {query_str}
Context: {context_str}
Answer: <|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

qa_template = PromptTemplate(template)

# you can create text prompt (for completion API)
prompt = qa_template.format(context_str=..., query_str=...)
print("Prompt:")
print(prompt)

print()
print("="*100)
print()

# or easily convert to message prompts (for chat API)
messages = qa_template.format_messages(context_str=..., query_str=...)
print("Messages:")
print(messages)

Prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as accurate to the context as possible and keep the answer concise.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Question: Ellipsis
Context: Ellipsis
Answer: <|eot_id|>
<|start_header_id|>assistant<|end_header_id|>


Messages:
[ChatMessage(role=<MessageRole.USER: 'user'>, content="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as accurate to the context as possible and keep the answer concise.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nQuestion: Ellipsis\nContext: Ellipsis\nAnswer: <|eot_id|>\n<|start_header_id|>assistant<

### Init query engine with prompt template

In [66]:
query_engine = index.as_query_engine(text_qa_template=qa_template)

### Print context nodes

In [67]:
def print_context(response):
    print("Nr of context nodes: ", len(response.source_nodes))
    print()
    print("="*150)
    for node in response.source_nodes:
        print("---------------------Metadat:---------------------")
        print(node.node.metadata)
        print("---------------------Text:---------------------")
        print(node.text)
        print("="*150)

In [69]:
response = query_engine.query("What do you know about the installing conditions of motor starters?")
print(response)

print()
print("Actual context: ")
print_context(response)

According to the provided context, the installing conditions for motor starters include:

* The motor starter can be installed vertically or horizontally.
* The maximum permissible ambient temperature range depends on the mounting position:
	+ Up to 60°C for horizontal mounting position.
	+ Up to 50°C for vertical installation position.

Actual context: 
Nr of context nodes:  1

---------------------Metadat:---------------------
{'page_number': 5, 'tag': 'para', 'parent_section': '7.2 Installation conditions for motor starters', 'parent_section_hierarchy': 'Installation > 7.2 Installation conditions for motor starters'}
---------------------Text:---------------------
Installation > 7.2 Installation conditions for motor starters
You can fit the motor starter vertically or horizontally.
The mounting position refers to the alignment of the mounting rail The maximum permissible ambient temperature range depends on the mounting position:
– Up to 60° C: Horizontal mounting position
– Up to 5

## Customize Retrievers
### Based on Similarity
Uses top-k = 10 and similarity cutoff = 60% 

In [94]:
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=20,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer()


# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    # node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.2)],
)

In [93]:
# query
response = query_engine.query("Tell me all you know about mounting a CPU module. What are the requirements?")
print(response)

print()
print("Actual context: ")
print_context(response)

To install a CPU/interface module, follow these steps:

1. Install the CPU/interface module on the mounting rail.
2. Swivel the CPU/interface module towards the back until you hear the mounting rail release button click into place.

The mounting rail is fitted as a requirement for mounting the CPU/interface module. The required tool is not specified in the provided context information, but it is mentioned that a 3 to 3.5 mm screwdriver might be needed for mounting and removing the BusAdapter.

Actual context: 
Nr of context nodes:  12

---------------------Metadat:---------------------
{'page_number': 7, 'tag': 'para', 'parent_section': 'Mounting the CPU/interface module', 'parent_section_hierarchy': 'Installation > 7.3 Mounting the CPU/interface module > Mounting the CPU/interface module'}
---------------------Text:---------------------
Installation > 7.3 Mounting the CPU/interface module > Mounting the CPU/interface module
Watch the video sequence (https://support.automation.siemens.

In [178]:
# query
response = query_engine.query("What is the first step I have to do in the installation?")
print(response)

Empty Response


### TODO: Add own retrieval algorithm based on sections (Install, Mounting, Assembly, Disassembly, ...)
https://docs.llamaindex.ai/en/stable/understanding/querying/querying/

https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/

In [99]:
#A dummy node-postprocessor can be implemented in just a few lines of code:
from typing import List, Optional

from llama_index.core import QueryBundle
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore


class DummyNodePostprocessor(BaseNodePostprocessor):
    def _postprocess_nodes(
        self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle]
    ) -> List[NodeWithScore]:
        # subtracts 1 from the score
        for n in nodes:
            n.score -= 1

        return nodes

### Based on Keywords
Include and exclude required keywords from the context

In [95]:
from llama_index.core.postprocessor import KeywordNodePostprocessor

# This works with or logic, so if any of the required keywords are found, the node is kept
node_postprocessors = [
    KeywordNodePostprocessor(
        # required_keywords=["Dismantling"]#, exclude_keywords=["Italy"]
        required_keywords=[]
    )
]
query_engine = RetrieverQueryEngine.from_args(
    retriever, node_postprocessors=node_postprocessors, response_mode="tree_summarize"
)

In [97]:
# query
response = query_engine.query("Tell me all you know about installing the ET 200SP R1? What are the requirements? Which tools do I need?")
print(response)

print()
print("Actual context: ")
print_context(response)

Based on the provided context information, here's what can be gathered about installing the ET 200SP R1 system:

**Requirements:** 

* Install the ET 200SP distributed I/O system in a suitable enclosure/control cabinet with sufficient mechanical strength and fire protection.
* Ensure the environmental conditions for operating the devices are taken into account.
* For vibration and shock loads, mechanically fix both ends of the ET 200SP system assembly to the mounting rail.

**Tools:** 

* No specific tools are mentioned in the context information. However, it is assumed that standard installation tools such as screwdrivers, wrenches, etc. would be required for the process.
* Optional tool: If vibration and shock loads are present, you may need 8WA1010-1PH01 ground terminals to mechanically fix both ends of the ET 200SP system assembly.

**Additional Information:** 

* The ET 200SP distributed I/O system is open equipment that must be installed in housings, cabinets, or electrical opera

In [31]:
for chunk in doc.chunks():
    print(chunk.to_context_text())
    print("===================================================================")


Installation
7
Installation > 7.1 Basics > Introduction
All modules of the ET 200SP distributed I/O system are open equipment.
This means you may only install the ET 200SP distributed I/O system in housings, cabinets or electrical operating rooms and in a dry indoor environment (degree of protection IP20).
The housings, cabinets and electrical operating rooms must guarantee protection against electric shock and spread of fire.
The requirements regarding mechanical strength must also be met.
The housings, cabinets, and electrical operating rooms must not be accessible without a key or tool.
Personnel with access must have been trained or authorized.
Installation > 7.1 Basics > Installation location
Install the ET 200SP distributed I/O system in a suitable enclosure/control cabinet with sufficient mechanical strength and fire protection.
Take into account the environmental conditions for operating the devices.
Installation > 7.1 Basics > Mounting position
You can mount the ET 200SP distr

In [118]:
#default: "create and refine" an answer by sequentially going through each retrieved Node; 
#This makes a separate LLM call per Node. Good for more detailed answers.

#tree_summarize: Given a set of Node objects and the query, recursively construct a tree and return the root node as the response. 
# Good for summarization purposes.
#response_mode = "tree_summarize" 

query_engine = RetrieverQueryEngine.from_args(
    retriever, node_postprocessors=node_postprocessors, #response_mode=response_mode
)

### TODO: Explore structured output
https://docs.llamaindex.ai/en/stable/module_guides/querying/structured_outputs/query_engine/

Allows for parsing the llm output into an object like Manual entry: string section, string text