In [20]:
from src.parsers import DoclingParser, DoclingNativeChunker
from src.retrievers import BM25sRetriever, ChromaDBRetriever
import os
from typing import Optional

## Parse documents

For this example we are going to parse and use the ESG/EFRAG documentation. This documentation is located in the `/data/ESG Documentation/EFRAG/` directory and contains PDF files related to European Financial Reporting Advisory Group (EFRAG) sustainability reporting standards.

The data includes:
- IG 1 Materiality Assessment
- ESRS-ISSB standards interoperability guidance
- ED_ESRS_AP5 document

We'll use the DoclingParser to parse these documents and then apply different chunking strategies using TextChunker or DoclingNativeChunker. After chunking, we'll create retrievers (BM25sRetriever and ChromaDBRetriever) to efficiently store and search through this documentation.

In [2]:
data_path = "../data/ESG Documentation/EFRAG/"
# Get all files in the data_path directory and its subdirectories
documentation_path = []
for root, dirs, files in os.walk(data_path):
    for file in files:
        documentation_path.append(os.path.join(root, file))

# Display the first few file paths
print(f"Found {len(documentation_path)} files")
if documentation_path:
    print("Sample file paths:")
    for path in documentation_path[:5]:
        print(path)

Found 3 files
Sample file paths:
../data/ESG Documentation/EFRAG/IG 1 Materiality Assessment_final.pdf
../data/ESG Documentation/EFRAG/esrs-issb-standards-interoperability-guidance.pdf
../data/ESG Documentation/EFRAG/ED_ESRS_AP5.pdf


Define the parser

In [3]:
parser = DoclingParser()

In [4]:
parsed_documents = parser.parse(documentation_path)



In [8]:
print(f"Parsed {len(parsed_documents)} documents")
print("Sample parsed documents:")
print(f"First document: {parsed_documents[0].filename}")
print(parsed_documents[0].text[100:150])

Parsed 3 documents
Sample parsed documents:
First document: IG 1 Materiality Assessment_final.pdf
of

55

## Disclaimer

This  implementation  guida


## Chunk Documents for Retrieval

After parsing the ESG/EFRAG documentation files, we need to prepare them for efficient retrieval. This involves chunking the documents into manageable pieces that can be indexed and searched.

The chunking process is crucial because:
- It breaks down large documents into smaller, more focused segments
- It preserves context through metadata
- It enables more precise retrieval of relevant information
- It optimizes performance of retrieval systems

In the following cells, we'll use DoclingNativeChunker to segment our documents and then index these chunks using both BM25sRetriever and ChromaDBRetriever to enable efficient search capabilities.

Define the chunker (chunk strategy)

In [None]:
chunker = DoclingNativeChunker()
# alternative chunker
# chunker = TextChunker(chunk_size=1000, chunk_overlap=100)

In [7]:
docling_chunks = parser.chunk_documents(parsed_documents, chunker)

Token indices sequence length is longer than the specified maximum sequence length for this model (3320 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (711 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1090 > 512). Running this sequence through the model will result in indexing errors


In [10]:
print(f"Parsed {len(parsed_documents)} documents into {len(docling_chunks)} chunks")

Parsed 3 documents into 309 chunks


In [13]:
docling_chunks[100]

Chunk(text="FAQ 15: Do the ESRS mandate to actively engage in dialogue with affected stakeholders for the materiality assessment process?\n197. The ESRS require disclosure on the materiality assessment and its outcomes but do not mandate specific behaviour on stakeholder engagement or the due diligence process.\n198. However, ESRS 1 paragraph 45 states that the impact materiality assessment is informed by the undertaking's due diligence process. In addition, ESRS 1 paragraph 24 points to affected stakeholders' engagement as central to the materiality assessment. Engagement with affected stakeholders is a tool that supports the undertaking's business processes (for example, due diligence) as well  as  the  management  of  sustainability  matters.  The  undertaking,  when preparing  its  sustainability  statement,  can  leverage  its  engagement  with affected stakeholders per its due diligence process, if applicable.\n199. Stakeholder  engagement  informs  the  identification and  asses

Now lets define the retriever

In [11]:
bm25s_retriever = BM25sRetriever(index_path="bm25s_index_notebook01")
chromadb_retriever = ChromaDBRetriever(persist_directory="chromadb_index_notebook01")

In [14]:
docs = [docling_chunks.text for docling_chunks in docling_chunks]
metas = [docling_chunks.metadata for docling_chunks in docling_chunks]

In [15]:
bm25s_retriever.add_documents(docs, metas)

                                                             

In [16]:
chromadb_retriever.add_documents(docs, metas)

Persist the changes

In [48]:
bm25s_retriever.save("bm25s_index_notebook01")
chromadb_retriever.save("chromadb_index_notebook01")

## Test retrievers

Some generated queries

In [41]:
# Define sample queries
sample_queries = [
    # Relevant queries
    # "What is double materiality?",
    # "How to implement materiality assessment?",
    "ESRS sustainability reporting requirements",
    # "Financial materiality vs impact materiality",
    "Value chain considerations in ESG reporting",
    # "Stakeholder engagement in sustainability reporting",
    # "EFRAG guidance on materiality assessment",
    "IROs in sustainability reporting",
    # Less relevant queries
    "Carbon footprint calculation methodologies",
    "ESG investment strategies",
    # Non-relevant queries
    "Recipe for chocolate cake",
    "Best hiking trails in Europe",
]

Define helpful functions

In [None]:
def print_results(indices, scores, docs):
    """Helper function to print retrieval results."""
    print(f"Found {len(indices)} results:")
    for i, (idx, score, doc) in enumerate(zip(indices, scores, docs)):
        # Truncate long documents for display
        doc_display = doc[:200]
        # doc_display = doc if len(doc) < 60 else doc[:57] + "..."
        print(f"  {i + 1}. [ID: {idx}, Score: {score:.4f}] {doc_display}")


def compare_retrievers(
    query: str,
    topk: int,
    metadata: Optional[dict] = None,
    threshold: Optional[float] = None,
):
    print(f"Query: {query}")
    print("BM25s Results:")
    indices, scores, docs = bm25s_retriever.retrieve(
        query=query,
        top_k=topk,
        metadata_filter=metadata,
        # threshold=threshold*10,    # normalize threshold!
    )
    print_results(indices, scores, docs)
    print("=========================")
    print("ChromaDB Results:")
    indices, scores, docs = chromadb_retriever.retrieve(
        query=query,
        top_k=topk,
        metadata_filter=metadata,
        # threshold=threshold,
    )
    print_results(indices, scores, docs)

Test 1: What is the definition of carbon credit? Retrieving 3 chunks

In [43]:
compare_retrievers(query="What is the definition of carbon credit", topk=3)

Query: What is the definition of carbon credit
BM25s Results:


                                                     

Found 3 results:
  1. [ID: 159, Score: 6.5858] 4.1 Choices to be made for an entity starting with ISSB Standards
Explanation, (iii) Carbon credits = Definition of carbon credit. , (iii) Carbon credits = Paragraph 36(e) of IFRS S2 requires an entit
  2. [ID: 160, Score: 5.1806] 4.1 Choices to be made for an entity starting with ISSB Standards
E1 should be aware that non-verified or carbon credits verified under schemes not recognised as quality standards,or carbon credits th
  3. [ID: 153, Score: 5.1246] Section 3. ESRS to IFRS S2 (climate):  information that an entity starting with ESRS needs to know when also applying ISSB Standards to enable compliance with both sets of standards
Explanation, (vii)
ChromaDB Results:




Found 3 results:
  1. [ID: 159, Score: 0.6197] 4.1 Choices to be made for an entity starting with ISSB Standards
Explanation, (iii) Carbon credits = Definition of carbon credit. , (iii) Carbon credits = Paragraph 36(e) of IFRS S2 requires an entit
  2. [ID: 185, Score: 0.5947] Greenhouse gas removals
Carbon credits, Reference to.IFRS S2 = . Carbon credits, TABLE 4.2.2 Requirements not covered by IFRS S2. = . ESRS E1.59 and AR61-AR63, Reference to.IFRS S2 = IFRS S2.36(e)(i)-
  3. [ID: 153, Score: 0.5445] Section 3. ESRS to IFRS S2 (climate):  information that an entity starting with ESRS needs to know when also applying ISSB Standards to enable compliance with both sets of standards
Explanation, (vii)


Test 2: What is Double Materiality? Retrieving 3 chunks

In [44]:
compare_retrievers(
    query="What is double materiality?",
    topk=3,
)

Query: What is double materiality?
BM25s Results:


                                                     

Found 3 results:
  1. [ID: 10, Score: 2.4930] Table of contents
materiality...............................................................................37, 1 = . FAQ 1: Is impact materiality based on materiality for the undertaking or for stake
  2. [ID: 20, Score: 2.4515] Table of contents
IROs?.............................................................................................................................................51 FAQ 23: Are remediation and mitig
  3. [ID: 125, Score: 2.3326] 1.1 Materiality
This alignment means that in assessing whether a particular disclosure is considered material in  applying  ISSB  Standards,  that  assessment  is  aligned  with  the  assessment  of  
ChromaDB Results:




Found 3 results:
  1. [ID: 208, Score: 0.5823] ESRS E1 § 46 to 56- Double materiality as the basis for sustainability disclosures
Double materiality is a concept which provides criteria for determination of whether a sustainability matter has to b
  2. [ID: 30, Score: 0.5222] 2. The ESRS approach to materiality
23. The  ESRS  require  that  the  sustainability  statement  include  sustainability information  related  to  material  IROs  identified  through  a  MA  process 
  3. [ID: 90, Score: 0.5180] FAQ 9: How to consider time horizon in the double materiality analysis?
- 177. A sustainability matter might be material  from  an  impact  or  financial perspective over the short-, mediumor long-ter


Test 3: What is Double Materiality? Retrieving 3 chunks from the "IG 1 Materiality Assessment_final.pdf" document

In [45]:
compare_retrievers(
    query="What is double materiality?",
    topk=3,
    metadata={"file": "IG 1 Materiality Assessment_final.pdf"},
)

Query: What is double materiality?
BM25s Results:


                                                     

Found 3 results:
  1. [ID: 10, Score: 2.4930] Table of contents
materiality...............................................................................37, 1 = . FAQ 1: Is impact materiality based on materiality for the undertaking or for stake
  2. [ID: 20, Score: 2.4515] Table of contents
IROs?.............................................................................................................................................51 FAQ 23: Are remediation and mitig
  3. [ID: 94, Score: 2.2479] Example of severity
If the  undertaking  concludes,  based  on qualitative criteria, that an impact connected  to  the  undertaking  sits  on  the edge dividing what is material from what is non-mater
ChromaDB Results:




Found 3 results:
  1. [ID: 30, Score: 0.5222] 2. The ESRS approach to materiality
23. The  ESRS  require  that  the  sustainability  statement  include  sustainability information  related  to  material  IROs  identified  through  a  MA  process 
  2. [ID: 90, Score: 0.5180] FAQ 9: How to consider time horizon in the double materiality analysis?
- 177. A sustainability matter might be material  from  an  impact  or  financial perspective over the short-, mediumor long-ter
  3. [ID: 5, Score: 0.4656] Table of contents
used........................................................................................................................, 1 = 8. 2. The ESRS approach to materiality..............


Test 4: Best hiking trails in Europe? Retrieving 3 chunks.

This tests show how the retrievers behave with an irrelevant query. We can establish a threshold to return only the chunks that we are sure are relevant to the query. Look at the scores and compare them to the previous ones.

In [None]:
compare_retrievers(
    query="Best hiking trails in Europe",
    topk=3,
    metadata={"file": "IG 1 Materiality Assessment_final.pdf"},
    # threshold=0.5
)

Query: Best hiking trails in Europe
BM25s Results:


                                                     

Found 3 results:
  1. [ID: 96, Score: 1.7557] FAQ 12: Should the materiality assessment be documented/evidenced?
186. The ESRS do not prescribe specific documentation as this is outside its remit, but it is reasonable to expect a certain level of
  2. [ID: 99, Score: 1.5586] FAQ 14: Will the implementation of sector-specific standards create any new subtopics or sub-subtopics to be considered in the materiality assessment?
194. Yes, it may. The sector-specific standards w
  3. [ID: 51, Score: 1.4647] Understanding of affected stakeholders
in  ESRS  S1  paragraph  AR73  may  help  to  assess  whether  the  sub-subtopic 'adequate wages' is  material.  It  is  equally  important  for  the  undertakin
ChromaDB Results:




Found 3 results:
  1. [ID: 3, Score: 0.1700] About EFRAG
EFRAG's  mission  is  to  serve  the  European  public  interest  in  both  financial  and sustainability reporting by developing and promoting European views in the field of corporate  re
  2. [ID: 23, Score: 0.1301] Summary in 13 key points
(b) identification  of  actual  and  potential  IROs  related  to  sustainability matters;
(c) assessment and  determination of the material IROs related to sustainability mat
  3. [ID: 117, Score: 0.1281] FAQ 25: What is the relationship between taxonomy eligible activities and materiality?
236. The  EU  Taxonomy  Regulation  and  its  Delegated  Acts  define  criteria  for  a number of economic activi
