# Ingestion Pipeline

# 01 - Load and Prepare ESG/Biodiversity Documents

This notebook loads ESG and biodiversity-related PDFs and prepares them for embedding. It uses LlamaIndex to chunk and structure the documents.

### Install required packages
```bash
pip install llama-index chromadb PyMuPDF python-dotenv
```

In [1]:
from llama_index.core import SimpleDirectoryReader
from pathlib import Path
import os
from dotenv import load_dotenv
from llama_index.core.schema import Document
from llama_index.core.node_parser import SimpleNodeParser
load_dotenv()

True

### Load PDFs from your local `/data/raw/` directory

In [2]:
# Set path to your ESG/Biodiversity documents
doc_path = Path('../data/raw')

documents = SimpleDirectoryReader(
    input_dir=doc_path,
    recursive=False,  # Do not load subdirectories
    file_extractor={'pdf': 'pymupdf'},  # Extract full text from PDFs
    filename_as_id=True  # Ensure document metadata includes filename
).load_data()

# 🔹 2️⃣ Merge multiple pages into a single document per file
merged_documents = {}
for doc in documents:
    filename = doc.metadata.get('file_name', 'unknown')
    if filename not in merged_documents:
        merged_documents[filename] = Document(
            text=doc.text, metadata=doc.metadata
        )
    else:
        merged_documents[filename] = Document(
            text=merged_documents[filename].text + "\n" + doc.text,  # Merge text
            metadata=merged_documents[filename].metadata  # Preserve metadata
        )

# Convert dictionary back to a list of single-document PDFs
documents = list(merged_documents.values())

print(f"Loaded {len(documents)} full documents.") 

# Apply controlled chunking: break docs into 3-sentence segments
parser = SimpleNodeParser.from_defaults(
    chunk_size=1024,      # Target ~1024 characters
    chunk_overlap=200     # Allow 200 characters of overlap to retain context
)

chunked_documents = [parser.get_nodes_from_documents([doc]) for doc in documents]
flattened_docs = [node for sublist in chunked_documents for node in sublist]

print(f"Chunked into {len(chunked_documents)} total chunks.")

Loaded 11 full documents.
Chunked into 11 total chunks.


### Preview the first few chunks

In [3]:
# Preview first few
# Step 5: Preview first few chunks
for i, chunk in enumerate(flattened_docs[:5]):
    print(f"\n📄 Chunk {i+1} (From: {chunk.metadata.get('file_name', 'Unknown')})")
    print(chunk.text[:500])
    print('-' * 50)


📄 Chunk 1 (From:  How investors and corporates are approaching natural capital _ LinkedIn.pdf)
Critical thinking13,427 subscribers Subscribed
How investors and corporates areapproaching natural capital
Mercer - Investments36,369 followers
September 27, 2024
Cara Williams, Global Head of ESG, Climate and Sustainability
As the sustainability agenda evolves, corporates and investors alike aregrappling with the complexity of valuing natural capital. The transitionfrom the Taskforce on Climate-related Financial Disclosures (TCFD) tothe Taskforce on Nature-related Financial Disclosures (TNFD) i
--------------------------------------------------

📄 Chunk 2 (From:  How investors and corporates are approaching natural capital _ LinkedIn.pdf)
The first step towards TNFD alignment is measurement. Investors andcorporates can undertake an initial health check of their portfolio and/oroperations to understand areas in which their investments and/oroperations are negatively impacting natural capital