# Chapter 4: Mess Management with Documents


>This notebook is based on the open-source project [wow-rag](https://github.com/datawhalechina/wow-rag) by Datawhale China.  
>I’ve adapted and annotated parts of it for personal learning and experimentation.


## 1. Introduction 

Before we can retrieve answers or build intelligent applications, we need to ensure that our document data is well-managed. This chapter focuses on the often-overlooked but essential tasks of **document ingestion**, **inspection**, and **modification** using `LlamaIndex`. Whether we're inserting new data, examining existing nodes, or attempting tricky operations like deletion and updates, understanding how your index handles documents and nodes is critical.

We’ll explore how to:

- View and inspect stored documents and their internal node structures.

- Add new nodes or re-ingest documents using transformation pipelines.

- Handle updates by replacing outdated nodes (since direct modification isn’t supported).

- Avoid common pitfalls—like accidental deletion—that can break your pipeline.

Let’s roll up our sleeves and bring order to the mess. 


---

> ⚠️ **Warning: Real-World Document Management is Harder Than It Looks**

While this tutorial uses a **highly simplified dataset and workflow**, real-world RAG projects are far more demanding. 

In production, **document management can consume up to 80% of the total project effort**. This includes not only indexing, but also data cleaning, transformation, and retrieval optimization.


###  1.1 Core Responsibilities of a Document Loader

- **Document Format Parsing**  
  Convert content from formats such as PDF, DOCX, Markdown, or HTML into raw text.

- **Metadata Extraction**  
  Extract contextual metadata during parsing (e.g., page numbers, file origin, section titles).

- **Unified Data Structuring**  
  Normalize output into a unified schema (e.g., `TextNode`, `Document`) to enable consistent downstream processing.

---

###  1.2 Popular RAG Document Loaders and Tools

| Tool Name         | Key Features                                       | Suitable Scenarios                  | Performance Notes                   |
|-------------------|----------------------------------------------------|-------------------------------------|-------------------------------------|
| **PyMuPDF4LLM**   | PDF to Markdown, OCR, table recognition            | Research papers, technical manuals  | Open source, GPU acceleration       |
| **TextLoader**    | Basic plaintext loading                            | Plain text files                    | Lightweight and efficient           |
| **DirectoryLoader** | Batch processing across mixed formats            | Local knowledge bases               | Extensible to multiple file types   |
| **Unstructured**  | Multi-format parsing with smart layout detection   | PDFs, Word, HTML                    | Unified API, intelligent parsing    |
| **FireCrawlLoader** | Web content scraping                             | Online docs, news pages             | Real-time fetching                  |
| **LlamaParse**    | Deep PDF parsing with structural fidelity          | Legal contracts, academic papers    | High accuracy, commercial API       |
| **Docling**       | Modular, enterprise-level parsing                  | Contracts, internal reports         | IBM ecosystem integration           |
| **Marker**        | Fast PDF-to-Markdown with GPU support              | Scientific literature, books        | Optimized for PDF conversion        |
| **MinerU**        | Multimodal parsing (text + layout + vision)        | Scientific/financial reports        | Combines LayoutLMv3 + YOLOv8        |







## 2. Preparation

Same as always, use `example.txt` as the document source, and adopt **Method 1 from Chapter 3** — the simplest way to build a semantic index directly from documents using VectorStoreIndex. This approach is ideal for rapid prototyping or small-scale testing.

In [2]:
import os
from dotenv import load_dotenv


load_dotenv()
api_key = os.getenv('API_KEY')

base_url = "https://api.openai.com/v1"  
chat_model = "gpt-4.1-nano-2025-04-14"   
emb_model = "text-embedding-3-small"


from llama_index.llms.openai import OpenAI
llm = OpenAI(
    api_key = api_key,
    model = chat_model,
)


from llama_index.embeddings.openai import OpenAIEmbedding
embedding = OpenAIEmbedding(
    api_key = api_key,
    model = emb_model,
)
emb = embedding.get_text_embedding("Hello")

#Method 1 in Ch3

from llama_index.core import SimpleDirectoryReader,Document
documents = SimpleDirectoryReader(input_files=['./docs/example.txt']).load_data()

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents,embed_model=embedding)


## 3. Knowing the Index Internals

This subsection helps us understand what happens behind the scenes after building an index. Specifically:

- `index.docstore.docs` shows all the stored nodes (chunks of your documents).

- `index.index_struct.nodes_dict` maps internal node IDs for quick lookup.

- `index.ref_doc`_info shows document-level metadata and references.

- `get_node()` lets us retrieve and inspect a specific indexed chunk by its ID.

Together, these let us verify:

- How many chunks were created from our document,

- What metadata is attached,

- How the document and node IDs are tracked.

This step is useful for debugging, transparency, and gaining confidence in the structure before we perform queries or apply filters.






### 3.1 View all documents under index

It's a dictionary that maps **Node ID** to its corresponding `TextNod` object.   

Each `TextNode` contains the chunked content, metadata, and optionally embedding info.

In [12]:
print(index.docstore.docs)

{'073ef4c9-ef91-4fc0-9851-eae86bc4ecc5': TextNode(id_='073ef4c9-ef91-4fc0-9851-eae86bc4ecc5', embedding=None, metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='7f19020e-21bb-4786-adb0-6f2def2d8a3e', node_type='4', metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, hash='a2b68e6316c64e022863136a58cae33d2dc26e9576e335f9ae1d8007a21aba56')}, metadata_template='{key}: {value}', metadata_separator='\

In [16]:
from pprint import pprint
import re

def print_node(node, show_full_text=False, wrap_sentences=True):
    """
    Pretty-print a TextNode with metadata, relationships, and formatted content.

    Parameters:
    - node: TextNode object
    - show_full_text (bool): Whether to show the full text content
    - wrap_sentences (bool): If True, add line breaks after each period
    """

    print(" Node ID:", node.node_id if hasattr(node, 'node_id') else node.id_)

    print("\n Metadata:")
    pprint(node.metadata)

    print("\n Relationships:")
    pprint(node.relationships)

    print("\n Text Content:\n")
    if wrap_sentences:
        text = re.sub(r'\.\s+', '.\n', node.text.strip())
    else:
        text = node.text.strip()

    if show_full_text:
        print(text)
    else:
        print(text[:1000] + "\n... [truncated]\n")



In [20]:
print_node(list(index.docstore.docs.values())[0], show_full_text=True)

 Node ID: 073ef4c9-ef91-4fc0-9851-eae86bc4ecc5

 Metadata:
{'creation_date': '2025-07-19',
 'file_name': 'example.txt',
 'file_path': 'docs\\example.txt',
 'file_size': 3355,
 'file_type': 'text/plain',
 'last_modified_date': '2025-07-19'}

 Relationships:
{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='7f19020e-21bb-4786-adb0-6f2def2d8a3e', node_type='4', metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, hash='a2b68e6316c64e022863136a58cae33d2dc26e9576e335f9ae1d8007a21aba56')}

 Text Content:

Multimodal Agent AI systems have many applications.
In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario.
Fundamental advances in agent AI help contribute towards the

### 3.2 View all node's id under index

In [25]:
print(index.index_struct.nodes_dict)

{'073ef4c9-ef91-4fc0-9851-eae86bc4ecc5': '073ef4c9-ef91-4fc0-9851-eae86bc4ecc5'}


**Why multiple nodes?**  

LlamaIndex splits long documents into smaller parts to:

Fit the token limit for embedding models (like OpenAI's text-embedding-3-small)

```
[ Example.txt ]
 ├── Chunk 1 → Node ID: ae2f75...
 ├── Chunk 2 → Node ID: adda4a...
 └── Chunk 3 → Node ID: e81b06...

### 3.3 View all documents ref

In [26]:
print(index.ref_doc_info)


{'7f19020e-21bb-4786-adb0-6f2def2d8a3e': RefDocInfo(node_ids=['073ef4c9-ef91-4fc0-9851-eae86bc4ecc5'], metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'})}


### 3.4 View information about node with a given id

In [11]:
index.docstore.get_node('e1ab309c-8f22-4900-bd46-b740b8a5fc5e') # last step 's any id

TextNode(id_='e1ab309c-8f22-4900-bd46-b740b8a5fc5e', embedding=None, metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='bf6c6d9b-81cf-4370-a9df-5f97304fbcaa', node_type='4', metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, hash='a2b68e6316c64e022863136a58cae33d2dc26e9576e335f9ae1d8007a21aba56')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Multimodal Agent AI systems hav

## 4. Modifying Your Vector Index: Add & Delete Operations

In this section, we explore how to manually update the vector index by either deleting a document or inserting new nodes.


**⚠️ Avoid using the delete operation unless absolutely necessary.
Improper deletion can lead to inconsistencies or runtime errors in later code blocks.**

While adding new nodes can be useful during incremental updates, deletion might break relationships between nodes and cause the retriever or query engine to fail. For stable workflows, it’s usually better to rebuild the index from scratch when document changes are needed.



### 4.1 Delete Node

In [None]:
# index.docstore.delete_document('51595901-ebe3-48b5-b57b-dc8794ef4556')

### 4.2 Add Node

In [None]:
# index.insert_nodes([doc_single])

Note: `doc_single` must be a `TextNode` object, such as the one we saw earlier when viewing a node.  

You can also construct a `TextNode` manually. Here's how:

In [27]:
from llama_index.core.schema import TextNode
nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    )
]
index.insert_nodes(nodes)

It's also possible to construct a `TextNode` object from a document like the last chapter

In [31]:

from llama_index.core import SimpleDirectoryReader,Document
documents = SimpleDirectoryReader(input_files=['./docs/another_example.txt']).load_data()
from llama_index.core.node_parser import SentenceSplitter
transformations = [SentenceSplitter(chunk_size = 512)]

from llama_index.core.ingestion.pipeline import run_transformations
nodes = run_transformations(documents, transformations=transformations)
index.insert_nodes(nodes)
print(nodes)

[TextNode(id_='80daf58b-f469-4e45-bf01-a126e84c0f7d', embedding=None, metadata={'file_path': 'docs\\another_example.txt', 'file_name': 'another_example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-21', 'last_modified_date': '2025-07-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='03bbc8b7-7e41-47a0-86fb-9fdbdff3b1b4', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'docs\\another_example.txt', 'file_name': 'another_example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-21', 'last_modified_date': '2025-07-19'}, hash='e21b18d7025601f845c7287c1f5cfc20cbd20d88aabe3050302d7a9742561127'), <NodeRelationship.NEXT: '3'>: RelatedNodeIn

In [36]:
print_node(list(index.docstore.docs.values())[3], show_full_text=True)

 Node ID: f020803c-4af5-487e-b433-1dccaf7bbf13

 Metadata:
{'creation_date': '2025-07-21',
 'file_name': 'another_example.txt',
 'file_path': 'docs\\another_example.txt',
 'file_size': 3355,
 'file_type': 'text/plain',
 'last_modified_date': '2025-07-19'}

 Relationships:
{<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='ccc33b28-2732-4681-bc18-d82b47233275', node_type='4', metadata={'file_path': 'docs\\another_example.txt', 'file_name': 'another_example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-21', 'last_modified_date': '2025-07-19'}, hash='e21b18d7025601f845c7287c1f5cfc20cbd20d88aabe3050302d7a9742561127'),
 <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='cdd709db-f588-4fc7-a22b-0bfddd434145', node_type='1', metadata={}, hash='5128f538caf583931f5dcc7bd0d013dd356a78f5eaf21fb9dae1dba236603900')}

 Text Content:

Multimodal Agent AI systems have many applications.
In addition to interactive AI, grounded multimodal models could help driv

### 4.3 Updating Existing Content: No Direct Edit, Only Replace

As for modifying existing content, `LlamaIndex` does not currently support direct node editing.
To make changes to an existing node, the recommended approach is:

1. Delete the old node using its node ID: ```index.docstore.delete_document('<your_node_id>')```
2. Create a new `TextNode` with updated content or metadata.
3. Insert ```index.insert_nodes([your_new_node])```

## PDF

Import the PDFReader

In [10]:
documents = PDFReader().load_data("./docs/PDFexample.pdf")
print(f"Loaded {len(documents)} document(s)")
print(documents[0].text[:1000])  # Preview first 1000 characters

Loaded 1 document(s)
Agent AI:
Surveying the Horizons of Multimodal Interaction A PREPRINT
Thirdly, the various elements of our event, including the expert presentations, informative posters, and notably the
winners of our two leader-board, are set to offer a substantive yet succinct overview of the latest and significant trends,
research directions, and innovative concepts in the realm of multimodal agents. These presentations will encapsulate
pivotal findings and developments, shining a light on new systems, ideas, and technologies in the field of mulitmodal
agent AI. This assortment of knowledge is not only beneficial for the attendees of our forum, who are looking to
deepen their understanding and expertise in this domain, but it also serves as a dynamic and rich resource board. Those
visiting our forum’s website can tap into this reservoir of information to discover and understand the cutting-edge
advancements and creative ideas steering the future of multimodal agent AI. We striv