# Chapter 4: Mess Management with Documents


>This notebook is based on the open-source project [wow-rag](https://github.com/datawhalechina/wow-rag) by Datawhale China.  
>I’ve adapted and annotated parts of it for personal learning and experimentation.


## 1. Introduction 

Before we can retrieve answers or build intelligent applications, we need to ensure that our document data is well-managed. This chapter focuses on the often-overlooked but essential tasks of **document ingestion**, **inspection**, and **modification** using `LlamaIndex`. Whether we're inserting new data, examining existing nodes, or attempting tricky operations like deletion and updates, understanding how your index handles documents and nodes is critical.

We’ll explore how to:

- View and inspect stored documents and their internal node structures.

- Add new nodes or re-ingest documents using transformation pipelines.

- Handle updates by replacing outdated nodes (since direct modification isn’t supported).

- Avoid common pitfalls—like accidental deletion—that can break your pipeline.

Let’s roll up our sleeves and bring order to the mess. 


---

> ⚠️ **Warning: Real-World Document Management is Harder Than It Looks**

While this tutorial uses a **highly simplified dataset and workflow**, real-world RAG projects are far more demanding. 

In production, **document management can consume up to 80% of the total project effort**. This includes not only indexing, but also data cleaning, transformation, and retrieval optimization.

Here’s what makes document management so challenging in practice:

###  1. Complex Document Formats
- Input data may come from PDFs, Word files, scanned images, websites, or databases.
- Some require OCR or DOM parsing, which adds complexity.

###  2. Smart Chunking is Critical
- Chunk size and overlap must balance **context preservation** with **embedding efficiency**.
- Poor chunking leads to semantic drift and irrelevant retrieval results.
- In Chinese (or multilingual) documents, sentence segmentation is even more difficult.

###  3. Embedding Quality Drives Results
- Choosing the right embedding model (e.g., BGE, m3e, text2vec) is crucial.
- Inclusion of metadata (title, section headers) often improves semantic clarity.
- Embedding must be re-run if chunks are updated or tokenized differently.

###  4. Indexing & Metadata Management
- Storing and querying metadata (e.g., tags, dates, authors) alongside vectors is essential.
- Indexing strategies (like FAISS, Qdrant, Chroma) vary in scalability, update support, and performance.

###  5. Updates and Deletions Are Non-Trivial
- There is **no direct in-place editing** of indexed nodes.
- You must **track document IDs** and **replace or rebuild** affected chunks carefully.

###  6. Inspection and Debugging Are Time-Consuming
- You’ll often need to **inspect individual nodes**, verify embeddings, or simulate retrieval results manually.




## 2. Preparation

Same as always, use `example.txt` as the document source, and adopt **Method 1 from Chapter 3** — the simplest way to build a semantic index directly from documents using VectorStoreIndex. This approach is ideal for rapid prototyping or small-scale testing.

In [2]:
import os
from dotenv import load_dotenv


load_dotenv()
api_key = os.getenv('API_KEY')

base_url = "https://api.openai.com/v1"  
chat_model = "gpt-4.1-nano-2025-04-14"   
emb_model = "text-embedding-3-small"


from llama_index.llms.openai import OpenAI
llm = OpenAI(
    api_key = api_key,
    model = chat_model,
)


from llama_index.embeddings.openai import OpenAIEmbedding
embedding = OpenAIEmbedding(
    api_key = api_key,
    model = emb_model,
)
emb = embedding.get_text_embedding("Hello")

#Method 1 in Ch3

from llama_index.core import SimpleDirectoryReader,Document
documents = SimpleDirectoryReader(input_files=['./docs/example.txt']).load_data()

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents,embed_model=embedding)


## 3. Knowing the Index Internals

This subsection helps us understand what happens behind the scenes after building an index. Specifically:

- `index.docstore.docs` shows all the stored nodes (chunks of your documents).

- `index.index_struct.nodes_dict` maps internal node IDs for quick lookup.

- `index.ref_doc`_info shows document-level metadata and references.

- `get_node()` lets us retrieve and inspect a specific indexed chunk by its ID.

Together, these let us verify:

- How many chunks were created from our document,

- What metadata is attached,

- How the document and node IDs are tracked.

This step is useful for debugging, transparency, and gaining confidence in the structure before we perform queries or apply filters.






### 3.1 View all documents under index

It's a dictionary that maps **Node ID** to its corresponding `TextNod` object.   

Each `TextNode` contains the chunked content, metadata, and optionally embedding info.

In [3]:
print(index.docstore.docs)

{'e1ab309c-8f22-4900-bd46-b740b8a5fc5e': TextNode(id_='e1ab309c-8f22-4900-bd46-b740b8a5fc5e', embedding=None, metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='bf6c6d9b-81cf-4370-a9df-5f97304fbcaa', node_type='4', metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, hash='a2b68e6316c64e022863136a58cae33d2dc26e9576e335f9ae1d8007a21aba56')}, metadata_template='{key}: {value}', metadata_separator='\

### 3.2 View all node's id under index

In [4]:
print(index.index_struct.nodes_dict)

{'e1ab309c-8f22-4900-bd46-b740b8a5fc5e': 'e1ab309c-8f22-4900-bd46-b740b8a5fc5e'}


**Why multiple nodes?**  

LlamaIndex splits long documents into smaller parts to:

Fit the token limit for embedding models (like OpenAI's text-embedding-3-small)

```
[ Example.txt ]
 ├── Chunk 1 → Node ID: ae2f75...
 ├── Chunk 2 → Node ID: adda4a...
 └── Chunk 3 → Node ID: e81b06...

### 3.3 View all documents ref

In [5]:
print(index.ref_doc_info)


{'bf6c6d9b-81cf-4370-a9df-5f97304fbcaa': RefDocInfo(node_ids=['e1ab309c-8f22-4900-bd46-b740b8a5fc5e'], metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'})}


### 3.4 View information about node with a given id

In [11]:
index.docstore.get_node('e1ab309c-8f22-4900-bd46-b740b8a5fc5e') # last step 's any id

TextNode(id_='e1ab309c-8f22-4900-bd46-b740b8a5fc5e', embedding=None, metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='bf6c6d9b-81cf-4370-a9df-5f97304fbcaa', node_type='4', metadata={'file_path': 'docs\\example.txt', 'file_name': 'example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-19', 'last_modified_date': '2025-07-19'}, hash='a2b68e6316c64e022863136a58cae33d2dc26e9576e335f9ae1d8007a21aba56')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Multimodal Agent AI systems hav

## 4. Modifying Your Vector Index: Add & Delete Operations

In this section, we explore how to manually update the vector index by either deleting a document or inserting new nodes.


**⚠️ Avoid using the delete operation unless absolutely necessary.
Improper deletion can lead to inconsistencies or runtime errors in later code blocks.**

While adding new nodes can be useful during incremental updates, deletion might break relationships between nodes and cause the retriever or query engine to fail. For stable workflows, it’s usually better to rebuild the index from scratch when document changes are needed.



### 4.1 Delete Node

In [None]:
# index.docstore.delete_document('51595901-ebe3-48b5-b57b-dc8794ef4556')

### 4.2 Add Node

In [None]:
# index.insert_nodes([doc_single])

Note: `doc_single` must be a `TextNode` object, such as the one we saw earlier when viewing a node.  

You can also construct a `TextNode` manually. Here's how:

In [None]:
from llama_index.core.schema import TextNode
nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    )
]
index.insert_nodes(nodes)

It's also possible to construct a `TextNode` object from a document like the last chapter

In [9]:

from llama_index.core import SimpleDirectoryReader,Document
documents = SimpleDirectoryReader(input_files=['./docs/another_example.txt']).load_data()
from llama_index.core.node_parser import SentenceSplitter
transformations = [SentenceSplitter(chunk_size = 512)]

from llama_index.core.ingestion.pipeline import run_transformations
nodes = run_transformations(documents, transformations=transformations)
index.insert_nodes(nodes)
print(nodes)

[TextNode(id_='516c5101-798f-41e1-9332-fe8e96ca6383', embedding=None, metadata={'file_path': 'docs\\another_example.txt', 'file_name': 'another_example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-21', 'last_modified_date': '2025-07-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='5b853f0a-2fd0-4a35-a2e2-033f53e0acf6', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'docs\\another_example.txt', 'file_name': 'another_example.txt', 'file_type': 'text/plain', 'file_size': 3355, 'creation_date': '2025-07-21', 'last_modified_date': '2025-07-19'}, hash='e21b18d7025601f845c7287c1f5cfc20cbd20d88aabe3050302d7a9742561127'), <NodeRelationship.NEXT: '3'>: RelatedNodeIn

### 4.3 Updating Existing Content: No Direct Edit, Only Replace

As for modifying existing content, `LlamaIndex` does not currently support direct node editing.
To make changes to an existing node, the recommended approach is:

1. Delete the old node using its node ID: ```index.docstore.delete_document('<your_node_id>')```
2. Create a new `TextNode` with updated content or metadata.
3. Insert ```index.insert_nodes([your_new_node])```