### Topic: Splitters with Metadata

When working with large documents, splitting text into smaller chunks is essential for processing by Large Language Models (LLMs). However, simply splitting text into chunks isn’t always enough. Sometimes, we need to **preserve additional information** about the chunks, such as where they came from (e.g., page number, section title, or source document). This is where **Splitters with Metadata** come into play.

---

## **1. Why is Metadata Needed?**

### **Problem: Loss of Context**
- When you split a document into chunks, the LLM loses information about the **origin** of each chunk.
- For example:
  - If a chunk comes from a specific section of a document (e.g., "Introduction" or "Conclusion"), the LLM won’t know this unless you explicitly tell it.
  - If a chunk comes from a specific page in a PDF, the LLM won’t know which page it’s from.

### **Solution: Metadata**
- **Metadata** is additional information attached to each chunk.
- It helps the LLM understand the **context** of the chunk, such as:
  - The source document.
  - The page number.
  - The section title.
  - The author or timestamp.

---

## **2. Benefits of Splitters with Metadata**

1. **Preserves Context**:
   - Metadata helps the LLM understand where each chunk came from, improving the accuracy of responses.

2. **Enables Better Retrieval**:
   - When using techniques like **Retrieval-Augmented Generation (RAG)**, metadata helps retrieve the most relevant chunks.

3. **Improves Traceability**:
   - You can trace back each chunk to its original source, which is useful for debugging or auditing.

4. **Supports Complex Workflows**:
   - Metadata allows you to build more advanced workflows, such as filtering chunks based on specific criteria (e.g., only use chunks from the "Introduction" section).

---

## **3. What Problem Does It Solve?**

### **Scenario: Building a Document-Based Q&A System**
Imagine you’re building a Q&A system that answers questions based on a large document (e.g., a 100-page PDF). Here’s how metadata solves key problems:

---

### **Problem 1: Loss of Source Information**
- Without metadata, the LLM won’t know which page or section a chunk came from.
- For example:
  - If a user asks, *"What does the Introduction say about AI?"*, the system won’t know which chunks belong to the "Introduction" section.

### **Solution: Add Metadata**
- Attach metadata to each chunk, such as:
  - `section: "Introduction"`
  - `page: 5`
- Now, when the user asks about the "Introduction," the system can filter chunks based on the `section` metadata.

---

### **Problem 2: Inefficient Retrieval**
- Without metadata, the system might retrieve irrelevant chunks.
- For example:
  - If a user asks, *"What is the conclusion?"*, the system might retrieve chunks from the middle of the document instead of the "Conclusion" section.

### **Solution: Use Metadata for Filtering**
- Attach metadata like `section: "Conclusion"` to relevant chunks.
- During retrieval, filter chunks based on the `section` metadata to ensure only relevant chunks are used.

---

### **Problem 3: Lack of Traceability**
- Without metadata, you can’t trace back a chunk to its original source.
- For example:
  - If the LLM generates an incorrect response, you won’t know which part of the document caused the error.

### **Solution: Include Source Information in Metadata**
- Attach metadata like `source: "document.pdf"` and `page: 10` to each chunk.
- Now, you can trace back each chunk to its original source for debugging or verification.

---

## **4. Example Scenario: Splitting a PDF with Metadata**

Let’s say you have a PDF document with the following structure:
- **Title Page**: Page 1
- **Introduction**: Pages 2–5
- **Main Content**: Pages 6–50
- **Conclusion**: Pages 51–52

You want to split this PDF into chunks and attach metadata to each chunk.

---

### **Step 1: Split the Document**
Use a text splitter to divide the document into smaller chunks.

---

### **Step 2: Attach Metadata**
For each chunk, attach metadata like:
- `source`: The name of the document (e.g., `"document.pdf"`).
- `page`: The page number the chunk came from.
- `section`: The section the chunk belongs to (e.g., `"Introduction"`, `"Conclusion"`).

---

### **Example: Splitting with Metadata**

```python
from langchain.text_splitter import CharacterTextSplitter

# Sample text with metadata
text = """
Introduction
Generative AI is a type of artificial intelligence that can create new content, such as text, images, or music. 
It works by learning patterns from existing data and using those patterns to generate new, similar data.

Main Content
Large Language Models (LLMs) like GPT-4 are examples of generative AI models. They are trained on vast amounts of text data and can generate human-like text.

Conclusion
In conclusion, generative AI has the potential to revolutionize many industries, from healthcare to entertainment.
"""

# Initialize the splitter
splitter = CharacterTextSplitter(
    chunk_size=100,  # Each chunk will have up to 100 characters
    chunk_overlap=20  # Overlap of 20 characters between chunks
)

# Split the text
chunks = splitter.split_text(text)

# Attach metadata to each chunk
metadata = [
    {"source": "document.pdf", "page": 2, "section": "Introduction"},
    {"source": "document.pdf", "page": 6, "section": "Main Content"},
    {"source": "document.pdf", "page": 51, "section": "Conclusion"}
]

# Combine chunks with metadata
chunks_with_metadata = [
    {"text": chunk, "metadata": metadata[i]} 
    for i, chunk in enumerate(chunks)
]

# Print the chunks with metadata
for chunk in chunks_with_metadata:
    print(f"Text: {chunk['text']}")
    print(f"Metadata: {chunk['metadata']}\n")
```

**Output**:
```
Text: Introduction
Generative AI is a type of artificial intelligence that can create new content, such as text, images, or music.
Metadata: {'source': 'document.pdf', 'page': 2, 'section': 'Introduction'}

Text: It works by learning patterns from existing data and using those patterns to generate new, similar data.
Metadata: {'source': 'document.pdf', 'page': 2, 'section': 'Introduction'}

Text: Main Content
Large Language Models (LLMs) like GPT-4 are examples of generative AI models. They are trained on vast amounts of text data and can generate human-like text.
Metadata: {'source': 'document.pdf', 'page': 6, 'section': 'Main Content'}

Text: Conclusion
In conclusion, generative AI has the potential to revolutionize many industries, from healthcare to entertainment.
Metadata: {'source': 'document.pdf', 'page': 51, 'section': 'Conclusion'}
```

---

## **5. Benefits of Splitters with Metadata**

1. **Context Preservation**:
   - Metadata helps the LLM understand the context of each chunk, improving response accuracy.

2. **Efficient Retrieval**:
   - Metadata allows you to filter and retrieve only the most relevant chunks.

3. **Traceability**:
   - You can trace back each chunk to its original source for debugging or auditing.

4. **Advanced Workflows**:
   - Metadata enables complex workflows, such as filtering chunks based on specific criteria (e.g., only use chunks from the "Conclusion" section).
