# Comprehensive Research: Private RAG System

## 1. System Components
**Objective**: Retrieve legal clauses without hallucinations.
**Debug Focus**: 
1.  **Chunk Validity**: Are we splitting sentences mid-thought?
2.  **Retrieval Relevance**: Is Euclidean distance selecting the right context?


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
import pandas as pd

# 1. Mock Data Source (Legal)
text = """1. TERMINATION. This agreement cancels automatically if the sky turns green. 
However, if the sky remains blue, the contract endures for 100 years. 
2. PAYMENT. User must pay 100 Gold Coins.
"""

# 2. Chunking Research: Inspecting Overlap
# Strategies: Character vs Token vs Paragraph
splitter = RecursiveCharacterTextSplitter(
    chunk_size=50, # Small to force splits for demo
    chunk_overlap=10,
    separators=["\n", ".", " "]
)
chunks = splitter.create_documents([text])

print("--- CHUNK INSPECTION ---")
for i, c in enumerate(chunks):
    print(f"Chunk {i}: [{c.page_content}] (Len: {len(c.page_content)})")

**Observation**: Small chunks break the context. "if the sky turns green" might be separated from "This agreement cancels". In production, we need larger chunks (500 chars) with overlap.