
# üß© 1. Conceptual Overview

### üîπ What Is a Text Splitter?

A **Text Splitter** divides large documents (loaded via Document Loaders) into smaller, manageable **chunks of text** that can be efficiently embedded, stored, and retrieved later.

LLMs have token limits (e.g., GPT-4 Turbo: ~128K tokens). Without chunking, you risk:

* Exceeding model limits
* Losing semantic boundaries
* Degraded retrieval accuracy

Splitters solve this by applying **intelligent segmentation** ‚Äî usually at sentence, paragraph, or token level.

---

### üîπ Why Splitting Matters

Splitting determines:

* How much **context** is stored in each vector.
* How **relevant** retrieved content will be.
* The **embedding quality** and **retrieval precision**.

A good splitter balances **semantic coherence** with **token efficiency**.

---

# üß± 2. Architectural Role

LangChain‚Äôs ingestion pipeline:

```
Raw Data ‚Üí Document Loader ‚Üí Text Splitter ‚Üí Embeddings ‚Üí VectorStore ‚Üí Retriever ‚Üí LLM
```

The **Text Splitter** ensures that:

* Each chunk fits within model/token constraints.
* Metadata is propagated and preserved.
* Context overlap avoids boundary loss.

---

# ‚öôÔ∏è 3. Core Splitter Classes

LangChain offers multiple **splitter strategies**, each suited for different content types.

| **Splitter**                            | **Purpose / Logic**                                | **Best For**             |
| --------------------------------------- | -------------------------------------------------- | ------------------------ |
| `CharacterTextSplitter`                 | Splits by character count                          | Plain text               |
| `RecursiveCharacterTextSplitter`        | Hierarchical split: paragraphs ‚Üí sentences ‚Üí words | Structured documents     |
| `TokenTextSplitter`                     | Splits by token count                              | LLM-token optimized      |
| `MarkdownHeaderTextSplitter`            | Splits markdown by headers                         | Notebooks, documentation |
| `SentenceTransformersTokenTextSplitter` | Token-aware, using model tokenizer                 | Fine-grained RAG         |
| `Language` Splitters                    | Language-specific chunking (Python, JS, SQL)       | Code data                |
| `HTMLHeaderTextSplitter`                | Based on HTML tags                                 | Web documents            |

---

# üß† 4. Key Parameters

| **Parameter**     | **Description**                                                   |
| ----------------- | ----------------------------------------------------------------- |
| `chunk_size`      | Maximum size of a chunk (in characters or tokens).                |
| `chunk_overlap`   | Number of overlapping units between chunks (to preserve context). |
| `separators`      | Custom delimiters (`\n`, `.`, `;`) controlling boundary logic.    |
| `length_function` | Defines how ‚Äúlength‚Äù is calculated (characters, tokens).          |

---

# üìò 5. Practical Example ‚Äî Recursive Splitter

### Example: Splitting a long PDF document

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load
loader = PyPDFLoader("data/policy.pdf")
docs = loader.load()

# Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " "]
)

chunks = splitter.split_documents(docs)
print(f"Original Docs: {len(docs)} | Split Chunks: {len(chunks)}")
print(chunks[0].page_content[:300])
```

üîπ *Why Recursive?*
It tries to split text at natural breakpoints (paragraph ‚Üí line ‚Üí sentence ‚Üí word), ensuring coherence.

---

# üìò 6. Example ‚Äî Token-Based Splitting

Useful when targeting **specific token budgets** aligned with embedding or generation models.

```python
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)
tokens = splitter.split_text("Your long technical document here...")
print(len(tokens))
```

This ensures consistent chunking aligned with **LLM tokenization rules** (using `tiktoken` or tokenizer-specific methods).

---

# üìò 7. Example ‚Äî Markdown & Structured Documents

Ideal for engineering documentation, README files, or Jupyter notebooks.

```python
from langchain.text_splitter import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3")
    ]
)
docs = splitter.split_text(open("docs/readme.md").read())
print(len(docs))
```

Preserves section hierarchy in metadata:

```python
{'Header 1': 'Introduction', 'Header 2': 'Setup', 'Header 3': 'Usage'}
```

---

# üß© 8. Example ‚Äî Code Splitter

LangChain supports **language-aware splitters** for codebases:

```python
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, 
    chunk_size=400, 
    chunk_overlap=50
)

code = open("scripts/model_training.py").read()
chunks = splitter.create_documents([code])
print(chunks[0].page_content)
```

This ensures logical splits at class/function boundaries.

---

# üîÑ 9. Chunking Strategy Design (Best Practices)

| **Objective**              | **Recommended Approach**                                    |
| -------------------------- | ----------------------------------------------------------- |
| **General knowledge base** | RecursiveCharacterTextSplitter, 800‚Äì1000 chars, 150 overlap |
| **Technical docs**         | MarkdownHeaderTextSplitter or TokenTextSplitter             |
| **Legal/Policy documents** | Recursive splitter with paragraph separators                |
| **Code repositories**      | Language-aware splitter                                     |
| **Chat history**           | Character splitter, smaller chunks (400‚Äì600 chars)          |

---

# ‚öñÔ∏è 10. Choosing the Right Chunk Size

### üîπ Considerations:

* Embedding model‚Äôs **token capacity** (e.g., `text-embedding-3-large` ‚Üí ~8K tokens)
* Context requirement per query
* Vector DB efficiency

| Model                              | Recommended Chunk Size | Overlap |
| ---------------------------------- | ---------------------- | ------- |
| `OpenAI text-embedding-3-small`    | 600‚Äì800                | 100     |
| `text-embedding-3-large`           | 1000‚Äì1500              | 200     |
| `sentence-transformers`            | 300‚Äì500                | 50‚Äì100  |
| `gpt-4-turbo` (generation context) | ‚â§1500                  | 200     |

---

# üßÆ 11. Metadata Propagation

All splitters preserve the metadata of the original document, ensuring traceability.

üìò Example:

```python
print(chunks[0].metadata)
# Output: {'source': 'data/policy.pdf', 'page': 2}
```

You can also append or enrich metadata at the chunk level for advanced retrieval analytics.

---

# üß† 12. Common Pitfalls

| **Issue**                         | **Cause**                          | **Mitigation**                  |
| --------------------------------- | ---------------------------------- | ------------------------------- |
| Chunks too large ‚Üí Token overflow | Chunk size > embedding/model limit | Reduce `chunk_size`             |
| Context loss between chunks       | No overlap                         | Use `chunk_overlap=100‚Äì200`     |
| Poor semantic alignment           | Simple character splitting         | Use Recursive or Token splitter |
| High memory use                   | Large doc ingestion                | Process in batches              |

---

# ‚öôÔ∏è 13. Integration with VectorStores

Example full pipeline:

```python
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Load & split
loader = TextLoader("data/ai_overview.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = splitter.split_documents(docs)

# Embed & store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
```

Now, each chunk is semantically indexed for high-precision retrieval.

---

# üíº 14. Interview Questions & Answers

### **Beginner**

**Q1. What is the purpose of a Text Splitter in LangChain?**
To break large documents into smaller, coherent text chunks for efficient embedding and retrieval.

**Q2. Why is chunk overlap important?**
To maintain context continuity across adjacent chunks, ensuring smoother comprehension during retrieval.

**Q3. What are the main types of text splitters?**
Character-based, Recursive, Token-based, Markdown-based, and Language-aware splitters.

---

### **Intermediate**

**Q4. Difference between `CharacterTextSplitter` and `RecursiveCharacterTextSplitter`?**

* `CharacterTextSplitter` uses fixed-size segmentation.
* `RecursiveCharacterTextSplitter` respects logical structure ‚Äî paragraphs, sentences, words.

**Q5. How would you choose chunk size for OpenAI embeddings?**
Typically 800‚Äì1200 characters with 100‚Äì200 overlap to balance token efficiency and semantic completeness.

**Q6. What metadata is preserved during splitting?**
File name, page numbers, headers, or source path from the original loader.

---

### **Advanced**

**Q7. How would you handle multilingual documents?**
Use a multilingual-aware splitter or sentence boundary detection models (`spacy`, `langdetect`) before chunking.

**Q8. What‚Äôs the trade-off between small and large chunks?**

* Small chunks ‚Üí High retrieval precision but lower coherence.
* Large chunks ‚Üí Better context but risk token inefficiency.

**Q9. How can splitting be optimized for RAG latency?**

* Preprocess & cache chunks
* Parallelize splitting
* Optimize chunk size for embedding vector dimensionality

**Q10. How would you split a 100MB document for RAG?**
Load in streaming batches ‚Üí split incrementally ‚Üí persist chunks asynchronously to a vector store.

---

# üß† 15. Real-World Implementation Pattern

| **Stage**         | **Component**                 | **Description**           |
| ----------------- | ----------------------------- | ------------------------- |
| 1Ô∏è‚É£ Data Load     | Document Loaders              | Extract raw content       |
| 2Ô∏è‚É£ Data Prep     | Text Splitters                | Segment into chunks       |
| 3Ô∏è‚É£ Vectorization | Embeddings                    | Generate semantic vectors |
| 4Ô∏è‚É£ Storage       | VectorStore (FAISS, Pinecone) | Store vectors + metadata  |
| 5Ô∏è‚É£ Retrieval     | Retriever + Chain             | Query + generate response |

