# 11 — Text Splitters (RecursiveCharacterTextSplitter)

A **text splitter** breaks a large document into smaller, manageable **chunks**. This is essential for LLMs, which have **context window limits**.

By splitting text you can:
- Search and retrieve more precisely.
- Process data efficiently.
- Feed the model coherent fragments instead of arbitrary cuts.

In [None]:
# ╔══════════════════════════════════════════════════════╗
# ║ Setup: Load environment variables & initialize model ║
# ╚══════════════════════════════════════════════════════╝

import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

from langchain_openai import ChatOpenAI
# from langchain_groq import ChatGroq

chat_model = ChatOpenAI(model="gpt-4o-mini")
# chat_model = ChatGroq(model="llama-3.1-70b-versatile")

print("✅ Environment loaded and model ready.")

## RecursiveCharacterTextSplitter

**RecursiveCharacterTextSplitter** tries a list of separators **from largest to smallest**. If a chunk is still too long after trying a given separator, it recursively tries the next one.

Typical separators order:
1. Paragraphs (e.g., `"\n\n"`)
2. New lines (e.g., `"\n"`)
3. Sentences (regex like `"(?<=\. )"`)
4. Words (`" "`)
5. Characters (`""`)

Key parameters:
- `chunk_size` → maximum size of each chunk (here **characters**, not tokens).
- `chunk_overlap` → how much two consecutive chunks overlap.
- `separators` → ordered list the splitter will try.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

second_recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", r"(?<=\. )", " ", ""]
)

print(second_recursive_splitter)

### Separators in detail
- `"\n\n"` — **double newline** → paragraph boundaries.
- `"\n"` — **single newline** → new lines within lists/blocks.
- `"(?<=\. )"` — **regex lookbehind** that splits **after a period and a space** (sentence boundaries).
- `" "` — **space** → words.
- `""` — **empty string** → characters (last resort).

## Practical example
We will split the following text using the splitter configured above.

In [None]:
text2 = (
    "Data that Speak\n"
    "LLM Applications are revolutionizing industries such as \n"
    "banking, healthcare, insurance, education, legal, tourism,\n"
    "construction, logistics, marketing, sales, customer service, \n"
    "and even public administration.\n\n"
    "The aim of our programs is for students to learn how to \n"
    "create LLM Applications in the context of a business,\n"
    "which presents a set of challenges that are important \n"
    "to consider in advance."
)

chunks = second_recursive_splitter.split_text(text2)
for i, ch in enumerate(chunks, 1):
    print(f"[{i}] len={len(ch)}\n{ch}\n---")

You should see coherent chunks, first attempting paragraphs, then sentences, then smaller units as needed.

## Tips: choosing `chunk_size` and `chunk_overlap`

- Start with **small chunks** (e.g., 200–400 characters) for demos, then adjust.
- Use a small **overlap** (e.g., 10–50 characters) when you need to preserve context between adjacent chunks (headings/sentences that span boundaries).
- Remember this splitter works with **characters**. For token-aware splitting, prefer `TokenTextSplitter` or set chunk sizes based on tokenization utilities.

## Why this matters for RAG

In RAG pipelines, you typically:
1. **Load** documents (Data Loaders).
2. **Split** them into coherent chunks (this step).
3. **Embed** and **index** those chunks in a vector store.
4. **Retrieve** the most relevant chunks for a query and feed them to the LLM.

Good splitting preserves semantic coherence and improves retrieval quality.