
---

## 🔄 Why Do We Need to Split Documents into Chunks?

**Large Language Models (LLMs)** like GPT-4 and Claude have a **context window limit** (e.g., 4K, 8K, or 32K tokens). When working with long documents like PDFs, research papers, or website content, feeding the entire document to the model is **not feasible**.

Hence, we:

* **Split documents** into smaller **overlapping chunks**,
* **Embed** each chunk individually into a vector store,
* During retrieval, fetch only **relevant chunks**.

📌 **Analogy**: Think of reading a book chapter-wise. If you’re only interested in one topic, you search within relevant chapters — not the entire book.

---

## 📦 `langchain-text-splitters`: What Is It?

LangChain provides a separate package:
**`langchain-text-splitters`**

> It contains classes and utilities to split raw text or documents into manageable chunks.

💡 It supports splitting by:

* Characters
* Tokens (e.g., using tiktoken for OpenAI)
* Sentences
* Markdown headers
* Code blocks (e.g., Python-specific)

---

## 📘 RecursiveCharacterTextSplitter – The Most Commonly Used Splitter

This is the **most intelligent and flexible** text splitter. It tries to split using:

1. Paragraphs (`\n\n`)
2. Sentences (`.`)
3. Words (` `)
4. Characters (as fallback)

### ✅ Purpose:

* Preserve **semantic meaning** as much as possible.
* Avoid breaking in the middle of a sentence or word.

---

## 🔧 Important Parameters of `RecursiveCharacterTextSplitter`

```python
RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""],
    length_function=len
)
```

| Parameter         | Description                                                                   |
| ----------------- | ----------------------------------------------------------------------------- |
| `chunk_size`      | Maximum size of each chunk (in characters or tokens depending on splitter)    |
| `chunk_overlap`   | Number of characters that **overlap** between chunks for context preservation |
| `separators`      | List of strings to split on (in order of priority)                            |
| `length_function` | How the splitter measures the chunk (default: `len`)                          |

---

## 🔍 Example: Splitting Using `RecursiveCharacterTextSplitter`

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = """LangChain is a framework for developing LLM-powered applications. 
It enables memory, chains, agents, and retrieval using vector stores."""

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")
```

### 🧾 Output:

```
Chunk 1: LangChain is a framework for developing LLM-
Chunk 2: or developing LLM-powered applications. It en
Chunk 3: ications. It enables memory, chains, agents, 
Chunk 4: , agents, and retrieval using vector stores.
```

> 🔄 Overlap helps preserve the meaning across chunk boundaries!

---

## 📄 `create_documents()` – What Does It Do?

LangChain also provides a utility method called `create_documents()` that:

* Accepts a **list of raw strings** (not LangChain `Document` objects)
* Wraps them with metadata into `Document` format
* Splits them into chunks

### 🧪 Example:

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter, create_documents

texts = ["LangChain helps build LLM apps.", "It supports memory, chains, and more."]
splitter = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=5)

docs = create_documents(texts, text_splitter=splitter)
for doc in docs:
    print(doc.page_content, doc.metadata)
```

### 🔍 Output:

```python
LangChain helps buil {'source': '0'}
ps build LLM apps. {'source': '0'}
It supports memory, {'source': '1'}
emory, chains, and {'source': '1'}
d more. {'source': '1'}
```

✅ **Key benefit**: auto-generates metadata like `source`, useful for document tracing.

---

## 📄 `split_documents()` – When to Use?

If you already have a list of LangChain `Document` objects (e.g., from PDF or Web loaders), you can’t use `create_documents()` anymore. You must use:

```python
split_documents(documents)
```

This method preserves the **original metadata** and splits the `page_content` of each document.

---

## ❗ Why `create_documents()` Can't Work on LangChain `Document` Types?

Because:

* `create_documents()` expects **raw strings** and generates new `Document` objects from scratch.
* But `Document` objects already include `page_content` + `metadata`, so the right way is to call:

```python
splitter.split_documents(documents)
```

✅ This keeps the source file name, page number, or URL intact!

---

## ⚔️ Difference Between `create_documents()` and `split_documents()`

| Feature           | `create_documents()`                         | `split_documents()`                     |
| ----------------- | -------------------------------------------- | --------------------------------------- |
| Input             | List of strings                              | List of `Document` objects              |
| Output            | List of `Document` chunks                    | List of `Document` chunks               |
| Metadata Handling | Generates new metadata (e.g., `source: '0'`) | Preserves original metadata             |
| Use Case          | From raw text / notes                        | After document loading (e.g., PDF, Web) |
| When to Use       | First ingestion from text                    | After loading docs with loaders         |

---

## 🧪 Final Example: Both in Action

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter, create_documents

# From raw strings
raw_texts = ["LangChain is amazing.", "It allows memory and agents."]
splitter = RecursiveCharacterTextSplitter(chunk_size=15, chunk_overlap=5)
docs_created = create_documents(raw_texts, text_splitter=splitter)

# From loaded documents
from langchain_community.document_loaders import TextLoader
loader = TextLoader("example.txt")
loaded_docs = loader.load()
docs_split = splitter.split_documents(loaded_docs)
```

---


---

### ✅ Important Questions You Can Now Answer:

1. Why do we split documents into smaller parts in GenAI pipelines?
2. What is the difference between `create_documents()` and `split_documents()`?
3. What happens if we don’t add chunk overlap while splitting?
4. Can you explain what `RecursiveCharacterTextSplitter` does internally?
5. When would you NOT use `create_documents()`?

---




### ✅ **1. Why do we split documents into smaller parts in GenAI pipelines?**

**Answer**:
LLMs like GPT-4 or Claude have a **context window limit** (e.g., 8K or 32K tokens). If a document exceeds this limit, it can't be processed in one go.
So we split the document into **smaller chunks** (e.g., 500 tokens), optionally with **overlapping content** (e.g., 50 tokens), to:

* Stay within token limits,
* Preserve semantic meaning,
* Enable chunk-level embedding and retrieval.

**Example**:
A 10-page PDF is split into 30 overlapping chunks. Only the top 3 relevant chunks are passed to the LLM when answering a question, reducing noise and cost.

---

### ✅ **2. What is the difference between `create_documents()` and `split_documents()`?**

| Feature               | `create_documents()`                           | `split_documents()`                                     |
| --------------------- | ---------------------------------------------- | ------------------------------------------------------- |
| **Input**             | List of raw strings                            | List of `Document` objects                              |
| **Output**            | List of `Document` chunks                      | List of `Document` chunks                               |
| **Metadata Handling** | Auto-generates metadata (`source: '0'`, etc.)  | Preserves original metadata from input                  |
| **Use Case**          | For first-time processing of raw text          | For splitting already-loaded documents (e.g. PDFs, web) |
| **Example Usage**     | When working with scraped or manual text input | When using loaders like `PyPDFLoader`, `WebBaseLoader`  |

---

### ✅ **3. What happens if we don’t add `chunk_overlap` while splitting?**

**Answer**:
If you don’t use **chunk overlap**, the end of one chunk and the start of the next will have **no shared context**, which may:

* Break sentence flow,
* Reduce accuracy in retrieval,
* Cause LLMs to lose understanding of context transitions.

**Example**:
A chunk ends with “Barack Obama was born in” and the next chunk starts with “Hawaii.” Without overlap, this fact is split awkwardly.

Overlap (e.g., 50 tokens) ensures such facts are preserved in multiple chunks.

---

### ✅ **4. Can you explain what `RecursiveCharacterTextSplitter` does internally?**

**Answer**:
`RecursiveCharacterTextSplitter` is a **multi-level smart text splitter**. It works like this:

1. It tries to split using **higher-level separators** first:
   `["\n\n", "\n", ".", " ", ""]`
2. If the resulting chunk is too long, it **recursively** applies the next separator level.
3. If it reaches the character level and the chunk is still too long, it hard-splits it.

💡 The idea is to keep chunks **coherent**, preferring paragraph/sentence boundaries.

**Example**:
A 1,000-character paragraph is too long:

* First try splitting by `\n\n` → if not enough,
* Then by `.` → if still too long,
* Then by spaces or characters.

---

### ✅ **5. When would you NOT use `create_documents()`?**

**Answer**:
You **should not use `create_documents()`** when you already have **loaded documents** using LangChain loaders (e.g., PDF, web, Notion, Arxiv).

Why?

* These loaders return `Document` objects with **metadata** like page number, file path, URL, etc.
* `create_documents()` would override or discard this metadata.

👉 Instead, use `split_documents()` to **retain all original metadata** and just split the content.

---

### ✅ **Bonus: What are some key parameters of `RecursiveCharacterTextSplitter`?**

| Parameter         | Description                                         |
| ----------------- | --------------------------------------------------- |
| `chunk_size`      | Max size of each chunk (in characters/tokens)       |
| `chunk_overlap`   | Characters/tokens shared between adjacent chunks    |
| `separators`      | List of split symbols (priority order)              |
| `length_function` | How chunk length is measured (`len` or token count) |

---



---

## ✅ **What is `length_function` in `RecursiveCharacterTextSplitter`?**

### 🔹 Definition:

`length_function` is a **custom function** that tells the splitter **how to measure the length** of each chunk.
By default, LangChain uses Python’s `len()` to count **characters**.

```python
length_function = len
```

But in **LLMs**, we usually care about **tokens**, not characters.
So we can pass a custom function like `tiktoken_len()` to **count tokens instead** of characters.

---

## 🎓 Analogy:

> Imagine you're packing suitcases (chunks).
> You need to make sure **each suitcase doesn't exceed airline limits**.

* If you use a **ruler**, you're measuring **inches** (characters).
* If you use a **weighing scale**, you're measuring **weight in kg** (tokens).

🧳 In GenAI, tokens are like "kg" — they **directly affect cost and limits** of LLMs.
So `length_function` = how you want to **measure your suitcase size**.

---

## 🧪 Example 1 – Default Behavior (Character Count)

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = "The quick brown fox jumps over the lazy dog. " * 10

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
    length_function=len
)

chunks = splitter.split_text(text)
print(chunks[:2])
```

### 🔍 What happens?

* It splits every 100 **characters**, with 10 characters overlapping.
* It doesn’t care how many **tokens** that makes — could be more or fewer than LLM limits.

---

## 🧪 Example 2 – Token-Based Splitting using `tiktoken`

```python
import tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Use tiktoken tokenizer (e.g., for GPT-3.5)
encoding = tiktoken.get_encoding("cl100k_base")
tiktoken_len = lambda text: len(encoding.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
    length_function=tiktoken_len
)

text = "The quick brown fox jumps over the lazy dog. " * 10
chunks = splitter.split_text(text)
print(chunks[:2])
```

### 🔍 What happens here?

* It splits based on **token count**, not characters.
* Useful when:

  * You want fine control over **token budgets**,
  * You're going to embed or send text to LLMs.

---

## 🧠 Why is `length_function` important?

| Scenario                    | Should You Use Token-Based?    |
| --------------------------- | ------------------------------ |
| Embedding chunks            | ✅ Yes — costs depend on tokens |
| Prompt engineering          | ✅ Yes — LLMs use token limits  |
| Simple character processing | ❌ No — `len()` is enough       |

---



### 🔹 Step 1: Sample Text

```python
text = "LangChain helps developers build applications powered by language models. " * 5
```

👉 This gives us a **repetitive** paragraph useful to illustrate chunk boundaries.

---

### 🔹 Step 2: Setup

#### 🔧 Imports + Tokenizer

```python
import tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter
```

#### 🔧 Token-based function

```python
# For OpenAI models like gpt-3.5, gpt-4
encoding = tiktoken.get_encoding("cl100k_base")
token_count = lambda text: len(encoding.encode(text))
```

---

### 🔹 Step 3: Split using Character Length

```python
splitter_char = RecursiveCharacterTextSplitter(
    chunk_size=120,
    chunk_overlap=20,
    length_function=len  # character count
)

chunks_char = splitter_char.split_text(text)
```

#### ✅ Output (first two chunks):

```python
[
  "LangChain helps developers build applications powered by language models. LangChain helps developers build applications powered by",
  "build applications powered by language models. LangChain helps developers build applications powered by language models. LangChain h"
]
```

---

### 🔹 Step 4: Split using Token Count

```python
splitter_token = RecursiveCharacterTextSplitter(
    chunk_size=40,
    chunk_overlap=10,
    length_function=token_count
)

chunks_token = splitter_token.split_text(text)
```

#### ✅ Output (first two chunks):

```python
[
  "LangChain helps developers build applications powered by language models. LangChain helps developers build applications",
  "powered by language models. LangChain helps developers build applications powered by language models. LangChain helps"
]
```

---

### 🧠 Key Observations:

| Feature           | Character-Based                     | Token-Based                               |
| ----------------- | ----------------------------------- | ----------------------------------------- |
| Measures          | Length in characters (letters)      | Token count (used by LLMs)                |
| Chunk Consistency | Less precise for model input length | Matches LLM token budgets more accurately |
| Overlap Meaning   | Overlap = characters                | Overlap = tokens                          |
| Ideal For         | Simple local splits, text preview   | LLMs, Embeddings, Prompt contexts         |

---



---

### ✅ **1. What is a token in an LLM?**

A **token** is the smallest unit of text that a language model understands. Tokens can represent words, subwords, punctuation marks, spaces, or even parts of words. For example, the word `unbelievable` might be split into multiple tokens like `un`, `believ`, `able`.

Models like GPT or Claude don’t process entire sentences directly — they process **sequences of tokens**.

---

### ✅ **2. How do tokens differ from words and characters?**

* **Words** are language units separated by spaces.
* **Characters** are individual letters or symbols.
* **Tokens** are **model-specific subword units**, created by a **tokenizer**.

Example:

```text
Text: "Don't stop"
Words: ["Don't", "stop"]
Characters: ['D', 'o', 'n', "'", 't', ..., 'p']
Tokens (tiktoken): ["Don", "'", "t", " stop"]
```

So, tokens can be **shorter or longer** than words, and may include parts of words + punctuation.

---

### ✅ **3. Why is tokenization important in LLM pipelines?**

Tokenization is critical because:

* LLMs **only operate on tokens**.
* The **input/output token limits** of the model depend on token count, not word count.
* Tokenization defines how text is **segmented** and influences cost, latency, and performance.
* Embedding and chunking logic relies on token counts to prevent cutoff errors or data loss.

So, poor understanding of tokenization can lead to:

* Truncated prompts
* Higher cost
* Lower-quality generation

---

### ✅ **4. How do you count tokens before sending a prompt to GPT-4?**

I use OpenAI’s [`tiktoken`](https://github.com/openai/tiktoken) library to simulate GPT-4’s tokenizer:

```python
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")  # GPT-4/3.5 encoding
text = "LangChain helps developers build LLM-powered apps."
tokens = encoding.encode(text)
print(len(tokens))  # Gives token count
```

This helps me ensure that my prompt stays within the model’s token limit.

---

### ✅ **5. What are the token limits of popular models?**

| Model             | Max Tokens |
| ----------------- | ---------- |
| GPT-3.5           | 4,096      |
| GPT-4 Turbo       | 128,000    |
| Claude 2          | 100,000    |
| Gemini 1.5        | \~1M       |
| Cohere Command R+ | \~128K     |

Note: These limits include both **prompt** and **response** tokens.

---

### ✅ **6. What happens if you exceed token limits?**

If you exceed token limits:

* **OpenAI/GPT models** will return an error like `context length exceeded`.
* **Claude or Gemini** may silently truncate or raise errors.
* If you split documents improperly, your chunk might get **cut off mid-sentence**, leading to:

  * Poor completions
  * Loss of important context
  * Failed memory or retrieval steps in RAG pipelines

So always chunk text using **token-aware splitters** like `RecursiveCharacterTextSplitter` with a token-based `length_function`.

---

### ✅ **7. How does chunking relate to tokens? Why not use character-based chunks?**

Chunking is used to split large documents into smaller parts before feeding them into an LLM. It must be **token-aware** because:

* LLMs operate in token space.
* Two chunks with the same number of characters might have **very different token lengths**.
* Character-based splitting might **exceed token limits** or **cut tokens improperly**.

That’s why in LangChain and other libraries, we use:

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=token_count
)
```

This ensures **chunks are token-safe** and avoid model errors.

---

### ✅ **8. How do emojis and Unicode characters affect token count?**

Emojis and non-English scripts (like Chinese, Arabic, Devanagari) often take **more than one token**.

Examples:

* "💡" = 1–4 tokens depending on tokenizer
* "你好" (Chinese) = 1 token per character
* "❤️" = 2 tokens

So if you're processing **multilingual or emoji-rich content**, you should **always count tokens** instead of assuming based on character length.

---

## 💬 Bonus Tip: Answer With Tools

In real interviews, **mentioning tools or best practices** shows you're practical.

Example:

> “To ensure my prompts stay under token limits, I always simulate the tokenizer using `tiktoken`. For visual inspection, I also use the OpenAI Tokenizer web tool.”

---

## ✅ Final Summary Table

| Concept        | Description                                                |
| -------------- | ---------------------------------------------------------- |
| Token          | Smallest LLM-readable unit of text                         |
| Tokenizer      | Tool to break text into tokens                             |
| Why Important? | Cost, chunking, context size, performance                  |
| Count Tokens   | Use `tiktoken` for OpenAI, online tools for others         |
| Interview Prep | Focus on LLM limits, tokenizer behaviors, chunking impacts |

---



---

## ✅ **1. Purpose of Text Splitters in LangChain**

Text splitters are used to divide large documents into smaller chunks that:

* **Fit within the token limits** of LLMs (e.g., GPT-4),
* Preserve **semantic meaning**, and
* Improve performance in **RAG (Retrieval-Augmented Generation)**.

---

## ✅ **2. CharacterTextSplitter – Basic Version**

### 🔹 How It Works:

* Splits text **purely based on character count**, without caring about sentence/paragraph boundaries.
* Simple logic: it takes N characters (default 1000) and moves ahead with a given overlap.

```python
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_text(text)
```

### ✅ **When to Use:**

* You want **very simple, fast** splitting.
* Your text has **clear separators** (e.g., paragraphs separated by newlines).
* You are building a **proof of concept** or minimal demo.

### ⚠️ Limitation:

* Can **break sentences** or thoughts mid-way.
* Doesn't try to be intelligent about preserving context.

---

## ✅ **3. RecursiveCharacterTextSplitter – Smart Version**

### 🔹 How It Works:

* Tries to split text by **increasingly smaller logical units**:

  * First by paragraphs (`\n\n`)
  * Then by sentences (`.`)
  * Then by words (` `)
  * Finally by characters
* Uses a **recursive strategy** to preserve as much **semantic coherence** as possible.

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_text(text)
```

### ✅ **When to Use:**

* You want **intelligent chunking** that avoids breaking meaning.
* You're preparing input for **RAG**, **summarization**, or **Q\&A** tasks.
* Your text is **unstructured**, like PDFs, scraped HTML, research papers.
* You care about **sentence integrity** and **semantic boundaries**.

### 🔍 Example:

Imagine splitting this text:

```
LangChain is an open-source framework that helps developers build applications with LLMs.
It provides tools for retrieval, chaining, memory, and more.
```

**CharacterTextSplitter** may cut after 100 characters **in the middle of a sentence**.
**RecursiveCharacterTextSplitter** would prefer to **split between sentences or paragraphs**.

---

## ✅ ** Summary:**

| Feature                    | CharacterTextSplitter             | RecursiveCharacterTextSplitter            |
| -------------------------- | --------------------------------- | ----------------------------------------- |
| Strategy                   | Fixed-size, naive character split | Hierarchical: paragraph → sentence → word |
| Preserves Semantic Meaning | ❌ Often breaks sentences          | ✅ Tries to preserve logical boundaries    |
| Performance                | ✅ Faster                          | ⚠️ Slightly slower but smarter            |
| Use Case                   | Quick demos, structured text      | RAG, summarization, QA, unstructured docs |
| When to Use                | Minimal control needed            | Need semantic-aware chunks                |

---

## 🧠 Important Q: When would RecursiveCharacterTextSplitter fail?

**Answer**:
RecursiveCharacterTextSplitter relies on finding good separators (like `\n`, `.`). If the text has no clear structure (e.g., **binary data**, **code without line breaks**, or poorly OCR-ed documents), it may default to token count-based splitting at the lowest level (characters). In that case, **custom logic or regex-based splitting** might be required.

---