#  **2. Text splitters (Chunking)**

LLMs and embedding models are designed with a hard limit on the size of input and output tokens they can handle. This limit is usually called context window, and usually applies to the combination of input and output; Context windows are usually measured in number of tokens, for instance 8,192 tokens. Tokens, as mentioned in the Preface, are a representation of text as numbers, with each token usually covering between three and four characters of English text.


- https://python.langchain.com/docs/concepts/text_splitters/

## 📌 Why Do We Need to Break Text into Chunks?


## **1️⃣ LLM Token Limits**
✅ **LLMs like GPT-4 have a token limit** (e.g., 4,096 or 8,192 tokens).  
✅ If a document **exceeds this limit**, it **cannot be processed** in a single request.  
✅ **Chunking ensures text fits within token constraints**, allowing LLMs to process long documents efficiently.

🔹 **Example Use Case:**  
- A **50-page legal document** needs to be **split into smaller sections** for question-answering with GPT.

---

## **2️⃣ Improves Retrieval Accuracy in Vector Databases**
✅ **Vector search systems (e.g., ChromaDB, FAISS, Pinecone) store text embeddings**.  
✅ **If chunks are too large**, embeddings lose specificity, reducing **search accuracy**.  
✅ Chunking creates **smaller, more precise embeddings**, improving **semantic search and RAG (Retrieval-Augmented Generation).**

🔹 **Example Use Case:**  
- Searching for **a specific legal clause** is easier if the text is chunked properly.

---

## **3️⃣ Keeps Context While Preventing Data Loss**
✅ **If chunks are too small**, they lose meaning because context is missing.  
✅ **If chunks are too large**, important details might be **buried** or **not retrieved efficiently**.  
✅ **Balanced chunking** ensures **each chunk contains enough information** while staying manageable.

🔹 **Example Use Case:**  
- A chatbot answering **medical questions** needs to **retain enough context** per chunk.

---

## **🚀 Why Break Text into Chunks?**
| **Problem** | **Solution with Chunking** |
|------------|--------------------------|
| LLM token limit exceeded | Splits long text into **manageable sections** |
| Poor search accuracy | Creates **smaller, precise embeddings** |
| Mid-sentence breaks lose meaning | Preserves **logical structure** |
| Inefficient processing | Allows **efficient streaming & indexing** |

📌 **Chunking ensures LLMs handle large texts effectively, improving retrieval, search, and contextual understanding!** 🚀😊


In [None]:
with open('some_data/FDR_State_of_Union_1944.txt') as file:
    speech_text = file.read()


In [175]:
# Characters
len(speech_text)

21927

In [176]:
# Words
len(speech_text.split())

3750

## **🚀 Which Text Splitter Should You Use?**
| **Splitter** | **How It Works** | **Best For** |
|-------------|----------------|--------------|
| **CharacterTextSplitter** | Splits by raw characters (simple but naive) | Basic text chunking |
| **CharacterTextSplitter.from_tiktoken_encoder()** | Splits by OpenAI `tiktoken` tokens | GPT/OpenAI token efficiency |
| **RecursiveCharacterTextSplitter** | Smart recursive splitting (best for structured text) | Most use cases (retrieval, search, chatbots) |
| **RecursiveCharacterTextSplitter.from_tiktoken_encoder()** | Smart recursive splitting Splits by OpenAI `tiktoken` tokens | Most use cases (retrieval, search, chatbots) |

✔ **Use `CharacterTextSplitter`** → If you want a **basic and fast** approach.  
✔ **Use `CharacterTextSplitter.from_tiktoken_encoder()`** → If working with **OpenAI models** to ensure **token-aware chunking**.  
✔ **Use `RecursiveCharacterTextSplitter`** → If you need **structured, smart splitting** (recommended for most cases).  
✔ **Use `RecursiveCharacterTextSplitter.from_tiktoken_encoder()`** → If you need **structured, smart splitting** If working with **OpenAI models** to ensure **token-aware chunking**).  


## **1️⃣ CharacterTextSplitter (Length Based)**
✅ Splits text based on **characters** (e.g., `\n\n` for paragraphs, `.` for sentences).  
✅ Simple and fast, good for **small texts**.  
❌ May **split mid-sentence** or **lose context** if not tuned properly.  

🔹 **Best For:** Simple **text splitting when structure isn’t a concern**.

- https://python.langchain.com/docs/how_to/character_text_splitter/

---

In [177]:
from langchain_openai import ChatOpenAI
from langchain_text_splitters import CharacterTextSplitter

In [41]:
text_splitter = CharacterTextSplitter(separator="\n\n",
                                       chunk_size=1000, # max number of characters
                                       chunk_overlap=100
                                       ) 

In [42]:
texts = text_splitter.create_documents([speech_text]) # speech_text is a text file

print(len(texts))
texts

27


[Document(page_content="This Nation in the past two years has become an active partner in the world's greatest war against human slavery. We have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule. But I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash."),
 Document(page_content='When Mr. Hull went to Moscow in October, and when I went to Cairo and Teheran in November, we knew that we 

In [43]:
print(texts[0].page_content)

This Nation in the past two years has become an active partner in the world's greatest war against human slavery. We have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule. But I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.

We are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.


In [44]:
for text in texts:
    print(len(text.page_content), len(text.page_content.split())) 

839 152
933 169
900 145
722 117
905 150
672 112
950 159
977 172
939 156
848 147
1000 169
848 144
855 149
829 144
866 147
878 149
727 139
874 154
588 93
504 80
609 104
886 144
982 172
701 121
544 88
782 137
892 166


## **2️⃣ CharacterTextSplitter.from_tiktoken_encoder() (Token-Based Splitter)**
✅ Uses **OpenAI’s `tiktoken` tokenizer** to split text by **tokens instead of characters**.  
✅ More efficient for **LLMs that count tokens (GPT models)**.  
✅ Helps **avoid exceeding token limits** in API requests.  

🔹 **Best For:** When using **GPT-based models** (OpenAI) to control **token counts accurately**.

---

### If you encountered the following error while using `tiktoken`:

```python
SSLError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): 
Max retries exceeded with url: /encodings/cl100k_base.tiktoken 
(Caused by SSLError(SSLCertVerificationError(1, 
'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: 
unable to get local issuer certificate (_ssl.c:1125)')))

Follows these steps:

### **Step 1: Getting the blob URL**

First, let's grab the tokenizer blob URL. Run the code blow and copy the url for cl100k_base.tiktoken

In [182]:
import tiktoken_ext.openai_public
import inspect

print(dir(tiktoken_ext.openai_public))
# The encoder we want is cl100k_base, we see this as a possible function

print(inspect.getsource(tiktoken_ext.openai_public.cl100k_base))
# The URL should be in the 'load_tiktoken_bpe function call'

['ENCODING_CONSTRUCTORS', 'ENDOFPROMPT', 'ENDOFTEXT', 'FIM_MIDDLE', 'FIM_PREFIX', 'FIM_SUFFIX', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'cl100k_base', 'data_gym_to_mergeable_bpe_ranks', 'gpt2', 'load_tiktoken_bpe', 'o200k_base', 'p50k_base', 'p50k_edit', 'r50k_base']
def cl100k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken",
        expected_hash="223921b76ee99bde995b7ff738513eef100fb51d18c93597a113bcffe865b2a7",
    )
    special_tokens = {
        ENDOFTEXT: 100257,
        FIM_PREFIX: 100258,
        FIM_MIDDLE: 100259,
        FIM_SUFFIX: 100260,
        ENDOFPROMPT: 100276,
    }
    return {
        "name": "cl100k_base",
        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": speci

### **Step 2: Download the Tokenizer File**

Navigate to the blob URL you obtained in Step 1 and download the file.

- https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

### **Step 3: Generate Cache Key & Rename File**
Run the following code to generate a cache_key and rename the file to the cache_key

In [183]:
import hashlib

blobpath = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()
print(cache_key)

9b5ad71b2ce5302211f9c61530b329a4922fc6a4


### **Step 4: Set up the tiktoken cache**

Run the following code every time you need to use tiktoken

In [184]:
import os
import hashlib


cache_key = hashlib.sha1(blobpath.encode()).hexdigest()


# path to the directory you created to paste the file
tiktoken_cache_dir = r"C:\Users\Seyed Barabadi\Downloads\Gen AI\tiktoken" 
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir

# validate
assert os.path.exists(os.path.join(tiktoken_cache_dir, cache_key))


---

In [33]:
with open('some_data/FDR_State_of_Union_1944.txt') as file:
    speech_text = file.read()

In [34]:
from langchain_text_splitters import CharacterTextSplitter


text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", 
    chunk_size=500, # max number of tokens
    chunk_overlap=100
)

texts = text_splitter.split_text(speech_text) # speech_text is a text file
len(texts)

12

In [35]:
print(texts[0])

This Nation in the past two years has become an active partner in the world's greatest war against human slavery. We have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule. But I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.

We are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.

When Mr. Hull went to Moscow in October, and when I went to Cairo and Teheran in November, we knew that we were in agreement with our allies in our common dete

In [36]:
import tiktoken
from langchain.text_splitter import CharacterTextSplitter

# Initialize the OpenAI tokenizer (use "cl100k_base" for GPT-4/3.5)
tokenizer = tiktoken.get_encoding("cl100k_base")

for text in texts:
    print( len(text), len(tokenizer.encode(text)) )

2330 470
2187 409
2493 487
2096 422
2213 481
2131 443
2499 473
2273 472
2412 462
2164 424
2066 414
892 192


---

## **3️⃣ RecursiveCharacterTextSplitter (Smart Splitter - Best for Most Cases)**
✅ **Breaks text recursively**, ensuring logical splits (**paragraphs → sentences → words**).  
✅ **Preserves more context** by avoiding mid-sentence breaks.  
✅ Works best with **structured content (e.g., articles, legal documents, research papers).**  
❌ Slightly slower than `CharacterTextSplitter`, but **more accurate**.

🔹 **Best For:** **Most real-world use cases** (retrieval-augmented generation, chunking for vector search, chatbot memory, etc.).

- https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html
---

### Split large documents into small, but **still meaningful, pieces of text**:
LangChain provides `RecursiveCharacterTextSplitter`, which does the following:

1. Take a list of separators, in order of importance. By default these are:

    a. The paragraph separator: `\n\n`

    b. The line separator: `\n`

    c. The word separator: `space` character

2. To respect the given chunk size, for instance, 1,000 characters, start by splitting up paragraphs.

3. For any paragraph longer than the desired chunk size, split by the next separator: lines. Continue until all chunks are smaller than the desired length, or there are no additional separators to try.

4. Emit each chunk as a Document, with the metadata of the original document passed in and additional information about the position in the original document.

In [189]:
with open('some_data/FDR_State_of_Union_1944.txt') as file:
    speech_text = file.read()

### Let's use a `TextLoader` this time

In [190]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = TextLoader("some_data/FDR_State_of_Union_1944.txt") # or any other loader
loaded_docs = loader.load()
loaded_docs

[Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt'}, page_content='This Nation in the past two years has become an active partner in the world\'s greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.\n\nWhen Mr. Hull went to Moscow in October, and when I went to Ca

In [191]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # max number of characters
    chunk_overlap=200,
)
 
splitted_docs = splitter.split_documents(loaded_docs) # loaded_docs is a list of Document objects created by loader

# or:
splitted_docs_2 = splitter.create_documents([speech_text]) # speech_text is a text file

print(len(splitted_docs))
splitted_docs

28


[Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt'}, page_content="This Nation in the past two years has become an active partner in the world's greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash."),
 Document(metadata={'source': 'some_data/FDR_State_of_Union_194

In [192]:
splitted_docs[1].page_content

'When Mr. Hull went to Moscow in October, and when I went to Cairo and Teheran in November, we knew that we were in agreement with our allies in our common determination to fight and win this war. But there were many vital questions concerning the future peace, and they were discussed in an atmosphere of complete candor and harmony.\n\nIn the last war such discussions, such meetings, did not even begin until the shooting had stopped and the delegates began to assemble at the peace table. There had been no previous opportunities for man-to-man discussions which lead to meetings of minds. The result was a peace which was not a peace. That was a mistake which we are not repeating in this war.\n\nAnd right here I want to address a word or two to some suspicious souls who are fearful that Mr. Hull or I have made "commitments" for the future which might pledge this Nation to secret treaties, or to enacting the role of Santa Claus.'

In the preceding code, the documents created by the document loader are split into chunks of 1,000 characters each, with some overlap between chunks of 200 characters to maintain some context. The result is also a list of documents, where each document is up to 1,000 characters in length, split along the natural divisions of written text—paragraphs, new lines and finally, words. This uses the structure of the text to keep each chunk a consistent, readable snippet of text.

In [193]:
for doc in splitted_docs:
    print(len(doc.page_content), len(doc.page_content.split())) 

841 152
933 169
900 145
904 144
905 150
672 112
950 159
977 172
497 84
944 156
969 167
877 150
660 113
855 149
829 144
866 147
878 149
727 139
874 154
588 93
504 80
609 104
886 144
882 149
925 162
972 163
782 137
892 166




## **4️⃣ RecursiveCharacterTextSplitter.from_tiktoken_encoder() (Smart Splitter - Best for GPT)**
✅ **Breaks text recursively**, ensuring logical splits (**paragraphs → sentences → words**).  
✅ **Preserves more context** by avoiding mid-sentence breaks.  
✅ Works best with **GPT models.**  
❌ Slightly slower than `CharacterTextSplitter`, but **more accurate**.

🔹 **Best For:** **Most real-world use cases** (retrieval-augmented generation, chunking for vector search, chatbot memory, etc.).

---

In [198]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=1000, # chunk_size is the number of tokens
    chunk_overlap=200,
)

splitted_docs = text_splitter.split_documents(loaded_docs) # loaded_docs is a list of Document objects created by loader

# or:
# splitted_docs_2 = splitter.create_documents([speech_text]) # speech_text is a text file

print(len(splitted_docs))
splitted_docs


6


[Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt'}, page_content='This Nation in the past two years has become an active partner in the world\'s greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.\n\nWhen Mr. Hull went to Moscow in October, and when I went to Ca

In [195]:
import tiktoken
from langchain.text_splitter import CharacterTextSplitter


# Initialize the OpenAI tokenizer (use "cl100k_base" for GPT-4/3.5)
tokenizer = tiktoken.get_encoding("cl100k_base")


for doc in splitted_docs:
    print(len(doc.page_content), len(tokenizer.encode(doc.page_content)) ) 

4884 961
4640 917
4388 928
4694 927
4547 887
3206 646


## 📌 Optimal Chunk Size & Overlap When Using `tiktoken` 

| **Use Case** | **Recommended Chunk Size** | **Recommended Overlap** |
|-------------|--------------------------|------------------------|
| **Short Documents (1-2 pages)** | `256-512 tokens` | `50 tokens` |
| **General NLP & Chatbots** | `500-800 tokens` | `50-100 tokens` |
| **Retrieval-Augmented Generation (RAG)** | `800-1200 tokens` | `100-150 tokens` |
| **Large Documents (Reports, Books)** | `1000-1500 tokens` | `150-200 tokens` |
| **Technical or Code Processing** | `200-500 tokens` | `50-100 tokens` |

---

## **Combining Documnet Loaders and Splitters** using `load_and_split()`

In [196]:
import os
import hashlib


blobpath = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()


# path to the directory you created to paste the file
tiktoken_cache_dir = r"C:\Users\Seyed Barabadi\Downloads\Gen AI\tiktoken" 
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir

# validate
assert os.path.exists(os.path.join(tiktoken_cache_dir, cache_key))

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


loader = PyPDFLoader('some_data/marvel_superheroes.pdf')

# Use a text splitter to break large text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

pages = loader.load_and_split(text_splitter)
pages

[Document(metadata={'source': 'some_data/marvel_superheroes.pdf', 'page': 0}, page_content='Spider-Man\nSpider-Man, also known as Peter Parker, is one of Marvel\'s most iconic superheroes. Created by\nwriter Stan Lee and artist Steve Ditko, he first appeared in Amazing Fantasy #15 in 1962. As a\nteenager, Peter Parker was bitten by a radioactive spider, which granted him extraordinary powers,\nincluding superhuman strength, agility, the ability to cling to walls, and a "spider-sense" that warns\nhim of impending danger.'),
 Document(metadata={'source': 'some_data/marvel_superheroes.pdf', 'page': 0}, page_content='him of impending danger.\nDespite his incredible abilities, Peter\'s life is fraught with hardship. After the tragic murder of his\nUncle Ben, Peter learns a valuable lesson: "With great power comes great responsibility." This\nphilosophy shapes his journey as he fights crime and protects the citizens of New York City from\nnotorious villains like the Green Goblin, Doctor Octo