## **Text Transformation using Text Splitters**

LLMs have fixed maximum context window. If you document contains more token then the maximum context window size, LLMs won't be able to process it. So, it is important to break the documents into chunks.

### **Types of Text Splitters**

LangChain offers many different types of text splitters. These all live in the `langchain-text-splitters` package. Below is a table listing all of them, along with a few characteristics:
> **Name:** Name of the text splitter  
> **Splits On:** How this text splitter splits text  
> **Adds Metadata:** Whether or not this text splitter adds metadata about where each chunk came from.  
> **Description:** Description of the splitter, including recommendation on when to use it.

| Name | Splits On | Adds Metadata | Description |
| :--- | :--- | :---: | :--- |
| Character | A user defined character | ❌ | Splits text based on a user defined character. One of the simpler methods. |
| Recursive | A list of user defined characters | ❌ | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. |
| HTML | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |  
| Markdown | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |  
| Code | Code (Python, JS) specific characters | ❌ | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
| Token | Tokens | ❌ | Splits text on tokens. There exist a few different ways to measure tokens. |
| [Experimental] Semantic Chunker | Sentences | ❌ | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt |

**credits:** [ThatAIGuy GitHub Repository](https://github.com/bansalkanav/Generative-AI-Scratch-2-Advance-By-ThatAIGuy)

### **Load the data**

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./example_data/text/file_1.txt")

data = loader.load()

In [6]:
doc_content = data[0].page_content

print(doc_content)

This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."


### **1. CharacterTextSplitter**

In [8]:
from langchain_text_splitters import CharacterTextSplitter

char_text_split = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=200,
    chunk_overlap=50
)

char_chunks = char_text_split.split_text(doc_content)

In [9]:
print("Type of texts variable:", type(char_chunks))
print()
print("Type of each object inside the list:", type(char_chunks[0]))
print()
print("Total number of documents inside texts list:", len(char_chunks))
print()
print("* Content of first chunk:", char_chunks[0])
print()
print("* Content of second chunk:", char_chunks[1])

Type of texts variable: <class 'list'>

Type of each object inside the list: <class 'str'>

Total number of documents inside texts list: 2

* Content of first chunk: This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.

* Content of second chunk: Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."


In [10]:
for chunk in char_chunks:
    print(len(chunk))

173
78


### **Multiple documents with Metadata**

In [11]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('example_data/text', glob="*.txt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|██████████| 2/2 [00:00<00:00, 106.94it/s]


In [12]:
print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

Type of Data Variable:  <class 'list'>
Number of Documents: 2


In [24]:
# for doc in data:
#     print(doc.page_content)
#     print("---")

# for doc in data:
#     print(doc.metadata)
#     print("---")

In [38]:
doc_contents = [doc.page_content for doc in data]
doc_metadata = [doc.metadata for doc in data]

doc_chunks = char_text_split.create_documents(doc_contents, doc_metadata)

In [69]:
print("Total number of documents inside chunks:", len(doc_chunks))
print()
for i, chunk in enumerate(doc_chunks, start=1):
    print(f"Document {i} metadata: {chunk.metadata}")
    print(f"Document {i} chunks: {chunk.page_content}")
    print("-" * 100)

Total number of documents inside chunks: 3

Document 1 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 1 chunks: This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.
----------------------------------------------------------------------------------------------------
Document 2 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 2 chunks: Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."
----------------------------------------------------------------------------------------------------
Document 3 metadata: {'source': 'example_data\\text\\file_2.txt'}
Document 3 chunks: These were unbelievely expensive
and he'll grow out of them

in 20 minutes,
but I couldn't resist!

Look at these.
----------------------------------------------------------------------------------------------------


### **2. Recursively split by character**

In [63]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=200,
    chunk_overlap=50,
)

recursive_chunks = recursive_text_splitter.create_documents(doc_contents, doc_metadata)

In [68]:
print("Total number of documents inside chunks:", len(recursive_chunks))
print()
for i, chunk in enumerate(recursive_chunks, start=1):
    print(f"Document {i} metadata: {chunk.metadata}")
    print(f"Document {i} chunks: {chunk.page_content}")
    print("-" * 100)

Total number of documents inside chunks: 3

Document 1 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 1 chunks: This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.
----------------------------------------------------------------------------------------------------
Document 2 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 2 chunks: Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."
----------------------------------------------------------------------------------------------------
Document 3 metadata: {'source': 'example_data\\text\\file_2.txt'}
Document 3 chunks: These were unbelievely expensive
and he'll grow out of them

in 20 minutes,
but I couldn't resist!

Look at these.
----------------------------------------------------------------------------------------------------


### **3. Split by tokens**

**i. NLTK Text Splitter**

In [61]:
# !pip install nltk

In [62]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91889\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [65]:
from langchain_text_splitters import NLTKTextSplitter

nltk_text_splitter = NLTKTextSplitter(chunk_size=100, chunk_overlap=50)

nltk_chunks = nltk_text_splitter.create_documents(doc_contents, doc_metadata)

In [None]:
print("Total number of documents inside chunks:", len(nltk_chunks))
print()
for i, chunk in enumerate(nltk_chunks, start=1):
    print(f"Document {i} metadata: {chunk.metadata}")
    print(f"Document {i} chunks: {chunk.page_content}")
    print("-" * 100)

Total number of documents inside chunks: 5

Document 1 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 1 chunks: This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.
----------------------------------------------------------------------------------------------------
Document 2 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 2 chunks: Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.
----------------------------------------------------------------------------------------------------
Document 3 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 3 chunks: And finally, Chandler was,
like, "Forget about her."
----------------------------------------------------------------------------------------------------
Document 4 metadata: {'source': 'example_data\\text\\file_2.txt'}
Document 4 chunks: These were unbelievely expensive
and he'll grow out of them

in 20 minutes,
but I could

**ii. tiktoken Text Splitter**

`tiktoken` is a fast BPE tokenizer created by `OpenAI`

In [71]:
# !pip install tiktoken

In [72]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

tiktoken_text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-3.5-turbo",
    chunk_size=50,
    chunk_overlap=20
)

tiktoken_chunks = tiktoken_text_splitter.create_documents(doc_contents, doc_metadata)

In [73]:
print("Total number of documents inside chunks:", len(tiktoken_chunks))
print()
for i, chunk in enumerate(tiktoken_chunks, start=1):
    print(f"Document {i} metadata: {chunk.metadata}")
    print(f"Document {i} chunks: {chunk.page_content}")
    print("-" * 100)

Total number of documents inside chunks: 3

Document 1 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 1 chunks: This is pretty much
what's happened so far.

Ross was in love
with Rachel since forever.

Every time he tried to tell her,
something got in the way

Iike cats, Italian guys.
----------------------------------------------------------------------------------------------------
Document 2 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 2 chunks: Iike cats, Italian guys.

And finally, Chandler was,
like, "Forget about her."
----------------------------------------------------------------------------------------------------
Document 3 metadata: {'source': 'example_data\\text\\file_2.txt'}
Document 3 chunks: These were unbelievely expensive
and he'll grow out of them

in 20 minutes,
but I couldn't resist!

Look at these.
----------------------------------------------------------------------------------------------------


**iii. Sentence Transformers Token Text Splitter**

In [75]:
# !pip install sentence-transformers

In [76]:
from langchain_text_splitters.sentence_transformers import SentenceTransformersTokenTextSplitter

st_text_splitter = SentenceTransformersTokenTextSplitter(model_name="sentence-transformers/all-mpnet-base-v2", 
                                                         chunk_size=100, 
                                                         chunk_overlap=50)

st_chunks = st_text_splitter.create_documents(doc_contents, doc_metadata)

  from .autonotebook import tqdm as notebook_tqdm


In [77]:
print("Total number of documents inside chunks:", len(st_chunks))
print()
for i, chunk in enumerate(st_chunks, start=1):
    print(f"Document {i} metadata: {chunk.metadata}")
    print(f"Document {i} chunks: {chunk.page_content}")
    print("-" * 100)

Total number of documents inside chunks: 2

Document 1 metadata: {'source': 'example_data\\text\\file_1.txt'}
Document 1 chunks: this is pretty much what ' s happened so far. ross was in love with rachel since forever. every time he tried to tell her, something got in the way iike cats, italian guys. and finally, chandler was, like, " forget about her. "
----------------------------------------------------------------------------------------------------
Document 2 metadata: {'source': 'example_data\\text\\file_2.txt'}
Document 2 chunks: these were unbelievely expensive and he ' ll grow out of them in 20 minutes, but i couldn ' t resist! look at these.
----------------------------------------------------------------------------------------------------
