# **Text Splitters** in LangChain

In [1]:
# Install the necessary packages
!pip install langchain -qU
!pip install langchain-community -qU
!pip install unstructured -qU

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.5 MB[0m [31m21.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m45.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolv

In [2]:
from langchain_community.document_loaders import DirectoryLoader

# Initialize the DirectoryLoader with the path to the directory for text files
loader = DirectoryLoader("/content/data", glob="**/*.txt")

# Load the text data from the directory
dataset = loader.load()

for data in dataset:
  print("------------------------")
  print(data.page_content)
  print(data.metadata)

------------------------
CodePRO LK is an educational platform offering a wide range of online courses primarily focused on programming and machine learning. All courses are provided for free and are delivered in Sinhala, catering specifically to the Sri Lankan audience. The platform includes various learning materials such as videos, quizzes, and assignments to enhance the learning experience​​.

Some of the featured courses on CodePRO LK include "Python GUI – Tkinter," "Machine Learning Project – Sentiment Analysis," and "Data Structures and Algorithms." These courses range from beginner to intermediate levels, making them accessible to a broad spectrum of learners​.

In addition to the courses, CodePRO LK also offers projects and resources to support learners in applying their knowledge practically. The platform emphasizes providing high-quality education to empower the next generation of Sri Lankans in the tech field​​.
{'source': '/content/data/text2.txt'}
------------------------

In [8]:
# Calculate the token count of each document in the dataset using a character level tokenizer
token_counts = [len(data.page_content) for data in dataset]

print(token_counts)  # characterwise

[912, 1784]


### Calculate Token Count Using tiktoken

In [9]:
# Install the tiktoken package for tokenization
!pip install tiktoken -qU

In [10]:
import tiktoken

tokenizer_model = tiktoken.encoding_for_model('gpt-3.5-turbo')
print(tokenizer_model)

<Encoding 'cl100k_base'>


In [11]:
# Get the encoding for the tokenizer
tokenizer = tiktoken.get_encoding('cl100k_base')

# Create a function to calculate the length of text in tokens using tiktoken
def tiktoken_len(text):
    tokens = tokenizer.encode(text)
    token_length = len(tokens)
    return token_length

In [12]:
# Calculate the token count of each document in the dataset using tiktoken
token_counts = [tiktoken_len(data.page_content) for data in dataset]

print(token_counts)

[168, 286]


### Chunking the Text Using **RecursiveCharacterTextSplitter**

In [18]:
!pip install langchain_text_splitters -qU



In [19]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


# Initialize the text splitter with a chunk size of 150 and no overlap, using character length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    length_function=len,
    separators=['\n\n', '\n', ' ', '']
)

In [20]:
# Split the text of the second document in the dataset into chunks
chunks = text_splitter.split_text(dataset[1].page_content)

len(chunks)

14

In [21]:
print(chunks[0])
print(len(chunks[0]))
print(chunks[1])
print(len(chunks[1]))

LangChain is a robust framework designed to facilitate the creation of applications leveraging large language models (LLMs). By providing a suite of
148
tools and utilities, LangChain simplifies the process of integrating LLMs into various applications, enabling developers to build sophisticated
143


In [22]:
# Reinitialize the text splitter with a chunk size of 150 and an overlap of 20, using character length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=len,
    separators=['\n\n', '\n', ' ', '']
)

In [23]:
# Split the text of the second document in the dataset into chunks
chunks = text_splitter.split_text(dataset[1].page_content)

len(chunks)

15

In [24]:
print(chunks[0])
print(len(chunks[0]))
print(chunks[1])
print(len(chunks[1]))

LangChain is a robust framework designed to facilitate the creation of applications leveraging large language models (LLMs). By providing a suite of
148
a suite of tools and utilities, LangChain simplifies the process of integrating LLMs into various applications, enabling developers to build
140


### Chunking the Text Using tiktoken Length Function

In [25]:
# Reinitialize the text splitter with a chunk size of 150 and an overlap of 20, using tiktoken length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

In [26]:
# Split the text of the second document in the dataset into chunks
chunks = text_splitter.split_text(dataset[1].page_content)

len(chunks)

3

In [27]:
print(chunks[0])
print(tiktoken_len(chunks[0]))
print(chunks[1])
print(tiktoken_len(chunks[1]))

LangChain is a robust framework designed to facilitate the creation of applications leveraging large language models (LLMs). By providing a suite of tools and utilities, LangChain simplifies the process of integrating LLMs into various applications, enabling developers to build sophisticated AI-powered solutions with greater ease. Its capabilities span from natural language understanding and generation to complex data processing tasks, making it a versatile choice for developing chatbots, virtual assistants, and other AI-driven applications.
91
One of the core features of LangChain is its modular architecture, which allows developers to customize and extend the framework according to their specific needs. This modularity ensures that different components, such as model training, data preprocessing, and inference, can be independently developed and optimized. Additionally, LangChain supports seamless integration with popular machine learning libraries and frameworks, providing a compreh