# **Text Splitters** in LangChain

In [None]:
# Install the necessary packages
!pip install langchain -qU
!pip install langchain-community -qU
!pip install unstructured -qU

In [5]:
from langchain_community.document_loaders import DirectoryLoader

# Initialize the DirectoryLoader with the path to the directory for text files
loader = DirectoryLoader("D:\\Next\\gai\\tutorials\\generative_ai_tutorial\\04.text_splitters\\data", glob="**/*.txt")

# Load the text data from the directory
dataset = loader.load()

for data in dataset:
  print("------------------------")
  print(data.page_content)
  print(data.metadata)

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


------------------------
LangChain is a robust framework designed to facilitate the creation of applications leveraging large language models (LLMs). By providing a suite of tools and utilities, LangChain simplifies the process of integrating LLMs into various applications, enabling developers to build sophisticated AI-powered solutions with greater ease. Its capabilities span from natural language understanding and generation to complex data processing tasks, making it a versatile choice for developing chatbots, virtual assistants, and other AI-driven applications.

One of the core features of LangChain is its modular architecture, which allows developers to customize and extend the framework according to their specific needs. This modularity ensures that different components, such as model training, data preprocessing, and inference, can be independently developed and optimized. Additionally, LangChain supports seamless integration with popular machine learning libraries and framewor

In [6]:
# Calculate the token count of each document in the dataset using a character level tokenizer
token_counts = [len(data.page_content) for data in dataset]

print(token_counts)

[1784, 912]


### Calculate Token Count Using tiktoken

In [10]:
pip install transformers


Collecting transformers
  Downloading transformers-4.53.0-py3-none-any.whl.metadata (39 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.33.1-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.2-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.30.0->transformers)
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.53.0-py3-none-any.whl (10.8 MB)
   ---------------------------------------- 0.0/10.8 MB ? eta -:--:--
    --------------------------------------- 0.3/10.8 MB ? eta -:--:--
   -- ------------------------------------- 0.8/10.8 MB 2.2 MB/s eta 0:00:05
  



In [7]:
# Install the tiktoken package for tokenization
!pip install tiktoken -qU

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/google/gemma-7b.
401 Client Error. (Request ID: Root=1-68608510-5357e3766df1a4e8266ee169;2d3f1802-ac4c-4152-a8d6-5f7312cd0b07)

Cannot access gated repo for url https://huggingface.co/google/gemma-7b/resolve/main/config.json.
Access to model google/gemma-7b is restricted. You must have access to it and be authenticated to access it. Please log in.

In [8]:
import tiktoken

tokenizer_model = tiktoken.encoding_for_model('gemini-2.5-flash')
print(tokenizer_model)

KeyError: 'Could not automatically map gemini-2.5-flash to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

In [9]:
# Get the encoding for the tokenizer
tokenizer = tiktoken.get_encoding('cl100k_base')

# Create a function to calculate the length of text in tokens using tiktoken
def tiktoken_len(text):
    tokens = tokenizer.encode(text)
    token_length = len(tokens)
    return token_length

In [12]:
# Calculate the token count of each document in the dataset using tiktoken
token_counts = [tiktoken_len(data.page_content) for data in dataset]

print(token_counts)

[286, 168]


### Chunking the Text Using **RecursiveCharacterTextSplitter**

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with a chunk size of 150 and no overlap, using character length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    length_function=len,
    separators=['\n\n', '\n', ' ', '']
)

In [14]:
# Split the text of the second document in the dataset into chunks
chunks = text_splitter.split_text(dataset[1].page_content)

len(chunks)

7

In [15]:
print(chunks[0])
print(len(chunks[0]))
print(chunks[1])
print(len(chunks[1]))

CodePRO LK is an educational platform offering a wide range of online courses primarily focused on programming and machine learning. All courses are
148
provided for free and are delivered in Sinhala, catering specifically to the Sri Lankan audience. The platform includes various learning materials
146


In [16]:
# Reinitialize the text splitter with a chunk size of 150 and an overlap of 20, using character length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=len,
    separators=['\n\n', '\n', ' ', '']
)

In [17]:
# Split the text of the second document in the dataset into chunks
chunks = text_splitter.split_text(dataset[1].page_content)

len(chunks)

7

In [18]:
print(chunks[0])
print(len(chunks[0]))
print(chunks[1])
print(len(chunks[1]))

CodePRO LK is an educational platform offering a wide range of online courses primarily focused on programming and machine learning. All courses are
148
All courses are provided for free and are delivered in Sinhala, catering specifically to the Sri Lankan audience. The platform includes various
143


### Chunking the Text Using tiktoken Length Function

In [19]:
# Reinitialize the text splitter with a chunk size of 150 and an overlap of 20, using tiktoken length function
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

In [20]:
# Split the text of the second document in the dataset into chunks
chunks = text_splitter.split_text(dataset[1].page_content)

len(chunks)

2

In [21]:
print(chunks[0])
print(tiktoken_len(chunks[0]))
print(chunks[1])
print(tiktoken_len(chunks[1]))

CodePRO LK is an educational platform offering a wide range of online courses primarily focused on programming and machine learning. All courses are provided for free and are delivered in Sinhala, catering specifically to the Sri Lankan audience. The platform includes various learning materials such as videos, quizzes, and assignments to enhance the learning experience​​.

Some of the featured courses on CodePRO LK include "Python GUI – Tkinter," "Machine Learning Project – Sentiment Analysis," and "Data Structures and Algorithms." These courses range from beginner to intermediate levels, making them accessible to a broad spectrum of learners​.
121
In addition to the courses, CodePRO LK also offers projects and resources to support learners in applying their knowledge practically. The platform emphasizes providing high-quality education to empower the next generation of Sri Lankans in the tech field​​.
47
