# **Length Based Text Splitting**

**Length Based Text Splitting** is a technique in LangChain (and other NLP frameworks) where large texts are divided into smaller chunks based on a fixed number of characters or tokens. This approach is commonly used to ensure that each chunk fits within the context window of language models, making it easier to process, embed, or retrieve relevant information.

**Advantages:**

Simple and fast to implement, requiring minimal configuration.
Ensures all chunks are of manageable and predictable size, which is ideal for models with strict input limits.
Works well for generic, unstructured text where semantic boundaries are less important.

**Disadvantages:**

May split sentences or paragraphs in unnatural places, potentially breaking context and meaning.
Can lead to loss of coherence if important information is divided between chunks.
Does not consider the semantic structure of the text, which may reduce the quality of downstream tasks like summarization or question answering.

**Limitation:**

Length based splitting is not suitable for documents where maintaining semantic or logical boundaries is critical, such as legal contracts or structured reports, as it may disrupt the flow and integrity of the information.

In [None]:
# Import the CharacterTextSplitter for splitting text into chunks based on character count.

from langchain.text_splitter import CharacterTextSplitter

In [None]:
# Define a long multi-paragraph text about data science to demonstrate text splitting.

text = """
Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large data sets and applying the knowledge from that data to solve problems in other application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, and summarizing these findings. As such, it incorporates skills from computer science, mathematics, data visualization, graphic design, communication, and business.[11]

Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g., from images, text, sensors, transactions, customer information, etc.) and emphasizes prediction and action.[12] Andrew Gelman of Columbia University has described statistics as a non-essential part of data science.[13] Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics.[14]

Etymology
Early usage
In 1962, John Tukey described a field he called "data analysis", which resembles modern data science.[14] In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics.[15] Later, attendees at a 1992 statistics symposium at the University of Montpellier  II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing.[16][17]

The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name to computer science.[6] In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic.[6] However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data.[18] In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.[17]

Modern usage
In 2012, technologists Thomas H. Davenport and DJ Patil declared "Data Scientist: The Sexiest Job of the 21st Century",[19] a catchphrase that was picked up even by major-city newspapers like the New York Times[20] and the Boston Globe.[21] A decade later, they reaffirmed it, stating that "the job is more in demand than ever with employers".[22]

The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland.[23] In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science.[24]

The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in 2008.[25] Though it was used by the National Science Board in their 2005 report "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century", it referred broadly to any key role in managing a digital data collection.[26]

"""

In [None]:
# Create a CharacterTextSplitter instance to split text into chunks of 100 characters with no overlap.
splitter = CharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 0,
    separator=''

)

In [None]:
# Split the sample text into chunks using the splitter and print the resulting list of chunks.

result = splitter.split_text(text)

print(result)

['Data science is an interdisciplinary field[10] focused on extracting knowledge from typically large', 'data sets and applying the knowledge from that data to solve problems in other application domains.', 'The field encompasses preparing data for analysis, formulating data science problems, analyzing dat', 'a, and summarizing these findings. As such, it incorporates skills from computer science, mathematic', 's, data visualization, graphic design, communication, and business.[11]\n\nVasant Dhar writes that sta', 'tistics emphasizes quantitative data and description. In contrast, data science deals with quantitat', 'ive and qualitative data (e.g., from images, text, sensors, transactions, customer information, etc.', ') and emphasizes prediction and action.[12] Andrew Gelman of Columbia University has described stati', 'stics as a non-essential part of data science.[13] Stanford professor David Donoho writes that data', 'science is not distinguished from statistics by the size of data

# **Using CharacterTextSplitter with PyPDFDirectoryLoader**

In [None]:
# Import CharacterTextSplitter and PyPDFLoader for splitting PDF documents into text chunks.

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

In [None]:
# Load a PDF file ('dl-curriculum.pdf') and convert its pages into LangChain document objects.

loader = PyPDFLoader('dl-curriculum.pdf')

docs = loader.load()

In [None]:
# Create a CharacterTextSplitter to split PDF document content into chunks of 200 characters with no overlap.

splitter = CharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 0,
    separator=''

)

In [None]:
# Split the loaded PDF documents into smaller chunks and print the content of the second chunk.

result = splitter.split_documents(docs)

print(result[1].page_content)

ificialneurons
2.HistoryofNeuralNetworks
● Earlymodels(Perceptron)● BackpropagationandMLPs● The"AIWinter"andresurgenceofneuralnetworks● Emergenceofdeeplearning
3.PerceptronandMultilayerPerceptrons(MLP


# **Text_Structure_Based**

In [None]:
# Import RecursiveCharacterTextSplitter for splitting text based on structure (e.g., paragraphs, sentences).
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Define a multi-paragraph text about the mindset of a data scientist for demonstrating structure-based splitting.

text = """
Becoming a data scientist demands a mindset rooted in curiosity and lifelong learning. You must be endlessly inquisitive about how things work, why patterns emerge in data, and how to turn numbers into meaningful insights. It’s important to love asking questions, digging deeper into problems, and being willing to learn new tools, algorithms, and concepts as the field rapidly evolves. A good data scientist embraces uncertainty and sees each dataset as an opportunity to discover something valuable, rather than as a mere technical task to complete.

Equally crucial is resilience and a growth mindset. Much of data science involves trial and error: models fail, hypotheses don’t hold, and data can be messy or incomplete. Instead of getting frustrated, you need to see setbacks as lessons and persist until you find a solution. Communication skills and empathy are also vital — you must translate technical findings into clear, actionable insights for non-technical audiences and collaborate effectively with diverse teams. Ultimately, the mindset of a successful data scientist blends analytical rigor with creativity, adaptability, and the courage to tackle complex, ambiguous problems.

"""

In [None]:
# Create a RecursiveCharacterTextSplitter to split the text into chunks of 300 characters, preserving structure where possible.

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 0,
    # separator=''

)

In [None]:
# Split the text into structured chunks and print the number of chunks and their content.
chunks = splitter.split_text(text)

print(len(chunks))
print(chunks)


5
['Becoming a data scientist demands a mindset rooted in curiosity and lifelong learning. You must be endlessly inquisitive about how things work, why patterns emerge in data, and how to turn numbers into meaningful insights. It’s important to love asking questions, digging deeper into problems, and', 'being willing to learn new tools, algorithms, and concepts as the field rapidly evolves. A good data scientist embraces uncertainty and sees each dataset as an opportunity to discover something valuable, rather than as a mere technical task to complete.', 'Equally crucial is resilience and a growth mindset. Much of data science involves trial and error: models fail, hypotheses don’t hold, and data can be messy or incomplete. Instead of getting frustrated, you need to see setbacks as lessons and persist until you find a solution. Communication skills', 'and empathy are also vital — you must translate technical findings into clear, actionable insights for non-technical audiences and colla

# **Python Code Splitting**

In [None]:
# Import RecursiveCharacterTextSplitter and Language for code-aware splitting.

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

In [None]:
# Define a sample Python code as a string for code splitting demonstration.
text = """
text = """
class Student:
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        self.grade = grade  # Grade is a float (like 8.5 or 9.2)

    def get_details(self):
        return self.name"

    def is_passing(self):
        return self.grade >= 6.0


# Example usage
student1 = Student("Aarav", 20, 8.2)
print(student1.get_details())

if student1.is_passing():
    print("The student is passing.")
else:
    print("The student is not passing.")

"""

In [None]:
# Initialize a RecursiveCharacterTextSplitter for Python code, specifying language and chunk size.
splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size = 300,
    chunk_overlap=0,
)

In [None]:
# Split the Python code into logical code chunks and print the number of chunks and the first chunk.
chunks = splitter.split_text(text)

print(len(chunks))

print(chunks[0])

2
class Student:
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        self.grade = grade  # Grade is a float (like 8.5 or 9.2)

    def get_details(self):
        return self.name"

    def is_passing(self):
        return self.grade >= 6.0


In [None]:
# Print the second chunk of the split Python code.
print(chunks[1])

# Example usage
student1 = Student("Aarav", 20, 8.2)
print(student1.get_details())

if student1.is_passing():
    print("The student is passing.")
else:
    print("The student is not passing.")


# **Markdown Splitting**

In [None]:
# Import RecursiveCharacterTextSplitter and Language for Markdown-aware splitting.
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

In [None]:
# Define a sample Markdown document as a string for splitting demonstration.
text="""
# Project Name: Smart Student Tracker

A simple Python-based project to manage and track student data, including their grades, age, and academic status.


## Features

- Add new students with relevant info
- View student details
- Check if a student is passing
- Easily extendable class-based design


## 🛠 Tech Stack

- Python 3.10+
- No external dependencies


## Getting Started

1. Clone the repo  
   ```bash
   git clone https://github.com/your-username/student-tracker.git


"""

In [None]:
# Initialize a RecursiveCharacterTextSplitter for Markdown, specifying language and chunk size.

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size =200,
    chunk_overlap =0,
)

In [None]:
# Split the Markdown text into logical chunks.
chunks = splitter.split_text(text)



In [None]:
# Print the number of Markdown chunks and the first chunk.

print(len(chunks))

print(chunks[0])

3
# Project Name: Smart Student Tracker

A simple Python-based project to manage and track student data, including their grades, age, and academic status.


In [None]:
# Print the second chunk of the split Markdown document.
print(chunks[1])

## Features

- Add new students with relevant info
- View student details
- Check if a student is passing
- Easily extendable class-based design


# **Semantic Meaning Based**

In [None]:
# Import SemanticChunker for semantic-based splitting and HuggingFaceEmbeddings for embeddings.
# Also import dotenv to load environment variables.
# Uncomment the pip install line if you haven't installed langchain-experimental.
# %pip install langchain-experimental

from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

from dotenv import load_dotenv
load_dotenv()

True

In [None]:
# Initialize HuggingFaceEmbeddings with a specific model for semantic chunking.

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    # huggingfacehub_api_token=api_key,
)

In [None]:
# Create a SemanticChunker using the embeddings and specify the breakpoint threshold.

text_splitter = SemanticChunker(
    embeddings, 
    breakpoint_threshold_type='standard_deviation',
    breakpoint_threshold_amount=3
)

In [None]:
# Define a sample text with multiple topics for semantic chunking demonstration.

sample = """
Farmers were working hard in the fields, preparing the soil and planting seeds for the next season. The sun was bright, and the air smelled of earth and fresh grass. The Indian Premier League (IPL) is the biggest cricket league in the world. People all over the world watch the matches and cheer for their favourite teams.


Terrorism is a big danger to peace and safety. It causes harm to people and creates fear in cities and villages. When such attacks happen, they leave behind pain and sadness. To fight terrorism, we need strong laws, alert security forces, and support from people who care about peace and safety.
"""

In [None]:
# Use the SemanticChunker to split the sample text into semantically meaningful chunks and print the results.

docs = text_splitter.create_documents([sample])

print(len(docs))

print(docs)