### Length Based

simple on the basis of chunksize and overlapping, it split the text into chunks

[https://chunkviz.up.railway.app](https://chunkviz.up.railway.app)

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [2]:
texts = """One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]

You can't understand the world without understanding the concept of superlinear returns. And if you're ambitious you definitely should, because this will be the wave you surf on.

It may seem as if there are a lot of different situations with superlinear returns, but as far as I can tell they reduce to two fundamental causes: exponential growth and thresholds.

The most obvious case of superlinear returns is when you're working on something that grows exponentially. For example, growing bacterial cultures. When they grow at all, they grow exponentially. But they're tricky to grow. Which means the difference in outcome between someone who's adept at it and someone who's not is very great.

Startups can also grow exponentially, and we see the same pattern there. Some manage to achieve high growth rates. Most don't. And as a result you get qualitatively different outcomes: the companies with high growth rates tend to become immensely valuable, while the ones with lower growth rates may not even survive.

Y Combinator encourages founders to focus on growth rate rather than absolute numbers. It prevents them from being discouraged early on, when the absolute numbers are still low. It also helps them decide what to focus on: you can use growth rate as a compass to tell you how to evolve the company. But the main advantage is that by focusing on growth rate you tend to get something that grows exponentially.

YC doesn't explicitly tell founders that with growth rate "you get out what you put in," but it's not far from the truth. And if growth rate were proportional to performance, then the reward for performance p over time t would be proportional to pt.

Even after decades of thinking about this, I find that sentence startling."""

In [3]:
spliter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
spliter.split_text(texts)

["One of the most important things I didn't understand about the world when I was a child is the",
 'was a child is the degree to which the returns for performance are superlinear.',
 'Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand',
 'I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your',
 "true. If your product is only half as good as your competitor's, you don't get half as many",
 'get half as many customers. You get no customers, and you go out of business.',
 "It's obviously true that the returns for performance are superlinear in business. Some think this",
 'Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true.',
 'stop being true. But superlinear returns for performance are a feature of the world, not an',
 "the world, not an artifact of rules we've invented. We see the same pattern in fame, power,",
 'in fame, power, military victories, k

### Text-Structure Splitter

based on structure of characters, it spilt texts into hirarchy like forst make chunk on basis of paragraph, then split on basis of sentences, then split on basis of words, lastly split on basis of characters.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

spliter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
)

In [5]:
spliter.split_text(texts)

["One of the most important things I didn't understand about the world when I was a child is the",
 'was a child is the degree to which the returns for performance are superlinear.',
 'Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand',
 'I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your',
 "true. If your product is only half as good as your competitor's, you don't get half as many",
 'get half as many customers. You get no customers, and you go out of business.',
 "It's obviously true that the returns for performance are superlinear in business. Some think this",
 'Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true.',
 'stop being true. But superlinear returns for performance are a feature of the world, not an',
 "the world, not an artifact of rules we've invented. We see the same pattern in fame, power,",
 'in fame, power, military victories, k

### Document-Structured based

Some time the texts are not pain texts rather it might be a code snippet or a document. In such cases, we need to split the text based on different document structure. like based on \nclass, \ndef etc

In [6]:
texts = """
class Student:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __str__(self):
        return f"Student(name={self.name}, age={self.age})"
        """
        


In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

spliter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=100,
    chunk_overlap=20,
    language=Language.PYTHON,
)

texts = spliter.split_text(texts)
texts

['class Student:\n    def __init__(self, name, age):\n        self.name = name\n        self.age = age',
 'def __str__(self):\n        return f"Student(name={self.name}, age={self.age})"']

### Semantic Meaning Based

check similarity of each sentence with next sentence and it will group/chunk them based on similarity 

In [8]:
from langchain_experimental.text_splitter import SemanticChunker

texts = """
ne of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]

You can't understand the world without understanding the concept of superlinear returns. And if you're ambitious you definitely should, because this will be the wave you surf on.
"""

In [9]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "NovaSearch/jasper_en_vision_language_v1"
model_kwargs = {'device': 'cpu',
                'trust_remote_code':True
                }
hf_emd_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
)

2025-05-19 09:52:00.123300: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-19 09:52:00.249711: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747628520.304247   20386 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747628520.318299   20386 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747628520.423261   20386 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [11]:
textsplitter = SemanticChunker(
    hf_emd_model,
    breakpoint_threshold_type='standard_deviation',
    breakpoint_threshold_amount=1
)

In [12]:
textsplitter.split_text(texts)

['\nne of the most important things I didn\'t understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers.',
 "You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]\n\nYou can't understand the world without understanding the concept of superlinear ret