## Text Splitter
[Visualizer](https://chunkviz.up.railway.app/)


### 1. length based

In [1]:
from langchain.text_splitter import CharacterTextSplitter
text ="I am susamay kumbhakar a data scientist with 4 years of experience in python"
splitter= CharacterTextSplitter(chunk_size=25, chunk_overlap=3, separator='') # 15% is a good overlap
chunks = splitter.split_text(text)
print(chunks)

# splitter.split_documents(docs) # for document objs

['I am susamay kumbhakar a', 'a data scientist with 4', '4 years of experience in', 'in python']


In [2]:
# TRY - load pdf and perform spliiting


### 2. text structure based - RecursiveCharacterTextSplitter
(like paragraph - \n\n, line - \n, word - ' ', character - '')

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=25, chunk_overlap=3) # max 25 characters, optimize smaller to near 25

In [4]:
text="""
Space exploration has led to incredible scientific discoveries. From landing on the Moon to exploring Mars, humanity continues to push the boundaries of what’s possible beyond our planet.

These missions have not only expanded our knowledge of the universe but have also contributed to advancements in technology here on Earth. Satellite communications, GPS, and even certain medical imaging techniques trace their roots back to innovations driven by space programs.
"""
chunks = splitter.split_text(text)
print(chunks)

['Space exploration has', 'led to incredible', 'scientific discoveries.', 'From landing on the Moon', 'to exploring Mars,', 'humanity continues to', 'to push the boundaries', 'of what’s possible', 'beyond our planet.', 'These missions have not', 'only expanded our', 'knowledge of the', 'universe but have also', 'contributed to', 'to advancements in', 'in technology here on', 'on Earth. Satellite', 'communications, GPS, and', 'even certain medical', 'imaging techniques trace', 'their roots back to', 'to innovations driven by', 'by space programs.']


### 3. document structure based - RecursiveCharacterTextSplitter
(diff format - like markdown or python file) - special case of prev

In [5]:
text = """
class Student:
    def __init__(self, name, age, grade):
        self.name = name
        self.age = age
        self.grade = grade  # Grade is a float (like 8.5 or 9.2)

    def get_details(self):
        return self.name"

    def is_passing(self):
        return self.grade >= 6.0


# Example usage
student1 = Student("Aarav", 20, 8.2)
print(student1.get_details())

if student1.is_passing():
    print("The student is passing.")
else:
    print("The student is not passing.")
"""

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language # can also use MarkdownTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=300, chunk_overlap=0)
chunks = splitter.split_text(text)
print(chunks)

['class Student:\n    def __init__(self, name, age, grade):\n        self.name = name\n        self.age = age\n        self.grade = grade  # Grade is a float (like 8.5 or 9.2)\n\n    def get_details(self):\n        return self.name"\n\n    def is_passing(self):\n        return self.grade >= 6.0', '# Example usage\nstudent1 = Student("Aarav", 20, 8.2)\nprint(student1.get_details())\n\nif student1.is_passing():\n    print("The student is passing.")\nelse:\n    print("The student is not passing.")']


### 4. Semantic Text Splitter
Given 5 sentences: S1, S2, S3, S4, S5

Generate embeddings for consecutive pairs: (S1-S2), (S2-S3), (S3-S4), (S4-S5)

Compute cosine similarities for each pair → get 4 similarity scores

Calculate mean and standard deviation (SD) of these scores

If a similarity drops significantly (e.g., below mean − SD), insert a split between those sentences

In [7]:
text1 = """
The ancient forest, a silent guardian of forgotten lore, whispered tales of old. Towering trees, centuries old, formed a canopy that filtered the sunlight into dappled patterns on the mossy ground. This was a place of profound peace, where time seemed to slow, and the very air hummed with a mystical energy.

Meanwhile, in the bustling city miles away, sirens wailed. Traffic snarled, and the relentless rhythm of urban life pulsed through concrete canyons. People rushed, their faces etched with the urgency of modern existence, a stark contrast to the timeless tranquility of the woods. Skyscrapers pierced the clouds, monuments to human ambition and tireless innovation.

Back in the forest, a solitary deer grazed quietly by a hidden spring. The rustle of leaves underfoot was the only sound, a gentle symphony of nature. Sunlight now painted the western sky in hues of orange and purple, casting long shadows that danced among the ancient trunks. The profound peace of the ancient forest settled once more, an enduring echo of forgotten lore.
"""

In [8]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation",  # percentile/fixed
                            breakpoint_threshold_amount = 1 # allow upto 1 sd, if more than that -> split
)
docs = splitter.create_documents([text1]) # the context of text1 and text2 would be treated as different and independent
docs

[Document(metadata={}, page_content='\nThe ancient forest, a silent guardian of forgotten lore, whispered tales of old. Towering trees, centuries old, formed a canopy that filtered the sunlight into dappled patterns on the mossy ground. This was a place of profound peace, where time seemed to slow, and the very air hummed with a mystical energy. Meanwhile, in the bustling city miles away, sirens wailed. Traffic snarled, and the relentless rhythm of urban life pulsed through concrete canyons. People rushed, their faces etched with the urgency of modern existence, a stark contrast to the timeless tranquility of the woods.'),
 Document(metadata={}, page_content='Skyscrapers pierced the clouds, monuments to human ambition and tireless innovation. Back in the forest, a solitary deer grazed quietly by a hidden spring. The rustle of leaves underfoot was the only sound, a gentle symphony of nature. Sunlight now painted the western sky in hues of orange and purple, casting long shadows that dan