# Text Splitting and txt Parsing

In [193]:
!pip install tiktoken



In [194]:
import os

text_content = """
This is a very long sentence that contains many words and goes on for a very long time to demonstrate how character text splitter works with overlap when the text is longer than the chunk size and needs to be split across multiple chunks while maintaining some overlap between them for better context preservation and understanding of the content that spans across chunk boundaries eowoeiwoefhoewhofweoifhoiweohfwoiefoweofhweohfowehofhowehfioweohifhiowehoifweiohfiohweofhwhiofhiofwhiofhoiwefhoiweohifehoiwfhoiwoheifohiewfhoiwehoifhoiewfhoiweohiefoihwefiohwohfewhoioehwfhoewohfweoifowehofewhofoweofwefoweh eiorfjoierjoferoie jerio jio ge jgoe or gergjeorjgoejr jg oejrgo eorj goer goeroigeoroj goj erjogoergijoer roefer iojgojre ejog ejorgo oergoj ejrgjeriogjoierjogerogejrgeor jgrejgjreogoergoerjoigoiergj erjgoe joegerjgieorgoerjogjergijeor giojejrgeorgoerojgejorjo ejogeoji ooegojgerogro.

Another very long sentence that also contains many words and continues for a long time to show how the overlap mechanism works when splitting text that is longer than the specified chunk size and needs to be divided into multiple parts while keeping some overlapping content between adjacent chunks for better text processing and analysis.

This is the third very long sentence that demonstrates the same concept with many words and continues for a long time to illustrate how character text splitter handles long text by splitting it into chunks with overlap when the content exceeds the specified chunk size limit.
"""
os.makedirs("data", exist_ok=True)
with open("data/some_text.txt", "w", encoding="utf-8") as file:
    file.write(text_content)

print("File some_text.txt created!")

File some_text.txt created!


In [195]:
from langchain_community.document_loaders import TextLoader
from typing import List
from langchain_core.documents import Document


def process_text(file_path: str) -> List[Document]:

    txt_loader = TextLoader(file_path,encoding='utf-8')
    txt_doc = txt_loader.load()

    return txt_doc



In [196]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

current_docs = process_text('data/some_text.txt')

print(current_docs)
print(current_docs[0].page_content[:500])

[Document(metadata={'source': 'data/some_text.txt'}, page_content='\nThis is a very long sentence that contains many words and goes on for a very long time to demonstrate how character text splitter works with overlap when the text is longer than the chunk size and needs to be split across multiple chunks while maintaining some overlap between them for better context preservation and understanding of the content that spans across chunk boundaries eowoeiwoefhoewhofweoifhoiweohfwoiefoweofhweohfowehofhowehfioweohifhiowehoifweiohfiohweofhwhiofhiofwhiofhoiwefhoiweohifehoiwfhoiwoheifohiewfhoiwehoifhoiewfhoiweohiefoihwefiohwohfewhoioehwfhoewohfweoifowehofewhofoweofwefoweh eiorfjoierjoferoie jerio jio ge jgoe or gergjeorjgoejr jg oejrgo eorj goer goeroigeoroj goj erjogoergijoer roefer iojgojre ejog ejorgo oergoj ejrgjeriogjoierjogerogejrgeor jgrejgjreogoergoerjoigoiergj erjgoe joegerjgieorgoerjogjergijeor giojejrgeorgoerojgejorjo ejogeoji ooegojgerogro.\n\nAnother very long sentence that also 

## Character-bases splitting


In [197]:
# Method 1.
text = current_docs[0].page_content

In [198]:
char_splitter = CharacterTextSplitter(
    separator = " ", # Split on new lines
    chunk_size = 100, # Max chunk length
    chunk_overlap =30, # Overlap between chunks
    length_function = len # How to count chunk size
)

char_chunks = char_splitter.split_text(text)




In [199]:
print(char_chunks[0])
print("------------")
print(char_chunks[1])
print("------------")
print(char_chunks[2])


This is a very long sentence that contains many words and goes on for a very long time to
------------
on for a very long time to demonstrate how character text splitter works with overlap when the text
------------
with overlap when the text is longer than the chunk size and needs to be split across multiple


## Recursive Text Splitter

In [200]:
# Recursively iterate throw all separators and chunking the text
RecSplitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " "], # Recursive iteration throw all spliters
    chunk_size = 250, # Max chunk length
    chunk_overlap =50, # Overlap between chunks
    length_function = len # How to count chunk size
)

rec_chunks = RecSplitter.split_text(text)

In [201]:
print(rec_chunks[0])
print("------------")
print(rec_chunks[1])
print("------------")
print(rec_chunks[2])


This is a very long sentence that contains many words and goes on for a very long time to demonstrate how character text splitter works with overlap when the text is longer than the chunk size and needs to be split across multiple chunks while
------------
needs to be split across multiple chunks while maintaining some overlap between them for better context preservation and understanding of the content that spans across chunk boundaries
------------
across chunk boundaries eowoeiwoefhoewhofweoifhoiweohfwoiefoweofhweohfowehofhowehfioweohifhiowehoifweiohfiohweofhwhiofhiofwhiofhoiwefhoiweohifehoiwfhoiwoheifohiewfhoiwehoifhoiewfhoiweohiefoihwefiohwohfewhoioehwfhoewohfweoifowehofewhofoweofwefoweh


## Token Text Splitter

In [202]:
# Token text splitter working like Character splitter, but instead of separators it uses tokens like LLM models


token_splitter = TokenTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50,
    model_name="gpt-3.5-turbo"
)

token_chunks = token_splitter.split_text(text)


In [203]:
print(token_chunks[0])
print("------------")
print(token_chunks[1])
print("------------")
print(token_chunks[2])


This is a very long sentence that contains many words and goes on for a very long time to demonstrate how character text splitter works with overlap when the text is longer than the chunk size and needs to be split across multiple chunks while maintaining some overlap between them for better context preservation and understanding of the content that spans across chunk boundaries eowoeiwoefhoewhofweoifhoiweohfwoiefoweofhweohfowehofhowehfioweohifhiowehoifweiohfiohweofhwhiofhiofwhiofhoiwefhoiweohifehoiwfhoiwoheifohiewfhoiwehoifhoiewfhoiweohiefoihwefiohwohfewhoioehwfhoewohfweoifowehofewhofoweofwefoweh eiorfjoierjoferoie jerio jio ge jgoe
------------
ohiefoihwefiohwohfewhoioehwfhoewohfweoifowehofewhofoweofwefoweh eiorfjoierjoferoie jerio jio ge jgoe or gergjeorjgoejr jg oejrgo eorj goer goeroigeoroj goj erjogoergijoer roefer iojgojre ejog ejorgo oergoj ejrgjeriogjoierjogerogejrgeor jgrejgjreogoergoerjoigoiergj erjgoe joegerjgieorgoerjogjergijeor giojejrgeorgoerojgejorjo ejogeoji ooegojger

In [204]:
# Thats all about chunking, in section 3 we will discuss senior methods of chunking, like chunking using semantic information