###  First Technique - > How to recursively split text by charector 

"""
The RecursiveCharacterTextSplitter is LangChain's recommended text splitting method for generic, long documents. It intelligently splits text by prioritizing larger, semantically meaningful boundaries (like paragraphs or sentences) before breaking down into smaller units, ensuring chunks retain as much context as possible. 
"""

In [2]:
##  pdf loader
from langchain_community.document_loaders import PyPDFLoader
pdf_loader = PyPDFLoader('sample.pdf')
pdf_document = pdf_loader.load()
print(pdf_document)

[Document(metadata={'producer': 'Powered By Crystal', 'creator': 'Crystal Reports', 'creationdate': '', 'source': 'sample.pdf', 'total_pages': 13, 'page': 0, 'page_label': '1'}, page_content='Report as of   January 05, 2026    \nTransfer Agent Report One\nTA Fund Number Price ChangeRatePriceSystem KeyUserbank CG Long CG Short QII %\n   Dimensional Funds      549\nDFA Short Term Investment Fund 493  11.57  0.00121135599380W9  0.00 \n   VICTORY      590\nSycamore Established Value 308  45.88Class R6 1602:07W7  0.40 \nSycamore Established Value 024  44.04Class C 1602:04W7  0.37 \nSycamore Established Value 001  44.75Class R 1602:03W7  0.38 \nSycamore Established Value 307  45.81Class A 1602:01W7  0.39 \nSycamore Established Value 203  45.85Class I 1602:05W7  0.40 \nSycamore Established Value 309  45.83Class Y 1602:06W7  0.39 \nSycamore Small Company Opportunity 859  46.51Class R6 1603:07W7  0.73 \nSycamore Small Company Opportunity 002  40.87Class R 1603:03W7  0.63 \nSycamore Small Compan

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
texts = text_splitter.split_documents(pdf_document)


In [4]:
print(texts[0])
print(texts[1])

page_content='Report as of   January 05, 2026    
Transfer Agent Report One
TA Fund Number Price ChangeRatePriceSystem KeyUserbank CG Long CG Short QII %
   Dimensional Funds      549' metadata={'producer': 'Powered By Crystal', 'creator': 'Crystal Reports', 'creationdate': '', 'source': 'sample.pdf', 'total_pages': 13, 'page': 0, 'page_label': '1'}
page_content='DFA Short Term Investment Fund 493  11.57  0.00121135599380W9  0.00 
   VICTORY      590
Sycamore Established Value 308  45.88Class R6 1602:07W7  0.40' metadata={'producer': 'Powered By Crystal', 'creator': 'Crystal Reports', 'creationdate': '', 'source': 'sample.pdf', 'total_pages': 13, 'page': 0, 'page_label': '1'}


In [5]:
## 1  txt loader 
from langchain_community.document_loaders import TextLoader

loader = TextLoader('speech.txt')


# this loader.load function convert the ll the text into a text_document which percentents the text file as a documentin langchain
text_documents = loader.load()
print(text_documents)

[Document(metadata={'source': 'speech.txt'}, page_content='Text to speech (TTS) is the use of software to create a sound output in the form of a spoken voice. The program that is used by programs to change text on the page to an audio output of the spoken voice is normally a text to speech engine. Blind people, people who do not see well, and people with reading disabilities can rely on good text-to-speech systems. That way they can listen to pieces of the text. TTS engines are needed for an audio output of machine translation results.\n\nUp until about 2010, there was the analytic approach: This approach uses multiply steps to convert the text to speech. Usually, an input text is transformed into phonetic writing. This says how the words are pronounced, and not how they are written. In the phonetic writing, phonemes can be identified. The system can then produce speech by putting together prerecorded or synthesized diphones. A problem is to make the language flow sound natural, what l

In [None]:
speech = ""
with open('speech.txt', 'r') as file:
    speech = file.read()


text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=5)    
texts = text_splitter.split_text(speech)

In [7]:
print(texts[0])
print(texts[1])
print(texts[2])

Text to speech (TTS) is the use of software to create a sound output in the form of a spoken voice.
The program that is used by programs to change text on the page to an audio output of the spoken
voice is normally a text to speech engine. Blind people, people who do not see well, and people


### Second techninque - > "Charector text splitter"

"""
The LangChain CharacterTextSplitter is a basic and fast text splitting method that divides text into chunks based purely on a specified separator character (defaulting to a double newline \\n\\n) and a fixed chunk_size (measured in characters). 
"""

In [9]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=200, chunk_overlap=20)
texts = text_splitter.split_documents(pdf_document)

In [10]:
speech = ""
with open('speech.txt', 'r') as file:
    speech = file.read()


text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)    
texts = text_splitter.split_text(speech)

print(texts[0])
print(texts[1])

Text to speech (TTS) is the use of software to create a sound output in the form of a spoken voice. The program that is used by programs to change text on the page to an audio output of the spoken voice is normally a text to speech engine. Blind people, people who do not see well, and people with reading disabilities can rely on good text-to-speech systems. That way they can listen to pieces of the text. TTS engines are needed for an audio output of machine translation results.
Up until about 2010, there was the analytic approach: This approach uses multiply steps to convert the text to speech. Usually, an input text is transformed into phonetic writing. This says how the words are pronounced, and not how they are written. In the phonetic writing, phonemes can be identified. The system can then produce speech by putting together prerecorded or synthesized diphones. A problem is to make the language flow sound natural, what linguists call prosody.


### 3rd technique by HTML Header textsplitters 

"""
the HTMLHeaderTextSplitter is a LangChain utility designed to parse HTML content and segment it into documents based on a specified hierarchy of header tags. This method preserves the hierarchical structure of the document, adding header information as metadata to each chunk of text. 
"""


In [11]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <h1>Main Title</h1>
    <p>This is some introductory text under the first main header.</p>
    <div>
        <h2>Sub-section 1</h2>
        <p>This content belongs to the first sub-section.</p>
        <h3>Detail A</h3>
        <p>Specific details for sub-section 1 go here.</p>
    </div>
    <h1>Second Main Title</h1>
    <p>This text is separate and belongs to the second major heading.</p>
</body>
</html>
"""

# 1. Use the correct parameter name 'headers_to_split_on'
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

# 2. Remove chunk_size and chunk_overlap from the constructor
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_docs = html_splitter.split_text(html_string)

for doc in html_docs:
    print(doc)


page_content='Main Title' metadata={'Header 1': 'Main Title'}
page_content='This is some introductory text under the first main header.' metadata={'Header 1': 'Main Title'}
page_content='Sub-section 1' metadata={'Header 1': 'Main Title', 'Header 2': 'Sub-section 1'}
page_content='This content belongs to the first sub-section.' metadata={'Header 1': 'Main Title', 'Header 2': 'Sub-section 1'}
page_content='Detail A' metadata={'Header 1': 'Main Title', 'Header 2': 'Sub-section 1', 'Header 3': 'Detail A'}
page_content='Specific details for sub-section 1 go here.' metadata={'Header 1': 'Main Title', 'Header 2': 'Sub-section 1', 'Header 3': 'Detail A'}
page_content='Second Main Title' metadata={'Header 1': 'Second Main Title'}
page_content='This text is separate and belongs to the second major heading.' metadata={'Header 1': 'Second Main Title'}


In [12]:

import requests
from langchain_text_splitters import HTMLHeaderTextSplitter

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

# Standard 2026 header to bypass basic bot detection
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

# Fetch the HTML manually
response = requests.get(url, headers=headers)
response.raise_for_status()

# Split the content
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_docs = html_splitter.split_text(response.text)

print(html_docs[0])



page_content='Jump to content  
Main menu  
Main menu  
move to sidebar  
hide  
Navigation  
Main page  
Contents  
Current events  
Random article  
About Wikipedia  
Contact us  
Contribute  
Help  
Learn to edit  
Community portal  
Recent changes  
Upload file  
Special pages  
Search  
Search  
Appearance  
Donate  
Create account  
Log in  
Personal tools  
Donate  
Create account  
Log in  
CentralNotice'
