 ##### How to split by HTML header:-
 
HTMLHeaderTextSplitter is a \"structure-aware\" chunker that splits text at the HTML element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline

In [2]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string="""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample Document</title>
</head>
<body>
    <h1>Main Title</h1>
    <p>This is the introductory paragraph of the document.</p>
    
    <h2>Section 1</h2>
    <p>This is the first section of content.</p>
    <ul>
        <li>First item</li>
        <li>Second item</li>
    </ul>
    
    <h2>Section 2</h2>
    <p>This is the second section with more details.</p>
    <div class="special">
        <h3>Subsection</h3>
        <p>Nested content inside a div.</p>
    </div>
    
    <footer>
        <p>Copyright © 2023</p>
    </footer>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(html_string)


for split in html_header_splits:
    print(f"Header: {split.metadata}")
    print(f"Content: {split.page_content[:120]}...")  # Print first 60 chars
    print("-" * 50)

Header: {'Header 1': 'Main Title'}
Content: Main Title...
--------------------------------------------------
Header: {'Header 1': 'Main Title'}
Content: This is the introductory paragraph of the document....
--------------------------------------------------
Header: {'Header 1': 'Main Title', 'Header 2': 'Section 1'}
Content: Section 1...
--------------------------------------------------
Header: {'Header 1': 'Main Title', 'Header 2': 'Section 1'}
Content: This is the first section of content.  
First item  
Second item...
--------------------------------------------------
Header: {'Header 1': 'Main Title', 'Header 2': 'Section 2'}
Content: Section 2...
--------------------------------------------------
Header: {'Header 1': 'Main Title', 'Header 2': 'Section 2'}
Content: This is the second section with more details....
--------------------------------------------------
Header: {'Header 1': 'Main Title', 'Header 2': 'Section 2', 'Header 3': 'Subsection'}
Content: Subsection...
---------

In [4]:
from langchain_text_splitters import HTMLHeaderTextSplitter

url="https://onepiece.fandom.com/wiki/Story_Arcs"


headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text_from_url(url)

for split in html_header_splits:
    print(f"Header: {split.metadata}")
    print(f"Content: {split.page_content}...")  # Print first 60 chars
    print("-" * 50)

Header: {}
Content: Start a Wiki  
Sign In  
Don't have an account?  
Register  
Sign In  
One Piece Wiki  
Explore  
Main Page  
Discuss  
All Pages  
Community  
Interactive Maps  
World  
Characters  
Straw Hat Pirates  
In Canon  
Non-Canon  
Pirates  
Marines  
Nobles  
By Race  
Antagonists  
Locations  
Grand Line  
East Blue  
New World  
Islands  
Towns and Cities  
Organizations  
Marines  
Pirate Crews  
Seven Warlords of the Sea  
Four Emperors  
World Government  
Revolutionary Army  
Society  
Occupations  
Races  
Technology  
Animal Species  
Plant and Fungi Species  
Media  
Manga  
Chapters and Volumes  
SBS Corners  
Story Arcs  
Cover Stories  
Volume Intros  
Books  
Anime  
Episode Guide  
Story Arcs  
One Piece Movies  
Filler  
Music  
Merchandise  
Spin-Offs  
Video Games  
Other Media  
Books  
Magazine  
Events  
Live Performances  
Live-Action Series  
Variety  
Mobile Apps  
Mythbusters  
Community  
Discord Server  
Discussions  
Forums  
Recent Blogs  
Fe