##### How to split by HTML header
HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.


In [2]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

header_to_split_on = [
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=header_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Bar main section'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 

In [4]:
type(html_header_splits[0])

langchain_core.documents.base.Document

In [5]:
url = "https://medium.com/wise-well/to-reduce-clutter-start-with-self-awareness-e7a4216dbf42"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

[Document(metadata={}, page_content='Sitemap  \ndocument.domain = document.domain;  \nOpen in app  \nSign up  \nSign in  \nMedium Logo  \nWrite  \nSign up  \nSign in  \nWise & Well  \n·  \nFollow publication  \nScience-backed insights into health, wellness and wisdom, to help you make tomorrow a little better than today.  \nFollow publication  \nMember-only story'),
 Document(metadata={'Header 1': 'The Secret to Reducing Clutter is Deep Inside You'}, page_content='The Secret to Reducing Clutter is Deep Inside You'),
 Document(metadata={}, page_content='Most decluttering advice won’t work. What to do instead.  \nGail Post, Ph.D.  \nFollow  \n7 min read  \n·  \nJun 4, 2025  \n--  \n51  \nShare  \nImage: Pexels/RDNE  \nSome things come easily.  \nBut tidiness is not one of them.  \nI save way too many things ( ). My dining room table is known to display a horizontal pile or two of paperwork, and I have way too many dust-gathering ceramics on view.  \nafter all, what if that jacket from 20