##### How to split by HTML header
HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.


In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """

<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>

"""



1. headers_to_split_on

- This is a list of tuples that maps HTML header tags  to labels or descriptions. You can use this to:

- Define which headers you want to split the page content on

- Assign human-readable names or levels to each header (like chapter or section titles)

2. ✅ What HTMLHeaderTextSplitter Does
- This splitter treats headers like h1, h2, h3, etc. as section markers, and collects all content under each heading (until the next heading of same or higher level).

In [None]:
html_splitter=HTMLHeaderTextSplitter(
    headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]
)

html_header_splits=html_splitter.split_text(html_string)

html_header_splits

| Section | Headers                                                   | Content                              |
| ------- | --------------------------------------------------------- | ------------------------------------ |
| 1       | Header 1: Foo                                             | Some intro text about Foo.           |
| 2       | Header 1: Foo, Header 2: Bar…                             | Some intro text about Bar.           |
| 3       | Header 1: Foo, Header 2: Bar…, Header 3: Bar subsection 1 | Some text about the first subtopic…  |
| 4       | Header 1: Foo, Header 2: Bar…, Header 3: Bar subsection 2 | Some text about the second subtopic… |
| 5       | Header 1: Foo, Header 2: Baz                              | Some text about Baz                  |
| 6       | Header 1: Foo                                             | Some concluding text about Foo       |

- All parts are nested under the most recent header(s) unless a new h1 resets the context.

## Example by taking a url

In [None]:
url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]


html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits