How to split by HTML header

HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

In [1]:
from langchain_text_splitters import HTMLHeaderTextSplitter
html_string="""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>HTML Text Formatting Example</title>
</head>
<body>
    <h1>HTML Text Formatting</h1>
    <p><strong>Strong:</strong> This text is important and bold.</p>
    <p><em>Emphasized:</em> This text is emphasized and italic.</p>
    <p><b>Bold:</b> This text is bold.</p>
    <p><i>Italic:</i> This text is italic.</p>
    <p><mark>Marked:</mark> This text is highlighted.</p>
    <p>The chemical formula of water is H<sub>2</sub>O.</p>
    <p>E = mc<sup>2</sup></p>
</body>
</html>"""
headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3"),
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'HTML Text Formatting'}, page_content='Strong: This text is important and bold.  \nEmphasized: This text is emphasized and italic.  \nBold: This text is bold.  \nItalic: This text is italic.  \nMarked: This text is highlighted.  \nThe chemical formula of water is H2O.  \nE = mc2')]

In [5]:
url="https://plato.stanford.edu/entries/goedel/"
headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3"),
    ("h4","Header 4"),
]
html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(url)
html_header_splits

[Document(metadata={}, page_content='https://plato.stanford.edu/entries/goedel/')]