# HeaderChunkedHTMLLoader
## Description and motivation
Similar in concept to the <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/markdown_header_metadata">`MarkdownHeaderTextSplitter`</a>, the `HeaderChunkedHTMLLoader` is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

## Usage examples
#### 1) With an HTML string:

In [1]:
from langchain.document_loaders.html import HeaderChunkedHTMLLoader, HeaderChunkedHTMLLoaderFromString

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

header_mapping = {
    "h1": "Header 1",
    "h2": "Header 2",
    "h3": "Header 3",
}

html_header_splits = HeaderChunkedHTMLLoaderFromString([html_string], header_mapping).load()
html_header_splits

[Document(page_content='Some intro text about Foo.', metadata={'url': None, 'Header 1': 'Foo'}),
 Document(page_content='Some intro text about Bar.', metadata={'url': None, 'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
 Document(page_content='Some text about the first subtopic of Bar.', metadata={'url': None, 'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
 Document(page_content='Some text about the second subtopic of Bar.', metadata={'url': None, 'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
 Document(page_content='Some text about Baz', metadata={'url': None, 'Header 1': 'Foo', 'Header 2': 'Baz'}),
 Document(page_content='Some concluding text about Foo', metadata={'url': None, 'Header 1': 'Foo'})]

#### 2) Pipelined to another splitter, with html loaded from a web URL:

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

url = ["https://plato.stanford.edu/entries/goedel/"]

header_mapping = {
    "h1": "Header 1",
    "h2": "Header 2",
    "h3": "Header 3",
    "h4": "Header 4"
}
html_splitter = HeaderChunkedHTMLLoader(url, header_mapping, return_each_element=True)

html_header_splits = html_splitter.load()

chunk_size = 1000
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)


splits = text_splitter.split_documents(html_header_splits)
for chunk in html_header_splits:
    print("metadata\n")
    print(chunk.metadata)
    print('\n\n')
    print(chunk.page_content)
    print('\n')

metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Stanford Encyclopedia of Philosophy


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Menu


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Browse


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Table of Contents
                    What's New
                    Random Entry
                    Chronological
                    Archives


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



About


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Editorial Information
                    About the SEP
                    Editorial Board
                    How to Cite the SEP
                    Special Characters
                    Advanced Tools
                    Contact


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Support SEP


metadata

{'url': 'https://plato.stanford.edu/entries/goedel/'}



Suppo

## Limitations

There can be quite a bit of structural variation from one HTML document to another, and while `HeaderChunkedHTMLLoader` will attempt to attach all "relevant" headers to any given chunk, it can sometimes miss certain headers. For example, the algorithm assumes an informational hierarchy in which headers are always at nodes "above" associated text, i.e. prior siblings, ancestors, and combinations thereof. In the following news article (as of the writing of this document), the document is structured such that the text of the top-level headline, while tagged "h1", is in a *distinct* subtree from the text elements that we'd expect it to be *"above"*&mdash;so we can observe that the "h1" element and its associated text do not show up in the chunk metadata (but, where applicable, we do see "h2" and its associated text):   


In [3]:
url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"

header_mapping = {
    "h1": "Header 1",
    "h2": "Header 2",
    "h3": "Header 3",
}

html_splitter = HeaderChunkedHTMLLoader([url], header_mapping)
html_header_splits = html_splitter.load()
html_header_splits

[Document(page_content='', metadata={'url': 'https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html', 'Header 3': 'CNN values your feedback'}),
 Document(page_content="1. How relevant is this ad to you?\n\n2. Did you encounter any technical issues?\n\nVideo player was slow to load content\n                                                                        \n                                                                        \n                                                                \n                                                                \n                                                                        Video content never loaded\n                                                                        \n                                                                        \n                                                                \n                                                                \n                            