# HTMLHeaderTextSplitter

HTMLHeaderTextSplitter is a specialized text splitter that divides HTML documents based on header tags (h1, h2, h3, etc.), making it ideal for processing structured web content while preserving the document's hierarchical organization.

HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.


- HTML Splitting: To split HTML content based on header tags for hierarchical document processing.

In [1]:
# Split HTML content based on header tags

from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(html_string)
html_header_splits


[Document(metadata={}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Baz'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}, page_content='Some text about Baz'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some concluding text about Foo')]

- HTML URL Splitting: To split web content directly from a URL based on header hierarchy.

In [2]:
# Split HTML content from URL using header tags

url = "https://python.langchain.com/docs/tutorials/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

[Document(metadata={}, page_content='Skip to main content  \nIntegrationsAPI Reference  \nMore  \nContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TS  \n💬  \nv0.3  \nv0.3v0.2v0.1  \nSearch  \nIntroductionSecurity Policy  \nTutorials  \nBuild a Question Answering application over a Graph DatabaseTutorialsBuild a simple LLM application with chat models and prompt templatesBuild a ChatbotBuild a Retrieval Augmented Generation (RAG) App: Part 2Build an Extraction ChainBuild an AgentTaggingBuild a Retrieval Augmented Generation (RAG) App: Part 1Build a semantic search engineBuild a Question/Answering system over SQL dataSummarize Text  \nHow-to guides  \nHow-to guidesHow to use tools in a chainHow to use a vectorstore as a retrieverHow to add memory to chatbotsHow to use example selectorsHow to add a semantic layer over graph databaseHow to invoke runnables in parallelHow to stream chat model responsesHow to add default invocation args to a RunnableHow to add ret

- HTML URL Processing: To split web content from URL into JSON format with header-based hierarchy.

In [4]:
# Process HTML content from URL and save as JSON

import json
from langchain.text_splitter import HTMLHeaderTextSplitter

# URL ve başlık ayırıcılarının tanımı
url = "https://python.langchain.com/docs/tutorials/"
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

# HTMLHeaderTextSplitter ile başlık bazlı içerik bölme
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

try:
    # URL'den içerik ayıklama ve bölme
    html_header_splits = html_splitter.split_text_from_url(url)

    # Document nesnelerini JSON formatına uygun bir yapıya dönüştürme
    json_ready_data = [
        {"Header": doc.metadata.get("type", "Unknown"), "Content": doc.page_content}
        for doc in html_header_splits
    ]

    # Veriyi JSON formatında kaydetme
    output_json = "html_header_splits.json"
    with open(output_json, mode="w", encoding="utf-8") as file:
        json.dump(json_ready_data, file, ensure_ascii=False, indent=4)

    print(f"Veri başarıyla JSON formatında kaydedildi: {output_json}")

except Exception as e:
    # Hata durumunda bilgi verme
    print(f"Bir hata oluştu: {e}")


Veri başarıyla JSON formatında kaydedildi: html_header_splits.json


In [5]:
output_json

'html_header_splits.json'

- JSON Reading: To read and display formatted JSON content from a file.

In [6]:
# Read and display JSON data with JSON formatting

with open("html_header_splits.json", mode="r", encoding="utf-8") as file:
    data = json.load(file)
    print(json.dumps(data, indent=4, ensure_ascii=False))

[
    {
        "Header": "Unknown",
        "Content": "Skip to main content  \nIntegrationsAPI Reference  \nMore  \nContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TS  \n💬  \nv0.3  \nv0.3v0.2v0.1  \nSearch  \nIntroductionSecurity Policy  \nTutorials  \nBuild a Question Answering application over a Graph DatabaseTutorialsBuild a simple LLM application with chat models and prompt templatesBuild a ChatbotBuild a Retrieval Augmented Generation (RAG) App: Part 2Build an Extraction ChainBuild an AgentTaggingBuild a Retrieval Augmented Generation (RAG) App: Part 1Build a semantic search engineBuild a Question/Answering system over SQL dataSummarize Text  \nHow-to guides  \nHow-to guidesHow to use tools in a chainHow to use a vectorstore as a retrieverHow to add memory to chatbotsHow to use example selectorsHow to add a semantic layer over graph databaseHow to invoke runnables in parallelHow to stream chat model responsesHow to add default invocation args to a Ru

In [None]:
# END