**How to Split Text by HTML Header**

HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. 

It can return chunks element by element or combine elements with the same metadata, with the objectives of, 

(a): keeping related text grouped (more or less) semantically, and 

(b): preserving context-rich information encoded in document structures. 

It can be used with other text splitters as part of a chunking pipeline.

In [1]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string="""
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to My Page</h1>
    <p>This is a simple paragraph under the main heading.</p>
    
    <h2>Section 1</h2>
    <p>This is the first section of my page. It contains some basic information.</p>
    <div>
    <h2>Section 2</h2>
    <p>This is the second section with more details. Here we can add various content.</p>
    </div>
    <h3>Subsection 2.1</h3>
    <p>This is a subsection with even more specific information.</p>
</body>
</html>

"""

In [2]:
header_to_split_on=[
  ("h1","Header 1"),
  ("h2","Header 2"),
  ("h3","Header 3")
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on=header_to_split_on)
html_header_splits=html_splitter.split_text(html_string)

html_header_splits







[Document(metadata={'Header 1': 'Welcome to My Page'}, page_content='Welcome to My Page'),
 Document(metadata={'Header 1': 'Welcome to My Page'}, page_content='This is a simple paragraph under the main heading.'),
 Document(metadata={'Header 1': 'Welcome to My Page', 'Header 2': 'Section 1'}, page_content='Section 1'),
 Document(metadata={'Header 1': 'Welcome to My Page', 'Header 2': 'Section 1'}, page_content='This is the first section of my page. It contains some basic information.'),
 Document(metadata={'Header 1': 'Welcome to My Page', 'Header 2': 'Section 2'}, page_content='Section 2'),
 Document(metadata={'Header 1': 'Welcome to My Page', 'Header 2': 'Section 2'}, page_content='This is the second section with more details. Here we can add various content.'),
 Document(metadata={'Header 1': 'Welcome to My Page', 'Header 3': 'Subsection 2.1'}, page_content='Subsection 2.1'),
 Document(metadata={'Header 1': 'Welcome to My Page', 'Header 3': 'Subsection 2.1'}, page_content='This is a

In [3]:
url="https://www.tpointtech.com/spring-boot-starters"

header_to_split_on=[
  ("h1","Header 1"),
  ("h2","Header 2"),
  ("h3","Header 3")
]

html_splitter=HTMLHeaderTextSplitter(headers_to_split_on=header_to_split_on)
html_header_splits=html_splitter.split_text_from_url(url)
html_header_splits



[Document(metadata={}, page_content='Tutorials  \n×'),
 Document(metadata={'Header 3': 'Python'}, page_content='Python'),
 Document(metadata={'Header 3': 'Python'}, page_content='Python  \nDjango  \nNumpy  \nPandas  \nTkinter  \nPytorch  \nFlask  \nOpenCV'),
 Document(metadata={'Header 3': 'AI, ML and Data Science'}, page_content='AI, ML and Data Science'),
 Document(metadata={'Header 3': 'AI, ML and Data Science'}, page_content='Artificial Intelligence  \nMachine Learning  \nData Science  \nDeep Learning  \nTensorFlow  \nArtificial Neural Network  \nMatplotlib  \nPython Scipy'),
 Document(metadata={'Header 3': 'Java'}, page_content='Java'),
 Document(metadata={'Header 3': 'Java'}, page_content='Java  \nServlet  \nJSP  \nSpring Boot  \nSpring Framework  \nHibernate  \nJavaFX  \nJava Web Services'),
 Document(metadata={'Header 3': 'B.Tech and MCA'}, page_content='B.Tech and MCA'),
 Document(metadata={'Header 3': 'B.Tech and MCA'}, page_content='DBMS  \nData Structures  \nOperating Syste