You're talking about a very useful text splitter in LangChain! The `HTMLHeaderTextSplitter` is designed specifically for splitting HTML content in a way that preserves the structure and hierarchy of the information.

Here's how it works:

**1. Structure-aware splitting:** Unlike basic text splitters that simply look for separators like newlines, the `HTMLHeaderTextSplitter` understands the structure of HTML documents. It uses HTML headers (`<h1>`, `<h2>`, `<h3>`, etc.) as the primary splitting points.

**2. Metadata generation:**  It not only splits the text but also generates metadata for each chunk based on the headers. This metadata captures the hierarchical context of the chunk within the HTML document. For example, a chunk under an `<h3>` tag might have metadata indicating the text of the `<h3>` as well as the `<h2>` and `<h1>` it falls under.

**3. Flexibility:** You can configure the splitter to:

   * **Specify headers:** Choose which header levels (`<h1>`, `<h2>`, etc.) to use as splitting points.
   * **Combine elements:**  Control whether to return each HTML element as a separate chunk or combine elements under the same header into a single chunk.

**Why this is powerful**

* **Preserves context:**  The metadata helps retain the hierarchical relationships between chunks, which is crucial for understanding the flow and organization of information in HTML documents.
* **Improved relevance:** When used with LLMs, the context-rich metadata can lead to more relevant and accurate responses, as the LLM has a better understanding of where the information comes from within the HTML structure.
* **Ideal for web scraping:** Perfect for processing web pages where the headers often indicate the topic or section of the content.

**Example**

```python
from langchain_text_splitters import HTMLHeaderTextSplitter

# Sample HTML content
html_content = """
<h1>My Website</h1>
<p>Some introductory text.</p>
<h2>About Us</h2>
<p>Information about our company.</p>
<h3>Our Team</h3>
<p>Details about our team members.</p>
"""

splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h2", "Header 2")])
docs = splitter.split_text(html_content)

# 'docs' will contain Document objects with chunks and metadata like:
# {
#   'page_content': 'Information about our company.\nOur Team\nDetails about our team members.',
#   'metadata': {'Header 2': 'About Us'}
# }
```

**Important Note:** The `HTMLHeaderTextSplitter` requires the `lxml` package. You can install it using `pip install lxml`.

If you have any more questions about how to use the `HTMLHeaderTextSplitter` with your specific HTML content or want to explore its advanced features, feel free to ask! I'm here to help you make the most of it.

In [1]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<head>
  <title>HTML Header Text Splitter Example</title>
</head>
<body>

  <h1>LangChain Text Splitter Tutorial</h1>
  <p>This is a sample page to demonstrate the HTMLHeaderTextSplitter.</p>

  <h2>What is LangChain?</h2>
  <p>LangChain is a framework for developing applications powered by language models.</p>

  <h2>Key Concepts</h2>
  <p>Let's explore some important concepts in LangChain:</p>

  <h3>1.  Document Loaders</h3>
  <p>Load data from various sources (web pages, files, etc.).</p>

  <h3>2. Text Splitters</h3>
  <p>Divide text into smaller chunks for processing.</p>
    <h4>Types of Text Splitters</h4>
    <p>There are different types of splitters like <code>CharacterTextSplitter</code> and <code>HTMLHeaderTextSplitter</code>.</p>

  <h3>3. Language Models</h3>
  <p>Use LLMs like GPT-3 to understand and generate text.</p>

  <h2>Getting Started</h2>
  <p>Follow these steps to start using LangChain:</p>
  <ol>
    <li>Install the necessary packages.</li>
    <li>Import the required modules.</li>
    <li>Load your data.</li>
    <li>Split the text into chunks.</li>
    <li>Use a language model to process the chunks.</li>
  </ol>

</body>
</html> 
"""

In [5]:
headers_to_split_on = [
    ("h1", "Headr 1"),
    ("h2", "Headr 2"),
    ("h3", "Headr 3"),S
]

html_splitter  = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splitter = html_splitter.split_text(html_string)

html_header_splitter

[Document(metadata={'Headr 1': 'LangChain Text Splitter Tutorial'}, page_content='This is a sample page to demonstrate the HTMLHeaderTextSplitter.'),
 Document(metadata={'Headr 1': 'LangChain Text Splitter Tutorial', 'Headr 2': 'What is LangChain?'}, page_content='LangChain is a framework for developing applications powered by language models.'),
 Document(metadata={'Headr 1': 'LangChain Text Splitter Tutorial', 'Headr 2': 'Key Concepts'}, page_content="Let's explore some important concepts in LangChain:"),
 Document(metadata={'Headr 1': 'LangChain Text Splitter Tutorial', 'Headr 2': 'Key Concepts', 'Headr 3': '1. Document Loaders'}, page_content='Load data from various sources (web pages, files, etc.).'),
 Document(metadata={'Headr 1': 'LangChain Text Splitter Tutorial', 'Headr 2': 'Key Concepts', 'Headr 3': '2. Text Splitters'}, page_content='Divide text into smaller chunks for processing.  \nThere are different types of splitters like CharacterTextSplitter and HTMLHeaderTextSplitter

In [7]:
for split in html_header_splitter:
    print(f"page content: {split.page_content}")

page content: This is a sample page to demonstrate the HTMLHeaderTextSplitter.
page content: LangChain is a framework for developing applications powered by language models.
page content: Let's explore some important concepts in LangChain:
page content: Load data from various sources (web pages, files, etc.).
page content: Divide text into smaller chunks for processing.  
There are different types of splitters like CharacterTextSplitter and HTMLHeaderTextSplitter.
page content: Use LLMs like GPT-3 to understand and generate text.
page content: Follow these steps to start using LangChain:  
Install the necessary packages. Import the required modules. Load your data. Split the text into chunks. Use a language model to process the chunks.


In [10]:
url = "https://plato.stanford.edu/entires/goedel"

headers_to_split_on = [
    ("h1", "Headr 1"),
    ("h2", "Headr 2"),
    ("h3", "Headr 3"),
    ("h4", "Headr 4"),
    ("h5", "Headr 5"),
    ("h6", "Headr 6")
]

html_splitter =  HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splitter = html_splitter.split_text_from_url(url)

html_header_splitter

[Document(metadata={}, page_content="Stanford Encyclopedia of Philosophy  \nMenu  \nBrowse About Support SEP  \nTable of Contents What's New Random Entry Chronological Archives  \nEditorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  \nSupport the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  \nDocument Not Found"),
 Document(metadata={'Headr 1': 'Document Not Found'}, page_content="We are sorry but the document you are looking for doesn't exist on our server.  \nPlease update any bookmark that led you to this page, or inform the webmaster of sites with bad links leading to this page.  \nTo find what you were looking for, you can use the links below to search or browse the SEP.  \nSearch Search Tips  \nBrowse"),
 Document(metadata={'Headr 1': 'Document Not Found', 'Headr 2': 'Browse'}, page_content="Table of Contents  \nWhat's New Archives Random Entry"),
 Document(metadata={}, page_content='Browse'),
 Documen

In [11]:
for split in html_header_splitter:
    print(f"page content: {split.page_content}")

page content: Stanford Encyclopedia of Philosophy  
Menu  
Browse About Support SEP  
Table of Contents What's New Random Entry Chronological Archives  
Editorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  
Support the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  
Document Not Found
page content: We are sorry but the document you are looking for doesn't exist on our server.  
Please update any bookmark that led you to this page, or inform the webmaster of sites with bad links leading to this page.  
To find what you were looking for, you can use the links below to search or browse the SEP.  
Search Search Tips  
Browse
page content: Table of Contents  
What's New Archives Random Entry
page content: Browse
page content: Table of Contents What's New Random Entry Chronological Archives
page content: About
page content: Editorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Ad