# üåê HTMLHeaderTextSplitter - Structure-Aware HTML Splitting

## What is HTMLHeaderTextSplitter?

A **"structure-aware"** splitter that understands HTML hierarchy. It splits at HTML header elements (`<h1>`, `<h2>`, etc.) and preserves the document structure in metadata.

## Why Use It? ü§î

When you scrape web pages, you get HTML with structure (headings, sections). This splitter:

1. **Preserves Hierarchy** - Knows that content under `<h2>` belongs to the previous `<h1>`
2. **Adds Context** - Each chunk's metadata includes its header chain
3. **Intelligent Grouping** - Keeps related content together

## Visual Example:

```html
<h1>Machine Learning</h1>           ‚Üê Header 1
  <p>ML is a subset of AI...</p>
  <h2>Supervised Learning</h2>      ‚Üê Header 2 (under Header 1)
    <p>Uses labeled data...</p>
    <h3>Classification</h3>         ‚Üê Header 3 (under Header 2)
      <p>Predicts categories...</p>
```

**Result**: The "Classification" chunk includes metadata:
```python
{"Header 1": "Machine Learning", "Header 2": "Supervised Learning", "Header 3": "Classification"}
```

## Key Benefits:
- üéØ **Better retrieval** - Search finds content with full context
- üìä **Structured metadata** - Know exactly where content came from
- üîó **Preserves relationships** - Parent-child header relationships maintained

---

## 1Ô∏è‚É£ Basic HTML Splitting

Let's split a simple HTML document by its headers.

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

# Sample HTML document with nested headers
html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

# Define which headers to split on and their names in metadata
headers_to_split_on = [
    ("h1", "Header 1"),    # Split on <h1>, store as "Header 1"
    ("h2", "Header 2"),    # Split on <h2>, store as "Header 2"
    ("h3", "Header 3"),    # Split on <h3>, store as "Header 3"
]

# Create the splitter
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# Split the HTML
html_header_splits = html_splitter.split_text(html_string)

print(f"üìä Created {len(html_header_splits)} chunks\n")

[Document(page_content='Foo'),
 Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
 Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
 Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
 Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
 Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]

---

## 2Ô∏è‚É£ Splitting Live Web Pages

You can split HTML directly from a URL using `split_text_from_url()`!

In [None]:
# Examine each chunk and its metadata
for i, chunk in enumerate(html_header_splits):
    print(f"{'='*60}")
    print(f"üìÑ CHUNK {i+1}")
    print(f"{'='*60}")
    print(f"üìã Metadata (headers): {chunk.metadata}")
    print(f"üìù Content: {chunk.page_content}")
    print()

In [None]:
# Split HTML directly from a URL
url = "https://plato.stanford.edu/entries/goedel/"

# Define headers to split on (include h4 for more granularity)
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

# Create splitter and fetch + split in one step
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)

print(f"üåê URL: {url}")
print(f"üìä Created {len(html_header_splits)} chunks from web page")

[Document(page_content="Stanford Encyclopedia of Philosophy  \nMenu  \nBrowse About Support SEP  \nTable of Contents What's New Random Entry Chronological Archives  \nEditorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  \nSupport the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  \nEntry Navigation  \nEntry Contents Bibliography Academic Tools Friends PDF Preview Author and Citation Info Back to Top  \nKurt G√∂del"),
 Document(page_content='First published Tue Feb 13, 2007; substantive revision Fri Dec 11, 2015  \nKurt Friedrich G√∂del (b. 1906, d. 1978) was one of the principal founders of the modern, metamathematical era in mathematical logic. He is widely known for his Incompleteness Theorems, which are among the handful of landmark theorems in twentieth century mathematics, but his work touched every field of mathematical logic, if it was not in most cases their original stimulus. In his philosophical work

In [None]:
# Examine first few chunks from the web page
for i, chunk in enumerate(html_header_splits[:5]):
    print(f"\n{'='*60}")
    print(f"üìÑ CHUNK {i+1}")
    print(f"{'='*60}")
    print(f"üìã Headers: {chunk.metadata}")
    print(f"üìù Content preview: {chunk.page_content[:200]}...")

In [None]:
---

## 3Ô∏è‚É£ Combining with Other Splitters

HTML chunks might still be too large! Combine with RecursiveCharacterTextSplitter.

In [None]:
# Pipeline: HTML split ‚Üí then character split for size control
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Step 1: Split by HTML headers first
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_chunks = html_splitter.split_text(html_string)

# Step 2: Further split large chunks by characters
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

# Apply character splitting to HTML chunks
final_chunks = char_splitter.split_documents(html_chunks)

print(f"üìä After HTML split: {len(html_chunks)} chunks")
print(f"üìä After character split: {len(final_chunks)} chunks")
print(f"\nüí° Each chunk now has both size limit AND header metadata!")

In [None]:
# Verify metadata is preserved after secondary split
for i, chunk in enumerate(final_chunks[:4]):
    print(f"Chunk {i+1}: {chunk.metadata} ‚Üí '{chunk.page_content[:50]}...'")

In [None]:
---

## üìù Key Takeaways

### HTMLHeaderTextSplitter Features:

| Feature | Description |
|---------|-------------|
| **Structure-aware** | Understands HTML hierarchy |
| **Metadata enriched** | Adds header chain to each chunk |
| **Two methods** | `split_text()` and `split_text_from_url()` |
| **Combinable** | Works well with other splitters |

### Headers Configuration:

```python
headers_to_split_on = [
    ("h1", "Header 1"),    # (HTML tag, metadata key name)
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]
```

### Best Practices üí°

1. ‚úÖ Use for web scraping and documentation
2. ‚úÖ Combine with RecursiveCharacterTextSplitter for size control
3. ‚úÖ Include all relevant header levels (h1-h4)
4. ‚úÖ Leverage metadata for better search/retrieval

### Common Pipeline:

```
Web Page ‚Üí HTMLHeaderTextSplitter ‚Üí RecursiveCharacterTextSplitter ‚Üí Vector Store
              (structure)              (size control)                (storage)
```

### Next Steps üöÄ
- Try **RecursiveJsonSplitter** for JSON/API data
- Combine with **WebBaseLoader** for full web scraping pipeline