How to split by HTML header

HTMLHeaderTextSplitter is a "structure-aware" chunk that spolits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by eomeent or combine elements with the sam emetadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitter as a part of a chunking pipeline

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """<!DOCTYPE html>
<html lang="en">
<body>
    <h1>Sample HTML Document</h1>
    <p>
        This is a demonstration of a long HTML page containing both tables and paragraphs. 
        HTML (HyperText Markup Language) is the backbone of any web page, providing the structure
        and semantics required to render content in web browsers. By combining headings, paragraphs,
        and tables, we can create rich layouts suitable for different types of documents.
    </p>

    <h2>Introduction</h2>
    <p>
        In this example, we explore how paragraphs of text and tables can be used together
        to present information in a structured and meaningful way. Tables allow us to align data 
        in rows and columns, making it easy for readers to compare values. Meanwhile, paragraphs
        provide context and explanation for the data presented.
    </p>

    <h2>Table of Student Grades</h2>
    <p>
        Below is a table that shows the scores of students in different subjects. Each student’s 
        performance is recorded across four subjects, and their average marks are calculated.
    </p>
    <table>
        <tr>
            <th>Student Name</th>
            <th>Math</th>
            <th>Science</th>
            <th>History</th>
            <th>English</th>
            <th>Average</th>
        </tr>
        <tr>
            <td>Alice Johnson</td>
            <td>88</td>
            <td>92</td>
            <td>85</td>
            <td>90</td>
            <td class="highlight">88.75</td>
        </tr>
        <tr>
            <td>Michael Smith</td>
            <td>75</td>
            <td>80</td>
            <td>70</td>
            <td>82</td>
            <td class="highlight">76.75</td>
        </tr>
        <tr>
            <td>Sophia Brown</td>
            <td>95</td>
            <td>98</td>
            <td>92</td>
            <td>94</td>
            <td class="highlight">94.75</td>
        </tr>
        <tr>
            <td>David Lee</td>
            <td>68</td>
            <td>72</td>
            <td>65</td>
            <td>70</td>
            <td class="highlight">68.75</td>
        </tr>
    </table>

    <h2>Company Employees</h2>
    <p>
        Another example of a table can represent company employee records. Information such as 
        employee ID, department, salary, and role can be organized neatly using a tabular format. 
        This allows quick access to critical data during meetings or audits.
    </p>
    <table>
        <tr>
            <th>Employee ID</th>
            <th>Name</th>
            <th>Department</th>
            <th>Role</th>
            <th>Salary</th>
        </tr>
        <tr>
            <td>EMP001</td>
            <td>John Carter</td>
            <td>IT</td>
            <td>Software Engineer</td>
            <td>$75,000</td>
        </tr>
        <tr>
            <td>EMP002</td>
            <td>Emma Davis</td>
            <td>HR</td>
            <td>HR Manager</td>
            <td>$68,500</td>
        </tr>
        <tr>
            <td>EMP003</td>
            <td>Robert Wilson</td>
            <td>Finance</td>
            <td>Accountant</td>
            <td>$59,000</td>
        </tr>
        <tr>
            <td>EMP004</td>
            <td>Linda Johnson</td>
            <td>Marketing</td>
            <td>Marketing Lead</td>
            <td>$72,300</td>
        </tr>
    </table>

    <h2>Conclusion</h2>
    <p>
        As demonstrated, combining textual paragraphs with tables creates a balanced document that 
        not only explains ideas but also supports them with structured data. Such design is commonly 
        seen in research papers, reports, dashboards, and web content.
    </p>
    <p>
        This HTML page shows how versatile and powerful simple HTML elements can be when 
        organized properly. In real-world applications, CSS and JavaScript are often added to 
        enhance visual appeal and interactivity.
    </p>
</body>
</html>
"""

headers_to_split_on = [("h1", "header 1"), ("h2", "header 2"), ("h3", "header 3"),("p", "paragraph")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

In [10]:
headers_to_split_on = [("h1", "header 1"), ("h2", "header 2"), ("h3", "header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'header 1': 'Sample HTML Document'}, page_content='Sample HTML Document'),
 Document(metadata={'header 1': 'Sample HTML Document'}, page_content='This is a demonstration of a long HTML page containing both tables and paragraphs. \n        HTML (HyperText Markup Language) is the backbone of any web page, providing the structure\n        and semantics required to render content in web browsers. By combining headings, paragraphs,\n        and tables, we can create rich layouts suitable for different types of documents.'),
 Document(metadata={'header 1': 'Sample HTML Document', 'header 2': 'Introduction'}, page_content='Introduction'),
 Document(metadata={'header 1': 'Sample HTML Document', 'header 2': 'Introduction'}, page_content='In this example, we explore how paragraphs of text and tables can be used together\n        to present information in a structured and meaningful way. Tables allow us to align data \n        in rows and columns, making it easy for readers to

In [13]:
# It can also work directly on the urls and split the text on the basis of the provided headers

url = "https://plato.stanford.edu/"
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits= html_splitter.split_text_from_url(url)
html_header_splits

[Document(metadata={}, page_content="End container NOTE: Script required for drop-down button to work (mirrors).  \nEnd header wrapper End content DO NOT MODIFY THIS LINE AND BELOW End footer  \nEnd header  \nEnd navigation  \nStanford Encyclopedia of Philosophy  \nMenu  \nBrowse  \nTable of Contents  \nWhat's New  \nRandom Entry  \nChronological  \nArchives  \nAbout  \nEditorial Information  \nAbout the SEP  \nEditorial Board  \nHow to Cite the SEP  \nSpecial Characters  \nAdvanced Tools  \nContact  \nSupport SEP  \nSupport the SEP  \nPDFs for SEP Friends  \nMake a Donation  \nSEPIA for Libraries  \nDO NOT MODIFY THIS LINE AND ABOVE End search End browse End mission End operations"),
 Document(metadata={'header 2': 'Search'}, page_content='Search'),
 Document(metadata={'header 2': 'Search'}, page_content='Search Tips'),
 Document(metadata={'header 2': 'Browse'}, page_content='Browse'),
 Document(metadata={}, page_content="Table of Contents  \nWhat's New  \nArchives  \nRandom Entry  \n