
#### HTML TEXT SPLITTER

HTMLHeadTextSplitter breaks web/HTML content into chunks based on the page’s heading tags
like TAGS LIKE h1  h2 h3. 
Each time it finds a new heading, it starts a new section, so the text is grouped by meaningful topics instead of being cut randomly by length. 
This keeps related information together (like a title with its explanation), which makes embeddings and question-answering much more accurate compared to simple character-based splitting.  

In [1]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<html>
<head>
    <title>AI Basics</title>
</head>
<body>

<h1>Artificial Intelligence</h1>
<p>Artificial Intelligence is the ability of machines to perform tasks that normally require human intelligence.</p>

<h2>History</h2>
<p>AI began in the 1950s with early research on machine learning and symbolic reasoning.</p>

<h2>Applications</h2>
<p>AI is widely used in healthcare, finance, and self-driving cars.</p>

<h3>Healthcare</h3>
<p>AI helps doctors detect diseases and analyze medical images.</p>

</body>
</html>
"""

header_to_split_on = [
    ('h1', "Header 1"),
    ('h2', "Header 2"),
    ("h3", "Header 3")
]

html_spliter = HTMLHeaderTextSplitter(headers_to_split_on = header_to_split_on)
html_header_splits = html_spliter.split_text(html_string)

In [2]:
html_header_splits

[Document(metadata={'Header 1': 'Artificial Intelligence'}, page_content='Artificial Intelligence'),
 Document(metadata={'Header 1': 'Artificial Intelligence'}, page_content='Artificial Intelligence is the ability of machines to perform tasks that normally require human intelligence.'),
 Document(metadata={'Header 1': 'Artificial Intelligence', 'Header 2': 'History'}, page_content='History'),
 Document(metadata={'Header 1': 'Artificial Intelligence', 'Header 2': 'History'}, page_content='AI began in the 1950s with early research on machine learning and symbolic reasoning.'),
 Document(metadata={'Header 1': 'Artificial Intelligence', 'Header 2': 'Applications'}, page_content='Applications'),
 Document(metadata={'Header 1': 'Artificial Intelligence', 'Header 2': 'Applications'}, page_content='AI is widely used in healthcare, finance, and self-driving cars.'),
 Document(metadata={'Header 1': 'Artificial Intelligence', 'Header 2': 'Applications', 'Header 3': 'Healthcare'}, page_content='He

In [6]:
url = "https://jalammar.github.io/illustrated-transformer/"
header_to_split_on = [('h1', "Header 1"),
    ('h2', "Header 2"),
    ("h3", "Header 3")
]

html_spliter = HTMLHeaderTextSplitter(headers_to_split_on = header_to_split_on)
html_header_splits = html_spliter.split_text_from_url(url)


In [7]:
html_header_splits

[Document(metadata={}, page_content='Google tag (gtag.js) End Google Analytics  \nJay Alammar  \nVisualizing machine learning one concept at a time. Read our book, and follow me on , , , ,  \nHands-On Large Language Models  \nLinkedIn  \nBluesky  \nSubstack  \nX  \nYouTube  \nBlog  \nAbout'),
 Document(metadata={'Header 1': 'The Illustrated Transformer'}, page_content='The Illustrated Transformer'),
 Document(metadata={'Header 1': 'The Illustrated Transformer'}, page_content="more  \nDiscussions: ,  \nHacker News (65 points, 4 comments)  \nReddit r/MachineLearning (29 points, 3 comments)  \nTranslations: , , , , , , , , , , , ,  \nArabic  \nChinese (Simplified) 1  \nChinese (Simplified) 2  \nFrench 1  \nFrench 2  \nItalian  \nJapanese  \nKorean  \nPersian  \nRussian  \nSpanish 1  \nSpanish 2  \nVietnamese  \nWatch: MIT’s lecture referencing this post  \nDeep Learning State of the Art  \nFeatured in courses at , , , , and others  \nStanford  \nHarvard  \nMIT  \nPrinceton  \nCMU  \nThis 