## How To Split HTML Data

- A LangChain text splitter that splits HTML documents based on heading tags (h1, h2, h3...).
- It uses BeautifulSoup to parse HTML and returns text chunks whenever it encounters specified headings.
- It is useful for structuring documents into hierarchical sections for further processing in LLMs.

In [None]:
# text splitter - website ----> HTML PAGE ko bases per todata he
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My First HTML Page</title>
</head>
<body style="font-family: Arial, sans-serif; text-align: center;">

    <!-- Heading -->
    <h1>Welcome to My Web Page</h1>
    <h2>Are You Join The New Amezing Seminar, For Gen-AI keep Enroll Link in The Discription</h2>

    <!-- Paragraph -->
    <p>This is my first HTML program. I am learning how to create web pages!</p>

    <!-- Image -->
    <img src="https://via.placeholder.com/200" alt="Sample Image" style="border-radius: 10px;">

    <!-- Button -->
    <br><br>
    <button onclick="alert('Hello! You clicked the button.')">Click Me</button>
</body>
</html>
"""

headers_to_split_on=[
    ("h1", "Header"),
    ("h2", "Subheader")
]

html_separetor=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_separetor.split_text(html_string)
html_header_splits

[Document(metadata={}, page_content='Heading Paragraph Image Button'),
 Document(metadata={'Header': 'Welcome to My Web Page'}, page_content='Welcome to My Web Page'),
 Document(metadata={'Header': 'Welcome to My Web Page', 'Subheader': 'Are You Join The New Amezing Seminar, For Gen-AI keep Enroll Link in The Discription'}, page_content='Are You Join The New Amezing Seminar, For Gen-AI keep Enroll Link in The Discription'),
 Document(metadata={'Header': 'Welcome to My Web Page', 'Subheader': 'Are You Join The New Amezing Seminar, For Gen-AI keep Enroll Link in The Discription'}, page_content='This is my first HTML program. I am learning how to create web pages!  \nClick Me')]

In [None]:
# change the in our websitem go to the o/p link
url = "https://guides.smu.edu/GenAIResearch"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_separetor=HTMLHeaderTextSplitter(headers_to_split_on)
load_header =html_separetor.split_text_from_url(url)
load_header

[Document(metadata={}, page_content="BEGIN: Page Header /container Hotjar Tracking Code for https://guides.smu.edu/  6 DEC 2022 END: Page Header BEGIN: Guide Info Header END: Guide Info Header BEGIN: Guide Content END: Guide Content BEGIN: Page Footer END: Page Footer scroll_top.twig !scroll_top.twig BEGIN: Custom Footer BEGIN FOOTER END: Custom Footer  \nSkip to Main Content  \nSearch Library Resources  \nMy Account  \nNavigate  \nSMU Libraries  \nFind & Borrow  \nFind  \nBorrow  \nResearch & Teaching  \nScholarship & Research  \nTeaching & Learning  \nLocations & Collections  \nBridwell Library  \nDeGolyer Library  \nDuda Family Business Library  \nFondren Library  \nHamon Arts Library  \nUnderwood Law Library  \nFort Burgwin Library  \nExhibits & Digital Collections  \nSMU Scholar  \nSpecial Collections & Archives  \nVisit  \nVisit  \nSpaces  \nAbout  \nAbout Us  \nJoin Us  \nConnect With Us  \nGet Help  \nAsk Us  \nFAQ  \nResearch Guides by Subject  \nHow Do I . . . ? Guides  \nFin

## HTMLHeaderTextSplitter With BeautifulSoup
- for understanding the actual what are website are says
- can user are specific search to give specific answer.

In [1]:
from bs4 import BeautifulSoup
from langchain.text_splitter import HTMLHeaderTextSplitter

# 1. Sample HTML content
html_content = """
<html>
<head><title>Sample Page</title></head>
<body>
<h1>LangChain Guide</h1>
<p>LangChain is a framework for building LLM-powered apps.</p>
<h2>Installation</h2>
<p>Install using pip: pip install langchain</p>
<h2>Usage</h2>
<p>Use LangChain to create chains, agents, and tools.</p>
<h3>Example</h3>
<p>This is an example of using LangChain in Python.</p>
</body>
</html>
"""

# 2. Clean/Parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# (Optional) You can clean unwanted tags like script, style, etc.
for tag in soup(["script", "style"]):
    tag.decompose()

clean_html = str(soup)

# 3. Split HTML by headers
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
)

splits = html_splitter.split_text(clean_html)

# 4. Print results
for i, section in enumerate(splits, 1):
    print(f"Section {i}:")
    print("Header:", section.metadata["Header 1"] if "Header 1" in section.metadata else "No H1")
    print("Content:", section.page_content)
    print("Metadata:", section.metadata)
    print("-" * 50)


Section 1:
Header: LangChain Guide
Content: LangChain Guide
Metadata: {'Header 1': 'LangChain Guide'}
--------------------------------------------------
Section 2:
Header: LangChain Guide
Content: LangChain is a framework for building LLM-powered apps.
Metadata: {'Header 1': 'LangChain Guide'}
--------------------------------------------------
Section 3:
Header: LangChain Guide
Content: Installation
Metadata: {'Header 1': 'LangChain Guide', 'Header 2': 'Installation'}
--------------------------------------------------
Section 4:
Header: LangChain Guide
Content: Install using pip: pip install langchain
Metadata: {'Header 1': 'LangChain Guide', 'Header 2': 'Installation'}
--------------------------------------------------
Section 5:
Header: LangChain Guide
Content: Usage
Metadata: {'Header 1': 'LangChain Guide', 'Header 2': 'Usage'}
--------------------------------------------------
Section 6:
Header: LangChain Guide
Content: Use LangChain to create chains, agents, and tools.
Metadata: {