### 1. **HTMLHeaderTextSplitter**:
- Splits strictly at headers (`<h1>`, `<h2>`, etc.) and treats them as individual chunks.

### 2. **HTMLSectionSplitter**:
- Groups related content (header + its subsections) into meaningful sections, maintaining context.


### 1. Splitting a Local HTML String using HTMLHeaderTextSplitter

In [14]:
from langchain_text_splitters import HTMLHeaderTextSplitter

file_path = "C:/Users/admin/OneDrive/Desktop/Chunking_Embedding/Dataset/simple.html"
with open(file_path, encoding='utf-8') as f:
    html_string = f.read()

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
chunks = html_header_splits

# Get the total chunks, maximum chunk size, minimum chunk size, sample chunk
chunk_sizes = [len(chunk.page_content) for chunk in chunks]
print("Total number of chunks:", len(chunks))
print("Maximum chunk size:", max(chunk_sizes))
print("Minimum chunk size:", min(chunk_sizes))

print("\nSample Chunk:\n", chunks[0].page_content)

Total number of chunks: 3
Maximum chunk size: 106
Minimum chunk size: 15

Sample Chunk:
 About This Page


### 1. Splitting HTML from a URL using HTMLHeaderTextSplitter.

In [15]:
from langchain_text_splitters import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter

# Define headers to split on
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

# Initialize HTMLHeaderTextSplitter and split HTML from URL
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
url = "https://plato.stanford.edu/entries/goedel/"
html_header_splits = html_splitter.split_text_from_url(url)

# Split the chunks further using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
chunks = text_splitter.split_documents(html_header_splits)

# Step 6: Print metadata for the first chunk and total number of chunks
print("Total number of chunks:", len(chunks),"\n")
print("First chunk:", chunks[0],"\n")

# Step 7: Print maximum and minimum chunk sizes
chunk_sizes = [len(chunk.page_content) for chunk in chunks]
print('Maximum chunk size among all:', max(chunk_sizes),"\n")
print('Minimum chunk size among all:', min(chunk_sizes),"\n")

Total number of chunks: 339 

First chunk: page_content='Stanford Encyclopedia of Philosophy  
Menu  
Browse About Support SEP  
Table of Contents What's New Random Entry Chronological Archives  
Editorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  
Support the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  
Entry Navigation  
Entry Contents Bibliography Academic Tools Friends PDF Preview Author and Citation Info Back to Top  
Kurt Gödel' 

Maximum chunk size among all: 500 

Minimum chunk size among all: 5 



### 2. Splitting HTML content by sections

In [16]:
from langchain_text_splitters import (RecursiveCharacterTextSplitter, HTMLSectionSplitter)

# Step 1: Load the HTML document from file
file_path = "C:/Users/admin/OneDrive/Desktop/Chunking_Embedding/Dataset/simple.html"
with open(file_path, encoding='utf-8') as f:
    html_string = f.read()

# Step 2: Define headers to split on
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

# Step 3: Use HTMLSectionSplitter to split the document based on headers
html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
chunks = html_header_splits

# Step 4: Get the total chunks, maximum chunk size, minimum chunk size, and sample chunk
chunk_sizes = [len(chunk.page_content) for chunk in chunks]
print("Total number of chunks:", len(chunks))
print("Maximum chunk size:", max(chunk_sizes))
print("Minimum chunk size:", min(chunk_sizes))

print("\nSample Chunk:\n",chunks[0].page_content)
print("---------------------")

# Step 5: Use RecursiveCharacterTextSplitter to further split the chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
splits = text_splitter.split_documents(html_header_splits)

# Step 6: Get the total splits, maximum split size, minimum split size, and sample split
split_sizes = [len(split.page_content) for split in splits]
print("Total number of splits:",len(splits))
print("Maximum split size: ",max(split_sizes))
print("Minimum split size: ",min(split_sizes))
# Print a sample split
print(f"Sample split (1st split):",splits[0].page_content)
print("---------------------")


Total number of chunks: 2
Maximum chunk size: 148
Minimum chunk size: 70

Sample Chunk:
 Welcome to My Simple HTML Page 
 
 
 Home 
 About 
 Services 
 Contact
---------------------
Total number of splits: 2
Maximum split size:  148
Minimum split size:  70
Sample split (1st split): Welcome to My Simple HTML Page 
 
 
 Home 
 About 
 Services 
 Contact
---------------------
