# **✅ Part 1: Recursive Character Text Splitter**

**📌 1. What is RecursiveCharacterTextSplitter?**

RecursiveCharacterTextSplitter is a smart text splitter from LangChain. It splits large texts into smaller chunks by prioritizing semantic breaks (e.g., paragraphs, sentences) rather than blindly cutting after a fixed number of characters.

It recursively tries different separators in order (like "\n\n", "\n", " ", "") to retain logical groupings.

**📌 2. Why is it preferred over basic Character Splitter?**

Because it:

Preserves sentence meaning and context better.

Avoids breaking in the middle of words/sentences.

Tries larger logical splits first, making chunks more readable.

# **📌 3–5. Code to Split Text**

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Sample Text
text = """
Telangana is a state in southern India. Hyderabad is its capital city. Telangana
is famous for its rich heritage, cultural festivals, iconic Charminar, and
delicious Hyderabadi Biryani. It is also known for major IT hubs like HITEC City
and historical temples such as Yadadri.
"""

# Create splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,         # max size per chunk
    chunk_overlap=20        # overlap between chunks
)

# Split text
chunks = splitter.split_text(text)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Chunk 1:
Telangana is a state in southern India. Hyderabad is its capital city. Telangana

Chunk 2:
is famous for its rich heritage, cultural festivals, iconic Charminar, and

Chunk 3:
delicious Hyderabadi Biryani. It is also known for major IT hubs like HITEC City

Chunk 4:
and historical temples such as Yadadri.



# ** 6. Where can this technique be practically applied?**

**📌Practical Applications**

Splitting large documents (PDFs, books) for QA systems.

Preprocessing for retrieval-based chatbots.

Text chunking before passing to LLMs for summarization or embeddings.



# **✅ Part 2: Character Text Splitter**

**📌 1. Difference from RecursiveCharacterTextSplitter**

| Feature           | CharacterTextSplitter       | RecursiveCharacterTextSplitter  |
| ----------------- | --------------------------- | ------------------------------- |
| Splitting logic   | Fixed-size character chunks | Recursive by semantic separator |
| Context awareness | ❌ No                        | ✅ Yes                           |
| Preferred when    | Simplicity needed           | Semantic coherence needed       |


# **📌 2–3. Code**

In [2]:
from langchain.text_splitter import CharacterTextSplitter

splitter_char = CharacterTextSplitter(
    separator=" ",         # split on space
    chunk_size=100,
    chunk_overlap=20
)

char_chunks = splitter_char.split_text(text)

# Display
for i, chunk in enumerate(char_chunks):
    print(f"Char Split Chunk {i+1}:\n{chunk}\n")


Char Split Chunk 1:
Telangana is a state in southern India. Hyderabad is its capital city. Telangana 
is famous for its

Char Split Chunk 2:
is famous for its rich heritage, cultural festivals, iconic Charminar, and 
delicious Hyderabadi

Char Split Chunk 3:
Hyderabadi Biryani. It is also known for major IT hubs like HITEC City 
and historical temples such

Char Split Chunk 4:
temples such as Yadadri.



**📌 4. When to prefer CharacterTextSplitter?**

When processing structured data where semantic coherence isn’t critical.

Useful for token-level tasks, e.g., language modeling, when alignment matters more than semantics.

# **✅ Part 3: HTML Header Text Splitter**

**📌 1. What is HTMLHeaderTextSplitter?**

This splitter breaks HTML content into chunks based on header tags (<h1>, <h2>, <h3>, etc.). It’s very useful when dealing with web pages, blog posts, or knowledge bases, where headers indicate sections.

# **📌 2–5. Code to Process HTML**

In [4]:
html_content = """
<html>
  <body>
    <h1>Main Title</h1>
    <p>This is an intro paragraph.</p>
    <h2>Section One</h2>
    <p>Details about section one.</p>
    <h3>Subsection</h3>
    <p>Further details here.</p>
    <h2>Section Two</h2>
    <p>More content here.</p>
  </body>
</html>
"""

In [5]:
from langchain.text_splitter import HTMLHeaderTextSplitter

html_string = """
<html>
  <body>
    <h1>Main Title</h1>
    <p>This is an intro paragraph.</p>
    <h2>Section One</h2>
    <p>Details about section one.</p>
    <h3>Subsection</h3>
    <p>Further details here.</p>
    <h2>Section Two</h2>
    <p>More content here.</p>
  </body>
</html>
"""

# Define which headers to split on
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")])

docs = html_splitter.split_text(html_string)

# Display chunks and metadata
for i, doc in enumerate(docs):
    print(f"\nChunk {i+1}:\nContent: {doc.page_content}\nMetadata: {doc.metadata}")



Chunk 1:
Content: Main Title
Metadata: {'Header 1': 'Main Title'}

Chunk 2:
Content: This is an intro paragraph.
Metadata: {'Header 1': 'Main Title'}

Chunk 3:
Content: Section One
Metadata: {'Header 1': 'Main Title', 'Header 2': 'Section One'}

Chunk 4:
Content: Details about section one.
Metadata: {'Header 1': 'Main Title', 'Header 2': 'Section One'}

Chunk 5:
Content: Subsection
Metadata: {'Header 1': 'Main Title', 'Header 2': 'Section One', 'Header 3': 'Subsection'}

Chunk 6:
Content: Further details here.
Metadata: {'Header 1': 'Main Title', 'Header 2': 'Section One', 'Header 3': 'Subsection'}

Chunk 7:
Content: Section Two
Metadata: {'Header 1': 'Main Title', 'Header 2': 'Section Two'}

Chunk 8:
Content: More content here.
Metadata: {'Header 1': 'Main Title', 'Header 2': 'Section Two'}



## 📌 6. Why this is useful?

Great for structuring web scraping results.

Useful for search indexing, QA systems, or summarizing webpage sections.

Maintains document hierarchy.




# **✅ Part 4: OpenAI API Integration**

**📌 1–4. Code (with dummy API key)**

In [14]:
from openai import OpenAI

# Load environment variables from .env file
#pip install python-dotenv # This should be in a separate cell with !
#from dotenv import load_dotenv
#load_dotenv()

#import os # Duplicate import
#import os

# Initialize OpenAI client with your API key
# Replace "e8a848d8-08dd-4438-bdd5-10848106389e" with your actual key or get it securely
# Using os.getenv here is incorrect as the key is directly provided
client = OpenAI(
    api_key="e8a848d8-08dd-4438-bdd5-10848106389e", # Replace with your actual key or get from secrets
    base_url="https://api.sambanova.ai/v1",  # SambaNova's endpoint
)

# Create a chat completion request with the OpenAI client
response = client.chat.completions.create(
model="Meta-Llama-3.3-70B-Instruct",
messages=[ {"role": "user", "content": "What are some famous places to visit in Telangana?"}
 ],
 )

# Print the response from the OpenAI API
print(response.choices[0].message.content)

Telangana, a state in southern India, is known for its rich history, cultural heritage, and natural beauty. Here are some famous places to visit in Telangana:

1. **Charminar**: A iconic monument and symbol of Hyderabad, the capital city of Telangana. It's a beautiful example of Indo-Islamic architecture.
2. **Golconda Fort**: A historic fort located in Hyderabad, known for its impressive architecture, acoustic effects, and stunning views of the city.
3. **Hussain Sagar Lake**: A large artificial lake in Hyderabad, popular for boating, water sports, and a giant Buddha statue in the middle of the lake.
4. **Birla Mandir**: A beautiful Hindu temple located on a hill in Hyderabad, offering stunning views of the city and a peaceful atmosphere.
5. **Warangal Fort**: A historic fort located in Warangal, known for its impressive architecture, intricate carvings, and rich history.
6. **Kakatiya Thermal Power Station**: A popular tourist destination in Warangal, offering a glimpse into the stat

In [15]:
from openai import OpenAI

# Load environment variables from .env file
#pip install python-dotenv # This should be in a separate cell with !
#from dotenv import load_dotenv
#load_dotenv()

#import os # Duplicate import
#import os

# Initialize OpenAI client with your API key
# Replace "e8a848d8-08dd-4438-bdd5-10848106389e" with your actual key or get it securely
# Using os.getenv here is incorrect as the key is directly provided
client = OpenAI(
    api_key="e8a848d8-08dd-4438-bdd5-10848106389e", # Replace with your actual key or get from secrets
    base_url="https://api.sambanova.ai/v1",  # SambaNova's endpoint
)

# Create a chat completion request with the OpenAI client
response = client.chat.completions.create(
model="Meta-Llama-3.3-70B-Instruct",
messages=[ {"role": "user", "content": "What are some famous places to visit in Telangana in telugu?"}
 ],
 )

# Print the response from the OpenAI API
print(response.choices[0].message.content)

తెలంగాణలో పర్యాటక ప్రదేశాలు:

1. చార్మినార్: హైదరాబాద్‌లోని ప్రసిద్ధ చారిత్రక నిర్మాణం.
2. గోల్కొండ కోట: హైదరాబాద్‌లోని చారిత్రక కోట.
3. బిర్లా మందిర్: హైదరాబాద్‌లోని ప్రసిద్ధ దేవాలయం.
4. నెహ్రూ జంతుప్రదర్శనశాల: హైదరాబాద్‌లోని ప్రసిద్ధ జంతుప్రదర్శనశాల.
5. రామోజీ ఫిల్మ్ సిటీ: హైదరాబాద్‌లోని ప్రసిద్ధ సినీ పరిశ్రమ.
6. కొండాపూర్: హైదరాబాద్‌లోని ప్రసిద్ధ పర్యాటక ప్రదేశం.
7. అనంతగిరి: వికారాబాద్ జిల్లాలోని ప్రసిద్ధ పర్యాటక ప్రదేశం.
8. నాగార్జునసాగర్: నల్గొండ జిల్లాలోని ప్రసిద్ధ పర్యాటక ప్రదేశం.
9. భద్రాచలం: ఖమ్మం జిల్లాలోని ప్రసిద్ధ పర్యాటక ప్రదేశం.
10. మెదక్ చర్చి: మెదక్ జిల్లాలోని ప్రసిద్ధ చర్చి.

ఇవి కొన్ని ప్రసిద్ధ పర్యాటక ప్రదేశాలు. తెలంగాణలో మరిన్ని పర్యాటక ప్రదేశాలు ఉన్నాయి.


# **📌 5. How Chat API Processes Requests?**

The messages parameter holds conversation history.

role indicates if the message is from "user", "system", or "assistant".

The API reads the prompt and generates a response in the selected model (gpt-4o here).

