###                                    Text Splitting Techniques and OPENAI API Integration

In [27]:
## Text Splitting with LangChain and OpenAI
# This code demonstrates how to split text into manageable chunks using LangChain's `RecursiveCharacterTextSplitter` and then process those chunks with the OpenAI API to generate insights or summaries.
import langchain

# Import necessary libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Give the text to be split
text = "Telangana is a state in southern India. Hyderabad is its capital city. Telangana is famous for its rich heritage, cultural festivals, iconic Charminar, and delicious Hyderabadi Biryani. It is also known for major IT hubs like HITEC City and historical temples such as Yadadri."

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=70, chunk_overlap=20)

# Split the text into chunks
final_texts = text_splitter.split_text(text)

# Print the resulting chunks
for i, text in enumerate(final_texts):
    print(f"Chunk {i+1}:\n{text}\n")
    print("-" * 70) 

Chunk 1:
Telangana is a state in southern India. Hyderabad is its capital city.

----------------------------------------------------------------------
Chunk 2:
its capital city. Telangana is famous for its rich heritage, cultural

----------------------------------------------------------------------
Chunk 3:
heritage, cultural festivals, iconic Charminar, and delicious

----------------------------------------------------------------------
Chunk 4:
and delicious Hyderabadi Biryani. It is also known for major IT hubs

----------------------------------------------------------------------
Chunk 5:
for major IT hubs like HITEC City and historical temples such as

----------------------------------------------------------------------
Chunk 6:
temples such as Yadadri.

----------------------------------------------------------------------


### Code Explanation:

### Explain what "Recursive Character Text Splitter" does.
-> The Recursive Character Text Splitter is a method used in natural language processing (NLP), particularly in document processing and chunking, to split large text into smaller chunks intelligently, while preserving semantic structure as much as possible.

-> The Recursive Character Text Splitter works by:
   Trying to split the text on larger semantic boundaries first, such as:
   -> Paragraphs (\n\n)
   -> Sentences (. or !)
   -> Phrases (;, ,)
-> If the chunk is still too large, it recursively tries smaller separators (e.g., words, then characters).
-> It ensures that each chunk:
-> Does not exceed a specified maximum length (e.g., 1000 characters).
-> Overlaps with the next chunk by a small number of characters (to preserve context across boundaries, like 100 characters).

### Why It’s Preferred Over Basic Character Splitter
    Feature                                   Basic Character Splitter	                  
-> Splitting Method         :	              Cuts text every N characters, regardless of meaning	  
-> Context Awareness        :                 No context or structure awareness    
-> Chunk Coherence          :      	          Chunks may split in the middle of sentences   
-> Use Case Suitability	    :                 Suitable for raw text chunking only  


    Feature                                   Recursive Character Text Splitter
-> Splitting Method         :                 Tries to split on logical boundaries (paragraphs, sentences, etc.)
-> Context Awareness        :                 Preserves semantic meaning as much as possible
-> Chunk Coherence          :                 Chunks are more natural and readable
-> Use Case Suitability	    :                 Ideal for LLMs, embeddings, summarization tasks
                                                                                           
                                                                                        
-> Example:

"Artificial intelligence is changing the world. It is being used in healthcare, education, and transportation. The impact is massive."
Basic Splitter (e.g., split every 50 characters):

"Artificial intelligence is changing the world. I"

"t is being used in healthcare, education, and t"

"ransportation. The impact is massive."

Recursive Character Text Splitter (with a max chunk size of ~60 characters):

"Artificial intelligence is changing the world."

"It is being used in healthcare, education, and transportation."

"The impact is massive."

### Where can "Recursive Character Text Splitter" technique be practically applied?
It’s especially useful when preparing text for:

-> Embedding in vector databases (e.g., for RAG pipelines)
-> Text summarization
-> Language model input chunking
-> Maintaining logical coherence in conversational AI

In [32]:

# Example of using CharacterTextSplitter for text splitting
# This example demonstrates how to use the `CharacterTextSplitter` from LangChain to split a given text into smaller chunks based on a specified separator and chunk size. The `chunk_overlap` parameter allows for some overlap between chunks, which can be useful for maintaining context in the text.
from langchain.text_splitter import CharacterTextSplitter

# Define the text to be split
text = "Telangana is a state in southern India. Hyderabad is its capital city. Telangana is famous for its rich heritage, cultural festivals, iconic Charminar, and delicious Hyderabadi Biryani. It is also known for major IT hubs like HITEC City and historical temples such as Yadadri."

# Initialize the CharacterTextSplitter with a separator, chunk size, and overlap
text_splitter = CharacterTextSplitter(separator='.',chunk_size=70, chunk_overlap=20)

# Split the text into chunks
final_texts = text_splitter.split_text(text)

# Print the resulting chunks
for i, text in enumerate(final_texts):
    print(f"Chunk {i+1}:\n{text}\n")
    print("-" * 70) 

Created a chunk of size 114, which is longer than the specified 70


Chunk 1:
Telangana is a state in southern India. Hyderabad is its capital city

----------------------------------------------------------------------
Chunk 2:
Telangana is famous for its rich heritage, cultural festivals, iconic Charminar, and delicious Hyderabadi Biryani

----------------------------------------------------------------------
Chunk 3:
It is also known for major IT hubs like HITEC City and historical temples such as Yadadri

----------------------------------------------------------------------


### Code Explanation

### Explain the difference between Character Text Splitter and Recursive Character Text Splitter

Feature                                                           CharacterTextSplitter

-> Splitting Strategy	           :                   Splits text based on a single separator
-> Separator Control	              :                   You define one separator (default is space " ")
-> Chunk Size Respect	           :                   If a chunk is larger than chunk_size, it may just pass it anyway
-> Semantic Chunks	              :                   May split mid-sentence or mid-word
-> Fallback Logic	                 :                   No fallback — only uses the one separator
-> Use Case Fit	                 :                   Best for structured text or fixed formats
-> Output Consistency              :                   Might produce inconsistent chunk lengths

Feature                                                           RecursiveCharacterTextSplitter

-> Splitting Strategy	           :                   Splits text recursively using a list of separators  
-> Separator Control	              :                   You can pass a list of separators (default: smart hierarchy) 
-> Chunk Size Respect	           :                   Tries to break big chunks further using smaller separators 
-> Semantic Chunks                 :                   Attempts to keep sentences or phrases whole 
-> Fallback Logic                  :                   Falls back through separators: from \n → . → → ''
-> Use Case Fit	                 :                   Best for unstructured or natural language text
-> Output Consistency              :                   More consistent and meaningful chunks

### Compare the output with RecursiveCharacterTextSplitter and CharacterTextSplitter

-> CharacterTextSplitter doesn’t care about sentences or commas — it just chunks by character count.

-> RecursiveCharacterTextSplitter uses smart breaking (punctuation, words, etc.), as long as the chunk is under the size.

### When should CharacterTextSplitter be preferred?

-> We want control over a specific separator
   -> If we only want to split on a known character or pattern (like a period, comma, newline, etc.).
   -> Useful when our data follows a fixed structure.

-> We’re working with clean, structured text
   -> If our input is already well-structured and doesn't require fallback logic.
   -> Texts like: Paragraphs, Lists, Code blocks, Metadata-annotated documents.

-> We want faster performance with large-scale data
   -> CharacterTextSplitter has less logic than recursive splitting.
   -> For bulk processing (e.g., thousands of documents), it’s faster and more memory efficient.

-> We don’t need sentence/word-level fallback logic
   -> We’re fine if a chunk exceeds the chunk_size a little.
   -> We care more about simplicity than precise control.

In [33]:
# Example of using HTMLHeaderTextSplitter for splitting HTML content
from langchain_text_splitters import HTMLHeaderTextSplitter

# Example HTML text to split

html_text = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample HTML Document</title>
</head>
<body>
    <h1>Welcome to Telangana</h1>
    <p>Telangana is a state in southern India. Hyderabad is its capital city.</p>

    <h2>Culture and Heritage</h2>
    <p>Telangana is famous for its rich heritage, cultural festivals, iconic Charminar, and delicious Hyderabadi Biryani.</p>

    <h3>Technology and History</h3>
    <p>It is also known for major IT hubs like HITEC City and historical temples such as Yadadri.</p>
</body>
</html>
"""
# Initialize the HTMLHeaderTextSplitter with headers to split on
header_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

# Create an instance of HTMLHeaderTextSplitter with the specified headers
html_splitter = HTMLHeaderTextSplitter( header_to_split_on)

# Split the HTML text into chunks based on the specified headers
html_header_splits = html_splitter.split_text(html_text)

# Display the split HTML chunks
for i, text in enumerate(html_header_splits):
    print(f"HTML Chunk {i+1}:\n{text}")
    #print(text.metadata)
    print("-" * 70)
    

HTML Chunk 1:
page_content='Welcome to Telangana' metadata={'Header 1': 'Welcome to Telangana'}
----------------------------------------------------------------------
HTML Chunk 2:
page_content='Telangana is a state in southern India. Hyderabad is its capital city.' metadata={'Header 1': 'Welcome to Telangana'}
----------------------------------------------------------------------
HTML Chunk 3:
page_content='Culture and Heritage' metadata={'Header 1': 'Welcome to Telangana', 'Header 2': 'Culture and Heritage'}
----------------------------------------------------------------------
HTML Chunk 4:
page_content='Telangana is famous for its rich heritage, cultural festivals, iconic Charminar, and delicious Hyderabadi Biryani.' metadata={'Header 1': 'Welcome to Telangana', 'Header 2': 'Culture and Heritage'}
----------------------------------------------------------------------
HTML Chunk 5:
page_content='Technology and History' metadata={'Header 1': 'Welcome to Telangana', 'Header 2': 'Cultu

### Code Explanation :

### Explain what HTML Header Text Splitter does?

-> The HTMLHeaderTextSplitter is a specialized text splitter in LangChain designed for splitting HTML documents into meaningful chunks based on 
   HTML header tags.

-> Instead of splitting text purely by character count or sentences, it understands the structure of an HTML document, using the header tags to  
   organize and group related content.

-> How It Works
   -> It parses the HTML content.
   -> It identifies header tags.
   -> It treats each header and its associated body as one chunk.
   -> We can configure which headers to consider and how deep the hierarchy goes.

### Explain the benefits of using this for parsing web documents.

-> Preserves Semantic Structure
   -> HTML documents often follow a hierarchical structure using header tags.
   -> HTMLHeaderTextSplitter retains this structure in each chunk.
   -> Each header and its related text become a logically grouped chunk.

->  Improves Context for Retrieval (RAG/NLP Systems)
    -> When we chunk based on headers, each piece contains meaningful context tied to a topic or subtopic.
    -> This makes search or retrieval in RAG (Retrieval-Augmented Generation) systems more accurate.

-> Cleaner Metadata for Filtering
   -> The splitter automatically adds header metadata (e.g., ["h1: AI", "h2: Applications"]).
   -> This allows filtering results or answers based on sections, like “only from 'Benefits' section”.

-> Avoids Arbitrary Breaks
   -> Other splitters (character-based, sentence-based) may split chunks mid-topic, harming coherence.
   -> HTMLHeaderTextSplitter respects natural topic boundaries, improving the usefulness of each chunk.

-> Ideal for Technical/Documentation Sites
   -> Websites like MDN, Wikipedia, ReadTheDocs, blogs, etc., use HTML headers heavily.
   -> We can extract each section like "Overview", "Usage", "Examples", etc., as separate, clean chunks.

-> Enables Hierarchical Search and Summarization
   -> We can perform search or summarization within specific sections 
   -> Great for FAQ bots, documentation assistants, or chapter-wise summarizers.

In [34]:
# Example of using OpenAI API with text splitting
from openai import OpenAI

# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

#import os
import os

# Initialize OpenAI client with your API key
client = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url="https://api.sambanova.ai/v1",  # SambaNova's endpoint
)

# Create a chat completion request with the OpenAI client
response = client.chat.completions.create(
model="Meta-Llama-3.3-70B-Instruct",
messages=[ {"role": "user", "content": "What are some famous places to visit in Telangana?"}
 ],
 )

# Print the response from the OpenAI API
print(response.choices[0].message.content)

Telangana, a state in southern India, is known for its rich history, cultural heritage, and natural beauty. Here are some famous places to visit in Telangana:

1. **Charminar**: A iconic monument and symbol of Hyderabad, the capital city of Telangana. It's a beautiful example of Indo-Islamic architecture.
2. **Golconda Fort**: A historic fort located in Hyderabad, known for its impressive architecture, acoustic effects, and stunning views of the city.
3. **Hussain Sagar Lake**: A large artificial lake in Hyderabad, popular for boating, water sports, and a giant Buddha statue in the middle of the lake.
4. **Birla Mandir**: A beautiful Hindu temple located on a hill in Hyderabad, offering stunning views of the city and a peaceful atmosphere.
5. **Warangal Fort**: A historic fort located in Warangal, known for its impressive architecture, intricate carvings, and rich history.
6. **Kakatiya Thermal Power Station**: A popular tourist destination in Warangal, offering a glimpse into the stat

In [35]:
# Example of using OpenAI API with text splitting in Telugu
from openai import OpenAI

# Load environment variables from .env file
from dotenv import load_dotenv
import os
load_dotenv()
# Initialize OpenAI client with your API key
client = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url="https://api.sambanova.ai/v1",  # SambaNova's endpoint
)

# Create a chat completion request with the OpenAI client
response = client.chat.completions.create(
model="Meta-Llama-3.3-70B-Instruct",
messages=[ {"role": "user", "content": "What are some famous places to visit in Telangana in telugu and also give an explanation about it in Telugu?"}
 ],
 )

# Print the response from the OpenAI API
print(response.choices[0].message.content)

తెలంగాణలో పర్యాటక ప్రదేశాలు:

1. **చార్మినార్**: హైదరాబాద్‌లోని ప్రసిద్ధ చారిత్రక నిర్మాణం. ఈ మసీదు 1591లో మహమ్మద్ కులీ కుతుబ్ షా నిర్మించాడు. దీని నిర్మాణం ప్రత్యేకమైనది, నాలుగు మీనార్‌లతో ఉంటుంది.

2. **గోల్కొండ కోట**: హైదరాబాద్‌లోని చారిత్రక కోట. ఈ కోట 16వ శతాబ్దంలో కుతుబ్ షాహీ రాజులు నిర్మించారు. దీని రక్షణ గోడలు, బురుజులు ప్రసిద్ధి చెందాయి.

3. **బిర్లా మందిర్**: హైదరాబాద్‌లోని ఒక ప్రసిద్ధ దేవాలయం. ఈ దేవాలయం 1976లో బిర్లా కుటుంబం నిర్మించింది. దీని వాస్తుశిల్పం ప్రత్యేకమైనది.

4. **హుస్సేన్ సాగర్**: హైదరాబాద్‌లోని ఒక ప్రసిద్ధ సరస్సు. ఈ సరస్సు 1562లో ఇబ్రహీం కుతుబ్ షా నిర్మించాడు. దీని మధ్యలో బుద్ధ విగ్రహం ఉంది.

5. **రామోజీ ఫిల్మ్ సిటీ**: హైదరాబాద్‌లోని ఒక ప్రసిద్ధ సినీ నగరం. ఈ సినీ నగరం 1996లో రామోజీ రావు స్థాపించాడు. దీనిలో వివిధ రకాల సెట్లు, థీమ్ పార్కులు ఉన్నాయి.

6. **సాలార్‌జంగ్ మ్యూజియం**: హైదరాబాద్‌లోని ఒక ప్రసిద్ధ మ్యూజియం. ఈ మ్యూజియం 1968లో స్థాపించబడింది. దీనిలో వివిధ రకాల కళాఖండాలు, చారిత్రక వస్తువులు ఉన్నాయి.

7. **నెహ్రూ జంతుప్రదర్శనశాల**: హైదరాబాద్‌లోని ఒక ప్రసిద్ధ 

### Code Explanation

### Explain how the Chat API processes your requests

1. We Send a Request
   -> We send an HTTP request using a Python client (like openai.OpenAI) with parameters such as:
   -> model: which model to use (e.g., "Meta-Llama-3.3-70B-Instruct")
   -> messages: the chat history (user, assistant roles)
   
2.  The Request Hits the SambaNova Endpoint
    -> This is SambaNova's hosted API endpoint, acting just like OpenAI’s — but backed by open-source models (like LLaMA 3, Mistral, Gemma, etc.).

3. Model Selection & Tokenization
   -> The API checks the model name you requested (Meta-Llama-3.3-70B-Instruct) and routes your request to that model instance.
   -> The input message is tokenized (converted to subword tokens) based on the model's tokenizer.
   -> Example: Telugu characters, punctuation, etc., are converted into a sequence of numeric tokens.

4. Inference (Generating a Response)
   -> The model generates a response token-by-token, predicting the most likely next token at each step.
   -> This process uses the settings you've provided:
   -> temperature=0.7 (controls randomness)
   -> max_tokens (limit on output size if specified)
   -> Internally, this happens on accelerated hardware (like GPUs or TPUs) managed by SambaNova.

5. Detokenization & Packaging
   -> Once the model finishes generating, the token IDs are detokenized into a human-readable string (Telugu in your case).
   -> The API wraps this into a structured JSON response 

6. Client Receives the Response
   -> Your Python code receives this JSON, and you extract the response like:


### (Bonus) : Suggest a project idea where text splitting and OPENAI API can be combined for document analysis.

-> Smart Document Summarizer and Insight Extractor.