#Nihar Lohar
#BTech AI
Generative AI

Lab 9


# Aim: 
To implement a Retrieval-Augmented Generation (RAG) pipeline using a vector database, with emphasis on chunking techniques and HNSW algorithm for nearest neighbor search.

Task 1: Dataset Preparation

1. Select a text dataset (e.g., 2–3 Wikipedia articles, course notes, or research papers).

2. Apply different chunking strategies:

a. Fixed-size Chunking

i. Split text into chunks of N tokens/characters.

ii. Simple but may break sentences mid-way.

iii. Example: 200–300 tokens per chunk.

b. Sentence-based Chunking

i. Split text into individual sentences or groups of 2–3 sentences.

ii. Maintains semantic meaning per chunk.

iii. Good for FAQs or structured text.

c. Paragraph-based Chunking (Semantic Chunking)

i. Split based on paragraph boundaries.

ii. Preserves natural structure and coherence.

iii. Best for articles, blogs, or research papers.

d. Sliding Window (Overlapping Chunks)

i. Use overlaps between chunks (e.g., 200 tokens with 50-token overlap).

ii. Ensures that information spanning chunk boundaries isn’t lost.

iii. Useful in tasks like Q&A where context may cross chunk boundaries.

e. Heading-based Chunking

i. Split text by document headings / subheadings.


ii. Works well with structured data (textbooks, research papers, Wikipedia).

iii. Ensures each chunk represents a logical section.

f. Semantic/Embedding-based Chunking

i. Use sentence embeddings + similarity to merge closely related sentences into one chunk.

ii. Requires clustering or similarity threshold.

iii. Produces contextually meaningful chunks.

g. Hybrid Chunking

i. Combine methods (e.g., fixed-size with overlap + heading detection).

ii. Gives balance between coverage and coherence.

3. Store the resulting chunks and note their average length.

4. Compare with sample queries:

i. Which strategy retrieves more relevant chunks?

ii. Which one leads to fewer “broken thoughts”?

Task 2: Embedding & Vector Storage

1. Use a pre-trained embedding model (all-MiniLM-L6-v2 or similar).

2. Store embeddings in a vector database (Chroma / FAISS / Weaviate).

3. Ensure the DB uses HNSW index.

4. Insert chunks with metadata (chunk id, source text).

Task 3: Retrieval with HNSW

1. Implement top-k retrieval using HNSW search.

2. Compare results for different efSearch (query-time search breadth) values (e.g., 20, 100, 200).

3. Build indexes with efConstruction (indexing quality) = 50, 200, 500 and compare retrieval results.

4. Try M (max connections per node) = 8, 16, 32 and compare accuracy vs memory usage.

5. Query example: “What is the role of HNSW in vector search?”

6. Observe how different chunking strategies affect retrieved results.

Task 4: RAG Pipeline

1. Implement a simple RAG loop:

a. Take a user query.

b. Retrieve top-k chunks using HNSW.

c. Pass retrieved context + query into an LLM (Gemini, Mistral or any other free API model).

d. Display generated answer.

In [None]:
pip install numpy
pip install nltk
pip install sentence-transformers
pip install faiss-cpu
pip install requests


In [None]:
pip install faiss-gpu


In [None]:
import nltk
nltk.download('punkt')


In [28]:
import os
import re
import time
from dataclasses import dataclass
from collections import defaultdict
from typing import List, Dict, Tuple, Any

import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
import faiss
import requests

In [29]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')


In [30]:
@dataclass
class Chunk:
    id: str
    text: str
    source: str
    chunk_type: str
    start_char: int
    end_char: int
    token_count: int

In [31]:
# -----------------------------
# Chunking Strategies
# -----------------------------
class TextChunker:
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

    def count_tokens(self, text: str) -> int:
        return len(text.split())

    def fixed_size_chunking(self, text: str, chunk_size: int = 250, source: str = "") -> List[Chunk]:
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size):
            chunk_words = words[i:i + chunk_size]
            chunks.append(Chunk(
                id=f"fixed_{source}_{i//chunk_size}",
                text=" ".join(chunk_words),
                source=source,
                chunk_type="fixed_size",
                start_char=len(" ".join(words[:i])),
                end_char=len(" ".join(words[:i + len(chunk_words)])),
                token_count=len(chunk_words)
            ))
        return chunks

    def sentence_based_chunking(self, text: str, sentences_per_chunk: int = 3, source: str = "") -> List[Chunk]:
        sentences = sent_tokenize(text)
        chunks = []
        for i in range(0, len(sentences), sentences_per_chunk):
            chunk_sentences = sentences[i:i+sentences_per_chunk]
            chunks.append(Chunk(
                id=f"sentence_{source}_{i//sentences_per_chunk}",
                text=" ".join(chunk_sentences),
                source=source,
                chunk_type="sentence_based",
                start_char=0,
                end_char=len(" ".join(chunk_sentences)),
                token_count=self.count_tokens(" ".join(chunk_sentences))
            ))
        return chunks

    def paragraph_based_chunking(self, text: str, source: str = "") -> List[Chunk]:
        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
        chunks = []
        for i, paragraph in enumerate(paragraphs):
            if len(paragraph) > 50:
                chunks.append(Chunk(
                    id=f"paragraph_{source}_{i}",
                    text=paragraph,
                    source=source,
                    chunk_type="paragraph_based",
                    start_char=0,
                    end_char=len(paragraph),
                    token_count=self.count_tokens(paragraph)
                ))
        return chunks

    def sliding_window_chunking(self, text: str, chunk_size: int = 200, overlap: int = 50, source: str = "") -> List[Chunk]:
        words = text.split()
        step = chunk_size - overlap
        chunks = []
        for i in range(0, len(words), step):
            chunk_words = words[i:i+chunk_size]
            if len(chunk_words) < 50:
                break
            chunks.append(Chunk(
                id=f"sliding_{source}_{i//step}",
                text=" ".join(chunk_words),
                source=source,
                chunk_type="sliding_window",
                start_char=0,
                end_char=len(" ".join(chunk_words)),
                token_count=len(chunk_words)
            ))
        return chunks

    def heading_based_chunking(self, text: str, source: str = "") -> List[Chunk]:
        sections = re.split(r'\n(?=#+\s)', text)
        chunks = []
        for i, section in enumerate(sections):
            section = section.strip()
            if len(section) > 100:
                chunks.append(Chunk(
                    id=f"heading_{source}_{i}",
                    text=section,
                    source=source,
                    chunk_type="heading_based",
                    start_char=0,
                    end_char=len(section),
                    token_count=self.count_tokens(section)
                ))
        return chunks

    def semantic_chunking(self, text: str, similarity_threshold: float = 0.7, source: str = "") -> List[Chunk]:
        sentences = sent_tokenize(text)
        if len(sentences) < 2:
            return [Chunk(
                id=f"semantic_{source}_0",
                text=text,
                source=source,
                chunk_type="semantic",
                start_char=0,
                end_char=len(text),
                token_count=self.count_tokens(text)
            )]
        embeddings = self.embedding_model.encode(sentences)
        chunks = []
        current_chunk_sentences = [sentences[0]]
        current_embedding = embeddings[0:1]
        for i in range(1, len(sentences)):
            centroid = np.mean(current_embedding, axis=0)
            similarity = np.dot(centroid, embeddings[i]) / (np.linalg.norm(centroid)*np.linalg.norm(embeddings[i]))
            if similarity > similarity_threshold:
                current_chunk_sentences.append(sentences[i])
                current_embedding = np.vstack([current_embedding, embeddings[i:i+1]])
            else:
                chunks.append(Chunk(
                    id=f"semantic_{source}_{len(chunks)}",
                    text=" ".join(current_chunk_sentences),
                    source=source,
                    chunk_type="semantic",
                    start_char=0,
                    end_char=len(" ".join(current_chunk_sentences)),
                    token_count=self.count_tokens(" ".join(current_chunk_sentences))
                ))
                current_chunk_sentences = [sentences[i]]
                current_embedding = embeddings[i:i+1]
        if current_chunk_sentences:
            chunks.append(Chunk(
                id=f"semantic_{source}_{len(chunks)}",
                text=" ".join(current_chunk_sentences),
                source=source,
                chunk_type="semantic",
                start_char=0,
                end_char=len(" ".join(current_chunk_sentences)),
                token_count=self.count_tokens(" ".join(current_chunk_sentences))
            ))
        return chunks

    def hybrid_chunking(self, text: str, source: str = "") -> List[Chunk]:
        heading_chunks = self.heading_based_chunking(text, source)
        if len(heading_chunks) > 1:
            return heading_chunks
        else:
            return self.sliding_window_chunking(text, source=source)




In [32]:
# -----------------------------
# Vector Store + HNSW
# -----------------------------
class VectorStore:
    def __init__(self, embedding_dim: int = 384):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.embedding_dim = embedding_dim
        self.chunks: List[Chunk] = []
        self.index = None

    def create_hnsw_index(self, M: int = 16, ef_construction: int = 200):
        self.index = faiss.IndexHNSWFlat(self.embedding_dim, M)
        self.index.hnsw.efConstruction = ef_construction

    def add_chunks(self, chunks: List[Chunk]):
        if not chunks:
            return
        self.chunks.extend(chunks)
        texts = [chunk.text for chunk in chunks]
        embeddings = self.embedding_model.encode(texts).astype('float32')
        if self.index is None:
            self.create_hnsw_index()
        self.index.add(embeddings)

    def search(self, query: str, k: int = 5, ef_search: int = 50) -> List[Tuple[Chunk, float]]:
        if not self.index or len(self.chunks) == 0:
            return []
        self.index.hnsw.efSearch = ef_search
        query_emb = self.embedding_model.encode([query]).astype('float32')
        scores, indices = self.index.search(query_emb, k)
        results = [(self.chunks[i], float(score)) for i, score in zip(indices[0], scores[0]) if i < len(self.chunks)]
        return results

    def get_stats(self) -> Dict[str, Any]:
        token_counts = [c.token_count for c in self.chunks]
        chunk_types = defaultdict(int)
        for c in self.chunks:
            chunk_types[c.chunk_type] += 1
        return {
            "total_chunks": len(self.chunks),
            "chunk_types": dict(chunk_types),
            "avg_tokens": np.mean(token_counts) if token_counts else 0,
            "min_tokens": np.min(token_counts) if token_counts else 0,
            "max_tokens": np.max(token_counts) if token_counts else 0
        }





In [33]:
# -----------------------------
# LLM Client
# -----------------------------
class LLMClient:
    def __init__(self, api_key: str = None):
        self.model_name = 'qwen/qwen3-32b'
        self.api_key = api_key or os.environ.get("GROQ_API_KEY")

    def set_api_key_interactive(self):
        import getpass
        try:
            self.api_key = getpass.getpass("Enter your GROQ API Key: ")
        except Exception:
            self.api_key = input("Enter your GROQ API Key: ")

    def call_llm(self, prompt: str) -> str:
        import requests
        if not self.api_key:
            return "Error: API key not set. Please call set_api_key_interactive() first."
        headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
        data = {
            "model": self.model_name,
            "messages": [
                {"role": "system", "content": "You are a helpful assistant you have to refer the document and answer."},
                {"role": "user", "content": prompt}
            ],
            "max_tokens": 1000,
            "temperature": 0.7
        }
        try:
            response = requests.post(
                "https://api.groq.com/openai/v1/chat/completions",
                headers=headers,
                json=data,
                timeout=30
            )
            response.raise_for_status()
            return response.json()["choices"][0]["message"]["content"]
        except Exception as e:
            return f"LLM API Error: {e}"


In [34]:
llm = LLMClient()
llm.set_api_key_interactive()


Enter your GROQ API Key: ··········


In [35]:
# -----------------------------
# RAG Pipeline
# -----------------------------
class RAGPipeline:
    def __init__(self):
        self.chunker = TextChunker()
        self.vector_stores: Dict[str, VectorStore] = {}
        self.llm = LLMClient()

    def process_documents(self, docs: Dict[str, str]):
        methods = {
            "fixed_size": self.chunker.fixed_size_chunking,
            "sentence_based": self.chunker.sentence_based_chunking,
            "paragraph_based": self.chunker.paragraph_based_chunking,
            "sliding_window": self.chunker.sliding_window_chunking,
            "heading_based": self.chunker.heading_based_chunking,
            "semantic": self.chunker.semantic_chunking,
            "hybrid": self.chunker.hybrid_chunking
        }
        for name, func in methods.items():
            print(f"\nProcessing: {name}")
            store = VectorStore()
            all_chunks = []
            for source, text in docs.items():
                chunks = func(text, source=source)
                all_chunks.extend(chunks)
            store.add_chunks(all_chunks)
            stats = store.get_stats()
            print(f"  {stats['total_chunks']} chunks, avg tokens: {stats['avg_tokens']:.1f}")
            for chunk in store.chunks:
                preview = chunk.text[:100] + "..." if len(chunk.text) > 100 else chunk.text
                print(f"    {chunk.id} | Tokens: {chunk.token_count} | Preview: {preview}")
            self.vector_stores[name] = store

    def rag_loop(self, method: str = 'semantic', k: int = 3):
        print(f"\nStarting RAG loop with '{method}' chunking")
        while True:
            query = input(" Query (type 'quit' to exit): ").strip()
            if query.lower() == 'quit':
                break
            if method not in self.vector_stores:
                print(f"Method {method} not available")
                continue
            store = self.vector_stores[method]
            retrieved = store.search(query, k=k)
            if not retrieved:
                print("No relevant chunks found")
                continue
            context = "\n\n".join([f"{i+1}. {chunk.text}" for i, (chunk, _) in enumerate(retrieved)])
            prompt = f"CONTEXT:\n{context}\n\nQUESTION: {query}\nANSWER:"
            answer = self.llm.call_llm(prompt)
            print("\n Answer:")
            print(answer)
            print("\n Retrieved Chunks:")
            for i, (chunk, score) in enumerate(retrieved):
                preview = chunk.text[:150] + "..." if len(chunk.text) > 150 else chunk.text
                print(f"{i+1}. Score: {score:.3f}, Tokens: {chunk.token_count}, Preview: {preview}")
            print("-"*60)

In [38]:
llm = LLMClient()
llm.set_api_key_interactive()

rag = RAGPipeline()
rag.llm = llm


Enter your GROQ API Key: ··········


In [41]:
if __name__ == "__main__":
    docs = {
        "ai_doc": """

        # Artificial Intelligence

Artificial Intelligence (AI) is a multidisciplinary field of computer science and engineering focused on creating systems and machines capable of performing tasks that typically require human intelligence. These tasks include reasoning, learning, problem-solving, perception, language understanding, decision-making, and interaction with the environment. AI systems leverage large amounts of data, advanced algorithms, and computational power to mimic cognitive functions such as pattern recognition, planning, and adaptation.

AI is applied across a wide range of domains including healthcare, finance, education, transportation, entertainment, and scientific research. The ultimate goal of AI is to build machines that can perform complex tasks autonomously and intelligently, while also augmenting human abilities.

## Machine Learning

Machine learning (ML) is a core subset of AI that focuses on creating algorithms that allow computers to learn from data and improve their performance over time without being explicitly programmed. ML techniques are broadly categorized into:

- **Supervised Learning**: Algorithms learn from labeled datasets to make predictions or classify data. Common applications include spam detection, fraud detection, and medical diagnosis.
- **Unsupervised Learning**: Algorithms identify patterns or structures in unlabeled data. Examples include clustering customers based on purchasing behavior or anomaly detection in network security.
- **Reinforcement Learning**: Agents learn by interacting with an environment and receiving feedback in the form of rewards or penalties. This approach is widely used in robotics, game AI, and autonomous driving.
- **Deep Learning**: A subfield of ML based on neural networks with many layers (deep neural networks), capable of learning complex representations from large-scale data. It powers modern AI applications like speech recognition, image classification, and natural language understanding.

Machine learning models are trained using optimization techniques to minimize errors and maximize predictive accuracy. Key challenges include overfitting, underfitting, interpretability, and scalability.

## Natural Language Processing

Natural Language Processing (NLP) is a branch of AI that enables computers to understand, interpret, and generate human language. NLP combines computational linguistics, machine learning, and statistical methods to process text, speech, and other forms of language data. Common applications include:

- **Chatbots and Virtual Assistants**: AI-powered conversational agents like virtual assistants, customer service bots, and interactive agents.
- **Sentiment Analysis**: Determining opinions or emotions in text for marketing, social media monitoring, and customer feedback.
- **Language Translation**: Translating text or speech between languages using AI models like neural machine translation.
- **Text Summarization**: Automatically creating concise summaries of large documents or articles.
- **Speech Recognition and Generation**: Converting spoken language to text and vice versa for accessibility, transcription, and interactive applications.

NLP also involves challenges such as understanding context, idioms, sarcasm, and ambiguity in human language, as well as addressing biases present in language data.

## Robotics

Robotics is an interdisciplinary field that integrates AI with mechanical, electrical, and software engineering to create autonomous and semi-autonomous machines. Robotics systems can sense their environment, process information, make decisions, and perform physical actions. Key areas include:

- **Industrial Robotics**: Robots used in manufacturing, assembly lines, packaging, and precision tasks.
- **Service Robotics**: Robots designed for healthcare, domestic help, hospitality, and customer service.
- **Autonomous Vehicles**: Self-driving cars, drones, and delivery robots that navigate without human intervention.
- **Humanoid Robots**: Robots designed to resemble humans and interact in human-like ways.
- **Medical Robotics**: Surgical robots, rehabilitation systems, and assistive devices for patients.

Challenges in robotics include real-time perception, mobility in unstructured environments, manipulation of objects, human-robot interaction, and safety.

## Computer Vision

Computer vision is a branch of AI focused on enabling machines to interpret, understand, and respond to visual data from the world. It combines image processing, pattern recognition, and machine learning techniques to extract meaningful information from images and videos. Common tasks and applications include:

- **Object Detection**: Identifying and localizing objects within images or videos.
- **Image Classification**: Categorizing images into predefined classes for recognition purposes.
- **Facial Recognition**: Identifying or verifying individuals based on facial features.
- **Medical Imaging Analysis**: Detecting diseases and anomalies in X-rays, MRIs, and CT scans.
- **Autonomous Navigation**: Allowing vehicles and robots to perceive their surroundings for safe navigation.
- **Augmented Reality and Computer Graphics**: Enhancing real-world environments with computer-generated imagery.

Computer vision systems face challenges such as variations in lighting, occlusion, background clutter, and the need for large annotated datasets.

## Ethics and Bias

Ethics in AI is a critical aspect that ensures AI technologies are developed and deployed responsibly. Ethical considerations include:

- **Fairness**: Avoiding discrimination and ensuring equitable treatment across different groups of people.
- **Transparency**: Making AI systems understandable and explainable so users can trust their decisions.
- **Privacy**: Protecting sensitive personal data and preventing misuse.
- **Bias Mitigation**: Identifying and reducing biases in training data and AI models.
- **Accountability**: Determining responsibility for AI-driven decisions and actions.
- **Safety and Security**: Preventing malicious use of AI and ensuring reliability in critical applications.

Ethical AI requires collaboration between technologists, policymakers, ethicists, and society to ensure that AI benefits humanity while minimizing potential harms.

## Future Directions

AI continues to evolve rapidly, with research focused on areas such as:

- **General AI**: Creating machines with human-level cognitive abilities across a broad range of tasks.
- **Explainable AI**: Developing models whose decision-making processes can be understood by humans.
- **AI in Healthcare**: Improving diagnostics, personalized medicine, and drug discovery.
- **AI for Climate and Sustainability**: Leveraging AI for environmental monitoring, resource management, and climate modeling.
- **Human-AI Collaboration**: Designing systems that augment human capabilities rather than replace them.

The future of AI promises to transform industries, enhance human potential, and address global challenges, but it also requires careful consideration of ethical, social, and economic impacts.
...
    """}
    llm = LLMClient()
    llm.set_api_key_interactive()

    rag = RAGPipeline()
    rag.llm = llm

    rag.process_documents(docs)

    rag.rag_loop(method='semantic', k=3)

Enter your GROQ API Key: ··········

Processing: fixed_size
  4 chunks, avg tokens: 232.5
    fixed_ai_doc_0 | Tokens: 250 | Preview: # Artificial Intelligence Artificial Intelligence (AI) is a multidisciplinary field of computer scie...
    fixed_ai_doc_1 | Tokens: 250 | Preview: representations from large-scale data. It powers modern AI applications like speech recognition, ima...
    fixed_ai_doc_2 | Tokens: 250 | Preview: domestic help, hospitality, and customer service. - **Autonomous Vehicles**: Self-driving cars, dron...
    fixed_ai_doc_3 | Tokens: 180 | Preview: trust their decisions. - **Privacy**: Protecting sensitive personal data and preventing misuse. - **...

Processing: sentence_based
  19 chunks, avg tokens: 48.9
    sentence_ai_doc_0 | Tokens: 68 | Preview: 

        # Artificial Intelligence

Artificial Intelligence (AI) is a multidisciplinary field of co...
    sentence_ai_doc_1 | Tokens: 74 | Preview: AI is applied across a wide range of domains including healthcar

#Inference :

Fixed Size :
Very few, large chunks. May contain multiple topics per chunk. Can lead to mixed context but fewer chunks to search.

Sentence Based :
Small, precise chunks (mostly single sentences). Less likely to break thoughts within a sentence. May require combining multiple chunks for complex queries.

Paragraph Based :
Each paragraph forms a chunk. Preserves some context while still small. Good for moderate granularity.

Sliding Window :
Large overlapping chunks. Preserves context across boundaries. Balanced between context and search efficiency.

Heading Based :
Each heading/section forms a chunk. Preserves semantic grouping. Moderate number of chunks; usually keeps thoughts intact.

Semantic :
Very fine-grained; splits content into semantic clusters. Most chunks retrieved are highly relevant. Can be fragmented; single sentences or small clusters.

Hybrid :
Combines heading-based and sliding window. Context-preserving with reasonable chunk count.




#Which strategy retrieves more relevant chunks :
Semantic retrieved 3 highly relevant chunks: definition, deep learning, optimization. Heading/Hybrid retrieve 1–2 chunks covering the main section. Fixed Size/Sliding Window may retrieve large chunks containing relevant info plus some noise. Sentence/Paragraph based retrieve many small chunks, may need combining to answer fully.

#Which leads to fewer “broken thoughts” :
Heading/Hybrid/Sliding Window maintain coherent ideas; fewer interruptions. Fixed Size chunks are large enough to preserve context, but may mix multiple topics. Sentence/Semantic can fragment information into very small pieces, breaking ideas.


#conclusion
Use Semantic for high precision and maximum relevance.

Use Heading or Hybrid for coherent chunks and natural flow of ideas.

Use Sliding Window for a balanced approach, retaining context without too much fragmentation.