# IIT Madras BS Chat Bot - RAG Pipeline

This notebook implements a Retrieval-Augmented Generation (RAG) system for answering questions about the IIT Madras B.S. program using:
- PDF documents (handbooks, guides)
- Web content from study.iitm.ac.in
- Pinecone vector database
- OpenAI models via AI Pipe

## 1. Setup and Imports

In [1]:
import os
from dotenv import load_dotenv

# Change to project root directory
os.chdir('../')

# Load environment variables
load_dotenv()
print("Environment setup complete")

Environment setup complete


In [2]:
# Import required libraries
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import Pinecone as LangchainPinecone
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
from sentence_transformers import CrossEncoder
import numpy as np

print("All libraries imported successfully")

  from .autonotebook import tqdm as notebook_tqdm
USER_AGENT environment variable not set, consider setting it to identify your requests.


All libraries imported successfully


## 2. Data Loading and Processing

In [3]:
# Load PDF documents with error handling
import os
from pathlib import Path

# Check current directory and Data folder
print(f"Current directory: {os.getcwd()}")
data_path = Path("Data")
if data_path.exists():
    print(f"Data folder contents: {list(data_path.glob('*.pdf'))}")
else:
    print("Data folder not found!")

try:
    pdf_path = "Data/handbook.pdf"
    if Path(pdf_path).exists():
        loader = PyPDFLoader(pdf_path)
        pdf_documents = loader.load()
        print(f"Loaded {len(pdf_documents)} pages from PDF")
    else:
        print(f"PDF file not found at {pdf_path}")
        # Try to find any PDF in Data folder
        pdf_files = list(Path("Data").glob("*.pdf"))
        if pdf_files:
            print(f"Found PDFs: {[str(p) for p in pdf_files]}")
            print("Please update the path or add handbook.pdf to the Data folder")
        pdf_documents = []
except Exception as e:
    print(f"Error loading PDF: {e}")
    pdf_documents = []

invalid pdf header: b' %PDF'


Current directory: /home/niloy/IIT-Madras-BS-Chat-Bot
Data folder contents: [PosixPath('Data/handbook.pdf')]
Loaded 50 pages from PDF


In [4]:
# Load web content from multiple pages with error handling
urls = [
    "https://study.iitm.ac.in/ds/academics.html",
    "https://study.iitm.ac.in/ds/admissions.html",
    "https://bsinsider.in/"
]

web_documents = []
for url in urls:
    try:
        web_loader = WebBaseLoader(url)
        docs = web_loader.load()
        web_documents.extend(docs)
        print(f"Loaded {url}")
    except Exception as e:
        print(f"Failed to load {url}: {e}")

print(f"Total web documents: {len(web_documents)}")

Loaded https://study.iitm.ac.in/ds/academics.html
Loaded https://study.iitm.ac.in/ds/admissions.html
Loaded https://bsinsider.in/
Total web documents: 3


In [5]:
# Check for cached chunks
import pickle
from pathlib import Path

cache_file = Path("cache/text_chunks.pkl")
cache_file.parent.mkdir(exist_ok=True)

# Try to load from cache
if cache_file.exists():
    print("Loading chunks from cache...")
    with open(cache_file, "rb") as f:
        text_chunks = pickle.load(f)
    print(f"Loaded {len(text_chunks)} chunks from cache")
else:
    print("Processing documents...")
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    
    # Combine all documents
    all_documents = pdf_documents + web_documents
    
    # Split into chunks
    text_chunks = text_splitter.split_documents(all_documents)
    
    # Save to cache
    with open(cache_file, "wb") as f:
        pickle.dump(text_chunks, f)
    print(f"Processed and cached {len(text_chunks)} chunks")

Loading chunks from cache...
Loaded 192 chunks from cache


## 3. Embeddings and Vector Database

In [6]:
# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print(f"Embedding model loaded (dimension: {len(embedding_model.embed_query('test'))})")

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Embedding model loaded (dimension: 384)


In [7]:
# Initialize Pinecone
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = "iit-madras-bs-chat-bot"

# Check if index exists
existing_indexes = [index.name for index in pc.list_indexes()]
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=384,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
    print(f"Created new index: {index_name}")
else:
    print(f"Using existing index: {index_name}")

Using existing index: iit-madras-bs-chat-bot


In [8]:
# Create or update vector store with caching
try:
    # Try to connect to existing index first
    docsearch = LangchainPinecone.from_existing_index(
        index_name=index_name,
        embedding=embedding_model
    )
    
    # Check if index has data
    index_stats = pc.Index(index_name).describe_index_stats()
    vector_count = index_stats.get('total_vector_count', 0)
    
    if vector_count == 0:
        print("Index exists but is empty. Uploading documents...")
        docsearch = LangchainPinecone.from_documents(
            documents=text_chunks,
            index_name=index_name,
            embedding=embedding_model,
        )
        print(f"Uploaded {len(text_chunks)} chunks to vector store")
    else:
        print(f"Using existing vector store with {vector_count} vectors")
        
except Exception as e:
    print(f"Error with vector store: {e}")
    print("Creating new vector store...")
    docsearch = LangchainPinecone.from_documents(
        documents=text_chunks,
        index_name=index_name,
        embedding=embedding_model,
    )
    print(f"Created vector store with {len(text_chunks)} chunks")

Using existing vector store with 82 vectors


## 4. RAG Chain Setup

### 4.1 Query Classification

Classify questions to use specialized prompts for better answers.

In [9]:
# Query classifier to route questions to specialized prompts
def classify_query(question):
    """
    Classify question type:
    - factual: What/When/Where questions (dates, facts, definitions)
    - procedural: How-to questions (steps, processes, procedures)
    - comparative: Comparison questions (differences, alternatives)
    - eligibility: Admission/qualification questions
    """
    question_lower = question.lower()
    
    # Keywords for each category
    factual_keywords = ['what is', 'when is', 'where is', 'who is', 'define', 'meaning']
    procedural_keywords = ['how to', 'how do', 'how can', 'steps to', 'process', 'procedure']
    comparative_keywords = ['difference', 'compare', 'vs', 'versus', 'better', 'between']
    eligibility_keywords = ['eligible', 'qualify', 'admission', 'requirement', 'criteria', 'apply']
    
    # Check keywords
    if any(kw in question_lower for kw in eligibility_keywords):
        return 'eligibility'
    elif any(kw in question_lower for kw in comparative_keywords):
        return 'comparative'
    elif any(kw in question_lower for kw in procedural_keywords):
        return 'procedural'
    elif any(kw in question_lower for kw in factual_keywords):
        return 'factual'
    else:
        return 'general'

# Test the classifier
test_questions = [
    "What is the duration of the B.S. program?",
    "How do I apply for admission?",
    "What is the difference between diploma and degree level?",
    "Am I eligible for the B.S. program?"
]

print("Query Classification Examples:")
for q in test_questions:
    print(f"  '{q}' → {classify_query(q)}")

print("\nQuery classifier ready")

Query Classification Examples:
  'What is the duration of the B.S. program?' → factual
  'How do I apply for admission?' → eligibility
  'What is the difference between diploma and degree level?' → comparative
  'Am I eligible for the B.S. program?' → eligibility

Query classifier ready


### 4.2 Re-ranking with Cross-Encoder

Use a cross-encoder model to re-rank retrieved chunks for better relevance.

In [10]:
# Initialize cross-encoder for re-ranking
print("Loading cross-encoder model (this may take a moment)...")
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("Cross-encoder model loaded")

def rerank_documents(question, documents, top_k=5):
    """
    Re-rank documents using cross-encoder for better relevance.
    Returns top_k documents sorted by relevance score.
    """
    if not documents:
        return documents
    
    # Create pairs of (question, document_content)
    pairs = [[question, doc.page_content] for doc in documents]
    
    # Get relevance scores
    scores = reranker.predict(pairs)
    
    # Sort documents by score
    doc_score_pairs = list(zip(documents, scores))
    doc_score_pairs.sort(key=lambda x: x[1], reverse=True)
    
    # Return top_k documents
    reranked_docs = [doc for doc, score in doc_score_pairs[:top_k]]
    
    # Print scores for debugging
    print(f"  Re-ranking scores: {[f'{score:.3f}' for _, score in doc_score_pairs[:top_k]]}")
    
    return reranked_docs

print("Re-ranking function ready")

Loading cross-encoder model (this may take a moment)...
Cross-encoder model loaded
Re-ranking function ready


### 4.3 Specialized Prompts

Different prompts optimized for different question types.

In [11]:
# Specialized prompts for different question types
PROMPTS = {
    'factual': """You are an expert on the IIT Madras B.S. program. Provide accurate, factual information.

Context: {context}

Question: {question}

Provide a clear, concise answer with specific facts, dates, and definitions. If not in context, say "I don't know".""",
    
    'procedural': """You are a helpful guide for the IIT Madras B.S. program. Explain processes step-by-step.

Context: {context}

Question: {question}

Provide a clear step-by-step answer. Use numbered steps when appropriate. If not in context, say "I don't know".""",
    
    'comparative': """You are an expert on the IIT Madras B.S. program. Compare and contrast clearly.

Context: {context}

Question: {question}

Highlight key differences and similarities. Use bullet points for clarity. If not in context, say "I don't know".""",
    
    'eligibility': """You are an admissions advisor for the IIT Madras B.S. program. Provide clear eligibility guidance.

Context: {context}

Question: {question}

List requirements clearly. Be specific about qualifications, criteria, and any exceptions. If not in context, say "I don't know".""",
    
    'general': """You are a helpful assistant for the IIT Madras B.S. program.

Context: {context}

Question: {question}

Provide accurate, concise information. If not in context, say "I don't know"."""
}

print("Specialized prompts configured")
print(f"  Available prompt types: {list(PROMPTS.keys())}")

Specialized prompts configured
  Available prompt types: ['factual', 'procedural', 'comparative', 'eligibility', 'general']


In [12]:
# Create retriever - retrieve more initially for re-ranking
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k": 10})
print("Retriever configured (retrieving top 10 for re-ranking)")

Retriever configured (retrieving top 10 for re-ranking)


In [13]:
# Initialize LLM (using AI Pipe with OpenRouter)
llm = ChatOpenAI(
    model="openai/gpt-4o-mini",
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    openai_api_base="https://aipipe.org/openrouter/v1",
    temperature=0,
    max_tokens=500
)
print("LLM configured")

LLM configured


In [14]:
# Enhanced RAG function with classification and re-ranking
def format_docs_with_sources(docs):
    """Format documents with source information"""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get('source', 'Unknown')
        page = doc.metadata.get('page', 'N/A')
        formatted.append(f"[Source {i}: {source}, Page {page}]\n{doc.page_content}")
    return "\n\n".join(formatted)

def get_answer_with_sources(question, use_reranking=True, use_classification=True):
    """
    Enhanced answer generation with:
    - Query classification for specialized prompts
    - Re-ranking for better relevance
    """
    print(f"\nProcessing: '{question}'")
    
    # Step 1: Classify the question
    if use_classification:
        query_type = classify_query(question)
        print(f"Question type: {query_type}")
        prompt_template = PROMPTS.get(query_type, PROMPTS['general'])
    else:
        query_type = 'general'
        prompt_template = PROMPTS['general']
    
    # Step 2: Retrieve documents
    print(f"Retrieving documents...")
    docs = retriever.invoke(question)
    print(f"Retrieved {len(docs)} initial documents")
    
    # Step 3: Re-rank documents
    if use_reranking and len(docs) > 0:
        print(f"Re-ranking documents...")
        docs = rerank_documents(question, docs, top_k=5)
        print(f"Using top 5 re-ranked documents")
    
    # Step 4: Format context
    context = format_docs_with_sources(docs)
    
    # Step 5: Create prompt and get answer
    prompt = ChatPromptTemplate.from_template(prompt_template)
    print(f"Generating answer...")
    
    answer = (prompt | llm | StrOutputParser()).invoke({
        "context": context,
        "question": question
    })
    
    return {
        "answer": answer,
        "sources": docs,
        "query_type": query_type
    }

print("Enhanced RAG system ready with classification & re-ranking")

Enhanced RAG system ready with classification & re-ranking


## 5. Test the Chatbot

In [15]:
# Helper function to display results
def ask_question(question):
    """Ask a question and display answer with sources"""
    try:
        result = get_answer_with_sources(question)
        print(f"Q: {question}")
        print(f"Type: {result['query_type'].upper()}")
        print(f"\nA: {result['answer']}")
        print("Sources:")
        for i, doc in enumerate(result['sources'][:3], 1):
            source = doc.metadata.get('source', 'Unknown')
            print(f"  {i}. {source}")
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()

# Test query 1: Factual question
ask_question("What courses are offered in the foundation level?")


Processing: 'What courses are offered in the foundation level?'
Question type: general
Retrieving documents...
Retrieved 10 initial documents
Re-ranking documents...
  Re-ranking scores: ['8.537', '6.060', '5.686', '5.299', '3.269']
Using top 5 re-ranked documents
Generating answer...
Q: What courses are offered in the foundation level?
Type: GENERAL

A: The courses offered in the Foundation Level are:

1. Mathematics
2. Statistics
3. Basics of Programming and Python
4. English

There are a total of 8 courses in the Foundation Level, but the specific courses listed above are part of the core offerings.
Sources:
  1. https://study.iitm.ac.in/ds/academics.html
  2. https://study.iitm.ac.in/ds/admissions.html
  3. https://study.iitm.ac.in/ds/academics.html


In [16]:
# Test query 2: Procedural question
ask_question("How do I apply for the B.S. program?")


Processing: 'How do I apply for the B.S. program?'
Question type: eligibility
Retrieving documents...
Retrieved 10 initial documents
Re-ranking documents...
  Re-ranking scores: ['0.565', '-2.163', '-4.144', '-6.712', '-6.853']
Using top 5 re-ranked documents
Generating answer...
Q: How do I apply for the B.S. program?
Type: ELIGIBILITY

A: To apply for the B.S. program in Data Science and Applications at IIT Madras, you need to meet the following eligibility requirements:

### Eligibility Criteria:

1. **Educational Qualification:**
   - You must have passed Class 12 or an equivalent examination. This applies to all applicants, regardless of age or academic background.
   - A list of accepted Class 12 equivalents is available on the admissions page.

2. **Current Students:**
   - If you are a school student who has appeared for your Class 11 final exams, you can also apply. You must complete Class 12 before you can join the program.

3. **Mathematics and English:**
   - It is expecte

In [17]:
# Test query 3: Comparative question
ask_question("What is the difference between diploma and degree level?")


Processing: 'What is the difference between diploma and degree level?'
Question type: comparative
Retrieving documents...
Retrieved 10 initial documents
Re-ranking documents...
  Re-ranking scores: ['2.882', '-0.387', '-0.415', '-0.816', '-2.498']
Using top 5 re-ranked documents
Generating answer...
Q: What is the difference between diploma and degree level?
Type: COMPARATIVE

A: Here are the key differences and similarities between the Diploma Level and Degree Level in the IIT Madras B.S. program:

### Key Differences:

- **Course Structure:**
  - **Diploma Level:**
    - Comprises 6 core courses, 2 projects, and 1 skill enhancement course.
    - Two specific diplomas: Diploma in Programming and Diploma in Data Science.
  - **Degree Level:**
    - BSc Degree Level consists of 28 credits, with a focus on a broader range of subjects beyond just core courses.

- **Credit Requirements:**
  - **Diploma Level:**
    - Requires a total of 27 credits for each diploma (Programming and Data Sc

In [18]:
# Test query 4: Eligibility question
ask_question("Am I eligible for the B.S. program if I have a 12th grade certificate?")


Processing: 'Am I eligible for the B.S. program if I have a 12th grade certificate?'
Question type: eligibility
Retrieving documents...
Retrieved 10 initial documents
Re-ranking documents...
  Re-ranking scores: ['-4.261', '-6.986', '-8.760', '-8.889', '-9.108']
Using top 5 re-ranked documents
Generating answer...
Q: Am I eligible for the B.S. program if I have a 12th grade certificate?
Type: ELIGIBILITY

A: Yes, you are eligible to apply for the B.S. program in Data Science and Applications at IIT Madras if you have a 12th grade certificate. Here are the specific eligibility requirements:

1. **Educational Qualification**: 
   - You must have passed Class 12 or an equivalent examination. This applies regardless of your age or academic background.

2. **Mathematics and English**: 
   - It is expected that you have studied Mathematics and English in Class 10.

3. **Class 11 Students**: 
   - If you are currently a school student who has appeared for your Class 11 final exams, you can a