<a href="https://colab.research.google.com/github/Binaz/rag-chatbot/blob/main/rag_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Energy & Sustainability RAG Chatbot – Development Notebook
This notebook demonstrates the development process of a Retrieval-Augmented Generation (RAG) chatbot that answers questions using Energy & Sustainability research papers.

**Contents:**
1. Loading and extracting text from PDFs
2. Chunking text for embedding
3. Creating embeddings and FAISS index
4. Loading the LLM and setting up the RAG pipeline
5. Querying and generating answers
6. Testing the Gradio interface


## Install Required Packages
We use `PyMuPDF` for PDF extraction, `sentence-transformers` for embeddings, `FAISS` for similarity search, and Hugging Face Transformers for LLMs.

In [None]:
!pip install pymupdf sentence-transformers faiss-cpu transformers accelerate bitsandbytes gradio torch

## Import Libraries
Import all necessary libraries for PDF extraction, embeddings, RAG, and Gradio interface.


In [None]:
import fitz, os
from sentence_transformers import SentenceTransformer
import faiss, numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import gradio as gr

## Load PDFs
We load research paper PDFs from the `data/pdfs/` folder and extract their text.


In [None]:
# Path to your folder containing PDFs

pdf_folder = "data/pdfs"
pdf_texts = []
for file in os.listdir(pdf_folder):
    if file.endswith(".pdf"):
        path = os.path.join(pdf_folder, file)
        doc = fitz.open(path)
        text = ""
        for page in doc:
            text += page.get_text()
        pdf_texts.append({"filename": file, "text": text})

print(f"Loaded {len(pdf_texts)} PDFs")


## Chunk Text
Split long PDF texts into smaller chunks (~500 words each) for embeddings.


In [None]:
# Chunck Text using sentence-transformers/all-MiniLM-L6-v2

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def chunk_text(text, max_tokens=500):     # max_tokens can be 500 - 1000, because rag works best for max_tokens between 500 and 1000.
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_tokens):
        chunks.append(" ".join(words[i:i+max_tokens]))
    return chunks

chunks = []
for pdf in pdf_texts:
    pdf_chunks = chunk_text(pdf["text"])
    for c in pdf_chunks:
        chunks.append({"filename": pdf["filename"], "chunk": c})

print(f"Created {len(chunks)} text chunks")


## Create Embeddings and FAISS Index
Use `all-MiniLM-L6-v2` to encode chunks into vectors and build a FAISS index for semantic search.


In [None]:
# Using 'all-MiniLM-L6-v2' model from huggingFace.
# This is a sentence-transformers model and it maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
# Using FAISS developed by meta. Faiss is a library for efficient similarity search and clustering of dense vectors.

!pip install faiss-cpu sentence-transformers

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Create embeddings
texts = [c["chunk"] for c in chunks]
embeddings = model.encode(texts, convert_to_numpy=True)

# Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

print(f"FAISS index built with {index.ntotal} vectors")


## Retrieve Relevant Chunks and Generate Answer
Define a function to query the FAISS index and generate answers using the RAG pipeline. Testing a sample query to extract chunks until 650 characters.


In [None]:
# Query / Retrieval Function

def retrieve(query, top_k=5):
    query_emb = model.encode([query])
    D, I = index.search(query_emb, top_k)
    results = [chunks[i] for i in I[0]]
    return results

# Test query
query = "Which countries are leading in renewable energy adoption?"
results = retrieve(query)
for r in results:
    print(r["filename"], r["chunk"][:650], "...\n")


## Load LLM (microsoft/phi-2)
We use a small instruct-tuned model from Hugging Face for text generation.

> Initally used "mistralai/Mistral-7B-Instruct-v0.2" model, but mistralai/Mistral-7B-Instruct-v0.2 needs ≈13–15 GB VRAM, while free Colab T4 gives ~15 GB total, but Colab also uses some of it for notebook kernel.

> So it often loads to CPU silently, making inference very slow (hundreds of seconds) or causing no output if the request times out.

> So, as a solution to resolve this issue will use "microsoft/phi-2" , which is light and faster.


> *Average Inference time is 13 seconds.*






In [None]:
!pip install transformers accelerate bitsandbytes

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Loading a small instruct-tuned model
model_name = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

rag_pipeline = pipeline(
    "text-generation",
    model=llm,
    tokenizer=tokenizer,
    max_new_tokens=350,
    temperature=0.4,      # higher = more diverse
    top_p=0.9,            # nucleus sampling
    do_sample=False,       # ensures sampling instead of greedy
    return_full_text=False
)


def generate_answer(query, retrieved_chunks, max_chunks=3, max_chunk_tokens=500):
    """
    Generate an answer using retrieved chunks from FAISS.

    Args:
        query (str): User question
        retrieved_chunks (list): List of dicts with 'chunk' and 'filename'
        max_chunks (int): Maximum number of chunks to use
        max_chunk_tokens (int): Maximum tokens per chunk to prevent context overflow

    Returns:
        str: Generated answer
    """
    # Keep only top N chunks
    retrieved_chunks = retrieved_chunks[:max_chunks]

    # Truncate each chunk to max_chunk_tokens tokens
    truncated_chunks = []
    for c in retrieved_chunks:
        tokens = tokenizer(c["chunk"], truncation=True, max_length=max_chunk_tokens)["input_ids"]
        truncated_text = tokenizer.decode(tokens, skip_special_tokens=True)
        truncated_chunks.append(truncated_text)

    # Combine into context
    context = "\n\n".join(truncated_chunks)

    # Build prompt
    prompt = f"""You are a helpful research assistant focusing on areas in Energy and Sustainability.
Use the context below to answer the question in 6-7 complete sentences

Context:
{context}

Question: {query}

Answer:"""

    # Generate answer using your LLM pipeline
    response = rag_pipeline(prompt)[0]["generated_text"]

    # Strip prompt repetition if any
    if "Answer:" in response:
        answer = response.split("Answer:")[-1].strip()
    else:
        answer = response.strip()

    return answer


## Launch Chatbot Interface
Use Gradio to interact with the RAG chatbot.
## Test Queries
Try asking some questions about Energy & Sustainability papers.

In [None]:
!pip install gradio

import gradio as gr

def chat_fn(user_input):
    # Retrieve top chunks related to the question
    retrieved = retrieve(user_input)
    print(f"Retrieved {len(retrieved)} chunks")  # Debug output in Colab console

    # Generate answer using the language model
    answer = generate_answer(user_input, retrieved)
    print("Answer:", answer)  # Debug output in Colab console

    return answer  # Display this in Gradio UI

demo = gr.Interface(
    fn=chat_fn,
    inputs=gr.Textbox(
        lines=2,
        placeholder="Ask about the research papers on Energy and Sustainability"
    ),
    outputs=gr.Textbox(
        lines=20,        # height of the output box
        max_lines=30     # optional, allows scrolling
    ),
    title="Research Paper RAG Chatbot"
)

demo.launch(debug=True)


In [None]:
!pip freeze > requirements.txt