<a href="https://colab.research.google.com/github/ShaunakSoni28/RAG_Systems/blob/main/RAG_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [31]:
# from google.colab import drive
# drive.mount('/content/drive')

# Creating project directory structure
import os
project_dir = '/content/drive/MyDrive/RAG_Project/'
os.makedirs(project_dir, exist_ok=True)
os.makedirs(f'{project_dir}/papers', exist_ok=True)
os.makedirs(f'{project_dir}/data', exist_ok=True)
os.makedirs(f'{project_dir}/results', exist_ok=True)
os.makedirs(f'{project_dir}/evaluation', exist_ok=True)

print(f"✅ Project directory: {project_dir}")
print("✅ All work will be saved to Google Drive!")
print("✅ Safe from disconnects!")

✅ Project directory: /content/drive/MyDrive/RAG_Project/
✅ All work will be saved to Google Drive!
✅ Safe from disconnects!


In [32]:
# Installing required libraries
!pip install -q transformers accelerate sentence-transformers faiss-cpu pypdf langchain huggingface_hub

# Downloading papers directly in Colab
!pip install arxiv


# Importing basic libraries
import os
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import arxiv

print("✅ Setup complete!")
print(f"CUDA available: {torch.cuda.is_available()}")

project_dir = '/content/drive/MyDrive/RAG_Project/'


✅ Setup complete!
CUDA available: True


In [33]:

import arxiv
import os

# Creating the directory if it doesn't exist
os.makedirs("/content/drive/MyDrive/RAG_Project/papers", exist_ok=True)

# Searching for NLP papers
search = arxiv.Search(
    query="cat:cs.CL",  # Computer Science - Computation and Language
    max_results=50,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

papers = []
for result in search.results():
    papers.append({
        'title': result.title,
        'pdf_url': result.pdf_url,
        'summary': result.summary,
        'authors': [author.name for author in result.authors]
    })
    # Downloading PDF
    result.download_pdf(filename=f"/content/drive/MyDrive/RAG_Project/papers/{result.get_short_id()}.pdf")

print(f"✅ Downloaded {len(papers)} papers!")

  for result in search.results():


✅ Downloaded 50 papers!


In [34]:
!pip install -q pyPDF2

from PyPDF2 import PdfReader
import pickle
from tqdm import tqdm

def extract_text_from_pdf(pdf_path):
  try:
    reader = PdfReader(pdf_path)
    text=""
    for page in reader.pages:
      text += page.extract_text() + "\n"
    return text.strip()
  except Exception as e:
    print(f"Error with {pdf_path}: {e}")
    return ""

print("\n Processing 50 Downloaded Papers!")
all_papers=[]

paper_files = [f for f in os.listdir(f"{project_dir}papers/") if f.endswith(".pdf") and not f.startswith("distractor_")]

for pdf_file in tqdm(paper_files, desc="Processing PDFs"):
  pdf_path = f"{project_dir}/papers/{pdf_file}"
  text = extract_text_from_pdf(pdf_path)

  if text and len(text.split()) > 100:
    all_papers.append({
        'filename ' : pdf_file,
        'text' : text,
        'word_count' : len(text.split()),
        'is_distractor' : False
    })

    print(f"Succesfuly processed {len(all_papers)} papers!")
    print(f"Avergae words per paper: {sum(p['word_count'] for p in all_papers)//len(all_papers)}")

    # Saving the files in the drive

    with open(f'{project_dir}data/main_papers.pkl','wb') as f:
      pickle.dump(all_papers,f)
    print(f"Saved in Google Drive: {project_dir}data/main_papers.pkl")


 Processing 50 Downloaded Papers!


Processing PDFs:   2%|▏         | 1/50 [00:00<00:05,  9.07it/s]

Succesfuly processed 1 papers!
Avergae words per paper: 3368
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:   4%|▍         | 2/50 [00:00<00:06,  6.90it/s]

Succesfuly processed 2 papers!
Avergae words per paper: 4431
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:   6%|▌         | 3/50 [00:00<00:12,  3.88it/s]

Succesfuly processed 3 papers!
Avergae words per paper: 6136
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  10%|█         | 5/50 [00:02<00:24,  1.86it/s]

Succesfuly processed 4 papers!
Avergae words per paper: 9496
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 5 papers!
Avergae words per paper: 9169
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  12%|█▏        | 6/50 [00:03<00:26,  1.67it/s]

Succesfuly processed 6 papers!
Avergae words per paper: 8874
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  16%|█▌        | 8/50 [00:03<00:15,  2.74it/s]

Succesfuly processed 7 papers!
Avergae words per paper: 7973
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 8 papers!
Avergae words per paper: 7632
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  18%|█▊        | 9/50 [00:03<00:16,  2.47it/s]

Succesfuly processed 9 papers!
Avergae words per paper: 8519
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  20%|██        | 10/50 [00:04<00:14,  2.78it/s]

Succesfuly processed 10 papers!
Avergae words per paper: 8407
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  22%|██▏       | 11/50 [00:04<00:15,  2.54it/s]

Succesfuly processed 11 papers!
Avergae words per paper: 8446
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  24%|██▍       | 12/50 [00:04<00:14,  2.60it/s]

Succesfuly processed 12 papers!
Avergae words per paper: 8500
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  26%|██▌       | 13/50 [00:06<00:31,  1.17it/s]

Succesfuly processed 13 papers!
Avergae words per paper: 9037
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  30%|███       | 15/50 [00:07<00:20,  1.74it/s]

Succesfuly processed 14 papers!
Avergae words per paper: 8901
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 15 papers!
Avergae words per paper: 8673
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  34%|███▍      | 17/50 [00:07<00:12,  2.75it/s]

Succesfuly processed 16 papers!
Avergae words per paper: 8580
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 17 papers!
Avergae words per paper: 8324
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  36%|███▌      | 18/50 [00:09<00:23,  1.38it/s]

Succesfuly processed 18 papers!
Avergae words per paper: 9710
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  38%|███▊      | 19/50 [00:09<00:17,  1.72it/s]

Succesfuly processed 19 papers!
Avergae words per paper: 9563
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  40%|████      | 20/50 [00:16<01:16,  2.56s/it]

Succesfuly processed 20 papers!
Avergae words per paper: 9859
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  42%|████▏     | 21/50 [00:17<00:54,  1.87s/it]

Succesfuly processed 21 papers!
Avergae words per paper: 9761
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  44%|████▍     | 22/50 [00:17<00:40,  1.45s/it]

Succesfuly processed 22 papers!
Avergae words per paper: 9786
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  48%|████▊     | 24/50 [00:18<00:23,  1.12it/s]

Succesfuly processed 23 papers!
Avergae words per paper: 9734
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 24 papers!
Avergae words per paper: 9591
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  50%|█████     | 25/50 [00:19<00:22,  1.11it/s]

Succesfuly processed 25 papers!
Avergae words per paper: 9520
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  54%|█████▍    | 27/50 [00:19<00:12,  1.83it/s]

Succesfuly processed 26 papers!
Avergae words per paper: 9267
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 27 papers!
Avergae words per paper: 9150
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  56%|█████▌    | 28/50 [00:20<00:11,  1.90it/s]

Succesfuly processed 28 papers!
Avergae words per paper: 9238
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  58%|█████▊    | 29/50 [00:20<00:10,  2.01it/s]

Succesfuly processed 29 papers!
Avergae words per paper: 9135
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  60%|██████    | 30/50 [00:21<00:14,  1.34it/s]

Succesfuly processed 30 papers!
Avergae words per paper: 9479
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  64%|██████▍   | 32/50 [00:22<00:09,  1.98it/s]

Succesfuly processed 31 papers!
Avergae words per paper: 9467
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 32 papers!
Avergae words per paper: 9386
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  66%|██████▌   | 33/50 [00:26<00:24,  1.42s/it]

Succesfuly processed 33 papers!
Avergae words per paper: 9461
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  68%|██████▊   | 34/50 [00:26<00:19,  1.22s/it]

Succesfuly processed 34 papers!
Avergae words per paper: 9580
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  70%|███████   | 35/50 [00:27<00:15,  1.01s/it]

Succesfuly processed 35 papers!
Avergae words per paper: 9578
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  72%|███████▏  | 36/50 [00:28<00:12,  1.08it/s]

Succesfuly processed 36 papers!
Avergae words per paper: 9516
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  74%|███████▍  | 37/50 [00:28<00:10,  1.29it/s]

Succesfuly processed 37 papers!
Avergae words per paper: 9458
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  76%|███████▌  | 38/50 [00:28<00:07,  1.52it/s]

Succesfuly processed 38 papers!
Avergae words per paper: 9400
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  78%|███████▊  | 39/50 [00:29<00:07,  1.56it/s]

Succesfuly processed 39 papers!
Avergae words per paper: 9486
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  80%|████████  | 40/50 [00:30<00:06,  1.63it/s]

Succesfuly processed 40 papers!
Avergae words per paper: 9604
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  82%|████████▏ | 41/50 [00:31<00:06,  1.34it/s]

Succesfuly processed 41 papers!
Avergae words per paper: 9636
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  84%|████████▍ | 42/50 [00:32<00:08,  1.01s/it]

Succesfuly processed 42 papers!
Avergae words per paper: 9597
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  86%|████████▌ | 43/50 [00:33<00:07,  1.02s/it]

Succesfuly processed 43 papers!
Avergae words per paper: 9567
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  88%|████████▊ | 44/50 [00:35<00:06,  1.08s/it]

Succesfuly processed 44 papers!
Avergae words per paper: 9802
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  90%|█████████ | 45/50 [00:36<00:05,  1.09s/it]

Succesfuly processed 45 papers!
Avergae words per paper: 9747
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  92%|█████████▏| 46/50 [00:36<00:03,  1.16it/s]

Succesfuly processed 46 papers!
Avergae words per paper: 9786
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  96%|█████████▌| 48/50 [00:37<00:01,  1.69it/s]

Succesfuly processed 47 papers!
Avergae words per paper: 9852
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl
Succesfuly processed 48 papers!
Avergae words per paper: 9793
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs:  98%|█████████▊| 49/50 [00:37<00:00,  2.20it/s]

Succesfuly processed 49 papers!
Avergae words per paper: 9662
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl


Processing PDFs: 100%|██████████| 50/50 [00:39<00:00,  1.26it/s]

Succesfuly processed 50 papers!
Avergae words per paper: 9812
Saved in Google Drive: /content/drive/MyDrive/RAG_Project/data/main_papers.pkl





In [35]:
print("Downloading 100 distractor papers from broader AI topics...")

# Distractor Papers

distractor_queries=[
    "cat:cs.AI", # Artificial Intelligence
    "cat:cs.LG", # Machine Learning
    "cat:cs.CV", # Computer Vision
]

distractor_count = 0

target_distractor = 100
downloads_ids = set() # creating a set that will help to store the ids of the distractor sequenctially

for query in distractor_queries:
  if distractor_count >= target_distractor:
    break

  print("Searching...")

  search = arxiv.Search(
      query = query ,
      max_results = 40 ,
      sort_by = arxiv.SortCriterion.SubmittedDate, # Here we are sorthing the data according to the publishing/submitting date
  )

  for result in search.results():
    if distractor_count >= target_distractor:
      break

    paper_id = result.get_short_id() # getting the paper id

    if paper_id == downloads_ids: # If the paper is downloaded then skip it
      continue

    try:
      filename = f"/content/drive/MyDrive/RAG_Project/papers/distractor_{paper_id}.pdf"

      if os.path.exists(filename):
        distractor_count += 1
        downloads_ids.add(paper_id)
        continue

      result.download_pdf(filename = filename)
      downloads_ids.add(paper_id)
      distractor_count +=1

      if distractor_count % 10 == 0 :
        print(f"Downloaded {distractor_count}/{target_distractor}")

    except Exception as e:
      print(f"Failed to download {paper_id}: {e}")
      continue

    print(f"Total paper downloaded {distractor_count}")

Downloading 100 distractor papers from broader AI topics...
Searching...


  for result in search.results():


Searching...
Searching...


In [36]:
# 1. SETUP: Load embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("✅ Embedding model loaded!")

# 2. PREPARE: Chunk your documents
def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks"""
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk.strip():  # Only add non-empty chunks
            chunks.append(chunk)

    # If no chunks created (text too short), use the whole text
    if len(chunks) == 0:
        chunks = [text.strip()]

    return chunks

# Example with one paper - LONGER sample text
sample_text = """
Attention mechanisms have become integral to sequence modeling tasks in natural language processing.
The Transformer architecture, introduced in the paper Attention is All You Need, relies entirely on self-attention mechanisms
to compute representations of input and output sequences without using recurrent or convolutional layers.
BERT uses bidirectional transformers for language understanding and has achieved state-of-the-art results on many NLP benchmarks.
The key innovation of transformers is the multi-head attention mechanism which allows the model to jointly attend to information
from different representation subspaces at different positions. This enables the model to capture long-range dependencies more
effectively than traditional RNNs or LSTMs. GPT models use a decoder-only transformer architecture and are trained using
a language modeling objective. These models have shown impressive performance on various downstream tasks through fine-tuning
or few-shot learning. Recent work has focused on making transformers more efficient through techniques like sparse attention,
linear attention mechanisms, and improved positional encodings. The scalability of transformers has enabled training of very
large language models with billions of parameters that demonstrate emergent capabilities on complex reasoning tasks.
"""

# Use smaller chunk size for this demo
chunks = chunk_text(sample_text, chunk_size=50, overlap=10)
print(f"✅ Created {len(chunks)} chunks")
print(f"Sample chunk 1: {chunks[0][:100]}...")
if len(chunks) > 1:
    print(f"Sample chunk 2: {chunks[1][:100]}...")

# 3. INDEX: Create FAISS vector database
print("\nCreating embeddings...")
chunk_embeddings = embedding_model.encode(chunks)
print(f"✅ Embeddings shape: {chunk_embeddings.shape}")

# Build FAISS index
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(chunk_embeddings)
print(f"✅ FAISS index created with {index.ntotal} vectors")

# 4. RETRIEVE: Search function
def retrieve_relevant_chunks(query, top_k=3):
    """Retrieve most relevant chunks for a query"""
    query_embedding = embedding_model.encode([query])

    # Make sure we don't ask for more chunks than we have
    top_k = min(top_k, len(chunks))

    distances, indices = index.search(query_embedding, top_k)

    results = []
    for idx, dist in zip(indices[0], distances[0]):
        results.append({
            'chunk': chunks[idx],
            'distance': float(dist),
            'chunk_id': int(idx)
        })

    return results

# Test retrieval
test_query = "What are attention mechanisms?"
results = retrieve_relevant_chunks(test_query, top_k=3)

print(f"\n{'='*60}")
print(f"✅ RETRIEVAL TEST")
print(f"{'='*60}")
print(f"Query: {test_query}\n")
for i, result in enumerate(results, 1):
    print(f"{i}. Distance: {result['distance']:.4f}")
    print(f"   Chunk: {result['chunk'][:150]}...")
    print()

Loading embedding model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embedding model loaded!
✅ Created 5 chunks
Sample chunk 1: Attention mechanisms have become integral to sequence modeling tasks in natural language processing....
Sample chunk 2: recurrent or convolutional layers. BERT uses bidirectional transformers for language understanding a...

Creating embeddings...
✅ Embeddings shape: (5, 384)
✅ FAISS index created with 5 vectors

✅ RETRIEVAL TEST
Query: What are attention mechanisms?

1. Distance: 0.9611
   Chunk: Attention mechanisms have become integral to sequence modeling tasks in natural language processing. The Transformer architecture, introduced in the p...

2. Distance: 1.3043
   Chunk: impressive performance on various downstream tasks through fine-tuning or few-shot learning. Recent work has focused on making transformers more effic...

3. Distance: 1.3635
   Chunk: recurrent or convolutional layers. BERT uses bidirectional transformers for language understanding and has achieved state-of-the-art results on many N...



In [37]:
# 5. GENERATE: Use Flan-T5 (works immediately, no authentication needed)
from transformers import pipeline

print("Loading Flan-T5 model...")
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    device_map="auto"
)
print("✅ Flan-T5 loaded!")

def generate_answer(query, retrieved_chunks):
    """Generate answer using retrieved context"""

    # Combine retrieved chunks into context
    context = "\n\n".join([chunk['chunk'] for chunk in retrieved_chunks])

    # Create prompt - Flan-T5 uses simpler format
    prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {query}

Answer:"""

    # Generate
    response = generator(
        prompt,
        max_length=256,
        temperature=0.7,
        do_sample=True
    )

    return response[0]['generated_text']

# Test the full pipeline
test_query = "What are attention mechanisms?"
print(f"\n🔍 Query: {test_query}")

# Retrieve
retrieved = retrieve_relevant_chunks(test_query, top_k=3)
print(f"\n📚 Retrieved {len(retrieved)} chunks")

# Generate
answer = generate_answer(test_query, retrieved)
print(f"\n💡 Answer: {answer}")

Loading Flan-T5 model...


Device set to use cuda:0


✅ Flan-T5 loaded!

🔍 Query: What are attention mechanisms?

📚 Retrieved 3 chunks


Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



💡 Answer: to compute representations of input and output sequences without using recurrent or convolutional layers


In [38]:
# COMPLETE DEMO FUNCTION
def rag_qa_system(question):
    """Complete RAG QA pipeline"""
    print("="*60)
    print(f"QUESTION: {question}")
    print("="*60)

    # Step 1: Retrieve
    print("\n🔍 RETRIEVING relevant documents...")
    retrieved_chunks = retrieve_relevant_chunks(question, top_k=3)

    for i, chunk in enumerate(retrieved_chunks, 1):
        print(f"\n  [{i}] Similarity: {1/(1+chunk['distance']):.3f}")
        print(f"      {chunk['chunk'][:100]}...")

    # Step 2: Generate
    print("\n\n💭 GENERATING answer...")
    answer = generate_answer(question, retrieved_chunks)

    print("\n" + "="*60)
    print("ANSWER:")
    print("="*60)
    print(answer)
    print("="*60)

    return answer

# Demo questions
demo_questions = [
    "What are attention mechanisms in transformers?",
    "How does BERT work?",
    "What is the difference between GPT and BERT?"
]

for q in demo_questions:
    rag_qa_system(q)
    print("\n\n")

Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


QUESTION: What are attention mechanisms in transformers?

🔍 RETRIEVING relevant documents...

  [1] Similarity: 0.532
      Attention mechanisms have become integral to sequence modeling tasks in natural language processing....

  [2] Similarity: 0.500
      impressive performance on various downstream tasks through fine-tuning or few-shot learning. Recent ...

  [3] Similarity: 0.455
      recurrent or convolutional layers. BERT uses bidirectional transformers for language understanding a...


💭 GENERATING answer...


Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



ANSWER:
self-attention mechanisms to compute representations of input and output sequences without using recurrent or convolutional layers



QUESTION: How does BERT work?

🔍 RETRIEVING relevant documents...

  [1] Similarity: 0.515
      recurrent or convolutional layers. BERT uses bidirectional transformers for language understanding a...

  [2] Similarity: 0.487
      Attention mechanisms have become integral to sequence modeling tasks in natural language processing....

  [3] Similarity: 0.398
      very large language models with billions of parameters that demonstrate emergent capabilities on com...


💭 GENERATING answer...


Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



ANSWER:
BERT uses bidirectional transformers for language understanding and has achieved state-of-the-art results on many NLP benchmarks



QUESTION: What is the difference between GPT and BERT?

🔍 RETRIEVING relevant documents...

  [1] Similarity: 0.457
      different representation subspaces at different positions. This enables the model to capture long-ra...

  [2] Similarity: 0.431
      recurrent or convolutional layers. BERT uses bidirectional transformers for language understanding a...

  [3] Similarity: 0.411
      Attention mechanisms have become integral to sequence modeling tasks in natural language processing....


💭 GENERATING answer...

ANSWER:
GPT models use a decoder-only transformer architecture and are trained using a language modeling objective



