# 🚀 Hybrid RAG System for CASML Generative AI Hackathon 🚀

## Overview
This repository presents my solution for the **CASML Generative AI Hackathon**. The goal was to build a **Retrieval-Augmented Generation (RAG)** system to answer questions from a psychology textbook while providing verifiable references to sections and pages.

## Solution Highlights
1. **Hybrid Retriever**:
  - Combines **Dense Retrieval** (FAISS + Sentence Embeddings) and **Sparse Retrieval** (BM25).
   - **Re-Ranking**: Uses a **Cross-Encoder** to improve relevance of retrieved contexts.

3. **Generative Model**:
   - Generates concise, context-aware answers using **Qwen2.5-0.5B**.

4. **Reference Extraction**:
   - Provides supporting sections and page numbers for each generated answer.

## Implementation

### Hybrid Retrieval
- **Dense Retrieval**: Efficient similarity search with **FAISS** and SentenceTransformer embeddings.
- **Sparse Retrieval**: Keyword-based search using **BM25**.
- **Hybrid Scoring**: Weighted combination of dense and sparse scores:
   ```python
   hybrid_score = (self.dense_weight * dense_score) + ((1 - self.dense_weight) * sparse_score)
   ```
- **Re-Ranking**: Top candidates are refined using a Cross-Encoder.

### RAG Pipeline
1. **Context Retrieval**: Fetches relevant sections using the hybrid retriever.
2. **Answer Generation**: Generates answers using Qwen2.5-0.5B with retrieved context.
3. **Reference Extraction**: Outputs relevant page and section references.

### Fine-Tuning (Optional)
- **Sentence Embeddings** can be fine-tuned for domain-specific data using **MultipleNegativesRankingLoss** from SentenceTransformers.

## Workflow
1. Load and preprocess the textbook data.
2. Perform **Hybrid Retrieval** with the `HybridRetriever` class.
3. Generate answers with context using a generative model.
4. Output a CSV file (`submission.csv`) in the required format:
   - **ID**: Query ID
   - **Context**: Retrieved sections
   - **Answer**: Model-generated answer
   - **References**: JSON with section and page details

## References
- **Textbook**: OpenStax Psychology (2e), CC BY 4.0.
- **Models**:
   - Sentence Embeddings: `all-mpnet-base-v2`
   - Cross-Encoder: `cross-encoder/ms-marco-MiniLM-L-12-v2`
   - Generative Model: `Qwen2.5-0.5B`.


## Necessary libraries

In [1]:
!pip install sentence-transformers faiss-cpu wordcloud transformers PyPDF2 datasets rank_bm25 adapters




[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: C:\Users\pc salah\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [1]:
import json
import os
import torch
import faiss
import numpy as np
import pandas as pd
from rank_bm25 import BM25Okapi
from pypdf import PdfReader
from datasets import Dataset
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers.cross_encoder import CrossEncoder
from adapters import AdapterConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
import re
from typing import List, Dict

  from .autonotebook import tqdm as notebook_tqdm


# Creation of a JSON file for the sections and pages

In [2]:
def create_sections_json(output_file=r"D:\Users\hadro\Desktop\RAG_Project\RAG_Project\queries.json"):
    """
    Generates the sections.json file with the hierarchical structure of the book.
    """
    data = {
        "Introduction to Psychology": {
            "page_start": 7,
            "page_end": 34,
            "subsections": {
                "What Is Psychology?": {
                    "page_start": 8,
                    "page_end": 8
                },
                "History of Psychology": {
                    "page_start": 9,
                    "page_end": 17
                },
                "Contemporary Psychology": {
                    "page_start": 18,
                    "page_end": 25
                },
                "Careers in Psychology": {
                    "page_start": 26,
                    "page_end": 29
                }
            }
        },
        "Psychological Research": {
            "page_start": 35,
            "page_end": 70,
            "subsections": {
                "Why Is Research Important?": {
                    "page_start": 36,
                    "page_end": 40
                },
                "Approaches to Research": {
                    "page_start": 41,
                    "page_end": 47
                },
                "Analyzing Findings": {
                    "page_start": 48,
                    "page_end": 58
                },
                "Ethics": {
                    "page_start": 59,
                    "page_end": 62
                }
            }
        },
        "Biopsychology": {
            "page_start": 71,
            "page_end": 108,
            "subsections": {
                "Human Genetics": {
                    "page_start": 72,
                    "page_end": 77
                },
                "Cells of the Nervous System": {
                    "page_start": 78,
                    "page_end": 83
                },
                "Parts of the Nervous System": {
                    "page_start": 84,
                    "page_end": 85
                },
                "The Brain and Spinal Cord": {
                    "page_start": 86,
                    "page_end": 96
                },
                "The Endocrine System": {
                    "page_start": 97,
                    "page_end": 99
                }
            }
        },
        "States of Consciousness": {
            "page_start": 109,
            "page_end": 143,
            "subsections": {
                "What Is Consciousness?": {
                    "page_start": 110,
                    "page_end": 113
                },
                "Sleep and Why We Sleep": {
                    "page_start": 114,
                    "page_end": 116
                },
                "Stages of Sleep": {
                    "page_start": 117,
                    "page_end": 120
                },
                "Sleep Problems and Disorders": {
                    "page_start": 121,
                    "page_end": 125
                },
                "Substance Use and Abuse": {
                    "page_start": 126,
                    "page_end": 133
                },
                "Other States of Consciousness": {
                    "page_start": 134,
                    "page_end": 136
                }
            }
        },
        "Sensation and Perception": {
            "page_start": 145,
            "page_end": 179,
            "subsections": {
                "Sensation versus Perception": {
                    "page_start": 146,
                    "page_end": 148
                },
                "Waves and Wavelengths": {
                    "page_start": 149,
                    "page_end": 152
                },
                "Vision": {
                    "page_start": 153,
                    "page_end": 160
                },
                "Hearing": {
                    "page_start": 161,
                    "page_end": 163
                },
                "The Other Senses": {
                    "page_start": 164,
                    "page_end": 167
                },
                "Gestalt Principles of Perception": {
                    "page_start": 168,
                    "page_end": 171
                }
            }
        },
        "Learning": {
            "page_start": 181,
            "page_end": 211,
            "subsections": {
                "What Is Learning?": {
                    "page_start": 182,
                    "page_end": 182
                },
                "Classical Conditioning": {
                    "page_start": 183,
                    "page_end": 191
                },
                "Operant Conditioning": {
                    "page_start": 192,
                    "page_end": 202
                },
                "Observational Learning (Modeling)": {
                    "page_start": 203,
                    "page_end": 206
                }
            }
        },
        "Thinking and Intelligence": {
            "page_start": 213,
            "page_end": 246,
            "subsections": {
                "What Is Cognition?": {
                    "page_start": 214,
                    "page_end": 217
                },
                "Language": {
                    "page_start": 218,
                    "page_end": 221
                },
                "Problem Solving": {
                    "page_start": 222,
                    "page_end": 227
                },
                "What Are Intelligence and Creativity?": {
                    "page_start": 228,
                    "page_end": 230
                },
                "Measures of Intelligence": {
                    "page_start": 231,
                    "page_end": 236
                },
                "The Source of Intelligence": {
                    "page_start": 237,
                    "page_end": 240
                }
            }
        },
        "Memory": {
            "page_start": 247,
            "page_end": 277,
            "subsections": {
                "How Memory Functions": {
                    "page_start": 248,
                    "page_end": 254
                },
                "Parts of the Brain Involved with Memory": {
                    "page_start": 255,
                    "page_end": 258
                },
                "Problems with Memory": {
                    "page_start": 259,
                    "page_end": 268
                },
                "Ways to Enhance Memory": {
                    "page_start": 269,
                    "page_end": 272
                }
            }
        },
        "Lifespan Development": {
            "page_start": 279,
            "page_end": 320,
            "subsections": {
                "What Is Lifespan Development?": {
                    "page_start": 280,
                    "page_end": 283
                },
                "Lifespan Theories": {
                    "page_start": 284,
                    "page_end": 291
                },
                "Stages of Development": {
                    "page_start": 292,
                    "page_end": 312
                },
                "Death and Dying": {
                    "page_start": 313,
                    "page_end": 314
                }
            }
        },
        "Emotion and Motivation": {
            "page_start": 321,
            "page_end": 357,
            "subsections": {
                "Motivation": {
                    "page_start": 322,
                    "page_end": 327
                },
                "Hunger and Eating": {
                    "page_start": 328,
                    "page_end": 333
                },
                "Sexual Behavior": {
                    "page_start": 334,
                    "page_end": 341
                },
                "Emotion": {
                    "page_start": 342,
                    "page_end": 352
                }
            }
        },
        "Personality": {
            "page_start": 359,
            "page_end": 396,
            "subsections": {
                "What Is Personality?": {
                    "page_start": 360,
                    "page_end": 361
                },
                "Freud and the Psychodynamic Perspective": {
                    "page_start": 362,
                    "page_end": 367
                },
                "Neo-Freudians: Adler, Erikson, Jung, and Horney": {
                    "page_start": 368,
                    "page_end": 372
                },
                "Learning Approaches": {
                    "page_start": 373,
                    "page_end": 376
                },
                "Humanistic Approaches": {
                    "page_start": 377,
                    "page_end": 377
                },
                "Biological Approaches": {
                    "page_start": 378,
                    "page_end": 378
                },
                "Trait Theorists": {
                    "page_start": 379,
                    "page_end": 383
                },
                "Cultural Understandings of Personality": {
                    "page_start": 384,
                    "page_end": 385
                },
                "Personality Assessment": {
                    "page_start": 386,
                    "page_end": 390
                }
            }
        },
        "Social Psychology": {
            "page_start": 399,
            "page_end": 445,
            "subsections": {
                "What Is Social Psychology?": {
                    "page_start": 400,
                    "page_end": 405
                },
                "Self-presentation": {
                    "page_start": 406,
                    "page_end": 408
                },
                "Attitudes and Persuasion": {
                    "page_start": 409,
                    "page_end": 414
                },
                "Conformity, Compliance, and Obedience": {
                    "page_start": 415,
                    "page_end": 421
                },
                "Prejudice and Discrimination": {
                    "page_start": 422,
                    "page_end": 428
                },
                "Aggression": {
                    "page_start": 429,
                    "page_end": 431
                },
                "Prosocial Behavior": {
                    "page_start": 432,
                    "page_end": 436
                }
            }
        },
        "Industrial-Organizational Psychology": {
            "page_start": 447,
            "page_end": 483,
            "subsections": {
                "What Is Industrial and Organizational Psychology?": {
                    "page_start": 448,
                    "page_end": 455
                },
                "Industrial Psychology: Selecting and Evaluating Employees": {
                    "page_start": 456,
                    "page_end": 466
                },
                "Organizational Psychology: The Social Dimension of Work": {
                    "page_start": 467,
                    "page_end": 476
                },
                "Human Factors Psychology and Workplace Design": {
                    "page_start": 477,
                    "page_end": 479
                }
            }
        },
        "Stress, Lifestyle, and Health": {
            "page_start": 485,
            "page_end": 535,
            "subsections": {
                "What Is Stress?": {
                    "page_start": 486,
                    "page_end": 495
                },
                "Stressors": {
                    "page_start": 496,
                    "page_end": 501
                },
                "Stress and Illness": {
                    "page_start": 502,
                    "page_end": 513
                },
                "Regulation of Stress": {
                    "page_start": 514,
                    "page_end": 520
                },
                "The Pursuit of Happiness": {
                    "page_start": 521,
                    "page_end": 528
                }
            }
        },
        "Psychological Disorders": {
            "page_start": 537,
            "page_end": 597,
            "subsections": {
                "What Are Psychological Disorders?": {
                    "page_start": 538,
                    "page_end": 541
                },
                "Diagnosing and Classifying Psychological Disorders": {
                    "page_start": 542,
                    "page_end": 544
                },
                "Perspectives on Psychological Disorders": {
                    "page_start": 545,
                    "page_end": 547
                },
                "Anxiety Disorders": {
                    "page_start": 548,
                    "page_end": 553
                },
                "Obsessive-Compulsive and Related Disorders": {
                    "page_start": 554,
                    "page_end": 557
                },
                "Posttraumatic Stress Disorder": {
                    "page_start": 558,
                    "page_end": 559
                },
                "Mood and Related Disorders": {
                    "page_start": 560,
                    "page_end": 569
                },
                "Schizophrenia": {
                    "page_start": 570,
                    "page_end": 573
                },
                "Dissociative Disorders": {
                    "page_start": 574,
                    "page_end": 575
                },
                "Disorders in Childhood": {
                    "page_start": 576,
                    "page_end": 581
                },
                "Personality Disorders": {
                    "page_start": 582,
                    "page_end": 588
                }
            }
        },
        "Therapy and Treatment": {
            "page_start": 599,
            "page_end": 631,
            "subsections": {
                "Mental Health Treatment: Past and Present": {
                    "page_start": 600,
                    "page_end": 604
                },
                "Types of Treatment": {
                    "page_start": 605,
                    "page_end": 616
                },
                "Treatment Modalities": {
                    "page_start": 617,
                    "page_end": 620
                },
                "Substance-Related and Addictive Disorders: A Special Case": {
                    "page_start": 621,
                    "page_end": 622
                },
                "The Sociocultural Model and Therapy Utilization": {
                    "page_start": 623,
                    "page_end": 626
                }
            }
        }
    }

    with open(output_file, "w") as f:
        json.dump(data, f, indent=4)

create_sections_json()

In [11]:
# --- HybridRetriever Class ---
class HybridRetriever:
    def __init__(self, book_data, embedding_model_name='all-mpnet-base-v2', cross_encoder_model_name='cross-encoder/ms-marco-MiniLM-L-12-v2', adapter=None, dense_weight=0.5):
        """
        Initialize hybrid retriever with dense, sparse, and cross-encoder re-ranking capabilities.

        Args:
            book_data (List[Dict]): List of book sections
            embedding_model_name (str): Sentence transformer model name for dense retrieval
            cross_encoder_model_name (str): Cross-encoder model name for re-ranking
            adapter (str): Path to adapter model or None
            dense_weight (float): Weight for dense semantic search (0-1)
        """
        # Device configuration
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.dense_weight = dense_weight
        
        # Load embedding model with optional adapter
        self.embedding_model = SentenceTransformer(embedding_model_name).to(self.device)
        if adapter:
            print(f"Loading adapter from {adapter}...")
            self.embedding_model.load_state_dict(torch.load(adapter))
        
        # Cross-encoder model for re-ranking
        self.cross_encoder_model = CrossEncoder(cross_encoder_model_name)
        
        # Prepare book data
        self.book_data = book_data
        self.texts = [item['text'] for item in book_data]
        
        # Dense retrieval (FAISS index)
        self._create_dense_index()
        
        # Sparse retrieval (BM25)
        self._create_bm25_index()
    
    def _create_dense_index(self):
        """Create FAISS index for dense semantic search"""
        embeddings = self.embedding_model.encode(
            self.texts, 
            show_progress_bar=True, 
            convert_to_tensor=True
        ).cpu().numpy()
        
        self.dimension = embeddings.shape[1]
        self.faiss_index = faiss.IndexFlatIP(self.dimension)
        self.faiss_index.add(embeddings)
        self.embeddings = embeddings
    
    def _create_bm25_index(self):
        """Create BM25 index for sparse keyword-based retrieval"""
        self.tokenized_texts = [text.lower().split() for text in self.texts]
        self.bm25 = BM25Okapi(self.tokenized_texts)
    
    def fine_tune_embeddings(self, train_examples: List[InputExample], epochs=3, batch_size=8, learning_rate=3e-5):
        """
        Fine-tune SentenceTransformer model using domain-specific data.

        Args:
            train_examples (List[InputExample]): Training data
            epochs (int): Number of fine-tuning epochs
            batch_size (int): Batch size
            learning_rate (float): Learning rate for optimization
        """
        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
        train_loss = losses.MultipleNegativesRankingLoss(self.embedding_model)

        self.embedding_model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=epochs,
            warmup_steps=100,
            optimizer_params={'lr': learning_rate},
            show_progress_bar=True
        )
    
    def save_adapter(self, save_path):
        """
        Save the fine-tuned embedding adapter.
        
        Args:
            save_path (str): Path to save the adapter
        """
        print(f"Saving adapter to {save_path}...")
        torch.save(self.embedding_model.state_dict(), save_path)

    def retrieve_sections(self, query: str, k_initial: int = 10, k_final: int = 3, window_size=2):
        """
        Hybrid retrieval combining dense and sparse search with cross-encoder re-ranking
        
        Args:
            query (str): Search query
            k_initial (int): Number of top sections to retrieve initially
            k_final (int): Number of top sections to return after re-ranking
            window_size (int): Window size for context enrichment (not used in this version, but could be added later)
            
        
        Returns:
            List of top k_final retrieved sections with re-ranked scoring
        """
        query_embedding = self.embedding_model.encode([query], convert_to_tensor=True).cpu().numpy()
        dense_distances, dense_indices = self.faiss_index.search(query_embedding, k_initial)
        
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        
        
        hybrid_scores = []
        for i in range(len(self.texts)):
            dense_score = 0
            if i in dense_indices[0]:
                dense_idx = np.where(dense_indices[0] == i)[0]
                if len(dense_idx) > 0:
                    dense_score = dense_distances[0][dense_idx[0]]
            sparse_score = bm25_scores[i]
            hybrid_score = (self.dense_weight * dense_score) + ((1 - self.dense_weight) * sparse_score) 
            hybrid_scores.append((hybrid_score, i))
        
        hybrid_scores.sort(reverse=True, key=lambda x: x[0])
        top_k_initial_indices = [idx for _, idx in hybrid_scores[:k_initial]]
        
        cross_encoder_pairs = [[query, self.texts[idx]] for idx in top_k_initial_indices]
        cross_encoder_scores = self.cross_encoder_model.predict(cross_encoder_pairs)
        
        final_scores = []
        for i, idx in enumerate(top_k_initial_indices):
            initial_score = hybrid_scores[i][0]
            re_ranked_score = cross_encoder_scores[i]
            combined_score = 0.6 * initial_score + 0.4 * re_ranked_score
            final_scores.append((combined_score, idx))
        
        final_scores.sort(reverse=True, key=lambda x: x[0])
        top_k_final_indices = [idx for _, idx in final_scores[:k_final]]
        
        results = [
            {
                "score": final_scores[i][0],
                "page": self.book_data[idx]['page'],
                "text": self.book_data[idx]['text']
            } 
            for i, idx in enumerate(top_k_final_indices)
        ]
        
        return results

# --- Helper Functions ---
def load_book(pdf_path, sections_metadata_path=r"D:\Users\hadro\Desktop\RAG_Project\RAG_Project\sections.json"):
    """Load and preprocess the book from PDF, including section metadata."""
    reader = PdfReader(pdf_path)
    sections_data = []

    # Load section metadata
    with open(sections_metadata_path, "r") as f:
        sections_metadata = json.load(f)

    # Create a flattened page-to-section mapping for easier lookup
    page_to_section_map = {}

    def map_sections(section_name, section_data, parent_path=""):
        current_path = f"{parent_path}/{section_name}" if parent_path else section_name
        for page_num in range(section_data["page_start"], section_data["page_end"] + 1):
            page_to_section_map[page_num] = current_path

        if "subsections" in section_data:
            for subsection_name, subsection_data in section_data["subsections"].items():
                map_sections(subsection_name, subsection_data, current_path)

    for section_name, section_data in sections_metadata.items():
        map_sections(section_name, section_data)

    for page_num, page in enumerate(reader.pages):
        text = page.extract_text()
        if text.strip():
            section_path = page_to_section_map.get(page_num + 1, "Unknown")
            sections_data.append({"page": page_num + 1, "text": text.strip(), "section": section_path})

    return sections_data, page_to_section_map

def generate_answer(context, question, device):
    """Generate an answer using the generative model."""
    generator_model_name = "Qwen/Qwen2.5-0.5B"
    generator_tokenizer = AutoTokenizer.from_pretrained(generator_model_name)
    generator_model = AutoModelForCausalLM.from_pretrained(generator_model_name).to(device)

    # Refine the prompt
    input_text = f"""
    Answer VERY concisely from the provided context only. Your answer length: less or equal 150 tokens. Plain text, no Markdown. Answer only, no other questions.
    Context: {context}

    Question: {question}

    Answer:
    """
    
    try:
        # Tokenize with padding and attention mask
        inputs = generator_tokenizer(
            input_text,
            return_tensors="pt",
            padding=True,  # Enable padding
            truncation=True,
            max_length=512  # Adjust max length as needed
        )
        input_ids = inputs["input_ids"].to(device)
        attention_mask = inputs["attention_mask"].to(device)
        
        # Set pad_token_id if not already set
        if generator_tokenizer.pad_token_id is None:
            generator_tokenizer.pad_token_id = generator_tokenizer.eos_token_id
        
        # Generate answer with attention mask and pad_token_id
        output_ids = generator_model.generate(
            input_ids,
            attention_mask=attention_mask,
            pad_token_id=generator_tokenizer.pad_token_id,
            num_return_sequences=1,
            #top_p=0.7,
            #top_k=20,
            num_beams=5,
            no_repeat_ngram_size=2,
            #temperature=0.1,
            max_new_tokens=512
        )
        return generator_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    except Exception as e:
        print(f"Error in answer generation: {e}")
        return "Unable to generate an answer."

def clean_answer(answer):
    """Cleans up the generated answer with more targeted operations."""

    # 1. Remove excessive newlines (but keep single newlines)
    answer = answer.replace("\n\n\n", "\n\n").replace("\n\n###", "###").replace("\n\n-", "-")

    # 2. Handle LaTeX-like formatting VERY selectively (example):
    answer = re.sub(r'\\\(.*?\\\)', '', answer)  # Remove only \( ... \)
    answer = re.sub(r'\\\[.*?\\\]', '', answer)  # Remove only \[ ... \]

    # 3. Remove stray backslashes and unnecessary commands 
    answer = re.sub(r'\\(text|quad|boxed|textit|textbf){', '', answer)
    answer = re.sub(r'(?<!\\)}', '}', answer)  # Remove } that is not preceded by a backslash
    answer = re.sub(r'(?![a-zA-Z])\\', '', answer)  # Remove \ followed by non-letters (likely stray)
    answer = re.sub(r'\\\$', '$', answer)  # Replace \$ with $
    answer = re.sub(r'\\%', '%', answer)  # Replace \% with %

    # 4. Remove Markdown bold formatting (**)
    answer = answer.replace("**", "")

    # 5. Remove multiple spaces, but keep single spaces
    answer = ' '.join(answer.split())

    return answer

def load_queries(json_path):
    """Load questions from the queries JSON file."""
    try:
        with open(json_path, "r", encoding="utf-8") as f:
            queries = json.load(f)
        return queries
    except Exception as e:
        print(f"Error loading queries: {e}")
        return []

def rag_pipeline(question, hybrid_retriever, device, page_to_section_map, k=3, window_size=2):
    """
    Updated RAG pipeline using hybrid retrieval with context window expansion.

    Args:
        question (str): The question to answer.
        hybrid_retriever (HybridRetriever): The hybrid retriever object.
        device (torch.device): The device to use for computation (CPU or CUDA).
        page_to_section_map (dict): Mapping of page numbers to section paths.
        k (int): Number of top sections to retrieve after re-ranking.
        window_size (int): Number of pages to include on either side of the retrieved page.

    Returns:
        dict: A dictionary containing the context, answer, and references.
    """
    try:
        retrieved_pages = hybrid_retriever.retrieve_sections(question, k_initial=10, k_final=k)  # Increased k_initial

        if not retrieved_pages:
            return {
                "context": "",
                "answer": "No relevant context found.",
                "references": json.dumps({"sections": [], "pages": []})
            }

        # Group pages by section path
        section_page_mapping = {}
        for page_data in retrieved_pages:
            page_num = page_data['page']
            section_path = page_to_section_map.get(page_num, "Unknown")
            if section_path not in section_page_mapping:
                section_page_mapping[section_path] = []
            section_page_mapping[section_path].append(page_num)

        # Sort pages within each section
        for section_path, pages in section_page_mapping.items():
            section_page_mapping[section_path] = sorted(pages)

        # Construct context with window_size
        context_pages = []
        num_pages = len(page_to_section_map)
        for section_path, pages in section_page_mapping.items():
            for page in pages:
                # Expand context to include pages within the window
                for i in range(max(1, page - window_size), min(num_pages, page + window_size + 1)):
                    context_pages.append(i)
        context_pages = sorted(list(set(context_pages)))  # Remove duplicates and sort

        # Retrieve the text for the context pages
        context = ""
        for page_num in context_pages:
            page_text = [p['text'] for p in hybrid_retriever.book_data if p['page'] == page_num]
            if page_text:
                context += page_text[0] + "\n"

        # Generate the answer using the context and question
        answer = generate_answer(context, question, device)

        # Clean the generated answer
        cleaned_answer = clean_answer(answer)

        references = {
            "sections": list(section_page_mapping.keys()),
            "pages": [str(page) for pages in section_page_mapping.values() for page in pages]
        }

        return {
            "context": context,
            "answer": cleaned_answer,
            "references": json.dumps(references)
        }

    except Exception as e:
        print(f"Error in RAG pipeline: {e}")
        return {
            "context": "",
            "answer": "An error occurred during processing.",
            "references": json.dumps({"sections": [], "pages": []})
        }

def main():
    # Define the dataset directory (Kaggle input)
    dataset_dir = r"D:\Users\hadro\Desktop\RAG_Project\RAG_Project"  
    pdf_path = os.path.join(dataset_dir, "book.pdf")
    queries_path = os.path.join(dataset_dir, "queries.json")

    # Device configuration
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load the book along with section metadata
    book_data, page_to_section_map = load_book(pdf_path)

    # Initialize hybrid retriever
    hybrid_retriever = HybridRetriever(book_data)

    # Load queries
    queries = load_queries(queries_path)

    # Generate submission data
    submission_data = []
    for query in queries:
        query_id = query.get("query_id", "unknown")
        question = query.get("question", "")

        if not question:
            print(f"Skipping query with ID {query_id} because the question is empty.")
            continue

        try:
            response = rag_pipeline(question, hybrid_retriever, device, page_to_section_map, window_size=2)
            print(f"Response for query ID {query_id}: {response}")

            submission_data.append({
                "ID": query_id,
                "context": response.get("context", ""),
                "answer": response.get("answer", ""),
                "references": response.get("references", "{}")
            })
        except Exception as e:
            print(f"Error processing query {query_id}: {e}")
            submission_data.append({
                "ID": query_id,
                "context": "",
                "answer": "Error processing query",
                "references": "{}"
            })

    # Create submission CSV
    submission_df = pd.DataFrame(submission_data)
    submission_df.to_csv("submission.csv", index=False)  # Save to /kaggle/working/

    print("Submission file 'submission.csv' has been generated successfully!")
    print("\nSubmission Statistics:")
    print(f"Total rows: {len(submission_df)}")
    print(f"Average context length: {submission_df['context'].str.len().mean():.0f}")
    
if __name__ == "__main__":
    main()

Batches: 100%|██████████| 24/24 [17:09<00:00, 42.88s/it]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Response for query ID 1: {'context': 'Christine Selby, Husson University\nSally B. Seraphin, Centre College\nBrian Sexton, Kean University\nNancy Simpson, Trident Technical College\nJason M. Smith, Federal Bureau of Prisons – FCC Hazelton\nRobert Stennett, University of Georgia\nJennifer Stevenson, Ursinus College\nEric Weiser, Curry College\nJay L. Wenger, Harrisburg Area Community College\nAlan Whitehead, Southern Virginia University\nValjean Whitlow, American Public University\nRachel Wu, University of California, Riverside\nAlexandra Zelin, University of Tennessee at Chattanooga\n6 Preface\nAccess for free at openstax.org\nFIGURE 1.1 Psychology is the scientific study of mind and behavior. (credit "background": modification of work by\nNattachai Noogure; credit "top left": modification of work by Peter Shanks; credit "top middle": modification of work\nby "devinf"/Flickr; credit "top right": modification of work by Alejandra Quintero Sinisterra; credit "bottom left":\nmodification 