# Group 99 - Conversational AI Assignment 2: Comparative Financial QA System

### Group Member Names



| Name                      | ID            |
| :------------------------ | :------------ |
| APARNARAM KASIREDDY       | 2023AC05145   |
| K NIRANJAN BABU           | 2023AC05464   |
| LAKSHMI MRUDULA MADDI     | 2023AC05138   |
| MURALIKRISHNA RAPARTHI    | 2023AC05208   |
| RAJAMOHAN NAIDU           | 2023AC05064   |




This Jupyter Notebook implements and compares two systems for answering questions based on HDFC Bank's financial statements:
1. Retrieval-Augmented Generation (RAG) Chatbot
2. Retrieval-Augmented Fine-Tuned (RAFT) Chatbot

Note: Ensure you have a Google Colab Pro or high-RAM Colab instance for running 7B models.
Make sure to upload your HDFC Bank PDF files and the Q&A JSON file to the Colab environment.



# **Phase 1: Data Preparation:**
This is the foundational stage for both systems. Raw PDF documents are cleaned, broken into manageable chunks, and then converted into two types of searchable indexes: a FAISS index for understanding the meaning (semantic search) and a BM25 index for finding exact keywords (lexical search).

# **Phase 2: Model Pipelines:**
This is where the two approaches diverge.
*   RAG Pipeline: At the time of a query, the system performs a Hybrid Retrieval to find the most relevant text chunks. These chunks are then fed directly to the base Mistral-7B model along with the user's question to generate an answer on the fly.

*   RAFT Pipeline: This involves an offline Fine-Tuning step. Before the user ever asks a question, the system uses the hybrid retriever to find context for a pre-made set of training questions. This "augmented" dataset is then used to train a specialized version of the Mistral-7B model. This new model is an expert at answering financial questions based on provided context.

# **Phase 3: Deployment & Interaction:**
After the models are ready, they remain on the GPU, which is crucial for handling the demanding requirements of inference. The models are not moved to the CPU; they continue to leverage the GPU's processing power for both training and execution. This setup is used for the Automated Evaluation and the final interactive interfaces—the Command-Line Interface (CLI) and the Streamlit web application—where the performance of both the RAG and specialized RAFT models can be compared side-by-side.

# Setup & Installations



In [None]:
!pip install -q PyPDF2 sentence-transformers faiss-cpu rank_bm25 langchain datasets scikit-learn pandas
!pip install -q optimum
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/trl.git

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m425.8/425.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies

# Startup: Auto ZIP Restore / Skip Logic

In [None]:
import os, shutil, json, faiss, zipfile
from google.colab import files

# Include all key RAG and RAFT artifacts below
SAVE_ITEMS = [
    "mistral7b_lora_ft",   # folder containing model adapters and config
    "mistral7b_lora_ft_tokenizer", # folder containing tokenizer files
    "raft_model_adapters",
    "results",
    "tokenizer_ft",
    "vector_store",        # folder for FAISS and other vector indices
    "rag_chunks_data.json",# document chunks for retrieval
    "faiss_index_small.bin",
    "bm25_index_small.pkl",
    "faiss_index_large.bin",
    "bm25_index_large.pkl",
    "raft_hyperparams.json",
    "financial_qa_dataset.json",
    "HDFC_Bank_IAR_FY23-24.pdf",
    "HDFC_Bank_IAR_FY24-25.pdf",
    "utils.py",
    "app.py"
    ]
ZIP_NAME = "project_state_artifacts.zip"

def save_project_state():
    # --- Save FAISS index in correct location ---
    # (assuming you created/updated faiss_index.bin for RAG)
    if not os.path.exists("vector_store"):
        os.makedirs("vector_store")
    # Move (or overwrite) faiss_index.bin into vector_store
    if os.path.exists("faiss_index_small.bin"):
        shutil.copy2("faiss_index_small.bin", "vector_store/faiss_index_small.bin")
    if os.path.exists("faiss_index_large.bin"):
        shutil.copy2("faiss_index_large.bin", "vector_store/faiss_index_large.bin")
    # Remove existing zip if present
    if os.path.exists(ZIP_NAME):
        os.remove(ZIP_NAME)

    items_to_zip = [item for item in SAVE_ITEMS if os.path.exists(item)]
    if not items_to_zip:
        print("No items found to save.")
        return

    # Create ZIP archive manually (handles files and directories)
    with zipfile.ZipFile(ZIP_NAME, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for item in items_to_zip:
            if os.path.isfile(item):
                zipf.write(item)
            else:
                for root, _, files_in_dir in os.walk(item):
                    for file in files_in_dir:
                        file_path = os.path.join(root, file)
                        arcname = os.path.relpath(file_path, ".") # Store paths relative to current directory
                        zipf.write(file_path, arcname=arcname)

    print(f"Project state saved as {ZIP_NAME}")
    # Colab download trigger
    try:
        files.download(ZIP_NAME)
    except Exception:
        print("Could not trigger download. You can download it manually from the file explorer.")

def restore_project_state():
    if os.path.exists(ZIP_NAME):
        print(f"Restoring from {ZIP_NAME}...")
        shutil.unpack_archive(ZIP_NAME, '.')
        print("Restore complete!")
        return True
    print("No saved ZIP found.")
    return False

RESTORED = restore_project_state()
if RESTORED:
    print("Resuming from saved state — skipping heavy steps.")
else:
    print("Starting fresh — heavy steps will run.")

No saved ZIP found.
Starting fresh — heavy steps will run.


## Imports & Core Setup

In [None]:
import gc
import time
import re
import numpy as np
import torch
import glob
import pickle
import random
import os, shutil, json, faiss, zipfile
import multiprocessing
import copy
import warnings
warnings.filterwarnings('ignore')
from typing import List
from tqdm import tqdm
from google.colab import files
from trl import SFTConfig
from PyPDF2 import PdfReader
from collections import Counter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from faiss import IndexFlatIP
from rank_bm25 import BM25Okapi
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from huggingface_hub import login
from google.colab import userdata

# GPU Check
if torch.cuda.is_available():
    print(f"CUDA is available. GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected. Running on CPU.")

CUDA is available. GPU: Tesla T4


# 1. Data Collection & Preprocessing

## 1.1 Data Acquisition (Dynamic and Content-Aware)

In [None]:
def find_pdfs_in_colab():
    """Scans the /content/ directory for all .pdf files."""
    return glob.glob(r"/content/*.pdf")  # raw string for consistency

import re
from collections import Counter

def extract_year_from_text(text):
    """
    Robustly searches text for the most common financial year pattern.
    Prioritizes full year ranges like '2023-24' or '2024-25'.
    Ensures a consistent string output.
    """
    # Pattern 1: Matches '2023-24', '2023-2024', or '2023 - 24'. Captures the first year.
    pattern1_matches = re.findall(r'\b(20\d{2})[-–]\s?\d{2,4}\b', text)
    if pattern1_matches:
        # CORRECT: Returns just the string, e.g., '2023'
        return Counter(pattern1_matches).most_common(1)[0][0]

    # Pattern 2: Fallback for 'FY24', 'FY 24', 'Financial Year 2024'
    pattern2_matches = re.findall(r'\b(?:FY|Financial Year)\s?(\d{2,4})\b', text, re.IGNORECASE)
    if pattern2_matches:
        years = [f"20{y}" if len(y) == 2 else y for y in pattern2_matches]
        # FIX: Added [0][0] to return only the year string
        return Counter(years).most_common(1)[0][0]

    # Pattern 3: Last resort fallback for standalone four-digit year '2023' etc.
    pattern3_matches = re.findall(r'\b(20\d{2})\b', text)
    if pattern3_matches:
        # FIX: Added [0][0] to return only the year string
        return Counter(pattern3_matches).most_common(1)[0][0]

    return None

def extract_text_from_pdf(pdf_path):
    """Extracts all text from a PDF file as a single string."""
    text = ""
    try:
        with open(pdf_path, 'rb') as file:
            reader = PdfReader(file)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
    return text

# --- Main Dynamic Logic ---
financial_texts = {}
pdf_paths = find_pdfs_in_colab()

if not pdf_paths:
    print("\nWARNING: No PDF files found in /content/. Please upload your reports.\n")
else:
    print(f"\nFound {len(pdf_paths)} PDF files to process: {pdf_paths}\n")

for path in pdf_paths:
    print(f"\n--- Processing: {os.path.basename(path)} ---\n")

    full_text = extract_text_from_pdf(path)
    if not full_text:
        print(f"Could not extract text from {path}. Skipping.\n")
        continue

    # Content-first approach to determine FY
    financial_year = extract_year_from_text(full_text)

    # Fallback: try filename if content search fails
    if not financial_year:
        print("Could not determine year from content. Falling back to filename...\n")
        financial_year = extract_year_from_text(os.path.basename(path))

    if financial_year:
        print(f"Successfully identified Financial Year: {financial_year}\n")
        financial_texts[financial_year] = full_text
    else:
        print(f"Could not determine a financial year for {path}. Storing under unknown key.\n")
        financial_texts[f"Unknown_FY_{os.path.basename(path)}"] = full_text


Found 2 PDF files to process: ['/content/HDFC_Bank_IAR_FY24-25.pdf', '/content/HDFC_Bank_IAR_FY23-24.pdf']


--- Processing: HDFC_Bank_IAR_FY24-25.pdf ---

Successfully identified Financial Year: 2024


--- Processing: HDFC_Bank_IAR_FY23-24.pdf ---

Successfully identified Financial Year: 2023



## 1.2 Text Cleaning

In [None]:
## 1.2 Text Cleaning
def clean_financial_text(text):
    """
    Cleans raw financial report text by removing common boilerplate and formatting issues:
    - Removes page numbering like 'Page 12 of 40'
    - Removes 'Annual Report 2023' or similar year mentions
    - Collapses multiple newlines to a single newline
    - Replaces multiple spaces with a single space
    """
    text = re.sub(r'Page\s+\d+\s+of\s+\d+', '', text, flags=re.IGNORECASE)
    text = re.sub(r'Annual Report \d{4}(-\d{2,4})?', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\n{2,}', '\n', text)      # collapse multiple newlines
    text = re.sub(r'\s{2,}', ' ', text)       # collapse multiple spaces
    return text.strip()

# Apply cleaning to all extracted financial report texts by financial year
cleaned_financial_texts = {fy: clean_financial_text(text) for fy, text in financial_texts.items()}

# --- Display a snippet of cleaned text for verification ---
if cleaned_financial_texts:
    # Get the latest year available to show the most recent cleaned report snippet
    latest_year = sorted(cleaned_financial_texts.keys())[-1]
    print(f"\n--- Snippet of Cleaned Text (FY: {latest_year}) ---")
    print(cleaned_financial_texts[latest_year][:500])  # Print first 500 chars as preview
else:
    print("\n--- No cleaned text available to display. ---")


--- Snippet of Cleaned Text (FY: 2024) ---
Powering Progress Together
Banking today is more than managing money — it’s about fulfilling aspirations, and shaping futures. At HDFC Bank, we see ourselves not just as custodians of capital but as partners in progress for individuals, businesses, and communities alike. We aim to simplify complexities, unlock opportunities and help dreams take flight. Our ability to evolve with changing times—
powered by innovation, a strong performance culture and a deep sense of responsibility—has made us a t


## 1.3 Text Segmentation (Dual Strategy)

In [None]:
## --- Section 1.3: Text Segmentation (Dual Strategy) ---
def create_chunks(cleaned_texts_dict, chunk_size, chunk_overlap):
    """Create text chunks from cleaned texts using RecursiveCharacterTextSplitter."""
    print(f"--- Creating chunks with size={chunk_size}, overlap={chunk_overlap} ---")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    all_chunks = []
    for fy, text in cleaned_texts_dict.items():
        # Create a Document object with page content and metadata
        doc = Document(page_content=text, metadata={"source": f"HDFC_Bank_IAR_FY{fy}.pdf", "year": fy})
        chunks = text_splitter.split_documents([doc])
        all_chunks.extend(chunks)
    print(f"Created {len(all_chunks)} chunks.")
    return all_chunks


# --- Main Logic for Dual Chunking ---

# Strategy 1: Small chunks for specific fact retrieval
rag_chunks_small = create_chunks(cleaned_financial_texts, chunk_size=512, chunk_overlap=50)

# Strategy 2: Large chunks for broader context understanding
rag_chunks_large = create_chunks(cleaned_financial_texts, chunk_size=2048, chunk_overlap=200)


# Save both chunk sets to JSON for persistence / later restoration
rag_data_to_save = {
    "small_chunks": [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in rag_chunks_small],
    "large_chunks": [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in rag_chunks_large]
}
with open("rag_chunks_data.json", "w") as f:
    json.dump(rag_data_to_save, f, indent=2)
print("\nSaved both small and large RAG chunks to rag_chunks_data.json")


# Set default chunk set to small chunks for further processing / indexing
rag_chunks = rag_chunks_small
print("rag_chunks is set to default 'small chunks' for now")

--- Creating chunks with size=512, overlap=50 ---
Created 8772 chunks.
--- Creating chunks with size=2048, overlap=200 ---
Created 1974 chunks.

Saved both small and large RAG chunks to rag_chunks_data.json
rag_chunks is set to default 'small chunks' for now


## 1.4 Q/A Pair Construction (Dynamic)

In [None]:
## 1.4 Q/A Pair Construction (Dynamic)
def find_and_load_qa_pairs(directory="/content/"):
    """
    Finds the first .json file in the specified directory, validates it, and loads the Q/A pairs.
    """
    # Find all json files in the directory
    json_files = glob.glob(os.path.join(directory, "*.json"))

    # Exclude the notebook's own checkpoint files and previously saved chunks
    json_files = [f for f in json_files if 'ipynb_checkpoints' not in f and 'rag_chunks_data' not in f]

    if not json_files:
        print(" WARNING: No Q/A JSON file found in the /content/ directory. Skipping Q/A pair loading.")
        return []

    # Use the first valid JSON file found
    # qa_json_path = json_files[0]
    qa_json_path = '/content/financial_qa_dataset.json'
    print(f"Found Q&A file: {qa_json_path}")

    try:
        with open(qa_json_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        # --- Validation Step ---
        # Check if the file contains a list of dictionaries with the required keys
        if isinstance(data, list) and data and isinstance(data[0], dict) and 'question' in data[0] and 'answer' in data[0]:
            print(f"Successfully loaded and validated {len(data)} Q/A pairs.")
            return data
        else:
            print(f" WARNING: The file {qa_json_path} is not a valid Q/A dataset. It must be a list of objects with 'question' and 'answer' keys.")
            return []

    except Exception as e:
        print(f"Error loading or parsing Q/A file {qa_json_path}: {e}")
        return []

# --- Main Logic ---
# Dynamically find and load the QA pairs
qa_pairs = find_and_load_qa_pairs()

if qa_pairs:
    train_qa_pairs, test_qa_pairs = train_test_split(qa_pairs, test_size=0.2, random_state=42)
    print(f"Split Q/A pairs: {len(train_qa_pairs)} for training, {len(test_qa_pairs)} for testing.")
else:
    # Ensure these variables exist even if loading fails to prevent downstream errors
    train_qa_pairs, test_qa_pairs = [], []

Found Q&A file: /content/financial_qa_dataset.json
Successfully loaded and validated 50 Q/A pairs.
Split Q/A pairs: 40 for training, 10 for testing.


# 2. RAG System Implementation

## 2.1 Chunks Info

In [None]:
# Print a few random sample chunks from the small chunks list
print("\n--- Printing a few random sample chunks from the SMALL list ---")
print(f"Total Small Chunks: {len(rag_chunks_small)}")
num_samples = 3
if len(rag_chunks_small) > num_samples:
    random_small_chunks = random.sample(rag_chunks_small, num_samples)
    for i, chunk in enumerate(random_small_chunks):
        print(f"\n--- Small Chunk Sample {i + 1} ---")
        print(f"Content: {chunk.page_content[:250]}...")
        print(f"Metadata: {chunk.metadata}")

# Print a few random sample chunks from the large chunks list
print("\n--- Printing a few random sample chunks from the LARGE list ---")
print(f"Total Large Chunks: {len(rag_chunks_large)}")
if len(rag_chunks_large) > num_samples:
    random_large_chunks = random.sample(rag_chunks_large, num_samples)
    for i, chunk in enumerate(random_large_chunks):
        print(f"\n--- Large Chunk Sample {i + 1} ---")
        print(f"Content: {chunk.page_content[:250]}...")
        print(f"Metadata: {chunk.metadata}")


--- Printing a few random sample chunks from the SMALL list ---
Total Small Chunks: 8772

--- Small Chunk Sample 1 ---
Content: HDFC Capital Advisors Limited HDFC Trustee Company Limited HDFC Sales Private LimitedHDFC Education and Development Services Private LimitedGriha InvestmentsGriha Pte Limited
W
elfare Trust of the Bank
H
DB Employees Welfare Trust K...
Metadata: {'source': 'HDFC_Bank_IAR_FY2023.pdf', 'year': '2023'}

--- Small Chunk Sample 2 ---
Content: Experience adjustment (gain) / loss on plan liabilities 17.40 16.83 3.32 6.44 31.41 Provident fund The guidance note on AS-15 “Employee Benefits”, states that employer established provident funds, where interest is guaranteed are to be considered as ...
Metadata: {'source': 'HDFC_Bank_IAR_FY2024.pdf', 'year': '2024'}

--- Small Chunk Sample 3 ---
Content: SCHEDULES TO THE STANDALONE FINANCIAL STATEMENTS
For the year ended March 31, 2024T he nature and terms of foreign currency IRS as at March 31, 2024 are set out below: (` cr

## 2.2 Embedding & Indexing

### **Embedding Model Selection:**
*   Using a good performing small open-source sentence embedding model.
*   `all-MiniLM-L6-v2` is a good balance of size and performance. `e5-small-v2` is also an option.

In [None]:
## --- Section 2.2: Embedding & Indexing (Dual Strategy) ---
def build_and_save_indexes(chunks, strategy_name):
    """Helper function to embed, index, and save data for a given chunk strategy."""
    print(f"--- Building indexes for strategy: {strategy_name} ---")
    if not chunks:
        print(f" No chunks provided for {strategy_name}. Skipping.")
        return None, None

    chunk_texts = [doc.page_content for doc in chunks]
    # Sanitize: Remove any empty strings
    chunk_texts = [t if t.strip() else "<EMPTY>" for t in chunk_texts]
    chunk_embeddings = embedding_model.encode(chunk_texts, show_progress_bar=True)
    dimension = chunk_embeddings.shape[1]

    # Build Dense Index (FAISS, cosine similarity via normalized vectors)
    normalized_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
    assert not np.isnan(normalized_embeddings).any(), "NaNs in normalized embeddings!"
    faiss_index = faiss.IndexFlatIP(dimension)
    faiss_index.add(np.array(normalized_embeddings).astype('float32'))
    faiss.write_index(faiss_index, f"faiss_index_{strategy_name}.bin")
    print(f"FAISS index for {strategy_name} built and saved.")

    # Build Sparse Index (BM25)
    tokenized_corpus = [text.split(" ") for text in chunk_texts]
    bm25_index = BM25Okapi(tokenized_corpus)
    with open(f"bm25_index_{strategy_name}.pkl", "wb") as f:
        pickle.dump(bm25_index, f)
    print(f"BM25 index for {strategy_name} built and saved.")

    return faiss_index, bm25_index

print("Loading Sentence Transformer model globally...")
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
print("Embedding model loaded.")

# Build indexes for both strategies
faiss_index_small, bm25_index_small = build_and_save_indexes(rag_chunks_small, "small")
faiss_index_large, bm25_index_large = build_and_save_indexes(rag_chunks_large, "large")

# Collect in a dictionary for dual strategy
index_sets = {
    "small": {"faiss": faiss_index_small, "bm25": bm25_index_small, "chunks": rag_chunks_small},
    "large": {"faiss": faiss_index_large, "bm25": bm25_index_large, "chunks": rag_chunks_large}
}

# --- File Existence Verification ---
print("\n--- Verifying existence of all generated files ---")

file_paths = {
    "small_faiss_index": "faiss_index_small.bin",
    "small_bm25_index": "bm25_index_small.pkl",
    "large_faiss_index": "faiss_index_large.bin",
    "large_bm25_index": "bm25_index_large.pkl"
}

all_files_exist = True
for name, path in file_paths.items():
    if os.path.exists(path):
        print(f" {name} exists at: {path}")
    else:
        print(f" {name} is MISSING at: {path}")
        all_files_exist = False

if all_files_exist:
    print("\n All essential files were successfully created.")
else:
    print("\n One or more essential files are missing. Please check the previous steps.")

Loading Sentence Transformer model globally...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded.
--- Building indexes for strategy: small ---


Batches:   0%|          | 0/275 [00:00<?, ?it/s]

FAISS index for small built and saved.
BM25 index for small built and saved.
--- Building indexes for strategy: large ---


Batches:   0%|          | 0/62 [00:00<?, ?it/s]

FAISS index for large built and saved.
BM25 index for large built and saved.

--- Verifying existence of all generated files ---
 small_faiss_index exists at: faiss_index_small.bin
 small_bm25_index exists at: bm25_index_small.pkl
 large_faiss_index exists at: faiss_index_large.bin
 large_bm25_index exists at: bm25_index_large.pkl

 All essential files were successfully created.


## Section 2.3: Hybrid Retrieval Pipeline

In [None]:
## --- Section 2.3: Hybrid Retrieval Pipeline (Dual Strategy Ready) ---
def preprocess_query(query):
    """Helper function to clean and normalize a query."""
    return query.lower().strip()

def hybrid_retrieval(query, strategy_name="small", top_k=5):
    """
    Performs hybrid retrieval using the specified chunking strategy ('small' or 'large').
    It combines dense (FAISS) and sparse (BM25) search results using Reciprocal Rank Fusion (RRF).
    """
    # 1. Select the correct set of indexes and chunks based on the chosen strategy
    if strategy_name not in index_sets:
        print(f" Warning: Strategy '{strategy_name}' not found. Defaulting to 'small'.")
        strategy_name = "small"

    strategy = index_sets[strategy_name]
    faiss_index = strategy.get("faiss")
    bm25_index = strategy.get("bm25")
    chunks = strategy.get("chunks")

    if faiss_index is None or bm25_index is None:
        print("Indexes are not available. Skipping retrieval.")
        return [], [], []

    # 2. Preprocess query
    cleaned_query = preprocess_query(query)

    # 3. Dense retrieval (FAISS)
    query_emb = embedding_model.encode([cleaned_query])[0]
    query_emb_norm = query_emb / np.linalg.norm(query_emb)
    distances, dense_indices = faiss_index.search(np.array([query_emb_norm]).astype('float32'), top_k)
    dense_results = [{'doc_index': i, 'score': s} for i, s in zip(dense_indices[0], distances[0]) if i != -1]

    # 4. Sparse retrieval (BM25)
    tokenized_query = cleaned_query.split()
    bm25_scores = bm25_index.get_scores(tokenized_query)
    sparse_top_indices = np.argsort(bm25_scores)[::-1][:top_k]
    sparse_results = [{'doc_index': i, 'score': bm25_scores[i]} for i in sparse_top_indices]

    # 5. Reciprocal Rank Fusion (RRF)
    fused_scores = {}
    k_rrf = 60  # Tunable parameter

    for rank, res in enumerate(dense_results):
        idx = res['doc_index']
        fused_scores[idx] = fused_scores.get(idx, 0) + 1 / (k_rrf + rank + 1)

    for rank, res in enumerate(sparse_results):
        idx = res['doc_index']
        fused_scores[idx] = fused_scores.get(idx, 0) + 1 / (k_rrf + rank + 1)

    # 6. Sort documents by fused score
    sorted_indices = sorted(fused_scores.keys(), key=lambda x: fused_scores[x], reverse=True)
    top_indices = sorted_indices[:top_k]

    # 7. Retrieve documents, add fused_score in metadata
    final_docs = []
    seen_indices = set()
    for idx in top_indices:
        if idx not in seen_indices:
            doc = chunks[idx]
            doc.metadata['fused_score'] = fused_scores[idx]
            final_docs.append(doc)
            seen_indices.add(idx)

    retrieved_texts = [doc.page_content for doc in final_docs]

    return final_docs, retrieved_texts, None

# Example usage - testing small and large strategies
print("--- Testing hybrid retrieval - SMALL ---")
test_query = "What was the total deposits growth in FY 2024?"
docs_small, texts_small, _ = hybrid_retrieval(test_query, strategy_name="small")
if docs_small:
    print(f"Top SMALL chunk content:\n{texts_small[0][:300]}...\n")

print("--- Testing hybrid retrieval - LARGE ---")
docs_large, texts_large, _ = hybrid_retrieval(test_query, strategy_name="large")
if docs_large:
    print(f"Top LARGE chunk content:\n{texts_large[:300]}...\n")

--- Testing hybrid retrieval - SMALL ---
Top SMALL chunk content:
Particulars March 31, 2025 March 31, 2024
Total deposits of twenty largest depositors 117,366.81 79,156.47 Percentage of deposits of twenty largest depositors to total deposits of the Bank 4.32% 3.33%
b) Concentration of advances* (` crore, except percentages)
Particulars March 31, 2025 March 31, 20...

--- Testing hybrid retrieval - LARGE ---
Top LARGE chunk content:
['From the supply side, manufacturing growth slowed in 2024-\n25 with a rise in input costs and slower volume growth while service sector growth broadly held up above 7 per cent. Elsewhere, growth in the construction sector remained healthy at 9.4 per cent. The biggest support to growth came from above trend growth in the agriculture sector, as favourable monsoon conditions supported kharif output while healthy reservoir levels and soil moisture conditions supported rabi crops. On the external front, global growth stood at 3.3 per cent in 2024 - below the h

---
## **2.4 Advanced RAG Technique: Hybrid Search (Sparse + Dense Retrieval)**
As implemented, this technique combines the strengths of two distinct search methods:

*   BM25, a sparse retrieval method, is excellent at keyword matching and finding documents with specific facts.

*   Dense vector search, which relies on embeddings, is highly effective at capturing semantic similarity and is better for nuanced or conceptual queries.

To combine their results, Reciprocal Rank Fusion (RRF) is used. This algorithm merges the ranked lists from both methods into a single, robust final ranking. RRF gives a higher score to documents that are ranked well by both dense and sparse methods, thereby leveraging both lexical and semantic signals.

This hybrid approach significantly improves both recall (the ability to find all relevant documents) and precision (the ability to return only relevant documents). This is particularly useful for financial documents where both exact terms (e.g., "revenue," "EBITDA") and a conceptual understanding (e.g., "financial performance," "market share") are critical for accurate retrieval.

---

## 2.5 Response Generation (RAG)

### **Generative Model Selection (Mistral-7B / Llama-2-7B):**
*   For larger models like Mistral/Llama-2, we need quantization (4-bit) to fit into Colab's free GPU memory.
*   This significantly reduces memory usage with minimal performance impact.
*   Model Link :https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
Logged in and generated a read token for sleamless authentication



In [None]:
# # --- Section 2.5: RAG Response ---

huggingface = userdata.get("huggingface")
login(token=huggingface)

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name_rag = "mistralai/Mistral-7B-Instruct-v0.1"
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

tokenizer_rag = AutoTokenizer.from_pretrained(model_name_rag)
model_rag = AutoModelForCausalLM.from_pretrained(model_name_rag, quantization_config=bnb_config, device_map="auto")

rag_pipeline = pipeline("text-generation", model=model_rag, tokenizer=tokenizer_rag, max_new_tokens=256, temperature=0.1)

def generate_rag_response(query, retrieved_texts):
    context = "\n".join(retrieved_texts)
    messages = [
        {"role": "system", "content": "You are a helpful financial assistant. Answer ONLY based on the provided context. If the answer is not in the context, say 'Information not available in the provided context.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{query}\n\nDirect Answer:"}
    ]
    prompt = tokenizer_rag.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    response = rag_pipeline(prompt)
    return response[0]['generated_text'].strip()

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Device set to use cuda:0


## 2.6 Guardrail Implementation (RAG - Input-side)

In [None]:
# RAG Input Guardrail Function
def is_query_relevant_and_safe(query, threshold=0.5):
    query_lower = query.lower()

    # Block queries containing these banned keywords outright
    banned_keywords = [
        "politics", "weather", "sports", "recipe", "personal opinion",
        "violence", "capital of"
    ]

    if any(banned_word in query_lower for banned_word in banned_keywords):
        return False, "Query contains irrelevant or disallowed keywords."

    # Whitelist of direct financial keywords (can be expanded)
    financial_keywords = [
        "revenue", "profit", "income", "total income", "net income", "asset",
        "liability", "dividend", "loan", "deposit", "balance sheet", "equity",
        "share", "financial year", "rupees", "inr", "crore"
    ]

    # Direct match check
    if any(keyword in query_lower for keyword in financial_keywords):
        return True, "Query is relevant and allowed."

    # Semantic similarity fallback using embeddings
    try:
        # Assume embedding_model is globally initialized and available
        query_emb = embedding_model.encode([query_lower])[0]
        financial_embs = embedding_model.encode(financial_keywords)

        # Compute cosine similarity
        similarities = np.dot(query_emb, financial_embs.T) / (
            np.linalg.norm(query_emb) * np.linalg.norm(financial_embs, axis=1)
        )

        if similarities.max() < threshold:
            return False, "Query is not sufficiently related to financial topics."

        return True, "Query is relevant and allowed by semantic similarity."

    except Exception as e:
        # Safe fallback: block query if embeddings fail
        return False, f"Embedding error, rejecting query: {str(e)}"


## Postprocess generated text

In [None]:
def postprocess_generated_text(raw_text, prompt_marker, question=None, years_found=None):
    if not raw_text:
        return ""

    # --- Clean model output ---
    # Only split if a valid prompt_marker is provided and exists in the text
    if prompt_marker and prompt_marker in raw_text:
        parts = raw_text.split(prompt_marker)
        raw_text = parts[-1]

    tokens_to_remove = ["[INST]", "[/INST]", "INST", "</s>", "<s>"]
    for token in tokens_to_remove:
        raw_text = raw_text.replace(token, "")

    raw_text = " ".join(raw_text.split()).strip()

    raw_text = re.sub(r'[J`‘´’]\s*([\d,]+\.?\d*)', r'₹ \1', raw_text)

    # If the text is just a number, it's likely from direct extraction, so return as is.
    if re.match(r'^[\d,.]+$', raw_text):
        return raw_text

    # (The rest of the function remains the same)
    words = raw_text.split()
    while words and len(words[-1]) < 2:
        words.pop()
    raw_text = " ".join(words)

    if len(raw_text) < 3:
        return "Information not available"

    # --------- STEP 2: Financial formatting ---------
    formatted = raw_text
    if question:
        q = question.lower()
        num_match = re.search(r"([\d,]+(?:\.\d+)?)\s*(crore|lakh|million|billion|trillion)", formatted, re.IGNORECASE)

        if num_match:
            value, unit = num_match.groups()
            formatted = f"₹{value} {unit}"

        # Dynamic year handling
        start_year, end_year = None, None
        if years_found and isinstance(years_found, list):
            years_found = sorted(list(set(years_found))) # Ensure unique and sorted
            if len(years_found) == 1:
                start_year = years_found[0]
            elif len(years_found) >= 2:
                start_year, end_year = years_found[0], years_found[1]

        # Auto-label based on question type
        if "revenue" in q:
            if end_year:
                formatted = f"Change in Total Revenue from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Total Revenue for FY {start_year} was {formatted}."
        elif "profit" in q:
            if end_year:
                formatted = f"Change in Net Profit after Tax from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Net Profit after Tax for FY {start_year} was {formatted}."
        elif "asset" in q:
            if end_year:
                formatted = f"Change in Total Assets from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Total Assets for FY {start_year} were {formatted}."
        elif "liabilit" in q:
            if end_year:
                formatted = f"Change in Total Liabilities from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Total Liabilities for FY {start_year} were {formatted}."

    formatted = re.sub(r'[J`‘´’]\s*([\d,]+\.?\d*)', r'₹ \1', formatted)

    return formatted

### Section 2.7: RAG System Sanity Check

In [None]:
print("--- Running RAG System Sanity Check ---")

# Sample financial query for testing
# sample_query_rag = "What was the revenue in 2024?"
sample_query_rag = "What was the dividend per share declared in 2024?"

print(f" Test Query: '{sample_query_rag}'")

# Step 1: Input guardrail check
is_valid, guardrail_msg = is_query_relevant_and_safe(sample_query_rag)
if not is_valid:
    print(f" Guardrail Blocked Query: {guardrail_msg}")
else:
    print(" Guardrail Passed.")

    # Step 2: Perform hybrid retrieval
    print("\n Retrieving context...")
    retrieved_docs, retrieved_texts, _ = hybrid_retrieval(sample_query_rag, top_k=5)

    if retrieved_texts:
        print(f"   - Retrieved {len(retrieved_texts)} chunks.")

        # Step 3: Generate RAG response
        print("\n Generating RAG answer...")
        start_time = time.time()
        raw_rag_answer = generate_rag_response(sample_query_rag, retrieved_texts)
        end_time = time.time()

        # Step 4: Clean output
        final_rag_answer = postprocess_generated_text(raw_rag_answer, "[/INST]", sample_query_rag)

        # Step 5: Display results
        print("\n--- RAG Test Result ---")
        print(f" Answer: {final_rag_answer}")
        print(f"⏱ Response Time: {end_time - start_time:.2f} seconds")
    else:
        print(" No relevant context found for the test query.")

print("\n--- RAG Sanity Check Complete ---")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


--- Running RAG System Sanity Check ---
 Test Query: 'What was the dividend per share declared in 2024?'
 Guardrail Passed.

 Retrieving context...
   - Retrieved 5 chunks.

 Generating RAG answer...

--- RAG Test Result ---
 Answer: The dividend per share declared in 2024 was ₹ 19.50.
⏱ Response Time: 9.04 seconds

--- RAG Sanity Check Complete ---


# 3. RAFT Model System Implementation

## 3.1 Q/A Dataset Preparation for RAFT

In [None]:
# # --- Section 3.1: Q/A Dataset Preparation for RAFT  ---
from datasets import Dataset

raft_dataset_items = []
print("\nPreparing RAFT dataset with retrieved context...")

if train_qa_pairs:
    for qa in train_qa_pairs:
        _, retrieved_texts_raft, _ = hybrid_retrieval(qa['question'])
        if not retrieved_texts_raft:
            retrieved_texts_raft = [""]  # fallback empty context

        context_str = "\n".join(retrieved_texts_raft)
        # Optionally truncate context to max tokens/characters here if needed
        instruction = f"Given the context:\n{context_str}\n\nAnswer the question:\n{qa['question']}"
        raft_dataset_items.append({"instruction": instruction, "output": qa['answer']})

    raft_dataset = Dataset.from_list(raft_dataset_items)
    print(f"Prepared {len(raft_dataset_items)} RAFT training examples.")

    print("\n--- Example RAFT Training Item ---")
    if len(raft_dataset_items) > 0:
        print(raft_dataset_items[0])
else:
    raft_dataset = None
    print("Warning: No training Q/A pairs loaded. Skipping fine-tuning.")



Preparing RAFT dataset with retrieved context...
Prepared 40 RAFT training examples.

--- Example RAFT Training Item ---
{'instruction': 'Given the context:\nto deliver value to shareholders, with a ROE of 14.6 per cent. The Earnings Per Share (EPS) increased by 2.9 per cent to J 88.3, while dividend per share rose by 12.8 per cent to J 22.0 in FY25.23,79,786 18,83,395 27,14,715 FY24FY25\nT he Board of Directors at its meeting held on April 20, 2024, proposed a dividend of ` 19.50 per equity share (previous year: ` 19.00 per equity share) aggregating to ` 14,813.98 crore subject to the approval of shareholders at the ensuing Annual General Meeting. During the year ended March 31, 2024, the dividend paid by the Bank in respect of the previous year ended March 31, 2023 was ` 8,404.42 crore. No dividend was paid in respect of equity shares that were cancelled upon the Scheme becoming\nHDFC Bank Limited212\nDIRECTORS’ REPORTDividend\nThe Board of Directors of the Bank, at its meeting held

## 3.2 Model Selection for Fine-Tuning (Mistral-7B / Llama-2-7B)

*   Using the same base model as RAG for consistency and comparison.
*   This ensures we're comparing RAG vs. Fine-Tuning performance of the same underlying model.

In [None]:
# Auto-detect device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

print("\nLoading Mistral-7B for RAG Response Generation...")

# Define paths and model name
local_model_dir = "mistral7b_lora_ft"
local_tokenizer_dir = "mistral7b_lora_ft_tokenizer"
model_name_ft = "mistralai/Mistral-7B-Instruct-v0.1"

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# --- Loading Model: local if exists, else DL from HF Hub ---
if os.path.isdir(local_model_dir) and os.path.isdir(local_tokenizer_dir):
    print(f" Loading fine-tuned model from local checkpoint: {local_model_dir}")
    print("   Configuring for INFERENCE (cache enabled).")

    tokenizer_ft = AutoTokenizer.from_pretrained(local_tokenizer_dir)
    model_ft = AutoModelForCausalLM.from_pretrained(
        local_model_dir,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype="bfloat16",
        low_cpu_mem_usage=True
    )

    # CHANGE 1: Set use_cache to True for the loaded inference model.
    model_ft.config.use_cache = True

else:
    print(" Local checkpoint NOT found, downloading from HuggingFace Hub...")
    print("   Configuring for TRAINING (cache disabled).")

    tokenizer_ft = AutoTokenizer.from_pretrained(model_name_ft)
    model_ft = AutoModelForCausalLM.from_pretrained(
        model_name_ft,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True
    )

    # CHANGE 2: Moved all training-preparation logic inside this 'else' block.
    # This code will now ONLY run when preparing a fresh model for fine-tuning.

    # Disable cache for training
    model_ft.config.use_cache = False
    model_ft.config.pretraining_tp = 1

    # Prepare for k-bit/quantized training (PEFT utilities)
    model_ft = prepare_model_for_kbit_training(model_ft)

    # Configure LoRA
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model_ft = get_peft_model(model_ft, lora_config)

    print("Mistral-7B for Fine-Tuning loaded and prepared with LoRA.")
    model_ft.print_trainable_parameters()

# Set pad token (this is safe to do for both cases)
tokenizer_ft.pad_token = tokenizer_ft.eos_token

Using device: cuda

Loading Mistral-7B for RAG Response Generation...
 Local checkpoint NOT found, downloading from HuggingFace Hub...
   Configuring for TRAINING (cache disabled).


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Mistral-7B for Fine-Tuning loaded and prepared with LoRA.
trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758


## 3.3 Baseline Benchmarking (Pre-Fine-Tuning)

In [None]:
# Evaluate the un-fine-tuned model on a subset of test questions.
print("\n--- Baseline Benchmarking (Pre-Fine-Tuning Mistral-7B) ---")

def clean_answer(text):
    """
    Cleans up redundant phrases and duplicate factual sentences.
    - Removes repeated sentences with the same numeric data.
    - Trims leading/trailing spaces.
    - Optional: Remove generic disclaimers if fact already present.
    """
    # Split into sentences
    sentences = re.split(r'(?<=[.?!])\s+', text.strip())

    # Remove duplicates (preserve order)
    seen = set()
    cleaned_sentences = []
    for s in sentences:
        if s not in seen:
            seen.add(s)
            cleaned_sentences.append(s)

    # Optional: Remove generic disclaimer if fact is already stated
    disclaimers = [
        "it is not possible to determine",
        "based on the given information",
        "insufficient data"
    ]
    if any(any(d in s.lower() for d in disclaimers) for s in cleaned_sentences):
        facts_only = [s for s in cleaned_sentences if not any(d in s.lower() for d in disclaimers)]
        if facts_only:
            cleaned_sentences = facts_only

    return " ".join(cleaned_sentences)

model_ft = model_ft.to(device, dtype=torch.bfloat16) #Converting to Bfloat16

def get_base_model_answer(query, context_str=""):
    messages = [
        {"role": "system", "content": "You are a general AI assistant. Provide concise answers."},
        {"role": "user", "content": f"Context:\n{context_str}\n\nQuestion: {query}"}
    ]
    prompt = tokenizer_ft.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    start_time = time.time()
    inputs = tokenizer_ft(prompt, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(model_ft.device)
    input_length = inputs.input_ids.shape[1] # Get the length of the input tokens

    with torch.no_grad():
        outputs = model_ft.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.1,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer_ft.eos_token_id
        )

    # Decode only the newly generated tokens, starting after the input prompt
    generated_tokens = outputs[0][input_length:]
    generated_text = tokenizer_ft.decode(generated_tokens, skip_special_tokens=True)

    end_time = time.time()
    return generated_text, (end_time - start_time)

baseline_results = []
baseline_test_subset = test_qa_pairs[:10] if len(test_qa_pairs) >= 10 else test_qa_pairs

print(f"Evaluating base model on {len(baseline_test_subset)} questions...")
for i, qa in enumerate(baseline_test_subset):
    query = qa['question']
    ground_truth = qa['answer']

    # Retrieve context
    _, retrieved_texts_baseline, _ = hybrid_retrieval(query, top_k=5)
    context_for_baseline = "\n".join(retrieved_texts_baseline)

    model_answer, inf_time = get_base_model_answer(query, context_for_baseline)
    model_answer = clean_answer(model_answer)

    is_correct = "Y" if ground_truth.lower() in model_answer.lower() or model_answer.lower() in ground_truth.lower() else "N"

    baseline_results.append({
        "Question": query,
        "Method": "Base Model",
        "Answer": model_answer,
        "Confidence": "N/A",
        "Time (s)": inf_time,
        "Correct (Y/N)": is_correct
    })

    print(f"Q{i+1}: {query}\nBase Answer: {model_answer}\nCorrect: {is_correct}\n")

baseline_df = pd.DataFrame(baseline_results)
print("\n--- Baseline Benchmarking Results (Pre-Fine-Tuning) ---")
print(baseline_df)


--- Baseline Benchmarking (Pre-Fine-Tuning Mistral-7B) ---
Evaluating base model on 10 questions...
Q1: What was the Bank’s Total Deposits as of March 31, 2025?
Base Answer: The Bank's Total Deposits as of March 31, 2025 were 117,366,810,000.
Correct: N

Q2: What was HDFC Bank’s digital engagement volume in FY25?
Base Answer: HDFC Bank's digital engagement volume in FY25 is not provided in the given context.
Correct: N

Q3: How did employee costs change in FY25?
Base Answer: Employee costs increased by 9.27% in FY 2024-25, primarily due to annual fixed pay increases and promotions. This includes front-line sales and overseas staff. The average percentage increase in salaries was 7.11%, inclusive of fixed pay increase and promotions. However, the calculation does not include whole-time directors.
Correct: N

Q4: How much of HDFC Bank’s electricity mix came from renewable sources in FY25?
Base Answer: In FY25, HDFC Bank's electricity mix came from renewable sources at 3.22%.
Correct: N


## 3.4 Fine-Tuning (RAFT)

In [None]:
# # --- Section 3.4: Fine-Tuning (RAFT)  ---
def formatting_prompts_func(example):
    return "\n".join(
        f"<s>[INST] {instr.strip()} [/INST] {outp.strip()}</s>"
        for instr, outp in zip(example["instruction"], example["output"])
    )

# This entire section is wrapped to be skipped if state is restored
if not RESTORED and raft_dataset:
    print("\n--- Running: Model Fine-Tuning (RAFT) ---")

    sft_config_args = SFTConfig(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_8bit",
        learning_rate=2e-4,
        bf16=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="cosine",
        report_to="none"
    )

    trainer = SFTTrainer(
        model=model_ft,
        train_dataset=raft_dataset,
        peft_config=lora_config,
        formatting_func=formatting_prompts_func,
        args=sft_config_args,
    )

    trainer.train()
    print("Fine-Tuning complete!")
    trainer.save_model("raft_model_adapters")
    tokenizer_ft.save_pretrained("tokenizer_ft") # Save tokenizer with adapters
    print("RAFT model adapters and tokenizer saved.")

    # Save hyperparameters
    hyperparams = {
        "learning_rate": sft_config_args.learning_rate,
        "epochs": sft_config_args.num_train_epochs,
        "device": device
    }
    with open("raft_hyperparams.json", "w") as f:
        json.dump(hyperparams, f, indent=4)
    print("Hyperparameters saved.")

    # Call the save_project_state function after all artifacts are created
    print("\n--- Saving Final Project State ---")
    save_project_state()

    # Deactivate gradient checkpointing and activate caching for inference
    print("\\nReconfiguring model for inference...")
    model_ft.gradient_checkpointing_disable()
    model_ft.config.use_cache = True

    # Free up memory
    del trainer
    torch.cuda.empty_cache()
    gc.collect()

    print("Model is now in inference mode.")

else:
    print("\n--- Skipping: Fine-Tuning (Using Restored Adapters) ---")


--- Running: Model Fine-Tuning (RAFT) ---


Applying formatting function to train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Step,Training Loss
10,1.7335
20,0.3793
30,0.2656


Fine-Tuning complete!
RAFT model adapters and tokenizer saved.
Hyperparameters saved.

--- Saving Final Project State ---
Project state saved as project_state_artifacts.zip


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

\nReconfiguring model for inference...
Model is now in inference mode.


# Storing the model to Hugging Face

In [1]:
import zipfile
import os

# The full path to your zip file in Colab
zip_file_path = "/content/raft_model_adapters.zip"

# The destination is the main /content/ directory
extraction_path = "/content/"

# Check if the zip file exists
if os.path.exists(zip_file_path):
    # Open the zip file in read mode
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        # Extract all contents directly into the /content/ directory
        zip_ref.extractall(extraction_path)

    print(f" Successfully extracted '{zip_file_path}' into the '{extraction_path}' directory.")
else:
    print(f" Error: The file '{zip_file_path}' was not found.")
    print("Please ensure the file has been uploaded to your Colab session.")

 Successfully extracted '/content/raft_model_adapters.zip' into the '/content/' directory.


In [2]:
from huggingface_hub import HfApi, HfFolder, create_repo, login
from google.colab import userdata
huggingface_write = userdata.get ("huggingface_write")
login(huggingface_write)

Hugging Face Repo https://huggingface.co/Muralikrishnaraparthi/Mistral-7B-HDFC-Finance-RAFT

In [3]:
from huggingface_hub import HfApi

# Define a name for your new model on the Hub
new_model_name = "Mistral-7B-HDFC-Finance-RAFT" # Our Model name
your_hf_username = "muralikrishnaraparthi" # HF username

api = HfApi()

# Upload the entire folder of model adapters
api.upload_folder(
    folder_path="raft_model_adapters",
    repo_id=f"{your_hf_username}/{new_model_name}",
    path_in_repo="." # Upload to the root of the repository
)

# Upload the entire folder containing the tokenizer
api.upload_folder(
    folder_path="tokenizer_ft",
    repo_id=f"{your_hf_username}/{new_model_name}",
    path_in_repo="." # Upload to the root of the repository
)

print(f"Successfully uploaded adapters and tokenizer to '{new_model_name}'.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...raft_model_adapters/tokenizer.model: 100%|##########|  493kB /  493kB            

  ..._adapters/adapter_model.safetensors:   0%|          |  564kB /  168MB            

  ...ft_model_adapters/training_args.bin:   2%|1         |   102B / 6.10kB            

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /content/tokenizer_ft/tokenizer.model : 100%|##########|  493kB /  493kB            

Successfully uploaded adapters and tokenizer to 'Mistral-7B-HDFC-Finance-RAFT'.




---


## **3.5 Advanced Fine-Tuning Technique (Group 99: Retrieval-Augmented Fine-Tuning - RAFT)**

1.   As implemented in 3.1, this technique focuses on augmenting the fine-tuning dataset with retrieved relevant contexts for each Q/A pair.
2.   The model learns to incorporate and leverage external information during its generative process.
3.   This helps the fine-tuned model:
  *   **Reduce Hallucination:** By explicitly training on contexts, it learns to ground its answers.
  *   **Improve Factual Accuracy:** It's guided to generate answers that are directly supported by the context.
  *   **Enhance Domain-Specific Understanding:** It learns to extract and synthesize information from financial documents more effectively.
4.   This approach bridges the gap between traditional fine-tuning (which can lead to memorization or hallucination) and RAG (which relies solely on retrieval at inference time).


---


## 3.6 Guardrail Implementation (RAFT - Output-side)

In [None]:
# # --- Section 3.6: Guardrail Implementation (RAFT - Output-side)  ---
# RAFT Output Guardrail Function

import re

def normalize_number(s):
    """
    Normalize numeric strings by removing currency symbols, commas, and whitespace,
    and converting to lowercase for uniform matching.
    """
    return s.lower().replace("₹", "").replace("inr", "").replace("rs.", "").replace(",", "").strip()

def check_raft_output_for_hallucination(generated_answer, original_query, relevant_context):
    """
    Returns (is_hallucinated: bool, message: str)
    Checks if numeric facts in generated_answer are present in the relevant_context,
    after normalization, to detect hallucinations.
    Also checks for irrelevant non-financial keywords in the output.

    Arguments:
    - generated_answer: the string response from the model
    - original_query: the user query string (not used here but may be useful)
    - relevant_context: the retrieved context string used as evidence
    """
    generated_lower = generated_answer.lower()
    context_lower = relevant_context.lower() if relevant_context else ""

    # Pattern to match numbers with optional INR symbols and suffixes like million, crore, etc.
    financial_figures_pattern = r'\b(?:₹|inr|rs\.?\s*)?[\d,\.]+\s*(?:million|billion|trillion|lakh|crore|cr)?\b'

    # Extract and normalize all numbers from the answer and context
    extracted_answer_figs = set(normalize_number(x) for x in re.findall(financial_figures_pattern, generated_lower, re.IGNORECASE))
    extracted_context_figs = set(normalize_number(x) for x in re.findall(financial_figures_pattern, context_lower, re.IGNORECASE))

    # Check if all numeric facts in the answer are present in the context
    if extracted_answer_figs and not extracted_answer_figs.issubset(extracted_context_figs):
        return True, "Potential hallucination: Numeric facts not grounded in context."

    # Check for presence of obvious non-financial keywords unrelated to financial context
    non_financial_keywords = ["weather", "movie", "sports", "recipe", "paris", "politics", "weather", "sports", "recipe", "personal opinion",
        "violence", "capital of" ]
    if any(k in generated_lower for k in non_financial_keywords):
        return True, "Potential hallucination: Output contains non-financial topics."

    return False, "Output appears factual and grounded."



## 3.7 Fine-Tuned Model Inference Function - Response Generation (RAFT)

In [None]:
##########################
# Step 1: Extend patterns #
##########################

RAFT_PATTERNS = {
    # --- Income Statement ---
    "revenue": [
        r"total income.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "profit after tax": [
        r"profit after tax.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore",
        r"net profit.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "earnings per share": [
        r"earnings per.*?share.*?(`|₹)\s*([\d\.]+)"
    ],

    # --- Balance Sheet ---
    "total assets": [
        r"total assets.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "total liabilities": [
        r"total liabilities.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "deposits": [
        r"total deposits.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "loans": [
        r"advances.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "total borrowings": [
        r"borrowings.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],

    # --- Dividends ---
    "dividend per share": [
        r"dividend of ` ([\d\.]+) per equity share",
        r"dividend of ₹ ([\d\.]+) per equity share",
        r"dividend of ([\d\.]+) per equity share",
        r"dividend.*?([\d\.]+) per share"
    ],
    "total dividend": [
        r"dividend paid by the Bank.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore",
        r"aggregating to.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],

    # --- Ratios & Metrics (usually percentages) ---
    "capital adequacy ratio": [
        r"capital adequacy ratio.*?([\d\.]+)%"
    ],
    "gross npa": [
        r"gross npa.*?([\d\.]+)%"
    ],
    "net npa": [
        r"net npa.*?([\d\.]+)%"
    ],
    "return on assets": [
        r"return on assets.*?([\d\.]+)%"
    ],
    "return on equity": [
        r"return on equity.*?([\d\.]+)%"
    ]
}

# Example alias mapping - define as per your use case
metric_aliases = {
    # --- Income Statement Items ---
    "revenue": ["revenue", "total income", "top line", "turnover"],
    "profit after tax": ["profit after tax", "pat", "net profit", "net earnings", "profit for the period"],
    "earnings per share": ["earnings per share", "eps"],

    # --- Balance Sheet Items ---
    "total assets": ["total assets"],
    "total liabilities": ["total liabilities"],
    "deposits": ["deposits", "total deposits"],
    "loans": ["advances", "loans"],
    "total borrowings": ["borrowings", "total borrowings"],

    # --- Dividend-Related ---
    "dividend per share": ["dividend per share"],
    "total dividend": ["dividend paid", "dividend aggregating", "dividend"],

    # --- Key Banking Ratios & Metrics ---
    "capital adequacy ratio": ["capital adequacy ratio", "car"],
    "gross npa": ["gross npa", "gross non-performing assets"],
    "net npa": ["net npa", "net non-performing assets"],
    "return on assets": ["return on assets", "roa"],
    "return on equity": ["return on equity", "roe"]
}

####################################
# Utility: Select top relevant paragraphs
####################################

def select_top_paragraphs(query: str, paragraphs: List[str], top_n: int = 3) -> List[str]:
    query_terms = set(query.lower().split())
    scored = []
    for para in paragraphs:
        para_terms = set(para.lower().split())
        score = len(query_terms & para_terms)
        scored.append((score, para))
    scored.sort(key=lambda x: (-x[0], -len(x[1])))  # Prioritize match count then length
    selected = [para for _, para in scored[:top_n]]
    return selected

#############################
# Convert to number function #
#############################

def convert_to_number(value_str, unit=None):
    """
    Convert a string + unit (lakh, crore, million, etc.) into a float in base units.
    Example: ("1,099.42", "crore") -> 10994200000.0
    """
    num = value_str.replace(",", "")
    try:
        base_val = float(num)
    except ValueError:
        return None

    if unit:
        unit = unit.lower()
        if unit in ["lakh", "lakhs"]:
            base_val *= 1e5
        elif unit in ["crore", "cr", "crores"]:
            base_val *= 1e7
        elif unit == "million":
            base_val *= 1e6
        elif unit == "billion":
            base_val *= 1e9
        elif unit == "trillion":
            base_val *= 1e12
    return base_val


#############################
# Normalize the year #
#############################

def normalize_years_in_query(query: str) -> list[str]:
    """
    Finds years in various formats (e.g., 2024, FY25, 23-24) in a query
    and returns them as a sorted list of unique four-digit strings.
    """
    found_years = set()

    # Pattern 1: Find four-digit years like 2023, 2024
    for match in re.findall(r"\b(20\d{2})\b", query):
        found_years.add(match)

    # Pattern 2: Find formats like FY24, FY 25, Financial Year 24
    for match in re.findall(r"\b(?:FY|Financial Year)\s*(\d{2,4})\b", query, re.IGNORECASE):
        year = f"20{match}" if len(match) == 2 else match
        found_years.add(year)

    # Pattern 3: Find year ranges like 23-24 or 2023-24
    for match in re.findall(r"\b(\d{2,4})[-–](\d{2,4})\b", query):
        start, end = match
        start_full = f"20{start}" if len(start) == 2 else start
        end_full = f"20{end}" if len(end) == 2 else end
        found_years.add(start_full)
        found_years.add(end_full)

    return sorted(list(found_years))

###########################################################
# FINAL CORRECTED HELPER FUNCTION
###########################################################
def extract_comparative_values(context, metric_aliases):
    """
    Finds sentences with a "value A compared to value B" structure.
    This robust version finds all numbers and years on a line and pairs them.
    Returns a tuple: (later_year_value, earlier_year_value)
    """
    # Helper to convert string like "` 500 crore" to a numeric value
    def to_float(s):
        try:
            num_part = re.search(r'[\d,\.]+', s).group()
            num = float(num_part.replace(",", ""))
            if "crore" in s.lower():
                num *= 10_000_000
            elif "lakh" in s.lower():
                num *= 100_000
            return num
        except (AttributeError, ValueError):
            return None

    for line in context.split('\n'):
        # 1. Check for prerequisite keywords to ensure the line is relevant
        if not any(alias in line.lower() for alias in metric_aliases):
            continue
        if not any(comp in line.lower() for comp in ["as compared to", "from"]):
            continue

        # 2. Find all number-like and year-like strings on the line
        num_pattern = r"[`₹]?\s*[\d,]+\.\d{2}(?:\s*crore|\s*lakh)?"
        year_pattern = r"\b20\d{2}\b"

        numbers_found = re.findall(num_pattern, line)
        years_found = re.findall(year_pattern, line)

        # 3. If we find exactly two of each, we have a very strong candidate
        if len(numbers_found) == 2 and len(years_found) == 2:
            try:
                val1 = to_float(numbers_found[0])
                val2 = to_float(numbers_found[1])
                year1 = int(years_found[0])
                year2 = int(years_found[1])

                if val1 is None or val2 is None:
                    continue

                # 4. Associate and return in the correct order (later year, earlier year)
                if year1 > year2:
                    return (val1, val2)
                else:
                    return (val2, val1)
            except (ValueError, TypeError, IndexError):
                # If anything goes wrong with this line, just move to the next
                continue

    # If no suitable line was found in the entire context, return None
    return None

############################################
# Post-process descriptive answers (truncation or summarization)
############################################

def postprocess_descriptive_answer(answer, max_sentences=3):
    # Simple truncation at sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', answer.strip())
    if len(sentences) > max_sentences:
        truncated = ' '.join(sentences[:max_sentences]) + " ..."
        return truncated
    return answer


#############################
# clean generated answer
#############################

def clean_generated_answer(text: str) -> str:
    # Find the text that comes after the last instance of the instruction marker
    if "[/INST]" in text:
        text = text.split("[/INST]")[-1]

    # Remove any remaining common artifacts and whitespace
    text = text.replace("</s>", "").strip()
    return text



#############################
# Ground Truth Checking
#############################

def is_answer_correct(predicted: str, ground_truth: str, guard_msg: str = None) -> bool:
    """
    Returns True if the predicted answer matches ground_truth.
    Uses a more robust regex to handle numbers and avoid errors.
    """
    # FIX: A more robust regex that ensures a number starts with a digit.
    number_pattern = r"\d[\d,]*\.?\d*"

    # Extract numbers from both the predicted answer and the ground truth
    pred_nums = re.findall(number_pattern, predicted)
    truth_nums = re.findall(number_pattern, ground_truth)

    # Clean the numbers by removing commas before converting to float
    pred_nums_set = {float(n.replace(",", "")) for n in pred_nums}
    truth_nums_set = {float(n.replace(",", "")) for n in truth_nums}

    if not truth_nums_set: # If ground truth has no numbers, fall back to simple string comparison
        return predicted.strip().lower() == ground_truth.strip().lower()

    # Check if any of the key numbers from the ground truth are present in the prediction
    return bool(pred_nums_set & truth_nums_set)

#############################
# Create generation pipeline
#############################

pipeline_raft = pipeline("text-generation", model=model_ft, tokenizer=tokenizer_ft, max_new_tokens=256, temperature=0.1)
model_ft = model_ft.to(device).to(dtype=torch.bfloat16)


###########################################################
# FINAL, ROBUST RAFT INFERENCE FUNCTION
###########################################################
def generate_raft_response(query, top_k=10):
    _, retrieved_texts, _ = hybrid_retrieval(query, top_k=top_k)
    context_str = "\n".join(retrieved_texts)

    query_lower = query.lower()
    metric_key = next((k for k, aliases in metric_aliases.items() if any(alias in query_lower for alias in aliases)), None)
    years_found = normalize_years_in_query(query)
    is_change_query = any(w in query_lower for w in ["change", "difference", "increase", "decrease", "compare", "between", "year-on-year"])

    # --- Path A: Advanced Direct Extraction ---
    if metric_key and len(years_found) == 2 and is_change_query:
        values = extract_comparative_values(context_str, metric_aliases[metric_key])

        if values:
            later_val, earlier_val = values
            change = later_val - earlier_val
            return f"₹{change:,.2f}", context_str, None, 0.98

    # --- Path B: LLM Fallback (only if Path A fails) ---
    messages = [
        {"role": "system", "content": "You are a financial analyst. Answer using ONLY the provided context. If the question requires a calculation (like a change or difference), you MUST perform the calculation and provide only the final numerical result."},
        {"role": "user", "content": f"Context:\n{context_str}\n\nQuestion:\n{query}\n\nDirect Numerical Answer:"}
    ]
    prompt = tokenizer_ft.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer_ft(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048).to(device)

    try:
        with torch.no_grad():
            outputs = model_ft.generate(**inputs, max_new_tokens=64, do_sample=False, pad_token_id=tokenizer_ft.eos_token_id)

        generated_text = tokenizer_ft.decode(outputs[0], skip_special_tokens=True).strip()
        final_answer = clean_generated_answer(generated_text)
        return final_answer, context_str, prompt, 0.7
    except Exception as e:
        print(f"Error during generation: {e}")
        return "Information not available", context_str, None, 0.0

Device set to use cuda:0


### Section 3.8: RAFT System Sanity Check

In [None]:
# --- Section 3.8: RAFT System Sanity Check ---
# Assume check_raft_output_for_hallucination and generate_raft_response are defined elsewhere

import time

print("--- Running RAFT System Sanity Check ---")

# Sample query for testing
# sample_query_raft = "What was the change in Profit After Tax from FY 2024 to FY 2025?"
sample_query_raft = "What was the revenue in 2024?"

print(f" Test Query: '{sample_query_raft}'")

print("\n Generating RAFT answer...")
start_time = time.time()

raw_answer, context_used, prompt, confidence = generate_raft_response(sample_query_raft)

end_time = time.time()

print("\n Checking output with guardrail...")
is_hallucinated, guardrail_msg = check_raft_output_for_hallucination(raw_answer, sample_query_raft, context_used)
if is_hallucinated:
    print(f"   -  Guardrail Flagged: {guardrail_msg}")
else:
    print("   -  Guardrail Passed.")

print("\n--- RAFT Test Result ---")
print(f" Answer: {raw_answer}")
print(f" Confidence: {confidence}")
print(f" Response Time: {end_time - start_time:.2f} seconds")
print(f"\nContext Used:\n{context_used[:500]}...")  # Show snippet of context

if prompt:
    print(f"\nLLM Prompt (if used):\n{prompt[:500]}...")

print("\n--- RAFT Sanity Check Complete ---")


--- Running RAFT System Sanity Check ---
 Test Query: 'What was the revenue in 2024?'

 Generating RAFT answer...

 Checking output with guardrail...
   -  Guardrail Passed.

--- RAFT Test Result ---
 Answer: The revenue in 2024 was over 9.32 crore.
 Confidence: 0.7
 Response Time: 9.54 seconds

Context Used:
31, 2025, and its profit and its cash flows for the year ended on that date.
This area was considered a key audit matter because of the significant concentration of revenue during the last quarter of financial period (including cut-off at the Balance sheet date). Due to the nature of the industry, revenue is skewed towards the balance sheet date. Hence, there is possibility that policy sales of the next financial year are accounted in the current period.Our procedures included the following: • U...

LLM Prompt (if used):
<s> [INST] You are a financial analyst. Answer using ONLY the provided context. If the question requires a calculation (like a change or difference), you MUST per

# 4. Testing, Evaluation & Comparison

In [None]:
# Unified function for both RAG and RAFT (for evaluation table)
def get_model_response_for_eval(query, model_type):
    answer, confidence, response_time = "", 0.0, 0.0
    guardrail_status, guardrail_message, context_used = "N/A", "", ""

    start_time = time.time()

    # Find and normalize years from the query to ensure consistent processing
    years_found = normalize_years_in_query(query)

    try:
        if model_type == "RAG":
            # (The logic to get raw_answer remains the same)
            retrieved_docs, retrieved_texts, _ = hybrid_retrieval(query, top_k=5)
            context_used = "\n".join(retrieved_texts)
            if retrieved_texts:
                raw_answer = generate_rag_response(query, retrieved_texts)
                # Pass the 'years_found' variable to the post-processor
                answer = postprocess_generated_text(raw_answer, "[/INST]", query, years_found=years_found)
                scores = [doc.metadata.get('fused_score', 0) for doc in retrieved_docs]
                confidence = np.mean(scores) if scores else 0.0

        elif model_type == "Fine-Tuned(RAFT)":
            raw_answer, context_used, prompt, confidence = generate_raft_response(query)
            # Pass the 'years_found' variable to the post-processor
            answer = postprocess_generated_text(raw_answer, prompt, query, years_found=years_found)
            is_hallucinated, output_guardrail_msg = check_raft_output_for_hallucination(
                answer, query, context_used
            )
            if is_hallucinated:
                guardrail_status, guardrail_message, confidence = "Flagged (Output)", output_guardrail_msg, 0.4
            else:
                guardrail_status = "Passed"

    except Exception as e:
        # This block now correctly handles any errors during the process
        print(f"An error occurred during evaluation for query '{query}': {e}")
        answer = f"Error during generation"
        confidence = 0.0
        guardrail_status = "Error"

    response_time = time.time() - start_time
    return answer, confidence, response_time, guardrail_status, guardrail_message, context_used

## 4.1 & 4.2 Test Questions & Extended Evaluation

In [None]:
import time
import numpy as np
import pandas as pd

# Assume these functions are defined elsewhere in your code:
# - get_model_response_for_eval(query, model_type)
# - is_answer_correct(predicted_answer, ground_truth, guardrail_message)
# - test_qa_pairs: list of dictionaries with 'question' and 'answer'

def run_evaluation(test_qa_pairs):
    if not test_qa_pairs:
        print("No test QA pairs provided.")
        return None

    print("\n--- Starting Evaluation on Dynamic Test Split ---")

    evaluation_results = []

    for qa in test_qa_pairs:
        query = qa['question']
        ground_truth = qa.get('answer', '')

        print(f"\nEvaluating query: '{query}'")

        # Evaluate RAG model
        answer_rag, conf_rag, time_rag, guard_stat_rag, guard_msg_rag, _ = get_model_response_for_eval(query, "RAG")
        correct_rag = 'Y' if is_answer_correct(answer_rag, ground_truth, guard_msg_rag) else 'N'
        evaluation_results.append({
            "Question": query,
            "Method": "RAG",
            "Answer": answer_rag,
            "Confidence": float(f"{conf_rag:.2f}"),
            "Time (s)": float(f"{time_rag:.2f}"),
            "Correct (Y/N)": correct_rag
        })

        # Evaluate Fine-Tuned RAFT model
        answer_raft, conf_raft, time_raft, guard_stat_raft, guard_msg_raft, _ = get_model_response_for_eval(query, "Fine-Tuned(RAFT)")
        correct_raft = 'Y' if is_answer_correct(answer_raft, ground_truth, guard_msg_raft) else 'N'
        evaluation_results.append({
            "Question": query,
            "Method": "Fine-Tune",
            "Answer": answer_raft,
            "Confidence": float(f"{conf_raft:.2f}"),
            "Time (s)": float(f"{time_raft:.2f}"),
            "Correct (Y/N)": correct_raft
        })

        # Optional: detailed log
        print(f"RAG Answer: {answer_rag} | Correct: {correct_rag} | Confidence: {conf_rag:.2f} | Time: {time_rag:.2f}s")
        print(f"RAFT Answer: {answer_raft} | Correct: {correct_raft} | Confidence: {conf_raft:.2f} | Time: {time_raft:.2f}s")

    evaluation_df = pd.DataFrame(evaluation_results)
    print("\n--- Dynamic Test Split Evaluation Complete ---")

    # Calculate aggregate metrics
    avg_speed_rag = evaluation_df[evaluation_df['Method'] == 'RAG']['Time (s)'].mean()
    avg_accuracy_rag = evaluation_df[evaluation_df['Method'] == 'RAG']['Correct (Y/N)'].value_counts(normalize=True).get('Y', 0) * 100

    avg_speed_raft = evaluation_df[evaluation_df['Method'] == 'Fine-Tune']['Time (s)'].mean()
    avg_accuracy_raft = evaluation_df[evaluation_df['Method'] == 'Fine-Tune']['Correct (Y/N)'].value_counts(normalize=True).get('Y', 0) * 100

    print(f"\n--- Performance Summary ---")
    print(f"Average Inference Speed (RAG): {avg_speed_rag:.4f} seconds")
    print(f"Average Accuracy (RAG): {avg_accuracy_rag:.2f}%")
    print(f"Average Inference Speed (Fine-Tuned RAFT): {avg_speed_raft:.4f} seconds")
    print(f"Average Accuracy (Fine-Tuned RAFT): {avg_accuracy_raft:.2f}%")

    return evaluation_df

# Example usage:
evaluation_df = run_evaluation(test_qa_pairs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Starting Evaluation on Dynamic Test Split ---

Evaluating query: 'What was the Bank’s Total Deposits as of March 31, 2025?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: The Bank's Total Deposits as of March 31, 2025 were 117,366,810,000. | Correct: Y | Confidence: 0.02 | Time: 9.90s
RAFT Answer: ₹117,366.81 crore | Correct: N | Confidence: 0.40 | Time: 18.90s

Evaluating query: 'What was HDFC Bank’s digital engagement volume in FY25?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Information not available in the provided context. | Correct: N | Confidence: 0.02 | Time: 4.30s
RAFT Answer: HDFC Bank's digital engagement volume in FY25 is not provided in the context. | Correct: Y | Confidence: 0.70 | Time: 11.16s

Evaluating query: 'How did employee costs change in FY25?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Information not available in the provided context. | Correct: N | Confidence: 0.02 | Time: 4.44s
RAFT Answer: The percentage increase in the median remuneration of employees in the FY 2024-25 was 9.27%. | Correct: Y | Confidence: 0.70 | Time: 14.47s

Evaluating query: 'How much of HDFC Bank’s electricity mix came from renewable sources in FY25?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: 3.22% of HDFC Bank's electricity mix came from renewable sources in FY25. | Correct: Y | Confidence: 0.02 | Time: 6.07s
RAFT Answer: 3.22% | Correct: Y | Confidence: 0.70 | Time: 9.40s

Evaluating query: 'What was the value of Total Borrowings as of March 31, 2025?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Information not available in the provided context. The context only provides details of loans acquired during the year ended March 31, 2024 and March 31, 2023, as well as expected rate of return on investments and major categories of plan assets as a percentage of fair value of total plan assets as at March 31, 2024 and March 31, 2023. There is no information provided about the value of total borrowings as of March 31, 2025. | Correct: Y | Confidence: 0.02 | Time: 16.09s
RAFT Answer: ₹29,326.76 crore | Correct: N | Confidence: 0.40 | Time: 16.56s

Evaluating query: 'What was HDFC Bank’s total customer base in FY24?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Information not available in the provided context. | Correct: N | Confidence: 0.02 | Time: 3.29s
RAFT Answer: ₹53.82 lakh | Correct: N | Confidence: 0.70 | Time: 7.97s

Evaluating query: 'What was the Profit on Sale of Investments in FY24?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Net Profit after Tax for FY 2024 was ₹2,461 crore. | Correct: N | Confidence: 0.02 | Time: 8.36s
RAFT Answer: Net Profit after Tax for FY 2024 was ₹1,959 crore. | Correct: N | Confidence: 0.70 | Time: 13.51s

Evaluating query: 'How much did HDFC Bank earn from Fees and Commissions in FY25?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Information not available in the provided context. | Correct: N | Confidence: 0.02 | Time: 3.50s
RAFT Answer: ₹28,160.7 crore | Correct: N | Confidence: 0.70 | Time: 10.13s

Evaluating query: 'What was the change in RoE between FY24 and FY25?'


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RAG Answer: Information not available in the provided context. | Correct: N | Confidence: 0.02 | Time: 4.24s
RAFT Answer: The change in RoE between FY24 and FY25 is not provided in the context. | Correct: Y | Confidence: 0.70 | Time: 9.43s

Evaluating query: 'What was the total Net Revenue for FY23–24 and FY24–25?'
RAG Answer: Change in Total Revenue from FY 2023 to FY 2024 was Information not available in the provided context.. | Correct: N | Confidence: 0.02 | Time: 5.86s
RAFT Answer: Change in Total Revenue from FY 2023 to FY 2024 was ₹78,409.16 crore. | Correct: N | Confidence: 0.70 | Time: 16.73s

--- Dynamic Test Split Evaluation Complete ---

--- Performance Summary ---
Average Inference Speed (RAG): 6.6050 seconds
Average Accuracy (RAG): 30.00%
Average Inference Speed (Fine-Tuned RAFT): 12.8260 seconds
Average Accuracy (Fine-Tuned RAFT): 40.00%


## 4.2 Results Table

In [None]:
# # # --- Section 4.3: Results Table  ---
print("\n\n--- Final Evaluation Results Table ---")
# Reorder columns to match the required format exactly
final_columns = ['Question', 'Method', 'Answer', 'Confidence', 'Time (s)', 'Correct (Y/N)']
print(evaluation_df[final_columns].to_string())



--- Final Evaluation Results Table ---
                                                                        Question     Method                                                                                                                                                                                                                                                                                                                                                                                                                                        Answer  Confidence  Time (s) Correct (Y/N)
0                       What was the Bank’s Total Deposits as of March 31, 2025?        RAG                                                                                                                                                                                                                                                                                                                      

# 5. User Interface (CLI)

Here are five robust questions used to test Financial Chatbot in CLI.

1.  What was the revenue from operations for the year ended March 31, 2024?
2.  What was the change in Profit After Tax from FY 2024 to FY 2025?
3.  How did the cost-to-income ratio of the bank change over the last two years?
4.  What was the dividend paid for the financial year 2024?
5.  What was the dividend PER share for the financial year 2025?

In [None]:
def run_cli_notebook():
    print("=== Group 99 Financial QA CLI ===")
    print("Type your question and choose mode (RAG / RAFT). Type 'exit' to quit.\n")

    while True:
        query = input("Enter your question: ").strip()
        if query.lower() in ["exit", "quit"]:
            print("Goodbye!")
            break
        if not query:
            continue

        valid, guard_msg = is_query_relevant_and_safe(query)
        if not valid:
            print(f" Query blocked: {guard_msg}\n")
            continue

        years_found = normalize_years_in_query(query)
        mode = input("Choose mode [RAG / RAFT] (default=RAFT): ").strip().upper() or "RAFT"

        if mode not in ["RAG", "RAFT"]:
            print(" Invalid mode. Please enter RAG or RAFT.\n")
            continue

        start_time = time.time()
        raw_answer = None # Initialize a single answer variable to None
        context_used = ""
        model_confidence = 0.0

        try:
            retrieved_docs, retrieved_texts, _ = hybrid_retrieval(query, top_k=5)
            context_used = "\n".join(retrieved_texts)
            scores = [doc.metadata.get('fused_score', 0) for doc in retrieved_docs]
            retrieval_confidence = np.mean(scores) if scores else 0.0

            if not retrieved_texts:
                print(" No relevant context found.\n")
                continue

            # --- UNIFIED HYBRID LOGIC ---
            query_lower = query.lower()
            metric_key = next((k for k, aliases in metric_aliases.items() if any(alias in query_lower for alias in aliases)), None)
            is_change_query = any(w in query_lower for w in ["change", "difference", "compare", "between", "year-on-year"])

            # 1. Attempt Direct Extraction First (for both RAG and RAFT)
            if metric_key and len(years_found) == 2 and is_change_query:
                values = extract_comparative_values(context_used, metric_aliases[metric_key])
                if values:
                    later_val, earlier_val = values
                    change = later_val - earlier_val
                    raw_answer = f"₹{change:,.2f}"
                    model_confidence = 0.98

            # 2. If Direct Extraction failed (raw_answer is still None), use the selected model
            if raw_answer is None:
                if mode == "RAG":
                    raw_answer = generate_rag_response(query, retrieved_texts)
                    model_confidence = retrieval_confidence
                else: # RAFT mode
                    # The RAFT generation function already contains its own internal extraction logic
                    raw_answer, context_used, _, model_confidence = generate_raft_response(query)

            # 3. Post-process the final result for consistent formatting
            final_answer = postprocess_generated_text(raw_answer, "[/INST]", query, years_found=years_found)

            # 4. Apply RAFT-specific guardrail if that mode was used
            if mode == "RAFT":
                flagged, flag_msg = check_raft_output_for_hallucination(final_answer, query, context_used)
                if flagged:
                    print(f" [!] Warning: RAFT output guardrail flagged a potential issue ({flag_msg})")
                    model_confidence = max(0.3, model_confidence - 0.4)

        except Exception as e:
            print(f" Error during answer generation: {str(e)}\n")
            continue

        elapsed = time.time() - start_time
        print("\n--- Result ---")
        print(f"Method Used: {mode}")
        print(f"Answer: {final_answer}")
        print(f"Retrieval Confidence: {retrieval_confidence:.4f}")
        print(f"Final Confidence: {model_confidence:.4f}")
        print(f"Response Time: {elapsed:.2f} seconds")
        print(f"\n--- Context Snippet --- \n{context_used[:500]}...")
        print("\n" + "=" * 40 + "\n")

#To run the CLI, you would call the function:
if __name__ == "__main__":
    run_cli_notebook()


=== Group 99 Financial QA CLI ===
Type your question and choose mode (RAG / RAFT). Type 'exit' to quit.

Enter your question: What was the dividend per share for the financial year 2025?
Choose mode [RAG / RAFT] (default=RAFT): RAFT

--- Result ---
Method Used: RAFT
Answer: The dividend per share for the financial year 2025 was ₹ 22.00.
Retrieval Confidence: 0.0287
Final Confidence: 0.7000
Response Time: 16.32 seconds

--- Context Snippet --- 
T he Board of Directors at its meeting held on April 20, 2024 proposed a dividend of ` 19.50 per equity share (previous year: ` 19.00 per equity share) aggregating to ` 14,813.98 crore subject to the approval of shareholders at the ensuing Annual General Meeting. During the year ended March 31, 2024, the dividend paid by the Bank in respect of the previous year ended March 31, 2023 was ` 8,404.42 crore. No dividend was paid in respect of equity shares that were cancelled upon the Scheme becoming...


Enter your question: What was the dividend per

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Result ---
Method Used: RAG
Answer: Information not available in the provided context.
Retrieval Confidence: 0.0287
Final Confidence: 0.0287
Response Time: 6.33 seconds

--- Context Snippet --- 
T he Board of Directors at its meeting held on April 20, 2024 proposed a dividend of ` 19.50 per equity share (previous year: ` 19.00 per equity share) aggregating to ` 14,813.98 crore subject to the approval of shareholders at the ensuing Annual General Meeting. During the year ended March 31, 2024, the dividend paid by the Bank in respect of the previous year ended March 31, 2023 was ` 8,404.42 crore. No dividend was paid in respect of equity shares that were cancelled upon the Scheme becoming...


Enter your question: What is the weather like in Mumbai?
 Query blocked: Query contains irrelevant or disallowed keywords.

Enter your question: What was the revenue from operations for the year ended March 31, 2025?
Choose mode [RAG / RAFT] (default=RAFT): RAG


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Result ---
Method Used: RAG
Answer: Total Revenue for FY 2025 was ₹346,149.32 crore.
Retrieval Confidence: 0.0193
Final Confidence: 0.0193
Response Time: 6.62 seconds

--- Context Snippet --- 
Statutory Reports and Financial StatementsSegment reporting for the year ended March 31, 2025 is given below: Business segments:
(` crore) Sr. No.Particulars Treasury Retail banking Wholesale bankingOther banking OperationsTotal
Digital banking #Non-Digital Banking
1Segment revenue 62,227.48 8.59 283,426.20 191,964.51 35,449.05 573,075.83
2Unallocated revenue -
3Less: Inter-segment revenue 226,926.51
4Income from operations (1) + (2) - (3) 346,149.32
31, 2025, and its profit and its cash flows f...


Enter your question: What was the revenue from operations for the year ended March 31, 2025?
Choose mode [RAG / RAFT] (default=RAFT): RAFT

--- Result ---
Method Used: RAFT
Answer: Total Revenue for FY 2025 was ₹346,149.32 crore.
Retrieval Confidence: 0.0193
Final Confidence: 0.3000
Response Tim

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Result ---
Method Used: RAG
Answer: Change in Net Profit after Tax from FY 2024 to FY 2025 was ₹500 crore.
Retrieval Confidence: 0.0162
Final Confidence: 0.0162
Response Time: 8.91 seconds

--- Context Snippet --- 
Profit After Tax for the year ended March 31, 2025 stood at ` 500 crore as compared to ` 438 crore for the year ended March 31, 2024.
Distribution Network
For the year ended March 31, 2024
Integrated 417
CC in crore
Year ended March 31, 2024 Year ended March 31, 2023
Cash flows from operating activities:
Consolidated profit before income tax and after minority interest 75,184.14 61,346.80 Adjustments for:
Depreciation on fixed assets 3,092.08 2,345.47 (Profit) / loss on revaluation of investment...


Enter your question: What was the change in Profit After Tax from FY 2024 to FY 2025?
Choose mode [RAG / RAFT] (default=RAFT): RAFT

--- Result ---
Method Used: RAFT
Answer: Change in Net Profit after Tax from FY 2024 to FY 2025 was ₹16,534.93 crore.
Retrieval Confidence: 0

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Result ---
Method Used: RAG
Answer: The cost-to-income ratio of the bank decreased from 40.4% in FY23 to 40.2% in FY24.
Retrieval Confidence: 0.0193
Final Confidence: 0.0193
Response Time: 6.25 seconds

--- Context Snippet --- 
Given below is a table of run-off factors and the average LCR maintained by the Bank quarter-wise over the past two years:
For more details, please refer to the Macroeconomic and Industry section on page no. 214.
In this changing environment, your Bank continued to prioritise growth while strengthening its focus on governance, sustainability and inclusive development.
Financial Parameters
was not enabled for part of the year for certain masters in two accounting software and two databases and th...


Enter your question: How did the cost-to-income ratio of the bank change over the last two years?
Choose mode [RAG / RAFT] (default=RAFT): RAFT

--- Result ---
Method Used: RAFT
Answer: The cost-to-income ratio of the bank decreased from 40.4% in FY23 to 40.2% 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Result ---
Method Used: RAG
Answer: ₹14,813.98 crore
Retrieval Confidence: 0.0287
Final Confidence: 0.0287
Response Time: 7.27 seconds

--- Context Snippet --- 
T he Board of Directors at its meeting held on April 20, 2024 proposed a dividend of ` 19.50 per equity share (previous year: ` 19.00 per equity share) aggregating to ` 14,813.98 crore subject to the approval of shareholders at the ensuing Annual General Meeting. During the year ended March 31, 2024, the dividend paid by the Bank in respect of the previous year ended March 31, 2023 was ` 8,404.42 crore. No dividend was paid in respect of equity shares that were cancelled upon the Scheme becoming...


Enter your question: What was the dividend paid for the financial year 2024?
Choose mode [RAG / RAFT] (default=RAFT): RAFT

--- Result ---
Method Used: RAFT
Answer: ₹14,813.98 crore
Retrieval Confidence: 0.0287
Final Confidence: 0.7000
Response Time: 13.67 seconds

--- Context Snippet --- 
T he Board of Directors at its meetin

# **Streamlit App Hosting the RAG & RAFT System**

Here are five robust questions used to test Financial Chatbot in Streamlit.

1.  What was the revenue from operations for the year ended March 31, 2024?
2.  What was the change in Profit After Tax from FY 2024 to FY 2025?
3.  How did the cost-to-income ratio of the bank change over the last two years?
4.  What was the dividend paid for the financial year 2024?
5.  What was the dividend PER share for the financial year 2025?

In [None]:
!pip install streamlit -q
!pip install pyngrok -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m100.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
%%writefile utils.py
import streamlit as st
import re
import numpy as np
import pickle
import faiss
from typing import List
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
import torch
import json

# This is the most important function. It loads all heavy components and caches them.
@st.cache_resource
def load_resources():
    """Loads all necessary models, tokenizers, and data indexes."""
    embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    # Load FAISS and BM25 indexes (ensure these files are uploaded to Colab)
    with open("bm25_index_small.pkl", "rb") as f:
        bm25_index_small = pickle.load(f)
    faiss_index_small = faiss.read_index("faiss_index_small.bin")

    # Load RAG Model from Hugging Face Hub
    model_name_rag = "mistralai/Mistral-7B-Instruct-v0.1"
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
    )
    tokenizer_rag = AutoTokenizer.from_pretrained(model_name_rag)
    tokenizer_rag.pad_token = tokenizer_rag.eos_token

    # FIX: Removed the 'device_map="auto"' argument
    model_rag = AutoModelForCausalLM.from_pretrained(
        model_name_rag,
        quantization_config=bnb_config
    )

    rag_pipeline = pipeline("text-generation", model=model_rag, tokenizer=tokenizer_rag, max_new_tokens=256, temperature=0.1)

    # For simplicity, we'll use the base RAG model for the RAFT path in this deployment.
    model_ft = model_rag
    tokenizer_ft = tokenizer_rag
    tokenizer_ft.pad_token = tokenizer_ft.eos_token # Fix padding token error


    # Load chunk data (ensure rag_chunks_data.json is uploaded)
    with open("rag_chunks_data.json", "r") as f:
        rag_data = json.load(f)
    rag_chunks_small = [Document(page_content=d['page_content'], metadata=d['metadata']) for d in rag_data['small_chunks']]

    index_sets = {"small": {"faiss": faiss_index_small, "bm25": bm25_index_small, "chunks": rag_chunks_small}}

    # Return all the loaded resources
    return embedding_model, index_sets, rag_pipeline, model_ft, tokenizer_ft


# ==============================================================================
# HELPER FUNCTIONS BELOW THIS LINE
# ==============================================================================

##########################
# Step 1: Extend patterns #
##########################

RAFT_PATTERNS = {
    # --- Income Statement ---
    "revenue": [
        r"total income.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "profit after tax": [
        r"profit after tax.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore",
        r"net profit.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "earnings per share": [
        r"earnings per.*?share.*?(`|₹)\s*([\d\.]+)"
    ],

    # --- Balance Sheet ---
    "total assets": [
        r"total assets.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "total liabilities": [
        r"total liabilities.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "deposits": [
        r"total deposits.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "loans": [
        r"advances.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],
    "total borrowings": [
        r"borrowings.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],

    # --- Dividends ---
    "dividend per share": [
        r"dividend of ` ([\d\.]+) per equity share",
        r"dividend of ₹ ([\d\.]+) per equity share",
        r"dividend of ([\d\.]+) per equity share",
        r"dividend.*?([\d\.]+) per share"
    ],
    "total dividend": [
        r"dividend paid by the Bank.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore",
        r"aggregating to.*?(`|₹)\s*([\d,]+\.\d{2})\s*crore"
    ],

    # --- Ratios & Metrics (usually percentages) ---
    "capital adequacy ratio": [
        r"capital adequacy ratio.*?([\d\.]+)%"
    ],
    "gross npa": [
        r"gross npa.*?([\d\.]+)%"
    ],
    "net npa": [
        r"net npa.*?([\d\.]+)%"
    ],
    "return on assets": [
        r"return on assets.*?([\d\.]+)%"
    ],
    "return on equity": [
        r"return on equity.*?([\d\.]+)%"
    ]
}

# Example alias mapping - define as per your use case
metric_aliases = {
    # --- Income Statement Items ---
    "revenue": ["revenue", "total income", "top line", "turnover"],
    "profit after tax": ["profit after tax", "pat", "net profit", "net earnings", "profit for the period"],
    "earnings per share": ["earnings per share", "eps"],

    # --- Balance Sheet Items ---
    "total assets": ["total assets"],
    "total liabilities": ["total liabilities"],
    "deposits": ["deposits", "total deposits"],
    "loans": ["advances", "loans"],
    "total borrowings": ["borrowings", "total borrowings"],

    # --- Dividend-Related ---
    "dividend per share": ["dividend per share"],
    "total dividend": ["dividend paid", "dividend aggregating", "dividend"],

    # --- Key Banking Ratios & Metrics ---
    "capital adequacy ratio": ["capital adequacy ratio", "car"],
    "gross npa": ["gross npa", "gross non-performing assets"],
    "net npa": ["net npa", "net non-performing assets"],
    "return on assets": ["return on assets", "roa"],
    "return on equity": ["return on equity", "roe"]
}

####################################
# Utility: Select top relevant paragraphs
####################################

def select_top_paragraphs(query: str, paragraphs: List[str], top_n: int = 3) -> List[str]:
    query_terms = set(query.lower().split())
    scored = []
    for para in paragraphs:
        para_terms = set(para.lower().split())
        score = len(query_terms & para_terms)
        scored.append((score, para))
    scored.sort(key=lambda x: (-x[0], -len(x[1])))  # Prioritize match count then length
    selected = [para for _, para in scored[:top_n]]
    return selected

#############################
# Convert to number function #
#############################

def convert_to_number(value_str, unit=None):
    """
    Convert a string + unit (lakh, crore, million, etc.) into a float in base units.
    Example: ("1,099.42", "crore") -> 10994200000.0
    """
    num = value_str.replace(",", "")
    try:
        base_val = float(num)
    except ValueError:
        return None

    if unit:
        unit = unit.lower()
        if unit in ["lakh", "lakhs"]:
            base_val *= 1e5
        elif unit in ["crore", "cr", "crores"]:
            base_val *= 1e7
        elif unit == "million":
            base_val *= 1e6
        elif unit == "billion":
            base_val *= 1e9
        elif unit == "trillion":
            base_val *= 1e12
    return base_val


############################################
# Pre-process Query
############################################

def preprocess_query(query: str) -> str:
    """
    Helper function to clean and normalize a query.
    """
    return query.lower().strip()

############################################
# Post-process descriptive answers (truncation or summarization)
############################################

def postprocess_descriptive_answer(answer, max_sentences=3):
    # Simple truncation at sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', answer.strip())
    if len(sentences) > max_sentences:
        truncated = ' '.join(sentences[:max_sentences]) + " ..."
        return truncated
    return answer


#############################
# clean generated answer
#############################

def clean_generated_answer(text: str) -> str:
    # Find the text that comes after the last instance of the instruction marker
    if "[/INST]" in text:
        text = text.split("[/INST]")[-1]

    # Remove any remaining common artifacts and whitespace
    text = text.replace("</s>", "").strip()
    return text

#############################
# Ground Truth Checking
#############################

def is_answer_correct(predicted: str, ground_truth: str, guard_msg: str = None) -> bool:
    """
    Returns True if the predicted answer matches ground_truth.
    Uses a more robust regex to handle numbers and avoid errors.
    """
    # FIX: A more robust regex that ensures a number starts with a digit.
    number_pattern = r"\d[\d,]*\.?\d*"

    # Extract numbers from both the predicted answer and the ground truth
    pred_nums = re.findall(number_pattern, predicted)
    truth_nums = re.findall(number_pattern, ground_truth)

    # Clean the numbers by removing commas before converting to float
    pred_nums_set = {float(n.replace(",", "")) for n in pred_nums}
    truth_nums_set = {float(n.replace(",", "")) for n in truth_nums}

    if not truth_nums_set: # If ground truth has no numbers, fall back to simple string comparison
        return predicted.strip().lower() == ground_truth.strip().lower()

    # Check if any of the key numbers from the ground truth are present in the prediction
    return bool(pred_nums_set & truth_nums_set)


# ==============================================================================
# Main FUNCTIONS BELOW THIS LINE
# ==============================================================================


#############################
# hybrid Retrieval
#############################
def hybrid_retrieval(query, index_sets, embedding_model, strategy_name="small", top_k=5):
    """
    Performs hybrid retrieval using the specified chunking strategy ('small' or 'large').
    It combines dense (FAISS) and sparse (BM25) search results using Reciprocal Rank Fusion (RRF).
    """
    # 1. Select the correct set of indexes and chunks based on the chosen strategy
    if strategy_name not in index_sets:
        print(f" Warning: Strategy '{strategy_name}' not found. Defaulting to 'small'.")
        strategy_name = "small"

    strategy = index_sets[strategy_name]
    faiss_index = strategy.get("faiss")
    bm25_index = strategy.get("bm25")
    chunks = strategy.get("chunks")

    if faiss_index is None or bm25_index is None:
        print("Indexes are not available. Skipping retrieval.")
        return [], [], []

    # 2. Preprocess query
    cleaned_query = preprocess_query(query)

    # 3. Dense retrieval (FAISS)
    query_emb = embedding_model.encode([cleaned_query])[0]
    query_emb_norm = query_emb / np.linalg.norm(query_emb)
    distances, dense_indices = faiss_index.search(np.array([query_emb_norm]).astype('float32'), top_k)
    dense_results = [{'doc_index': i, 'score': s} for i, s in zip(dense_indices[0], distances[0]) if i != -1]

    # 4. Sparse retrieval (BM25)
    tokenized_query = cleaned_query.split()
    bm25_scores = bm25_index.get_scores(tokenized_query)
    sparse_top_indices = np.argsort(bm25_scores)[::-1][:top_k]
    sparse_results = [{'doc_index': i, 'score': bm25_scores[i]} for i in sparse_top_indices]

    # 5. Reciprocal Rank Fusion (RRF)
    fused_scores = {}
    k_rrf = 60  # Tunable parameter

    for rank, res in enumerate(dense_results):
        idx = res['doc_index']
        fused_scores[idx] = fused_scores.get(idx, 0) + 1 / (k_rrf + rank + 1)

    for rank, res in enumerate(sparse_results):
        idx = res['doc_index']
        fused_scores[idx] = fused_scores.get(idx, 0) + 1 / (k_rrf + rank + 1)

    # 6. Sort documents by fused score
    sorted_indices = sorted(fused_scores.keys(), key=lambda x: fused_scores[x], reverse=True)
    top_indices = sorted_indices[:top_k]

    # 7. Retrieve documents, add fused_score in metadata
    final_docs = []
    seen_indices = set()
    for idx in top_indices:
        if idx not in seen_indices:
            doc = chunks[idx]
            doc.metadata['fused_score'] = fused_scores[idx]
            final_docs.append(doc)
            seen_indices.add(idx)

    retrieved_texts = [doc.page_content for doc in final_docs]

    return final_docs, retrieved_texts, None

#############################
# Normalize Years
#############################
def normalize_years_in_query(query: str) -> list[str]:
    """
    Finds years in various formats (e.g., 2024, FY25, 23-24) in a query
    and returns them as a sorted list of unique four-digit strings.
    """
    found_years = set()

    # Pattern 1: Find four-digit years like 2023, 2024
    for match in re.findall(r"\b(20\d{2})\b", query):
        found_years.add(match)

    # Pattern 2: Find formats like FY24, FY 25, Financial Year 24
    for match in re.findall(r"\b(?:FY|Financial Year)\s*(\d{2,4})\b", query, re.IGNORECASE):
        year = f"20{match}" if len(match) == 2 else match
        found_years.add(year)

    # Pattern 3: Find year ranges like 23-24 or 2023-24
    for match in re.findall(r"\b(\d{2,4})[-–](\d{2,4})\b", query):
        start, end = match
        start_full = f"20{start}" if len(start) == 2 else start
        end_full = f"20{end}" if len(end) == 2 else end
        found_years.add(start_full)
        found_years.add(end_full)

    return sorted(list(found_years))

#############################
# Extract comparative values
#############################
def extract_comparative_values(context, metric_aliases):
    """
    Finds sentences with a "value A compared to value B" structure.
    This robust version finds all numbers and years on a line and pairs them.
    Returns a tuple: (later_year_value, earlier_year_value)
    """
    # Helper to convert string like "` 500 crore" to a numeric value
    def to_float(s):
        try:
            num_part = re.search(r'[\d,\.]+', s).group()
            num = float(num_part.replace(",", ""))
            if "crore" in s.lower():
                num *= 10_000_000
            elif "lakh" in s.lower():
                num *= 100_000
            return num
        except (AttributeError, ValueError):
            return None

    for line in context.split('\n'):
        # 1. Check for prerequisite keywords to ensure the line is relevant
        if not any(alias in line.lower() for alias in metric_aliases):
            continue
        if not any(comp in line.lower() for comp in ["as compared to", "from"]):
            continue

        # 2. Find all number-like and year-like strings on the line
        num_pattern = r"[`₹]?\s*[\d,]+\.\d{2}(?:\s*crore|\s*lakh)?"
        year_pattern = r"\b20\d{2}\b"

        numbers_found = re.findall(num_pattern, line)
        years_found = re.findall(year_pattern, line)

        # 3. If we find exactly two of each, we have a very strong candidate
        if len(numbers_found) == 2 and len(years_found) == 2:
            try:
                val1 = to_float(numbers_found[0])
                val2 = to_float(numbers_found[1])
                year1 = int(years_found[0])
                year2 = int(years_found[1])

                if val1 is None or val2 is None:
                    continue

                # 4. Associate and return in the correct order (later year, earlier year)
                if year1 > year2:
                    return (val1, val2)
                else:
                    return (val2, val1)
            except (ValueError, TypeError, IndexError):
                # If anything goes wrong with this line, just move to the next
                continue

    # If no suitable line was found in the entire context, return None
    return None

#############################
# RAG Generative
#############################
def generate_rag_response(query, retrieved_texts, rag_pipeline, tokenizer_rag):
    context = "\n".join(retrieved_texts)
    messages = [
        {"role": "system", "content": "You are a helpful financial assistant. Answer ONLY based on the provided context. Be concise. If the answer is not in the context, say 'Information not available.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{query}\n\nDirect Answer:"}
    ]
    prompt = tokenizer_rag.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    response = rag_pipeline(prompt)
    return response[0]['generated_text'].strip()

#############################
# RAFT Generative
#############################
def generate_raft_response(query, retrieved_texts, model_ft, tokenizer_ft, extract_comparative_values_func, metric_aliases, years_found):
    """
    Generates a response using the RAFT pipeline.
    This corrected version uses the arguments passed into it, avoiding redundant work.
    """
    # 1. Use the retrieved_texts that are passed in directly.
    context_str = "\n".join(retrieved_texts)

    # 2. Perform direct extraction using the provided context and helper functions.
    query_lower = query.lower()
    metric_key = next((k for k, aliases in metric_aliases.items() if any(alias in query_lower for alias in aliases)), None)
    is_change_query = any(w in query_lower for w in ["change", "difference", "compare", "between", "year-on-year"])

    # Attempt direct extraction for comparative questions.
    if metric_key and len(years_found) == 2 and is_change_query:
        # Use the function that was passed in as an argument.
        values = extract_comparative_values_func(context_str, metric_aliases[metric_key])
        if values:
            later_val, earlier_val = values
            change = later_val - earlier_val
            # Return the calculated answer and high confidence.
            return f"₹{change:,.2f}", context_str, None, 0.98

    # 3. If direct extraction fails, fall back to the LLM using the provided model and tokenizer.
    messages = [
        {"role": "system", "content": "You are a financial analyst. Answer using ONLY the provided context. If a calculation is needed, perform it and provide only the final numerical result."},
        {"role": "user", "content": f"Context:\n{context_str}\n\nQuestion:\n{query}\n\nDirect Numerical Answer:"}
    ]
    prompt = tokenizer_ft.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer_ft(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048).to(model_ft.device)

    try:
        with torch.no_grad():
            outputs = model_ft.generate(**inputs, max_new_tokens=64, do_sample=False, pad_token_id=tokenizer_ft.eos_token_id)

        generated_text = tokenizer_ft.decode(outputs[0], skip_special_tokens=True).strip()
        # Assumes a 'clean_generated_answer' function is available.
        final_answer = clean_generated_answer(generated_text)
        return final_answer, context_str, prompt, 0.7
    except Exception as e:
        print(f"Error during RAFT generation: {e}")
        return "Information not available", context_str, None, 0.0

#############################
# Post Processing the generated text
#############################
def postprocess_generated_text(raw_text, prompt_marker, question=None, years_found=None):
    if not raw_text:
        return ""

    # --- Clean model output ---
    # Only split if a valid prompt_marker is provided and exists in the text
    if prompt_marker and prompt_marker in raw_text:
        parts = raw_text.split(prompt_marker)
        raw_text = parts[-1]

    tokens_to_remove = ["[INST]", "[/INST]", "INST", "</s>", "<s>"]
    for token in tokens_to_remove:
        raw_text = raw_text.replace(token, "")

    raw_text = " ".join(raw_text.split()).strip()

    raw_text = re.sub(r'[J`‘´’]\s*([\d,]+\.?\d*)', r'₹ \1', raw_text)

    # If the text is just a number, it's likely from direct extraction, so return as is.
    if re.match(r'^[\d,.]+$', raw_text):
        return raw_text

    # (The rest of the function remains the same)
    words = raw_text.split()
    while words and len(words[-1]) < 2:
        words.pop()
    raw_text = " ".join(words)

    if len(raw_text) < 3:
        return "Information not available"

    # --------- STEP 2: Financial formatting ---------
    formatted = raw_text
    if question:
        q = question.lower()
        num_match = re.search(r"([\d,]+(?:\.\d+)?)\s*(crore|lakh|million|billion|trillion)", formatted, re.IGNORECASE)
        if num_match:
            value, unit = num_match.groups()
            formatted = f"₹{value} {unit}"

        # Dynamic year handling
        start_year, end_year = None, None
        if years_found and isinstance(years_found, list):
            years_found = sorted(list(set(years_found))) # Ensure unique and sorted
            if len(years_found) == 1:
                start_year = years_found[0]
            elif len(years_found) >= 2:
                start_year, end_year = years_found[0], years_found[1]

        # Auto-label based on question type
        if "revenue" in q:
            if end_year:
                formatted = f"Change in Total Revenue from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Total Revenue for FY {start_year} was {formatted}."
        elif "profit" in q:
            if end_year:
                formatted = f"Change in Net Profit after Tax from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Net Profit after Tax for FY {start_year} was {formatted}."
        elif "asset" in q:
            if end_year:
                formatted = f"Change in Total Assets from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Total Assets for FY {start_year} were {formatted}."
        elif "liabilit" in q:
            if end_year:
                formatted = f"Change in Total Liabilities from FY {start_year} to FY {end_year} was {formatted}."
            elif start_year:
                formatted = f"Total Liabilities for FY {start_year} were {formatted}."

    return formatted

Overwriting utils.py


In [None]:
%%writefile app.py
import streamlit as st
import time
import numpy as np
from utils import (
    load_resources,
    hybrid_retrieval,
    normalize_years_in_query,
    extract_comparative_values,
    generate_rag_response,
    generate_raft_response,
    postprocess_generated_text,
    metric_aliases
)

# --- App Configuration ---
st.set_page_config(
    page_title="Group 99 | Financial QA",
    layout="wide",
    initial_sidebar_state="expanded"
)

# --- Load Models and Data ---
with st.spinner("Loading financial models and data... This may take a moment on first startup."):
    embedding_model, index_sets, rag_pipeline, model_ft, tokenizer_ft = load_resources()

# --- UI Components ---
st.title("🤖 Group 99: Financial QA Chatbot")
st.sidebar.header("Query Options")

# FIXED: Removed the unsupported 'caption' argument
mode_selection = st.sidebar.radio(
    "Choose a Model:",
    ("RAFT (Fine-Tuned)", "RAG (Base Model)")
)
mode = "RAFT" if "RAFT" in mode_selection else "RAG"

st.sidebar.markdown("---")
with st.sidebar.expander("Project Group Members"):
    st.markdown("""
    | Name                      | ID            |
    | :------------------------ | :------------ |
    | APARNARAM KASIREDDY       | 2023AC05145   |
    | K NIRANJAN BABU           | 2023AC05464   |
    | LAKSHMI MRUDULA MADDI     | 2023AC05138   |
    | MURALIKRISHNA RAPARTHI    | 2023AC05208   |
    | RAJAMOHAN NAIDU           | 2023AC05064   |
    """)

query = st.text_input("Ask a question about HDFC Bank's financial reports:", "")

if st.button("Get Answer"):
    if not query.strip():
        st.warning("Please enter a question.")
    else:
        with st.spinner("Finding relevant documents and generating an answer..."):
            start_time = time.time()

            retrieved_docs, retrieved_texts, _ = hybrid_retrieval(query, index_sets, embedding_model, top_k=5)
            context_used = "\n".join(retrieved_texts)

            if not retrieved_texts:
                st.error("Could not find relevant context to answer this question.")
            else:
                years_found = normalize_years_in_query(query)
                raw_answer = None
                model_confidence = 0.0

                if mode == "RAFT":
                    # FIXED: Pass correct parameters to generate_raft_response
                    raw_answer, _, _, model_confidence = generate_raft_response(
                        query,
                        retrieved_texts,
                        model_ft,          # FIXED: Use model_ft instead of rag_pipeline
                        tokenizer_ft,      # FIXED: Use tokenizer_ft consistently
                        extract_comparative_values,
                        metric_aliases,
                        years_found
                    )
                else: # RAG (Base Model)
                    # FIXED: Use correct tokenizer variable name and improved function
                    raw_answer = generate_rag_response(query, retrieved_texts, rag_pipeline, tokenizer_ft)

                    # Calculate confidence from retrieval scores
                    scores = [doc.metadata.get('fused_score', 0) for doc in retrieved_docs]
                    model_confidence = np.mean(scores) if scores else 0.0

                final_answer = postprocess_generated_text(raw_answer, "[/INST]", query, years_found=years_found)
                end_time = time.time()

                st.success("Answer")
                st.markdown(f"### {final_answer}")

                col1, col2 = st.columns(2)
                col1.metric(label="Response Time", value=f"{end_time - start_time:.2f}s")
                col2.metric(label="Context Relevance Score", value=f"{model_confidence:.2%}")

                with st.expander("Show Retrieved Context"):
                    st.text(context_used)

Overwriting app.py


In [None]:
from pyngrok import ngrok

#kill if any ngrok running
ngrok.kill()

from google.colab import userdata
userdata.get('NGROK_AUTH_TOKEN')

# --- Securely Access Your Ngrok Authtoken ---
# This gets the token from Colab's Secrets manager
NGROK_AUTH_TOKEN = userdata.get('NGROK_AUTH_TOKEN')
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# --- Run Streamlit in Background & Create Public URL ---
!streamlit run app.py &>/dev/null&
public_url = ngrok.connect(8501)
print(f"✅ Your app is live! Click the link to open: {public_url}")

✅ Your app is live! Click the link to open: NgrokTunnel: "https://3b43f36f3d2e.ngrok-free.app" -> "http://localhost:8501"


# **Summary of Key Takeaways from Building and Testing the RAFT System**

Our RAFT (Retrieval-Augmented Fine-Tuning) system, employing a hybrid approach, offers a powerful solution for tasks requiring both high precision on specific data points and intelligent reasoning for broader, general summaries. The codebase exemplifies this advanced hybrid methodology to financial question answering.

### Key takeaways:

- **Dual-Strategy Chunking (Hierarchical Chunking):**  
  We implemented a dual-chunking strategy instead of relying on a single uniform chunk size, allowing us to:  
  - Create **small chunks (512 tokens):** Optimized for retrieval, enabling BM25 and dense vector search to quickly and accurately identify specific, factual information.  
  - Create **large chunks (2048 tokens):** Optimized for generation, providing rich, full context once relevant facts are located, so that the fine-tuned LLM can synthesize comprehensive, coherent answers.

- **Hybrid Retrieval:**  
  Our approach combines sparse and dense retrieval methods for robust results:  
  - **BM25 (Sparse):** Effective for exact keyword matching, essential for identifying specific financial terminology such as "revenue" or "profit."  
  - **FAISS (Dense):** Excels at semantic similarity search, enabling discovery of related information even when exact keywords differ.  
  The use of **Reciprocal Rank Fusion (RRF)** to merge these results produces a highly relevant, prioritized set of documents.

- **Hybrid Direct Extraction (RAFT with Rule-Based Logic):**  
  For factual, numeric queries (e.g., “What was the dividend per share?”), the system employs a hybrid extraction pipeline:  
  - **Direct Extraction:** Uses regex and rule-based logic to pull numbers directly from retrieved context, providing fast, precise, and hallucination-free answers.  
  - **LLM Fallback:** If direct extraction fails, the fine-tuned LLM generates an answer, making the system robust to different phrasings or formatting variations.

- **Context Window Optimization:**  
  By limiting retrieved chunks to the most relevant top results from the hybrid retriever, the system efficiently fills the LLM’s context window with high-value information. This ensures better answer quality while respecting token limits for generation.  
  ***This particular approach was implemented based on feedback from Bhagath sir to achieve optimal model performance and result quality.***
