# Financial Document Analysis with Retrieval-Augmented Generation (RAG) and Finetuning

This Jupyter Notebook demonstrates a complete workflow for building a Q&A system from financial reports using a Retrieval-Augmented Generation (RAG) architecture. We will process Kyndryl's annual reports, build a sophisticated retrieval system, and re-rank the results for higher accuracy.

### Part 1: Data Collection & Preprocessing

**1.1: Obtain and Ingest Financial Statements**

First, we'll install the necessary libraries for our project. Then, we will download the annual reports for the last two fiscal years from the provided URLs and store them locally. This step ensures we have the raw data ready for processing.

In [1]:
# !pip install \
#   accelerate>=1.10.0 \
#   datasets>=4.0.0 \
#   docling>=2.47.0 \
#   faiss-cpu>=1.12.0 \
#   huggingface-hub>=0.34.4 \
#   ipykernel>=6.30.1 \
#   pandas>=2.3.2 \
#   rank-bm25>=0.2.2 \
#   sentence-transformers>=5.1.0 \
#   sentencepiece>=0.2.1 \
#   tableformatter>=0.1.6 \
#   tabula>=1.0.5 \
#   torch>=2.8.0 \
#   transformers>=4.55.4


In [2]:
import logging
import time
from pathlib import Path
import pandas as pd
from docling.document_converter import DocumentConverter
_log = logging.getLogger(__name__)

# Download financial reports for KYNDRYL HOLDINGS, INC.

In [3]:
import requests
import os

# Create a directory to store the PDFs
if not os.path.exists("annual_reports"):
    os.makedirs("annual_reports")

pdf_urls = [
    "https://investors.kyndryl.com/static-files/0205a5a1-2f59-4ab7-b892-58615604423a",  # 2024 Annual Report
    "https://investors.kyndryl.com/static-files/1488970a-672b-4caa-ad23-00c77e2b2434",  # 2023 Annual Report
]
pdf_paths = []
for url in pdf_urls:
    file_name = os.path.join("annual_reports", url.split("/")[-1]) + ".pdf"
    if not os.path.exists(file_name):
        print(f"Downloading {file_name}...")
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(file_name, "wb") as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
    else:
        print(f"{file_name} already exists.")
    pdf_paths.append(file_name)

Downloading annual_reports/0205a5a1-2f59-4ab7-b892-58615604423a.pdf...
Downloading annual_reports/1488970a-672b-4caa-ad23-00c77e2b2434.pdf...


**1.2: Convert Documents to Plain Text and Clean**

We will use the `Docling` library to parse the downloaded PDF files and extract raw text. We will then apply a basic cleaning function to remove common artifacts like headers, footers, and extra whitespace, which are irrelevant for our Q&A task.

The 2024 financial report contains the data for both the year 2024 and 2023.

In [4]:
def parse_pdf(pdf_path, start, end):
    print(f"Parsing {pdf_path} from page {start} to {end}")
    res = []
    logging.basicConfig(level=logging.INFO)

    input_doc_path = pdf_path
    output_dir = Path("scratch")

    doc_converter = DocumentConverter()

    start_time = time.time()

    conv_res = doc_converter.convert(input_doc_path, page_range=(start, end))

    output_dir.mkdir(parents=True, exist_ok=True)

    doc_filename = conv_res.input.file.stem

    # Export tables
    for table_ix, table in enumerate(conv_res.document.tables):
        table_df: pd.DataFrame = table.export_to_dataframe()
        print(f"## Table {table_ix}")
        res.append(table_df)
        print(table_df.to_markdown())

        # Save the table as csv
        element_csv_filename = output_dir / f"{doc_filename}-table-{table_ix + 1}.csv"
        _log.info(f"Saving CSV table to {element_csv_filename}")
        table_df.to_csv(element_csv_filename)

    end_time = time.time() - start_time

    _log.info(f"Document converted and tables exported in {end_time:.2f} seconds.")
    return res

# let's parse pdf from the pages 57 to 59. The pages that contain financial tables.
# The 2024 financial report contains the data for both the year 2024 and 2023.
raw_documents = parse_pdf(pdf_paths[0], 57, 59)


Parsing annual_reports/0205a5a1-2f59-4ab7-b892-58615604423a.pdf from page 57 to 59


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Table 0
|    |                                                                      | Notes   | Year Ended.2024   | March 31,.2023   | Three Months Ended March 31,.2022   | Year Ended December 31,.2021   |
|---:|:---------------------------------------------------------------------|:--------|:------------------|:-----------------|:------------------------------------|:-------------------------------|
|  0 | Revenues * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 3       | $ 16,052          | $ 17,026         | $ 4,431                             | $ 18,657                       |
|  1 | Cost of services ** . . . . . . . . . . . . . . . . . . . . . . .    | 3       | $ 13,189          | $ 14,498         | $ 3,824                             | $ 16,550                       |
|  2 | Selling, general and administrative expenses . .                     |         | 2,773             | 2,914            | 690                                 | 2,776                          |

### Clean text and Segment reports into logical sections

The above parsed document has financial data for 2024, 2023, 2022 and 2021. Let's clean up and extract for 2024 and 2023.


#### Process INCOME statement
1. KYNDRYL HOLDINGS, INC. CONSOLIDATED INCOME STATEMENT
2. KYNDRYL HOLDINGS, INC. CONSOLIDATED STATEMENT OF COMPREHENSIVE INCOME (LOSS)

In [5]:
from itertools import chain

qa_documents = {}

questions_2023 = []
questions_2024 = []

# iterate over income statement tables and create simple Q&A pairs
for row in chain(raw_documents[0].itertuples(), raw_documents[1].itertuples()):
    if not row[3] and not row[4]:
        continue
    question_suffix = "?"
    question_prefix = f'{row[1].strip(".* :").strip()} of'
    answer_suffix = "." if "per share" in question_prefix.lower() else " million."

    question = question_prefix + " 2024" + question_suffix
    answer = row[3]
    answer = "Not applicable" if answer == "-" else (answer + answer_suffix)
    questions_2024.append((question, answer))

    question = question_prefix + " 2023" + question_suffix
    answer = row[4]
    answer = "Not applicable" if answer == "-" else (answer + answer_suffix)
    questions_2023.append((question, answer))

print("Sample Questions and Answers")
print(questions_2024[:2])
print(questions_2023[1:3])

qa_documents['Income Statement 2024'] = questions_2024
qa_documents['Income Statement 2023'] = questions_2023

Sample Questions and Answers
[('Revenues of 2024?', '$ 16,052 million.'), ('Cost of services of 2024?', '$ 13,189 million.')]
[('Cost of services of 2023?', '$ 14,498 million.'), ('Selling, general and administrative expenses of 2023?', '2,914 million.')]


#### Process KYNDRYL HOLDINGS, INC. CONSOLIDATED BALANCE SHEET
(In millions, except per share amounts)

In [6]:
questions_2023 = []
questions_2024 = []
common_prefix = ""
skip = False

# iterate over the balance sheet table and create simple Q&A pairs
for row in raw_documents[2].itertuples():
    question_suffix = "?"
    question_prefix = f'{row[1].strip(".* :").strip()}'
    if question_prefix.endswith("Assets") and not row[3]:
        common_prefix = " assets"
    elif question_prefix.endswith("Liabilities") and not row[3]:
        common_prefix = " liabilities"
    elif question_prefix.endswith("Equity") and not row[3]:
        break
    if not row[3] and not row[4]:
        continue

    if not question_prefix:
        skip = True
        continue
    elif skip:
        skip = False
        continue

    answer_suffix = "." if "per share" in question_prefix.lower() else " million."

    question = (
        (
            question_prefix
            if question_prefix.endswith(common_prefix)
            else f"{question_prefix}{common_prefix}"
        )
        + " of 2024"
        + question_suffix
    )
    answer = row[3] + answer_suffix
    questions_2024.append((question, answer))
    question = (
        (
            question_prefix
            if question_prefix.endswith(common_prefix)
            else f"{question_prefix}{common_prefix}"
        )
        + " of 2023"
        + question_suffix
    )
    answer = row[4] + answer_suffix
    questions_2023.append((question, answer))

print("Sample Questions and Answers")
print(questions_2024[-2:])
print(questions_2023[-1:])

print(f"Total 2024 Q&A pairs: {len(questions_2024)}")
print(f"Total 2023 Q&A pairs: {len(questions_2023)}")

qa_documents['Balance Sheet 2024'] = questions_2024
qa_documents['Balance Sheet 2023'] = questions_2023

Sample Questions and Answers
[('Other noncurrent liabilities of 2024?', '332 million.'), ('Total liabilities of 2024?', '$ 9,468 million.')]
[('Total liabilities of 2023?', '$ 10,002 million.')]
Total 2024 Q&A pairs: 26
Total 2023 Q&A pairs: 26


#### Q/A Pairs

In [7]:
for doc, questions in qa_documents.items():
    print(f"\nDocument: {doc}")
    for q, a in questions:
        print(f"Q: {q}\nA: {a}\n")


Document: Income Statement 2024
Q: Revenues of 2024?
A: $ 16,052 million.

Q: Cost of services of 2024?
A: $ 13,189 million.

Q: Selling, general and administrative expenses of 2024?
A: 2,773 million.

Q: Workforce rebalancing charges of 2024?
A: 138 million.

Q: Transaction-related costs (benefits) of 2024?
A: (46) million.

Q: Impairment expense of 2024?
A: Not applicable

Q: Interest expense of 2024?
A: 122 million.

Q: Other expense of 2024?
A: 45 million.

Q: Total costs and expenses of 2024?
A: $ 16,221 million.

Q: Income (loss) before income taxes of 2024?
A: $ (168) million.

Q: Provision for income taxes of 2024?
A: $ 172 million.

Q: Net income (loss) of 2024?
A: $ (340) million.

Q: Basic earnings (loss) per share of 2024?
A: $ (1.48).

Q: Diluted earnings (loss) per share of 2024?
A: (1.48).

Q: Weighted-average basic shares outstanding of 2024?
A: 229.2 million.

Q: Weighted-average diluted shares outstanding of 2024?
A: 229.2 million.

Q: Net income (loss) of 2024?
A: $

### Part 2: Retrieval-Augmented Generation (RAG) System Implementation

**2.1 Data Processing: Chunking**

To prepare the text for retrieval, we'll split it into smaller, manageable chunks. This allows the model to find more specific and relevant passages. We will create two sets of chunks with different sizes (100 and 400 tokens) to analyze the impact of chunk size on retrieval performance. Each chunk will be assigned a unique ID and metadata indicating its source document and chunk size.

The metadata will have info whether the chunk is courced from balance sheet or income statement segments.

In [8]:
from transformers import AutoTokenizer

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def chunk_text(qa_list, chunk_size, source_key):
    chunks = []
    for idx, (q, a) in enumerate(qa_list):
        text_block = f"Q: {q}\nA: {a}"
        tokens = tokenizer.encode(text_block)
        for i in range(0, len(tokens), chunk_size):
            chunk_tokens = tokens[i:i + chunk_size]
            chunk_str = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
            chunks.append({
                "id": f"{source_key}_{idx}_size_{chunk_size}_chunk_{len(chunks)}",
                "text": chunk_str,
                "metadata": {
                    "segment": source_key,
                    "qa_index": idx,
                    "chunk_size": chunk_size
                }
            })
    return chunks

chunk_sizes = [100, 400]
all_chunks = []

for size in chunk_sizes:
    for key in qa_documents:
        all_chunks.extend(chunk_text(qa_documents[key], size, key))

print(f"Total number of chunks created: {len(all_chunks)}")
print("Sample chunk:", all_chunks[0])

Total number of chunks created: 236
Sample chunk: {'id': 'Income Statement 2024_0_size_100_chunk_0', 'text': 'q : revenues of 2024? a : $ 16, 052 million.', 'metadata': {'segment': 'Income Statement 2024', 'qa_index': 0, 'chunk_size': 100}}


**2.2 Embedding & Indexing**

Next, we will convert the text chunks into numerical vectors (embeddings) using the `all-MiniLM-L6-v2` model. These embeddings capture the semantic meaning of the text.

We will build two types of indexes:
1.  **Dense Vector Store (FAISS):** For fast semantic similarity search.
2.  **Sparse Index (BM25):** For efficient keyword-based retrieval.

In [9]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from rank_bm25 import BM25Okapi

# 1. Embed Chunks
embedding_model = SentenceTransformer(model_name)
chunk_texts = [chunk['text'] for chunk in all_chunks]
embeddings = embedding_model.encode(chunk_texts, convert_to_numpy=True, show_progress_bar=True)

# 2. Build Dense Vector Store (FAISS)
embedding_dim = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(embedding_dim)
faiss_index.add(np.array(embeddings, dtype=np.float32))
print(f"FAISS index built with {faiss_index.ntotal} vectors.")

# 3. Build Sparse Index (BM25)
tokenized_corpus = [doc.lower().split(" ") for doc in chunk_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 index built.")

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

FAISS index built with 236 vectors.
BM25 index built.


**2.3 Hybrid Retrieval Pipeline**

Our retrieval pipeline will combine the strengths of both dense and sparse methods. For a given query, we will:
1. Preprocess the query.
2. Retrieve the top N chunks from FAISS based on vector similarity.
3. Retrieve the top N chunks from BM25 based on keyword matching.
4. Combine the results using a simple union to create a comprehensive list of candidate chunks.

In [10]:
def preprocess_query(query):
    # Normalize
    query = query.lower()

    # All the queries are about Kyndryl. So, let's remove that, it might bias the retriever
    # Remove redundant keywords
    stopwords = ["kyndryl", "IBM", "company", "inc", "inc.", "corporation", "corp", "corp.", "ltd", "ltd.", "plc", "the"]
    for w in stopwords:
        query = query.replace(w.lower(), "")
    return query.strip()

def hybrid_retrieve(query, top_n=5):
    # 1. Preprocess query
    clean_query = preprocess_query(query)

    # 2. Dense Retrieval (FAISS)
    query_embedding = embedding_model.encode([clean_query])
    _, dense_indices = faiss_index.search(np.array(query_embedding, dtype=np.float32), top_n)
    dense_results = [all_chunks[i] for i in dense_indices[0]]

    # 3. Sparse Retrieval (BM25)
    tokenized_query = clean_query.split(" ")
    bm25_scores = bm25.get_scores(tokenized_query)
    sparse_indices = np.argsort(bm25_scores)[::-1][:top_n]
    sparse_results = [all_chunks[i] for i in sparse_indices]

    # 4. Combine results
    combined_results_dict = {chunk['id']: chunk for chunk in dense_results + sparse_results}

    print(f"Retrieved {len(dense_results)} chunks from dense search.")
    print(f"Retrieved {len(sparse_results)} chunks from sparse search.")
    print(f"Combined to {len(combined_results_dict)} unique chunks.")

    return list(combined_results_dict.values())

In [11]:
# Example Usage
test_query = "What was the company's revenue in 2024?"
retrieved_chunks = hybrid_retrieve(test_query)

print(f"\n--- Top Retrieved Chunk for '{test_query}' ---")
print(retrieved_chunks[0]['text'])

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.

--- Top Retrieved Chunk for 'What was the company's revenue in 2024?' ---
q : revenues of 2024? a : $ 16, 052 million.


**2.4 Advanced RAG Technique: Cross-Encoder Re-ranking**

The initial retrieval might return chunks that are only partially relevant. To refine our results, we'll use a Cross-Encoder model. Unlike the embedding model which computes vectors independently, a Cross-Encoder takes both the query and a candidate chunk as input to produce a more accurate relevance score. We will use this to re-rank the top chunks retrieved from our hybrid pipeline.

In [12]:
from sentence_transformers.cross_encoder import CrossEncoder

# Load a cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_with_cross_encoder(query, chunks):
    # Create pairs of [query, chunk_text] for scoring
    query_chunk_pairs = [[query, chunk['text']] for chunk in chunks]

    # Get scores from the cross-encoder
    scores = cross_encoder.predict(query_chunk_pairs, show_progress_bar=False)

    # Add scores to chunks and sort
    for i in range(len(chunks)):
        chunks[i]['relevance_score'] = scores[i]

    reranked_chunks = sorted(chunks, key=lambda x: x['relevance_score'], reverse=True)
    return reranked_chunks


def advanced_retrieve(query):
    retrieved_chunks = hybrid_retrieve(query)
    reranked_results = rerank_with_cross_encoder(query, retrieved_chunks)
    return reranked_results

In [13]:
# Example Usage with the same test query
test_query = "What was the company's revenue in 2024?"
reranked_results = advanced_retrieve(test_query)
print(f"\n--- Top Re-ranked Chunk for '{test_query}' ---")
print(f"Relevance Score: {reranked_results[0]['relevance_score']:.4f}")
print(reranked_results[0]['text'])

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.

--- Top Re-ranked Chunk for 'What was the company's revenue in 2024?' ---
Relevance Score: 8.9147
q : revenues of 2024? a : $ 16, 052 million.


**2.5 Response Generation**

After retrieving and re-ranking the most relevant document chunks, the final step is to generate a coherent, human-readable answer. We will use a generative language model to synthesize the information from the retrieved passages into a direct response to the user's query.

For this demonstration, we'll use `Flan T5`, a smaller and more efficient version, which is suitable for tasks where resource constraints are a consideration. The retrieved text and the original query are combined into a carefully crafted prompt to guide the model in generating a factual answer grounded in the provided context.

In [14]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# --- Configuration ---
model_name = "google/flan-t5-base"
# Explicitly set the device to ensure all tensors are on the same hardware
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# --- Model and Tokenizer Loading ---
try:
    gen_tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Ensure the model is moved to the correct device upon loading
    gen_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
except Exception as e:
    print(f"Error loading model: {e}")
    print("Please ensure you have a stable internet connection.")
    exit()

# Note: The 'pipeline' object is no longer used in the generation function.

def generate_answer(query, reranked_chunks, max_new_tokens=100):
    """
    Generates an answer using the model directly and calculates its confidence score.
    """
    # 1. Prepare the context
    context = "\n\n".join([chunk.get("text", "") for chunk in reranked_chunks[:3]])

    # 2. Create the prompt
    prompt = f"""
Based on the following context, please answer the question.

Context:
{context}

Question:
{query}

Answer:
"""

    # 3. Manually tokenize the input and move it to the model's device
    inputs = gen_tokenizer(prompt, return_tensors="pt").to(device)

    # 4. Use model.generate() directly. This returns the raw output object.
    generated_output = gen_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        num_return_sequences=1,
        num_beams=4,  # Using beam search
        return_dict_in_generate=True,
        output_scores=True,
    )

    # --- 5. Calculate Confidence Score ---
    # We now access the attributes directly from the raw output object.
    # These attributes are guaranteed to be the correct tensor types.
    transition_scores = gen_model.compute_transition_scores(
        sequences=generated_output.sequences,
        scores=generated_output.scores,
        beam_indices=generated_output.beam_indices,
        normalize_logits=True
    )

    # To get the final probability, exponentiate the sum of the log-probabilities
    confidence = torch.exp(transition_scores.sum()).item()

    # 6. Manually decode the output token IDs to get the answer text
    # We take the first (and only) generated sequence.
    output_sequence = generated_output.sequences[0]
    answer = gen_tokenizer.decode(output_sequence, skip_special_tokens=True)

    # 7. Return both the answer and its confidence score
    return answer.strip(), confidence


Using device: cuda


In [15]:
def query_finance_system_rag(query, debug = False):
    reranked_results = advanced_retrieve(query)

    if debug:
        for r in reranked_results:
            print('-----')
            print(r['text'])

    # Example Usage
    start_time = time.time()
    final_answer, confidence = generate_answer(query, reranked_results)
    end_time = time.time()
    latency = end_time - start_time

    return final_answer, latency, confidence

test_query = "What is fiscal year 2024 revenue?"
print(f"Question: {test_query}")
final_answer, _, _ = query_finance_system_rag(test_query)
print(f"\nGenerated Answer:\n{final_answer}")

Question: What is fiscal year 2024 revenue?
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.

Generated Answer:
$ 16, 052 million


**2.6 Guardrail Implementation**

To ensure the reliability and safety of our RAG system, we need to implement guardrails. These are checks and balances that prevent the system from processing inappropriate queries or generating harmful, irrelevant, or factually incorrect answers. We will implement two basic guardrails:

1.  **Input Guardrail:** A simple filter to block off-topic or nonsensical questions.
2.  **Output Guardrail:** A check to ensure the generated answer is grounded in the retrieved context and not a hallucination.

In [None]:
# --- Input Guardrail ---

def validate_query(query):
    """
    Validates the input query to filter out irrelevant or harmful inputs.
    """
    query = query.lower()

    # Check for minimum length
    if len(query.split()) < 3:
        return False, "Query is too short. Please ask a more specific question."

    # Basic check for off-topic keywords
    off_topic_keywords = ['recipe', 'poem', 'joke', 'movie', 'celebrity', 'sports', 'weather', 'travel', 'music', 'game', 'politics', 'history', 'science fiction']
    if any(keyword in query for keyword in off_topic_keywords):
        return False, "Query is off-topic. This system is for financial document analysis."

    return True, "Query is valid."

# --- Output Guardrail ---

def validate_output(answer, retrieved_chunks):
    """
    Validates the generated output to flag potential hallucinations.
    This is a simple check based on keyword overlap.
    """
    context = " ".join([chunk['text'].lower() for chunk in retrieved_chunks])
    answer_tokens = set(answer.lower().split())
    context_tokens = set(context.split())

    if not answer_tokens:
        return False, "Generated answer is empty."

    # Calculate the percentage of answer tokens that are present in the context
    overlap = answer_tokens.intersection(context_tokens)
    overlap_ratio = len(overlap) / len(answer_tokens)

    # If overlap is less than a certain threshold (e.g., 30%), flag it as potentially ungrounded.
    if overlap_ratio < 0.3:
        return False, f"Potential hallucination detected. Answer may not be grounded in context (Overlap: {overlap_ratio:.2%})."

    return True, f"Answer appears grounded in context (Overlap: {overlap_ratio:.2%})."


# --- Example Usage of Guardrails ---

# 1. Input Guardrail Examples
print("--- Input Guardrail Tests ---")
valid_query = "What were the total assets of Kyndryl as of March 31, 2024?"
invalid_query_short = "revenue?"
invalid_query_topic = "Can you give me a recipe for a cake?"

is_valid, reason = validate_query(valid_query)
print(f"Query: '{valid_query}' -> Valid: {is_valid}, Reason: {reason}")

is_valid, reason = validate_query(invalid_query_short)
print(f"Query: '{invalid_query_short}' -> Valid: {is_valid}, Reason: {reason}")

is_valid, reason = validate_query(invalid_query_topic)
print(f"Query: '{invalid_query_topic}' -> Valid: {is_valid}, Reason: {reason}")

print("\n" + "="*50 + "\n")

# 2. Output Guardrail Example
print("--- Output Guardrail Tests ---")
# Use the previously generated answer
is_grounded, reason = validate_output(final_answer, reranked_results)
print(f"Answer: '{final_answer[:100]}...' -> Grounded: {is_grounded}, Reason: {reason}")

# Example of a potentially hallucinated answer
hallucinated_answer = "Kyndryl announced a partnership with SpaceX to build datacenters on Mars."
is_grounded, reason = validate_output(hallucinated_answer, reranked_results)
print(f"Answer: '{hallucinated_answer}' -> Grounded: {is_grounded}, Reason: {reason}")

--- Input Guardrail Tests ---
Query: 'What were the total assets of Kyndryl as of March 31, 2024?' -> Valid: True, Reason: Query is valid.
Query: 'revenue?' -> Valid: False, Reason: Query is too short. Please ask a more specific question.
Query: 'Can you give me a recipe for a cake?' -> Valid: False, Reason: Query is off-topic. This system is for financial document analysis.


--- Output Guardrail Tests ---
Answer: '$ 16, 052 million...' -> Grounded: True, Reason: Answer appears grounded in context (Overlap: 75.00%).
Answer: 'Kyndryl announced a partnership with SpaceX to build datacenters on Mars.' -> Grounded: False, Reason: Potential hallucination detected. Answer may not be grounded in context (Overlap: 9.09%).


2.7 Interface Development

UI is built at the end together with finetuning

### 3. Fine-Tuned Model System Implementation


#### 3.1 Q/A Dataset Preparation

In [17]:
from datasets import Dataset

# Convert to a list of dictionaries for the dataset
fine_tuning_data = []
for doc_id, questions in qa_documents.items():
    fine_tuning_data.extend(
        [
            {"text": f"question: {question} answer: {answer}"}
            for question, answer in questions
        ]
    )

df = pd.DataFrame(fine_tuning_data)
dataset = Dataset.from_pandas(df)


print(dataset[-1])

{'text': 'question: Total liabilities of 2023? answer: $ 10,002 million.'}


#### 3.2 Model Selection

we will use the google/flan-t5-base model as RAG setup. This is a versatile and powerful model that is well-suited for a variety of NLP tasks, including question answering and demands less hardware.


In [18]:
model_name = 'google/flan-t5-base'

### 3.3 Baseline Benchmarking (Pre-Fine-Tuning)



In [33]:
import torch


def semantic_similarity(a, b):
    emb_a = embedding_model.encode([a])
    emb_b = embedding_model.encode([b])
    return float(
        torch.cosine_similarity(torch.tensor(emb_a), torch.tensor(emb_b)).item()
    )


test_questions_10 = [
    {
        "question": "What were the total revenues in fiscal 2024?",
        "ground_truth": "$ 16,052 million.",
    },
    {
        "question": "What were the total score in fiscal 2023?",
        "ground_truth": "Not applicable",
    },
    {
        "question": "Cost of services of 2024?",
        "ground_truth": "$ 13,189 million.",
    },
    {
        "question": "What was the cost of services in fiscal 2023?",
        "ground_truth": "$ 14,498 million.",
    },
    {
        "question": "What was the net income (loss) reported in fiscal 2024?",
        "ground_truth": "$ (340) million.",
    },
    {
        "question": "What was the net income (loss) reported in fiscal 2023?",
        "ground_truth": "$ (1,374) million.",
    },
    {
        "question": "Cookie recipe?",
        "ground_truth": "Not Applicable",
    },
    {
        "question": "How much were the total liabilities in fiscal 2023?",
        "ground_truth": "$ 10,002 million.",
    },
    {
        "question": "What were the cash and cash equivalents at the end of fiscal 2024?",
        "ground_truth": "$ 1,553 million.",
    },
    {
        "question": "What were the cash and cash equivalents at the end of fiscal 2023?",
        "ground_truth": "$ 1,847 million.",
    },
]

def evaluate(test_questions, rag=True, fine_tuning=False):
    evaluations = []
    for test_question in test_questions:
        query = test_question["question"]
        truth = test_question["ground_truth"]

        if rag:
            answer, latency, confidence = query_finance_system_rag(query)
            evaluations.append({
                "Method": "RAG",
                "Query": query,
                "Response": answer,
                "Ground Truth": truth,
                "Accuracy": round(
                    1 if answer.strip() == truth.strip()
                    else semantic_similarity(answer, truth), 3
                ),
                "Latency (s)": round(latency, 3),
                "Confidence": confidence
            })

        if fine_tuning:
            answer, latency, confidence = query_finance_system_finetune(query)
            evaluations.append({
                "Method": "Fine-Tune",
                "Query": query,
                "Response": answer,
                "Ground Truth": truth,
                "Accuracy": round(
                    1 if answer.strip() == truth.strip()
                    else semantic_similarity(answer, truth), 3
                ),
                "Latency (s)": round(latency, 3),
                "Confidence": confidence
            })

    df = pd.DataFrame(evaluations)
    print(df.to_markdown())
    return df


# evaluate pre-finetuned system
evaluate(test_questions_10, True, False)
print('RAG with response generation evaluated.')

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 5 unique chunks.
Retrieved 5 chunks from dense search.
Retr

### 3.4 Fine-Tuning
Now we will fine-tune the selected model on our prepared Q/A dataset. We will use the transformers library from Hugging Face for this task. We'll also log the hyperparameters used in the training process.



In [20]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Preprocess the dataset
def preprocess_function(examples):
    inputs = [ex for ex in examples["text"]]
    # The model expects 'labels' for the target text
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")

    # The T5 model needs the decoder input_ids to be created from the labels
    # The Trainer does this automatically if the 'labels' field is present.
    # We just need to make sure our tokenized outputs have a 'labels' key.
    # For T5, the input and output are the same text sequence for this task.
    model_inputs["labels"] = model_inputs["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Define the training arguments
# NOTE: The 'evaluation_strategy' and 'device' arguments have been removed.
# 'do_eval=True' enables evaluation, which defaults to the end of each epoch.
# The Trainer will automatically use the GPU if it's available.
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    do_eval=True,  # Enable evaluation
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

# Log the hyperparameters
print("Hyperparameters:")
print(f"  Learning Rate: {training_args.learning_rate}")
print(f"  Batch Size: {training_args.per_device_train_batch_size}")
print(f"  Number of Epochs: {training_args.num_train_epochs}")

# Determine compute setup and log it
compute_device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"  Compute Setup: {compute_device}")


# Create the data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start fine-tuning
trainer.train()

Map:   0%|          | 0/118 [00:00<?, ? examples/s]

Hyperparameters:
  Learning Rate: 2e-05
  Batch Size: 8
  Number of Epochs: 3
  Compute Setup: cuda


  trainer = Seq2SeqTrainer(
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjawaharbtech[0m ([33mjawahar-s[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


TrainOutput(global_step=45, training_loss=25.108327907986112, metrics={'train_runtime': 239.4901, 'train_samples_per_second': 1.478, 'train_steps_per_second': 0.188, 'total_flos': 60601025691648.0, 'train_loss': 25.108327907986112, 'epoch': 3.0})

In [21]:

print("\n--- Testing the Fine-Tuned Model from Section 3.3 ---")

# Save the final model
trainer.save_model("./results_s3_3/final_model")

# Load the fine-tuned model from the checkpoint
fine_tuned_model_3_3 = AutoModelForSeq2SeqLM.from_pretrained("./results_s3_3/final_model")
fine_tuned_model_3_3.to(compute_device)

def ask_simple_finetuned_model(question):
    # We must match the prompt format that the model was trained on
    prompt = f"question: {question} answer:"

    inputs = tokenizer(prompt, return_tensors="pt").to(compute_device)

    with torch.no_grad():
        outputs = fine_tuned_model_3_3.generate(**inputs, max_new_tokens=50)

    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# --- Prediction and Analysis ---
test_question = "Total assets of 2023?"
predicted_answer = ask_simple_finetuned_model(test_question)

print(f"\nQ: {test_question}")
print(f"A: {predicted_answer}")


--- Testing the Fine-Tuned Model from Section 3.3 ---

Q: Total assets of 2023?
A: £2 billion


#### 3.5 Advanced Fine-Tuning Technique (Mixture-of-Experts Fine-Tuning)

We will create two specialized experts:

Expert A: The "Balance Sheet" Expert: This expert will be a specialist on questions related to a company's assets and liabilities. These items reflect a company's financial position at a single point in time (e.g., "as of March 31, 2023").

Expert B: The "Income Statement" Expert: This expert will specialize in questions about revenues, costs, and expenses. These items reflect a company's performance over a period of time (e.g., "for the year 2024").


##### Step 1: Prepare and Split the Dataset


In [22]:
import pandas as pd
from datasets import Dataset

# Assuming qa_documents is defined as before (with list of tuples)

# --- Data Preparation ---
balance_sheet_q = []
balance_sheet_a = []
income_statement_q = []
income_statement_a = []

# Iterate and populate separate lists for questions and answers
for key, qa_list_of_tuples in qa_documents.items():
    for question,answer  in qa_list_of_tuples:
        if 'Balance Sheet' in key:
            balance_sheet_q.append(question)
            balance_sheet_a.append(answer)
        elif 'Income Statement' in key:
            income_statement_q.append(question)
            income_statement_a.append(answer)

# Create DataFrames with separate columns
df_balance_sheet = pd.DataFrame({'question': balance_sheet_q, 'answer': balance_sheet_a})
df_income_statement = pd.DataFrame({'question': income_statement_q, 'answer': income_statement_a})

# Create Hugging Face Datasets
balance_sheet_dataset = Dataset.from_pandas(df_balance_sheet)
income_statement_dataset = Dataset.from_pandas(df_income_statement)

print("--- Balance Sheet Dataset (Expert A) ---")
print(balance_sheet_dataset[0])


# --- Corrected Preprocessing Function ---
# This is the most critical change.
def preprocess_seq2seq(examples):
    # Tokenize the questions (inputs)
    model_inputs = tokenizer(examples['question'], max_length=128, truncation=True, padding="max_length")

    # Tokenize the answers (labels)
    labels = tokenizer(text_target=examples['answer'], max_length=50, truncation=True, padding="max_length")

    # The 'labels' field is what the model will learn to predict
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Apply the new preprocessing function
# Assumes 'tokenizer' is already loaded
tokenized_balance_sheet_dataset = balance_sheet_dataset.map(preprocess_seq2seq, batched=True)
tokenized_income_statement_dataset = income_statement_dataset.map(preprocess_seq2seq, batched=True)

--- Balance Sheet Dataset (Expert A) ---
{'question': 'Cash and cash equivalents assets of 2024?', 'answer': '$ 1,553 million.'}


Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/66 [00:00<?, ? examples/s]

##### Step 2: Train the Two Expert LoRA Adapters


In [23]:
import os
import shutil

dir_balance_sheet = './lora_expert_balance_sheet'
dir_income_statement = './lora_expert_income_statement'

if os.path.exists(dir_balance_sheet):
    shutil.rmtree(dir_balance_sheet)
    print(f"Deleted old directory: {dir_balance_sheet}")

if os.path.exists(dir_income_statement):
    shutil.rmtree(dir_income_statement)
    print(f"Deleted old directory: {dir_income_statement}")

In [24]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)
from peft import LoraConfig, get_peft_model, TaskType

# --- Reusable Setup ---
model_name = "google/flan-t5-base"
# Assumes 'tokenizer' is loaded from data prep step

# We give the model more capacity to learn the facts.
lora_config = LoraConfig(
    r=32,  # <-- INCREASED RANK from 16 to 32
    lora_alpha=64,  # <-- INCREASED ALPHA from 32 to 64 (common practice is 2*r)
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM,
)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

# --- Train Expert A (Balance Sheet) ---
print("\n--- Starting INTENSIVE Training for Expert A (Balance Sheet) ---")
balance_sheet_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
balance_sheet_peft_model = get_peft_model(balance_sheet_model, lora_config)
balance_sheet_peft_model.print_trainable_parameters()

balance_sheet_training_args = Seq2SeqTrainingArguments(
    output_dir="./lora_expert_balance_sheet",
    num_train_epochs=200,
    learning_rate=1e-4,  # <-- Adjusted Learning Rate
    per_device_train_batch_size=4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",  # <-- Added a learning rate scheduler for stability
    report_to="none",
)

balance_sheet_trainer = Seq2SeqTrainer(
    model=balance_sheet_peft_model,
    args=balance_sheet_training_args,
    train_dataset=tokenized_balance_sheet_dataset,  # Assumes this is correctly loaded
    tokenizer=tokenizer,
    data_collator=data_collator,
)
balance_sheet_trainer.train()
balance_sheet_trainer.model.save_pretrained("./lora_expert_balance_sheet")
print("Expert A (Balance Sheet) adapter saved.")


# --- Train Expert B (Income Statement) ---
print("\n--- Starting INTENSIVE Training for Expert B (Income Statement) ---")
income_statement_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
income_statement_peft_model = get_peft_model(income_statement_model, lora_config)

# Use the same aggressive arguments for the second expert
income_statement_training_args = Seq2SeqTrainingArguments(
    output_dir="./lora_expert_income_statement",
    num_train_epochs=200,
    learning_rate=1e-4,  # <-- Adjusted Learning Rate
    per_device_train_batch_size=4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",  # <-- Added a learning rate scheduler
    report_to="none",
)

income_statement_trainer = Seq2SeqTrainer(
    model=income_statement_peft_model,
    args=income_statement_training_args,
    train_dataset=tokenized_income_statement_dataset,  # Assumes this is correctly loaded
    tokenizer=tokenizer,
    data_collator=data_collator,
)
income_statement_trainer.train()
income_statement_trainer.model.save_pretrained("./lora_expert_income_statement")
print("Expert B (Income Statement) adapter saved.")


--- Starting INTENSIVE Training for Expert A (Balance Sheet) ---
trainable params: 3,538,944 || all params: 251,116,800 || trainable%: 1.4093


  balance_sheet_trainer = Seq2SeqTrainer(


Step,Training Loss
500,4.8597
1000,0.247
1500,0.1939
2000,0.1728
2500,0.1645


Expert A (Balance Sheet) adapter saved.

--- Starting INTENSIVE Training for Expert B (Income Statement) ---


  income_statement_trainer = Seq2SeqTrainer(


Step,Training Loss
500,5.3745
1000,0.2473
1500,0.1751


Step,Training Loss
500,5.3745
1000,0.2473
1500,0.1751
2000,0.1429
2500,0.1267
3000,0.116


Expert B (Income Statement) adapter saved.


##### Step 3: Implement the Router and Perform Inference

In our implementation, we have created a system-level Mixture-of-Experts (MoE). Instead of a single monolithic model, we use two smaller, specialized "expert" models, each fine-tuned on a distinct subset of financial data—one for the Balance Sheet and one for the Income Statement.
Our route_to_financial_expert function serves as the gating network or router. This critical component analyzes the user's question and intelligently directs it to the appropriate expert, ensuring that the query is handled by the model with the most relevant training for that specific domain.


In [25]:
import time
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
import torch.nn.functional as F

# --- 1. The Financial Router ---
def route_to_financial_expert(question):
    q_lower = question.lower()
    if 'assets' in q_lower or 'liabilities' in q_lower:
        return 'balance_sheet'
    return 'income_statement'


# --- 2. Load Base + Experts ---
model_name = "google/flan-t5-base"
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

expert_balance_sheet = PeftModel.from_pretrained(base_model, "./lora_expert_balance_sheet")
expert_income_statement = PeftModel.from_pretrained(base_model, "./lora_expert_income_statement")

device = "cuda" if torch.cuda.is_available() else "cpu"
expert_balance_sheet.to(device)
expert_income_statement.to(device)


# === 3. INFERENCE FUNCTION ===
def query_finance_system_finetune(question):
    expert_type = route_to_financial_expert(question)
    print(f"Routing to... Expert '{expert_type.upper()}'")

    model_to_use = expert_balance_sheet if expert_type == 'balance_sheet' else expert_income_statement

    # Tokenize input
    inputs = tokenizer(question, return_tensors="pt").to(device)

    # Measure latency
    start_time = time.time()
    with torch.no_grad():
        outputs = model_to_use.generate(**inputs, max_new_tokens=50, output_scores=True, return_dict_in_generate=True)
    latency = time.time() - start_time

    # Decode text
    answer = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)

    # --- Confidence estimation ---
    # Take first token probability as a simple proxy
    if hasattr(outputs, "scores") and len(outputs.scores) > 0:
        probs = F.softmax(outputs.scores[0], dim=-1)
        first_token_id = outputs.sequences[0][1].item()  # token after <s>
        confidence = probs[0, first_token_id].item()
    else:
        confidence = 0.0

    return answer, latency, confidence


# --- 4. Test the System ---
print("\n--- Testing the Correctly Trained Expert System ---")

question1 = "Total liabilities of 2023?"
answer1 = query_finance_system_finetune(question1)
print(f"Q: {question1}\nA: {answer1}\n")

question2 = "Cost of services of 2024?"
answer2 = query_finance_system_finetune(question2)
print(f"Q: {question2}\nA: {answer2}\n")





--- Testing the Correctly Trained Expert System ---
Routing to... Expert 'BALANCE_SHEET'
Q: Total liabilities of 2023?
A: ('$ 17,026 million.', 0.31064701080322266, 0.9783398509025574)

Routing to... Expert 'INCOME_STATEMENT'
Q: Cost of services of 2024?
A: ('$ 17,026 million.', 0.27557992935180664, 0.9995242357254028)



#### 3.6 Guardrail Implementation

Input side - we are validating the input query to filter out irrelevant or harmful inputs.

In [26]:

question = "Assets of 2024?"
answer2 = query_finance_system_finetune(question)
print(f"Q: {question}\nA: {answer2}\n")

validate_query(question)

Routing to... Expert 'BALANCE_SHEET'
Q: Assets of 2024?
A: ('$ 16,221 million.', 0.2823517322540283, 0.987553596496582)



(True, 'Query is valid.')

3.7 Interface Development

In [34]:
# implemented at the bottom of the notebook.

### 4. Testing, Evaluation & Comparison
#### 4.1 Test Questions

For both systems, ask three official questions:

In [30]:
test_questions = {
    "Relevant, high-confidence": "What was the total revenue in 2024?",
    "Relevant, low-confidence": "What was the year-over-year (reported) revenue change for Kyndryl in fiscal year 2024 versus fiscal year 2023?",
    "Irrelevant": "What is the recipes of cookie?",
}

for category, test_question in test_questions.items():
    print(f"{category}")
    print(f"   Question: {test_question}")
    print("RAG System:")
    result = query_finance_system_rag(test_question)
    print(f"   Result: {result}")
    print("Fine-Tuned System:")
    result = query_finance_system_finetune(test_question)
    print(f"   Result: {result}")

Relevant, high-confidence
   Question: What was the total revenue in 2024?
RAG System:
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
   Result: ('$ 16, 052 million', 0.8741888999938965, 0.5615305304527283)
Fine-Tuned System:
Routing to... Expert 'INCOME_STATEMENT'
   Result: ('$ 17,026 million.', 1.2007906436920166, 0.9929532408714294)
Relevant, low-confidence
   Question: What was the year-over-year (reported) revenue change for Kyndryl in fiscal year 2024 versus fiscal year 2023?
RAG System:
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
   Result: ('$ 16, 052 million', 0.9512648582458496, 0.4123569428920746)
Fine-Tuned System:
Routing to... Expert 'INCOME_STATEMENT'
   Result: ('229.2 million.', 0.7766199111938477, 0.17903085052967072)
Irrelevant
   Question: What is the recipes of cookie?
RAG System:
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from s

#### 4.2 & 4.3 Extended Evaluation with results

Evaluate both systems on at least 10 different financial questions.

In [36]:
evaluate(test_questions_10, True, True)
print('Evaluated both RAG and Fine Tuning.')

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Routing to... Expert 'INCOME_STATEMENT'
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Routing to... Expert 'INCOME_STATEMENT'
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Routing to... Expert 'INCOME_STATEMENT'
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 6 unique chunks.
Routing to... Expert 'INCOME_STATEMENT'
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Routing to... Expert 'INCOME_STATEMENT'
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Routing to... Expert 'INCOME_STATEMENT'
Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.
Routing to... 

RAG consistently gave more factually accurate answers aligned with the financial ground truth, showing higher accuracy across most queries but with lower confidence scores.
Fine-tuning on small number of data produced fluent outputs with very high confidence, but it often overfit to repeated patterns (e.g., $17,026 million), leading to weaker factual reliability compared to RAG.

### 4.4 Analysis

The amount of data is too less for fine-tuning the model. This financial data helps RAG system excel with the grounding facts.

#### RAG (Retrieval-Augmented Generation)

Pros

Adaptability – Easy to update answers when new 10-Ks or balance sheets come out.

Factual grounding – Pulls numbers directly from source documents, reducing hallucinations.

Cons

Retrieval errors – With noisy queries, it may fetch irrelevant sections, hurting accuracy.

Latency & complexity – Needs a search index + retriever model, which adds infra overhead.

#### Fine-Tuning (with 50 Q&A)

Pros

Fluency – Model learns consistent, natural financial reporting style.

Efficiency – At inference, no retrieval step, so answers are faster once trained.

Cons (especially with just 50 Q&A)

Overfitting – The model memorizes those 50 examples, failing to generalize to new financial phrasing.

Catastrophic forgetting – May lose some general language ability since so little data is steering it.

#### Robustness to Irrelevant Queries

RAG: Safer — it can say “no relevant context found,” though still depends on retriever quality.

Fine-Tuning (50 Q&A): Riskier — with so little data, the model will often hallucinate plausible but wrong answers.

#### Practical Trade-offs

RAG:
✅ Cheaper to maintain knowledge freshness.
❌ Needs extra infra (vector DB, retriever).

Fine-Tuning (50 Q&A):
✅ Simpler runtime (just the model).
❌ Very weak generalization with so little data; retraining required when new reports come.



## UI Implementation using Gradio

In [None]:
import os
import time
import pandas as pd
import gradio as gr
# ---- Router to your existing backends ----


def _run_backend(question: str, mode: str) -> str:
    q = (question or "").strip()
    if not q:
        return "Please enter a question."

    # Optional guardrail if defined earlier
    try:
        is_valid, reason = validate_query(q)
        if not is_valid:
            return f"[Guardrail] {reason}"
    except NameError:
        pass
    if mode == "RAG":
        return query_finance_system_rag(q)
    else:
        return query_finance_system_finetune(q)


# ---- Main handler ----
def answer_and_log(question: str, mode: str, history: list):
    """
    Returns: answer_text, updated_history, chatbot_pairs, history_df
    """
    answer_text = _run_backend(question, mode)
    ts = time.strftime("%Y-%m-%d %H:%M:%S")
    entry = {
        "time": ts,
        "method": mode,
        "question": (question or "").strip(),
        "answer": answer_text,
    }
    history = (history or []) + [entry]
    chatbot_pairs = [(h["question"], f"[{h['method']}] {h['answer']}") for h in history]
    hist_df = pd.DataFrame(history, columns=["time", "method", "question", "answer"])
    return f"[{mode}] {answer_text}", history, chatbot_pairs, hist_df


def clear_history():
    return "", [], [], pd.DataFrame(columns=["time", "method", "question", "answer"])


# ---- UI (Blocks) ----
with gr.Blocks(title="Financial QA — RAG vs Fine-Tuned") as demo:
    gr.Markdown(
        "## Financial QA — RAG vs Fine-Tuned\nAsk a question about the Kyndryl financial reports and see history below."
    )
    state_history = gr.State([])
    with gr.Row():

        with gr.Column(scale=5):

            question_in = gr.Textbox(
                label="Question",
                placeholder="e.g., What were the total revenues in fiscal 2024?",
                lines=2,
            )

            mode_in = gr.Radio(
                choices=["RAG", "Fine-Tune"],
                value="RAG",
                label="Method",
            )

            with gr.Row():
                ask_btn = gr.Button("Answer", variant="primary")
                clear_btn = gr.Button("Clear History")

        with gr.Column(scale=5):
            answer_out = gr.Markdown(label="Answer")
            chat_out = gr.Chatbot(
                label="Q/A History (Chat view)", bubble_full_width=False
            )
            table_out = gr.Dataframe(label="History (Table view)", interactive=False)

    # Button wiring
    ask_btn.click(
        fn=answer_and_log,
        inputs=[question_in, mode_in, state_history],
        outputs=[answer_out, state_history, chat_out, table_out],
    )

    clear_btn.click(
        fn=clear_history,
        inputs=None,
        outputs=[answer_out, state_history, chat_out, table_out],
    )

if __name__ == "__main__":
    demo.launch(
        server_name="0.0.0.0",
        server_port=int(os.getenv("PORT", 7861)),
        # share=True,  # enable locally if you want a public gradio.live link
    )