# Financial Document Analysis with Retrieval-Augmented Generation (RAG)

This Jupyter Notebook demonstrates a complete workflow for building a Q&A system from financial reports using a Retrieval-Augmented Generation (RAG) architecture. We will process Kyndryl's annual reports, build a sophisticated retrieval system, and re-rank the results for higher accuracy.

### Part 1: Data Collection & Preprocessing

**1.1: Obtain and Ingest Financial Statements**

First, we'll install the necessary libraries for our project. Then, we will download the annual reports for the last two fiscal years from the provided URLs and store them locally. This step ensures we have the raw data ready for processing.

In [1]:
# !pip install PyMuPDF sentence-transformers faiss-cpu rank_bm25 transformers pandas -q

In [2]:
import requests
import os

# Create a directory to store the PDFs
if not os.path.exists("annual_reports"):
    os.makedirs("annual_reports")

# pdf_urls = [
#     "https://investors.kyndryl.com/static-files/7209882b-033b-42ee-8c99-140c8deb9390", # 2024 Annual Report
#     "https://investors.kyndryl.com/static-files/0205a5a1-2f59-4ab7-b892-58615604423a"  # 2023 Annual Report
# ]
pdf_urls = [
    "https://investors.kyndryl.com/static-files/8cd74aec-d775-4c72-8153-7e614f42cbee",  # 2024 Annual Report
    "https://investors.kyndryl.com/static-files/f63144d6-bdc5-458b-9334-08c7c43a2f4f"  # 2023 Annual Report
]
pdf_paths=[]
for url in pdf_urls:
    file_name = os.path.join("annual_reports", url.split("/")[-1]) +'.pdf'
    if not os.path.exists(file_name):
        print(f"Downloading {file_name}...")
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(file_name, "wb") as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
    else:
        print(f"{file_name} already exists.")
    pdf_paths.append(file_name)


annual_reports/8cd74aec-d775-4c72-8153-7e614f42cbee.pdf already exists.
annual_reports/f63144d6-bdc5-458b-9334-08c7c43a2f4f.pdf already exists.


**1.2: Convert Documents to Plain Text and Clean**

We will use the `PyMuPDF` library to parse the downloaded PDF files and extract raw text. We will then apply a basic cleaning function to remove common artifacts like headers, footers, and extra whitespace, which are irrelevant for our Q&A task.

In [3]:
import fitz  # PyMuPDF
import re

def extract_and_clean_text(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        text = page.get_text()
        # Simple cleaning: remove page numbers that are standalone lines
        text = re.sub(r'^\d+\s*$', '', text, flags=re.MULTILINE)
        # Remove excessive newlines
        text = re.sub(r'\n{3,}', '\n\n', text)
        full_text += text
    doc.close()
    return full_text

documents = [extract_and_clean_text(path) for path in pdf_paths]
print(f"Extracted text from {len(documents)} documents.")
print("Sample of extracted text (first 500 characters):\n", documents[0][:500])

Extracted text from 2 documents.
Sample of extracted text (first 500 characters):
  
 
 
 
KYNDRYL REPORTS FOURTH QUARTER AND FULL-YEAR 2024 RESULTS 
• 
Revenues for the quarter ended March 31, 2024 total $3.8 billion, pretax loss is $4 
million, net loss is $45 million, adjusted EBITDA is $566 million, and adjusted pretax 
income is $30 million 
• 
Fiscal year 2024 revenues total $16.1 billion, pretax loss is $168 million, net loss is 
$340 million, adjusted EBITDA is $2.4 billion, and adjusted pretax income is $165 
million 
• 
Company expects to return to positive constant-


In [4]:
with open("annual_reports/combined_reports.txt", "w", encoding="utf-8") as f:
    for doc in documents:
        f.write(doc + "\n\n--- End of Document ---\n\n")
print("Combined text saved to annual_reports/combined_reports.txt")

Combined text saved to annual_reports/combined_reports.txt


**1.3 Construct Q/A Pairs**

While programmatic segmentation into financial sections like 'Income Statement' is complex, our subsequent chunking strategy will serve a similar purpose by breaking the document into manageable pieces. Below, we load the 50 manually constructed Q/A pairs into a pandas DataFrame. This will serve as our evaluation set to test the RAG system's performance.

In [5]:
import pandas as pd

# Data extracted from the Kyndryl financial reports PDF using Gemini 2.5 model
qa_data = [
    # Questions from 2024 Report
    (1, "What were Kyndryl's revenues for the quarter ended March 31, 2024?", "$3.8 billion."),
    (2, "What was the net loss for the fourth quarter of fiscal year 2024?", "$45 million."),
    (3, "What were the total revenues for the full fiscal year 2024?", "$16.1 billion."),
    (4, "What was the net loss for the full fiscal year 2024?", "$340 million."),
    (5, "What is Kyndryl's outlook for adjusted pretax income in fiscal year 2025?", "At least $435 million."),
    (6, "When does Kyndryl expect to return to positive constant-currency revenue growth?", "In the fourth quarter of the fiscal year 2025."),
    (7, "Who is the Chairman and Chief Executive Officer of Kyndryl?", "Martin Schroeter."),
    (8, "What was the reported year-over-year revenue decline for the fourth quarter of 2024?", "10% (and 9% in constant currency)."),
    (9, "What was the adjusted EBITDA for the fourth quarter ended March 31, 2024?", "$566 million."),
    (10, "What was the adjusted pretax income for the full fiscal year 2024?", "$165 million, compared to an adjusted pretax loss of $217 million in the prior year."),
    (11, "How much revenue was recognized in fiscal year 2024 from cloud hyperscaler alliances?", "More than $500 million, which was triple the prior-year amount."),
    (12, "How many delivery professionals has the Advanced Delivery initiative helped Kyndryl redeploy as of the 2024 report?", "More than 9,500."),
    (13, "What are the annualized savings generated by the Advanced Delivery initiative as of year-end 2024?", "Approximately $575 million."),
    (14, "What was the year-over-year revenue growth for Kyndryl Consult in the fourth quarter of 2024?", "13% year-over-year and 15% in constant currency."),
    (15, "What is the projected revenue range for fiscal year 2025?", "The outlook implies fiscal 2025 revenue of $15.2 to $15.5 billion."),
    (16, "What is the forecasted adjusted EBITDA margin for fiscal year 2025?", "At least 16.2%, an increase of at least 150 basis points compared to 14.7% in fiscal 2024."),
    (17, "On what date did Kyndryl release its 2024 financial results?", "May 7, 2024."),
    (18, "What was the cash flow from operations for the fiscal year ended March 31, 2024?", "$454 million."),
    (19, "What was the adjusted free cash flow for fiscal year 2024?", "$291 million."),
    (20, "What are the 'three-A initiatives' mentioned as primary drivers for growth in the 2024 report?", "Alliances, Advanced Delivery and Accounts."),
    (21, "What was the total revenue for the United States segment for the year ended March 31, 2024?", "$4,295 million (or $4.295 billion)."),
    (22, "What was the total amount of debt (short-term and long-term) as of March 31, 2024?", "$3,238 million (or $3.238 billion)."),
    (23, "What were the capital expenditures for the year ended March 31, 2024?", "$651 million."),
    (24, "What was the value of 'Signings' for the full year ended March 31, 2024?", "$12.5 billion."),
    (25, "According to CEO Martin Schroeter in the 2024 report, what is the company's new focus for the fiscal year?", "The company has pivoted from transformation to growth."),

    # Questions from 2023 Report
    (26, "What were Kyndryl's revenues for the quarter ended March 31, 2023?", "$4.3 billion. [1]"),
    (27, "What was the net loss for the fourth quarter of fiscal year 2023?", "$737 million. [1]"),
    (28, "What were the total revenues for the full fiscal year 2023?", "$17.0 billion. [1]"),
    (29, "What was the net loss for the full fiscal year 2023?", "$1.4 billion. [1]"),
    (30, "What was the adjusted pretax loss for the quarter ended March 31, 2023?", "$61 million. [1]"),
    (31, "What was the adjusted EBITDA for the full fiscal year 2023?", "$2.0 billion. [1]"),
    (32, "What was the value of contracts tied to cloud hyperscaler alliances signed in fiscal year 2023?", "An aggregate value of $1.2 billion, exceeding the $1 billion target. [2]"),
    (33, "How many hyperscaler certifications did Kyndryl employees have at the end of fiscal year 2023?", "35,000, a 100% year-over-year increase. [2]"),
    (34, "How many delivery professionals were redeployed as part of the Advanced Delivery initiative in fiscal year 2023?", "More than 5,500. [2]"),
    (35, "What were the annualized savings from the Advanced Delivery initiative in fiscal year 2023?", "Approximately $275 million. [2]"),
    (36, "What was the total impact of the Accounts initiative in annualized benefits for fiscal year 2023?", "Approximately $210 million of annualized benefits. [2]"),
    (37, "What were the transaction-related costs for the full fiscal year 2023?", "$264 million. [2]"),
    (38, "What were the workforce rebalancing charges recorded in the quarter ended March 31, 2023?", "$55 million. [2]"),
    (39, "What is the implied fiscal 2024 revenue outlook based on the 2023 report?", "$16.0 to $16.4 billion. [2]"),
    (40, "What was the outlook for the adjusted EBITDA margin for fiscal year 2024?", "12% to 13%. [2]"),
    (41, "What was the adjusted free cash flow for the fiscal year ended March 31, 2023?", "$352 million. [1]"),
    (42, "On what date did Kyndryl release its 2023 financial results?", "May 16, 2023. [1]"),
    (43, "Who is the Chief Financial Officer of Kyndryl mentioned in the 2023 report?", "David Wyshner. [2]"),
    (44, "What was the cash flow from operations for the fiscal year ended March 31, 2023?", "$781 million. [1]"),
    (45, "What was the adjusted pretax loss for the full fiscal year 2023?", "$217 million. [1]"),
    (46, "What were the expected cumulative annualized benefits from the Accounts initiative by March 2024?", "Approximately $400 million."),
    (47, "What was the year-over-year decline in revenue for fiscal year 2023?", "A decline of 7%, or unchanged in constant currency. [1]"),
    (48, "What was the total amount of debt (short-term and long-term) as of March 31, 2023?", "$3,221 million (or $3.221 billion)."),
    (49, "What were the capital expenditures for the year ended March 31, 2023?", "$865 million."),
    (50, "What was the value of 'Signings' for the year ended March 31, 2023?", "$12.2 billion.")
]

# Create the DataFrame
df = pd.DataFrame(qa_data, columns=['No', 'Question', 'Answer'])
df.to_csv("annual_reports/kyndryl_qa_data.csv", index=False)
# Display the DataFrame
df.head()

Unnamed: 0,No,Question,Answer
0,1,What were Kyndryl's revenues for the quarter e...,$3.8 billion.
1,2,What was the net loss for the fourth quarter o...,$45 million.
2,3,What were the total revenues for the full fisc...,$16.1 billion.
3,4,What was the net loss for the full fiscal year...,$340 million.
4,5,What is Kyndryl's outlook for adjusted pretax ...,At least $435 million.


### Part 2: Retrieval-Augmented Generation (RAG) System Implementation

**2.1 Data Processing: Chunking**

To prepare the text for retrieval, we'll split it into smaller, manageable chunks. This allows the model to find more specific and relevant passages. We will create two sets of chunks with different sizes (100 and 400 tokens) to analyze the impact of chunk size on retrieval performance. Each chunk will be assigned a unique ID and metadata indicating its source document and chunk size.

In [6]:
from transformers import AutoTokenizer

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def chunk_text(text, chunk_size, doc_id):
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        chunks.append({
            "id": f"{doc_id}_chunk_{len(chunks)}",
            "text": chunk_text,
            "metadata": {"doc_id": doc_id, "chunk_size": chunk_size}
        })
    return chunks

chunk_sizes = [100, 400]
all_chunks = []
for size in chunk_sizes:
    for i, doc_text in enumerate(documents):
        all_chunks.extend(chunk_text(doc_text, size, f"doc_{i}_size_{size}"))

print(f"Total number of chunks created: {len(all_chunks)}")
print("Sample chunk:", all_chunks[0])

  from .autonotebook import tqdm as notebook_tqdm
Token indices sequence length is longer than the specified maximum sequence length for this model (6349 > 512). Running this sequence through the model will result in indexing errors


Total number of chunks created: 148
Sample chunk: {'id': 'doc_0_size_100_chunk_0', 'text': 'kyndryl reports fourth quarter and full - year 2024 results • revenues for the quarter ended march 31, 2024 total $ 3. 8 billion, pretax loss is $ 4 million, net loss is $ 45 million, adjusted ebitda is $ 566 million, and adjusted pretax income is $ 30 million • fiscal year 2024 revenues total $ 16. 1 billion, pretax loss is $ 168 million, net loss is $ 340 million, adjusted ebit', 'metadata': {'doc_id': 'doc_0_size_100', 'chunk_size': 100}}


**2.2 Embedding & Indexing**

Next, we will convert the text chunks into numerical vectors (embeddings) using the `all-MiniLM-L6-v2` model. These embeddings capture the semantic meaning of the text.

We will build two types of indexes:
1.  **Dense Vector Store (FAISS):** For fast semantic similarity search.
2.  **Sparse Index (BM25):** For efficient keyword-based retrieval.

In [7]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from rank_bm25 import BM25Okapi

# 1. Embed Chunks
embedding_model = SentenceTransformer(model_name)
chunk_texts = [chunk['text'] for chunk in all_chunks]
embeddings = embedding_model.encode(chunk_texts, convert_to_numpy=True, show_progress_bar=True)

# 2. Build Dense Vector Store (FAISS)
embedding_dim = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(embedding_dim)
faiss_index.add(np.array(embeddings, dtype=np.float32))
print(f"FAISS index built with {faiss_index.ntotal} vectors.")

# 3. Build Sparse Index (BM25)
tokenized_corpus = [doc.lower().split(" ") for doc in chunk_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 index built.")

Batches:   0%|          | 0/5 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Batches: 100%|██████████| 5/5 [00:00<00:00,  6.50it/s]

FAISS index built with 148 vectors.
BM25 index built.





**2.3 Hybrid Retrieval Pipeline**

Our retrieval pipeline will combine the strengths of both dense and sparse methods. For a given query, we will:
1. Preprocess the query.
2. Retrieve the top N chunks from FAISS based on vector similarity.
3. Retrieve the top N chunks from BM25 based on keyword matching.
4. Combine the results using a simple union to create a comprehensive list of candidate chunks.

In [8]:
def preprocess_query(query):
    return query.lower()

def hybrid_retrieve(query, top_n=5):
    # 1. Preprocess query
    clean_query = preprocess_query(query)
    
    # 2. Dense Retrieval (FAISS)
    query_embedding = embedding_model.encode([clean_query])
    _, dense_indices = faiss_index.search(np.array(query_embedding, dtype=np.float32), top_n)
    dense_results = [all_chunks[i] for i in dense_indices[0]]
    
    # 3. Sparse Retrieval (BM25)
    tokenized_query = clean_query.split(" ")
    bm25_scores = bm25.get_scores(tokenized_query)
    sparse_indices = np.argsort(bm25_scores)[::-1][:top_n]
    sparse_results = [all_chunks[i] for i in sparse_indices]
    
    # 4. Combine results
    combined_results_dict = {chunk['id']: chunk for chunk in dense_results + sparse_results}
    
    print(f"Retrieved {len(dense_results)} chunks from dense search.")
    print(f"Retrieved {len(sparse_results)} chunks from sparse search.")
    print(f"Combined to {len(combined_results_dict)} unique chunks.")
    
    return list(combined_results_dict.values())

In [9]:
# Example Usage
test_query = "What was the company's revenue in 2024?"
retrieved_chunks = hybrid_retrieve(test_query)

print(f"\n--- Top Retrieved Chunk for '{test_query}' ---")
print(retrieved_chunks[0]['text'])

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.

--- Top Retrieved Chunk for 'What was the company's revenue in 2024?' ---
to march 2024 : revenue growth of ( 6 % ) to ( 8 % ) in constant currency compared to revenue of $ 17. 0 billion in fiscal 2023, with most of the decline due to accelerated actions by kyndryl to reduce certain low - margin revenue streams. based on recent exchange rates, the company ’ s outlook implies fiscal 2024 revenue of $ 16. 0 to $ 16. 4 billion. the company continues to expect to return to positive revenue growth in calendar year 2025


**2.4 Advanced RAG Technique: Cross-Encoder Re-ranking**

The initial retrieval might return chunks that are only partially relevant. To refine our results, we'll use a Cross-Encoder model. Unlike the embedding model which computes vectors independently, a Cross-Encoder takes both the query and a candidate chunk as input to produce a more accurate relevance score. We will use this to re-rank the top chunks retrieved from our hybrid pipeline.

In [23]:
from sentence_transformers.cross_encoder import CrossEncoder

# Load a cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_with_cross_encoder(query, chunks):
    # Create pairs of [query, chunk_text] for scoring
    query_chunk_pairs = [[query, chunk['text']] for chunk in chunks]
    
    # Get scores from the cross-encoder
    scores = cross_encoder.predict(query_chunk_pairs, show_progress_bar=False)
    
    # Add scores to chunks and sort
    for i in range(len(chunks)):
        chunks[i]['relevance_score'] = scores[i]
        
    reranked_chunks = sorted(chunks, key=lambda x: x['relevance_score'], reverse=True)
    return reranked_chunks


def advanced_retrieve(query):
    retrieved_chunks = hybrid_retrieve(query)
    reranked_results = rerank_with_cross_encoder(query, retrieved_chunks)
    return reranked_results

In [24]:
# Example Usage with the same test query
test_query = "What was the company's revenue in 2024?"
reranked_results = advanced_retrieve(test_query)
print(f"\n--- Top Re-ranked Chunk for '{test_query}' ---")
print(f"Relevance Score: {reranked_results[0]['relevance_score']:.4f}")
print(reranked_results[0]['text'])

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.

--- Top Re-ranked Chunk for 'What was the company's revenue in 2024?' ---
Relevance Score: 8.2950
to march 2024 : revenue growth of ( 6 % ) to ( 8 % ) in constant currency compared to revenue of $ 17. 0 billion in fiscal 2023, with most of the decline due to accelerated actions by kyndryl to reduce certain low - margin revenue streams. based on recent exchange rates, the company ’ s outlook implies fiscal 2024 revenue of $ 16. 0 to $ 16. 4 billion. the company continues to expect to return to positive revenue growth in calendar year 2025


**2.5 Response Generation**

After retrieving and re-ranking the most relevant document chunks, the final step is to generate a coherent, human-readable answer. We will use a generative language model to synthesize the information from the retrieved passages into a direct response to the user's query. 

For this demonstration, we'll use `DistilGPT-2`, a smaller and more efficient version of GPT-2, which is suitable for tasks where resource constraints are a consideration. The retrieved text and the original query are combined into a carefully crafted prompt to guide the model in generating a factual answer grounded in the provided context.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

# --- Configuration ---
# Using google/flan-t5-base is a good balance for an 8GB M2 Mac.
# 'small' is faster but less accurate; 'large' may run out of memory.
model_name = "google/flan-t5-base"

# --- Model and Tokenizer Loading ---
# Load the model and tokenizer, sending the model to the specified device
try:
    gen_tokenizer = AutoTokenizer.from_pretrained(model_name)
    gen_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
except Exception as e:
    print(f"Error loading model: {e}")
    print("Please ensure you have a stable internet connection.")
    exit()

# --- Initialize Pipeline ---
# Initialize the pipeline once for efficiency, specifying the correct task and device
# For MPS, device=0 is the standard way to reference the first MPS device.
nlp_pipeline = pipeline(
    "text2text-generation",
    model=gen_model,
    tokenizer=gen_tokenizer,
)

def generate_answer(query, reranked_chunks, max_new_tokens=100):
    """
    Generates an answer using the Flan-T5 model based on the retrieved chunks.
    """
    # 1. Prepare the context from the top 3 reranked chunks
    context = "\n\n".join([chunk.get("text", "") for chunk in reranked_chunks[:3]])

    # 2. Create a prompt suitable for Flan-T5
    # This instruction-based format works well for this model family.
    prompt = f"""
Based on the following context, please answer the question.

Context:
{context}

Question:
{query}

Answer:
"""

    # 3. Generate the answer using the pre-initialized pipeline
    # The pipeline handles tokenization, truncation, and moving data to the MPS device.
    generated_output = nlp_pipeline(
        prompt,
        max_length=512,  # Set a max_length for the input+output
        max_new_tokens=max_new_tokens,  # Control the max length of the generated answer
        num_return_sequences=1,
    )

    # 4. Extract and clean the answer text
    # The output from a 'text2text-generation' pipeline is cleaner.
    answer = generated_output[0]["generated_text"].strip()

    return answer

Device set to use mps (Apple Silicon GPU)


Device set to use mps:0


In [31]:
test_query = "What is fiscal year 2024 revenue?"

reranked_results = advanced_retrieve(test_query)


for r in reranked_results:
    print('-----')
    print(r['text'])

# Example Usage
final_answer = generate_answer(test_query, reranked_results)

print(f"Question: {test_query}")
print(f"\nGenerated Answer:\n{final_answer}")

Retrieved 5 chunks from dense search.
Retrieved 5 chunks from sparse search.
Combined to 10 unique chunks.


Both `max_new_tokens` (=100) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


-----
to march 2024 : revenue growth of ( 6 % ) to ( 8 % ) in constant currency compared to revenue of $ 17. 0 billion in fiscal 2023, with most of the decline due to accelerated actions by kyndryl to reduce certain low - margin revenue streams. based on recent exchange rates, the company ’ s outlook implies fiscal 2024 revenue of $ 16. 0 to $ 16. 4 billion. the company continues to expect to return to positive revenue growth in calendar year 2025
-----
kyndryl reports fourth quarter and full - year 2024 results • revenues for the quarter ended march 31, 2024 total $ 3. 8 billion, pretax loss is $ 4 million, net loss is $ 45 million, adjusted ebitda is $ 566 million, and adjusted pretax income is $ 30 million • fiscal year 2024 revenues total $ 16. 1 billion, pretax loss is $ 168 million, net loss is $ 340 million, adjusted ebit
-----
to march 2025 : • revenue growth of ( 2 % ) to ( 4 % ) in constant currency compared to revenue of $ 16. 1 billion in fiscal 2024, which reflects actions

**2.6 Guardrail Implementation**

To ensure the reliability and safety of our RAG system, we need to implement guardrails. These are checks and balances that prevent the system from processing inappropriate queries or generating harmful, irrelevant, or factually incorrect answers. We will implement two basic guardrails:

1.  **Input Guardrail:** A simple filter to block off-topic or nonsensical questions.
2.  **Output Guardrail:** A check to ensure the generated answer is grounded in the retrieved context and not a hallucination.

In [14]:
# --- Input Guardrail ---

def validate_query(query):
    """
    Validates the input query to filter out irrelevant or harmful inputs.
    """
    query = query.lower()
    
    # Check for minimum length
    if len(query.split()) < 3:
        return False, "Query is too short. Please ask a more specific question."
        
    # Basic check for off-topic keywords
    off_topic_keywords = ['recipe', 'poem', 'joke', 'movie', 'celebrity']
    if any(keyword in query for keyword in off_topic_keywords):
        return False, "Query is off-topic. This system is for financial document analysis."

    return True, "Query is valid."

# --- Output Guardrail ---

def validate_output(answer, retrieved_chunks):
    """
    Validates the generated output to flag potential hallucinations.
    This is a simple check based on keyword overlap.
    """
    context = " ".join([chunk['text'].lower() for chunk in retrieved_chunks])
    answer_tokens = set(answer.lower().split())
    context_tokens = set(context.split())
    
    if not answer_tokens:
        return False, "Generated answer is empty."

    # Calculate the percentage of answer tokens that are present in the context
    overlap = answer_tokens.intersection(context_tokens)
    overlap_ratio = len(overlap) / len(answer_tokens)
    
    # If overlap is less than a certain threshold (e.g., 30%), flag it as potentially ungrounded.
    if overlap_ratio < 0.3:
        return False, f"Potential hallucination detected. Answer may not be grounded in context (Overlap: {overlap_ratio:.2%})."
        
    return True, f"Answer appears grounded in context (Overlap: {overlap_ratio:.2%})."


# --- Example Usage of Guardrails ---

# 1. Input Guardrail Examples
print("--- Input Guardrail Tests ---")
valid_query = "What were the total assets of Kyndryl as of March 31, 2024?"
invalid_query_short = "revenue?"
invalid_query_topic = "Can you give me a recipe for a cake?"

is_valid, reason = validate_query(valid_query)
print(f"Query: '{valid_query}' -> Valid: {is_valid}, Reason: {reason}")

is_valid, reason = validate_query(invalid_query_short)
print(f"Query: '{invalid_query_short}' -> Valid: {is_valid}, Reason: {reason}")

is_valid, reason = validate_query(invalid_query_topic)
print(f"Query: '{invalid_query_topic}' -> Valid: {is_valid}, Reason: {reason}")

print("\n" + "="*50 + "\n")

# 2. Output Guardrail Example
print("--- Output Guardrail Tests ---")
# Use the previously generated answer
is_grounded, reason = validate_output(final_answer, reranked_results)
print(f"Answer: '{final_answer[:100]}...' -> Grounded: {is_grounded}, Reason: {reason}")

# Example of a potentially hallucinated answer
hallucinated_answer = "Kyndryl announced a partnership with SpaceX to build datacenters on Mars."
is_grounded, reason = validate_output(hallucinated_answer, reranked_results)
print(f"Answer: '{hallucinated_answer}' -> Grounded: {is_grounded}, Reason: {reason}")

--- Input Guardrail Tests ---
Query: 'What were the total assets of Kyndryl as of March 31, 2024?' -> Valid: True, Reason: Query is valid.
Query: 'revenue?' -> Valid: False, Reason: Query is too short. Please ask a more specific question.
Query: 'Can you give me a recipe for a cake?' -> Valid: False, Reason: Query is off-topic. This system is for financial document analysis.


--- Output Guardrail Tests ---
Answer: 'answer the question based on the context.

context: to march 2024 : revenue growth of ( 6 % ) to ( 8...' -> Grounded: True, Reason: Answer appears grounded in context (Overlap: 80.29%).
Answer: 'Kyndryl announced a partnership with SpaceX to build datacenters on Mars.' -> Grounded: True, Reason: Answer appears grounded in context (Overlap: 45.45%).
