# Assignment 2 – Comparative Financial QA System: RAG vs Fine-Tuning

Group No 16

## Group Member Names:
1. | Anup Jindal        | 2023ac05472 |100%
2. | Yogesh Chaturvedi  | 2023ac05167 |100%
3. | HRISHIKESH MALAKAR | 2023Ac05058 |100%
4. | Anit Nair          | 2023ac05503 |100%
5. | DEBASISH ACHARYA   | 2023ac05417 |100%


### Objective
Develop and compare two systems for answering questions based on company financial statements (last two years):

- Retrieval-Augmented Generation (RAG) Chatbot: Combines document retrieval and generative response.
- Fine-Tuned Language Model (FT) Chatbot: Directly fine-tunes a small open-source language model on financial Q&A.
<p>
Use the same financial data for both methods and perform a detailed comparison on accuracy, speed, and robustness.

In [None]:
%pip install -q -r requirements.txt

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# imports
import zipfile
import os
from bs4 import BeautifulSoup

## 1. Data Collection & Preprocessing
#### In this assignment we will be using GE Healthcares financial statements submitted to US SEC. The raw data can be downloaded from below URL
- Downloaded Financial Statement of GE Healthcare From United States Securities and Exchange Commission: Click [here](https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001932393&type=10-Q&dateb=&owner=include&count=40&search_text=) for link to source of data.
- Downloaded 'gehc-annual-report-2023-2024.zip' file is available under the data folder

### 1.1 Extract the data and convert them to plain text. (Source data is html files)
- Use BeautifulSoup to parse HTML and extract text
- Post cleanup save plain text files in ./gehc_fin_plain_text folder

In [None]:
zip_file_path = '../../data/gehc-annual-report-2023-2024.zip'
extracted_dir_path = '../../data/content/gehc_fin_extracted'

# zip_file_path = './gehc-annual-report-2023-2024.zip'
# extracted_dir_path = './gehc_fin_extracted'


# Create the extraction directory if it doesn't exist
os.makedirs(extracted_dir_path, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir_path)

print(f"Extracted {zip_file_path} to {extracted_dir_path}")

In [None]:
extracted_dir_path = '../../data/content/gehc_fin_extracted'
plain_text_dir_path = './gehc_fin_plain_text'

#extracted_dir_path = './gehc_fin_extracted'
#plain_text_dir_path = './gehc_fin_plain_text'

# Create the directory for plain text files if it doesn't exist
os.makedirs(plain_text_dir_path, exist_ok=True)

html_files = []
for root, _, files in os.walk(extracted_dir_path):
    for file in files:
        if file.endswith(".html") or file.endswith(".htm"):
            html_files.append(os.path.join(root, file))

print(f"Found {len(html_files)} HTML files.")

for html_file_path in html_files:
    try:
        # Try reading with utf-8 first, then latin-1
        try:
            with open(html_file_path, 'r', encoding='utf-8') as f:
                html_content = f.read()
        except UnicodeDecodeError:
            with open(html_file_path, 'r', encoding='latin-1') as f:
                html_content = f.read()


        # Use BeautifulSoup to parse HTML and extract text
        soup = BeautifulSoup(html_content, 'html.parser')
        plain_text = soup.get_text(separator='\n')

        # Create a corresponding plain text file path
        relative_path = os.path.relpath(html_file_path, extracted_dir_path)
        plain_text_file_path = os.path.join(plain_text_dir_path, relative_path + ".txt")

        # Create directories for the plain text file if they don't exist
        os.makedirs(os.path.dirname(plain_text_file_path), exist_ok=True)

        with open(plain_text_file_path, 'w', encoding='utf-8') as f:
            f.write(plain_text)

        print(f"Converted {html_file_path} to plain text and saved to {plain_text_file_path}")

    except Exception as e:
        print(f"Error processing {html_file_path}: {e}")

print("Finished converting HTML files to plain text.")


### 1.2 Walk through each text file and save them to a list as string.

In [None]:
plain_text_data = []
# Walk through the directory and read all .txt files
for root, _, files in os.walk(plain_text_dir_path):
    for file in files:
        if file.endswith(".txt"):
            file_path = os.path.join(root, file)
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    plain_text_data.append(f.read())
                print(f"Loaded {file_path}")
            except Exception as e:
                print(f"Error reading {file_path}: {e}")

print(f"Loaded {len(plain_text_data)} plain text files.")


### 1.3 Clean text by removing noise like headers, footers, and page numbers.

In [None]:
import re

cleaned_text_data = []

# Function to clean text
def clean_text(text):
    # Remove common headers/footers (example patterns, adjust as needed)
    text = re.sub(r'\[\s*\d+\s*\]', '', text) # Remove numbers in brackets like [ 1 ]
    text = re.sub(r'Page\s+\d+\s+of\s+\d+', '', text, flags=re.IGNORECASE) # Remove "Page X of Y"
    text = re.sub(r'Exhibit\s+\d+\.\d+', '', text, flags=re.IGNORECASE) # Remove "Exhibit X.Y"
    text = re.sub(r'\n\s*\n', '\n', text) # Remove excessive newlines 

    return text

# Apply cleaning to each document
for text in plain_text_data:
    cleaned_text_data.append(clean_text(text))

### 1.4 Segment reports into logical sections (e.g., income statement, balance sheet).

In [None]:
segmented_financial_statements = []

# Define the financial statement segments and their potential headings
# Using a dictionary to map a user-friendly name to a list of potential regex patterns.
# This allows for variations in how headings might appear.
financial_segment_patterns = {
    "Statements of Operations / Income": [
        r"CONSOLIDATED STATEMENTS OF OPERATIONS\s*\n(.*?)(?=\n(?:Statements of Financial Position|Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
        r"Statements of Income\s*\n(.*?)(?=\n(?:Statements of Financial Position|Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
    ],
    "Statements of Financial Position / Balance Sheet": [
        r"Statements of Financial Position\s*\n(.*?)(?=\n(?:Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
        r"balance sheet\s*\n(.*?)(?=\n(?:Statements of Comprehensive Income|Statements of Changes in Equity|cash flows)|\Z)",
    ],
    "Statements of Cash Flows": [
        r"cash flows\s*\n(.*?)(?=\Z)",
    ]
}

# Iterate through each cleaned document
for doc_text in cleaned_text_data:
    doc_segments = {}
    remaining_text = doc_text

    # Iterate through each financial segment and try to find its content using the defined patterns
    for segment_name, patterns in financial_segment_patterns.items():
        found_segment = False
        for pattern in patterns:
            match = re.search(pattern, remaining_text, re.DOTALL | re.IGNORECASE) # Use IGNORECASE for flexibility
            if match:
                doc_segments[segment_name] = match.group(1).strip()
                # Update remaining_text to be the part after the found segment if a match is found
                remaining_text = remaining_text[match.end():]
                found_segment = True
                break # Move to the next segment after finding a match

        if not found_segment:
             doc_segments[segment_name] = "Segment not found." # Indicate if a segment is not found after trying all patterns


    segmented_financial_statements.append(doc_segments)

print(f"Segmented {len(segmented_financial_statements)} documents into financial statements.")

# You can inspect the first segmented financial statements to see the results
import json
print(json.dumps(segmented_financial_statements[0], indent=2))

### 1.5 From the segmented_financial_statements, create a data structure as below to store the information:

```json
{
    document: number, // document id
    segment: string,  // finacial segment like Operations, inancial Position / Balance Sheet,  Comprehensive Income
    line_item: string, Cash Flow
    2024: number, // value in each year
    2023: number,
}
```

In [None]:
financial_data = []
# Regex to find line items and their values for 2024, 2023 and 2022
line_item_pattern = re.compile(
    r"^(.*?)\s+"  # Capture the line item description
    r"\$\s*([\d,]+)\s+"  # Capture the 2024 value
    r"\$\s*([\d,]+)\s+"  # Capture the 2023 value
    r"\$\s*([\d,]+)\s+", # Capture the 2022 value
    re.MULTILINE # Pass the flag here
)

for i, doc_segments in enumerate(segmented_financial_statements):
    for segment_name, content in doc_segments.items():
        if content != "Segment not found.":
            # Find all matches in the content
            matches = line_item_pattern.finditer(content)
            for match in matches:
                line_item = match.group(1).strip()
                value_2024 = match.group(2).replace(',', '')
                value_2023 = match.group(3).replace(',', '')
                value_2022 = match.group(4).replace(',', '') # Corrected index for 2022 value
                # Add to our structured data list
                financial_data.append({
                    "document": i + 1,
                    "segment": segment_name,
                    "line_item": line_item,
                    "2024": int(value_2024),
                    "2023": int(value_2023),
                    "2022": int(value_2022)
                })

# Print the first 5 extracted key-value pairs
for item in financial_data[:5]:
    print(item)

### 1.6 Formulate Questions at least 50 (Q/A) pairs

In [None]:
num_question_pair=200

In [None]:
generated_questions = []
count  = 0
# Iterate through the extracted financial data
for item in financial_data:
    line_item = item["line_item"]
    value_2024 = item["2024"]
    value_2023 = item["2023"]
    value_2022 = item["2022"]
    segment = item["segment"]

    # Question type 1: Value in a specific year
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"What was the value of '{line_item}' in {2024} according to the {segment}?",
    })
    if (len(generated_questions) == num_question_pair):
      break;
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"Find the value for '{line_item}' in {2023} from the {segment}.",
    })
    if (len(generated_questions) == num_question_pair):
      break;
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"Could you provide the figure for '{line_item}' in {2022} as reported in the {segment}?",
    })
    if (len(generated_questions) == num_question_pair):
          break;
    # Question type 2: Change between two years
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"How much did the '{line_item}' change from {2023} to {2024} based on the {segment}?",
    })
    if (len(generated_questions) == num_question_pair):
      break;
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"What was the difference in '{line_item}' between {2022} and {2023} according to the {segment}?",
    })
    if (len(generated_questions) == num_question_pair):
      break;
    # Question type 3: Value across multiple years (if applicable and makes sense)
    # Only generate this if all three years have values
    if value_2024 is not None and value_2023 is not None and value_2022 is not None:
         generated_questions.append({
            "based_on_data_item": item,
            "question": f"What were the values for '{line_item}' for the years {2024}, {2023}, and {2022} in the {segment}?",
        })
    if (len(generated_questions) == num_question_pair):
      break;
# Print the first 10 generated questions to inspect
for q in generated_questions[:10]:
    print(q)

print(f"\nGenerated {len(generated_questions)} questions.")

#### For each questions formulated above generate the answers.

In [None]:
# Iterate through the generated questions
for q in generated_questions:
    item = q["based_on_data_item"]
    line_item = item["line_item"]
    value_2024 = item["2024"]
    value_2023 = item["2023"]
    value_2022 = item["2022"]
    segment = item["segment"]
    question_text = q["question"]
    answer = ""

    # Determine the type of question and extract/calculate the answer
    if f"in {2024}" in question_text and f"{2023}, and {2022}" not in question_text:
        answer = f"The value of '{line_item}' in 2024 was {value_2024} millions of dollars."
    elif f"in {2023}" in question_text and f"{2024}, and {2022}" not in question_text:
        answer = f"The value of '{line_item}' in 2023 was {value_2023} millions of dollars."
    elif f"in {2022}" in question_text and f"{2024}, and {2023}" not in question_text:
        answer = f"The value of '{line_item}' in 2022 was {value_2022} millions of dollars."
    elif f"change from {2023} to {2024}" in question_text:
        change = value_2024 - value_2023
        answer = f"The change in '{line_item}' from 2023 to 2024 was {change} millions of dollars."
    elif f"difference in '{line_item}' between {2022} and {2023}" in question_text:
        difference = value_2023 - value_2022
        answer = f"The difference in '{line_item}' between 2022 and 2023 was {difference} millions of dollars."
    elif f"for the years {2024}, {2023}, and {2022}" in question_text:
         answer = f"The values for '{line_item}' for the years 2024, 2023, and 2022 were {value_2024}, {value_2023}, and {value_2022} millions of dollars, respectively."
    else:
        # Handle any unexpected question formats or if the pattern doesn't match
        answer = "Could not determine the specific answer based on the question format."


    q['answer'] = answer

# Print the first 10 question-answer pairs
print("First 10 Generated Q/A Pairs:")
for q in generated_questions[:10]:
    print(f"Question: {q['question']}")
    print(f"Answer: {q['answer']}")
    print("-" * 20)

generated_questions_answer = generated_questions.copy();
# Print the total number of Q/A pairs generated
print(f"\nTotal Q/A pairs generated: {len(generated_questions)}")

## 2. Retrieval-Augmented Generation (RAG) System Implementation

### 2.1 Data Processing

 - Split the cleaned text into chunks suitable for retrieval with at least two chunk sizes (e.g., 100 and 400 tokens).
 - Assign unique IDs and metadata to chunks.


In [None]:
# Assuming 'cleaned_text_data' is available from the previous data cleaning step

def chunk_text(text, chunk_size=100, overlap=20):
    """
    Splits text into overlapping chunks.

    Args:
        text (str): The input text.
        chunk_size (int): The desired size of each chunk (in words or tokens, depending on how you split).
        overlap (int): The number of words/tokens to overlap between chunks.

    Returns:
        list: A list of text chunks.
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Define chunk sizes
chunk_sizes = [100, 400]
chunked_data = {}

# Process each cleaned document and create chunks of different sizes
for doc_id, cleaned_text in enumerate(cleaned_text_data):
    for size in chunk_sizes:
        chunks = chunk_text(cleaned_text, chunk_size=size)
        if f'chunks_{size}' not in chunked_data:
            chunked_data[f'chunks_{size}'] = []

        for i, chunk in enumerate(chunks):
            chunked_data[f'chunks_{size}'].append({
                'id': f'doc_{doc_id}_chunk_{i}_size_{size}',
                'content': chunk,
                'metadata': {
                    'document_id': doc_id,
                    'chunk_id': i,
                    'chunk_size': size
                }
            })

# Print some information about the generated chunks
for size, chunks in chunked_data.items():
    print(f"Generated {len(chunks)} chunks of size {size}.")
    if len(chunks) > 0:
        print(f"First chunk ({size}): {chunks[0]['content'][:200]}...") # Print first 200 characters of the first chunk

### 2.2 Embedding & Indexing

#### 2.2.1 Embed the chunks using the all-MiniLM-L6-v2

In [None]:
from sentence_transformers import SentenceTransformer

# Load the sentence embedding model
try:
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Sentence embedding model 'all-MiniLM-L6-v2' loaded successfully.")
except Exception as e:
    print(f"Error loading sentence embedding model: {e}")
    print("Please ensure you have an active internet connection to download the model.")
    embedding_model = None # Set to None if loading fails

# Assuming 'chunked_data' is available from the previous chunking step
# Embed chunks of size 100 and 400

if embedding_model is not None:
    embedded_chunks = {}
    for size, chunks in chunked_data.items():
        print(f"Embedding {len(chunks)} chunks of size {size.split('_')[-1]}...")
        # Extract the content of the chunks to embed
        chunks_content = [chunk['content'] for chunk in chunks]

        # Generate embeddings
        try:
            embeddings = embedding_model.encode(chunks_content, show_progress_bar=True)
            embedded_chunks[size] = {
                'chunks': chunks, # Keep the original chunk data
                'embeddings': embeddings
            }
            print(f"Finished embedding chunks of size {size.split('_')[-1]}. Shape of embeddings: {embeddings.shape}")
        except Exception as e:
            print(f"Error during embedding for chunk size {size.split('_')[-1]}: {e}")
            embedded_chunks[size] = None # Indicate if embedding failed


else:
    print("Embedding model not loaded, skipping embedding step.")

# Inspect the shape of the embeddings for one chunk size, e.g., size 100
# if embedded_chunks.get('chunks_100') and embedded_chunks['chunks_100']['embeddings'] is not None:
#     print(f"\nShape of embeddings for chunk size 100: {embedded_chunks['chunks_100']['embeddings'].shape}")
#     print(f"Shape of embeddings for chunk size 400: {embedded_chunks['chunks_400']['embeddings'].shape}")

#### 2.2.2 Build dense vector store to capture semantic relation using ChromaDB

In [None]:
import chromadb

# Initialize ChromaDB client
# By default, it will use an in-memory database. You can configure it for persistent storage if needed.
try:
    client = chromadb.Client()
    print("ChromaDB client initialized.")
except Exception as e:
    print(f"Error initializing ChromaDB client: {e}")
    client = None # Set to None if initialization fails


# Create or get a collection for our chunks
# A collection is like a table in a traditional database.
collection_name = "financial_report_chunks"
try:
    collection = client.get_or_create_collection(name=collection_name)
    print(f"ChromaDB collection '{collection_name}' created or retrieved.")
except Exception as e:
    print(f"Error getting or creating ChromaDB collection: {e}")
    collection = None # Set to None if collection creation fails

# Add the embedded chunks to the collection
# We'll add the chunks from one of the sizes, for example, size 100, to the dense vector store.
# You could potentially add both sizes to separate collections or experiment with different strategies.
if collection is not None and embedded_chunks.get('chunks_100') and embedded_chunks['chunks_100']['embeddings'] is not None:
    chunks_to_add = embedded_chunks['chunks_100']['chunks']
    embeddings_to_add = embedded_chunks['chunks_100']['embeddings']

    # Prepare data for ChromaDB
    ids = [chunk['id'] for chunk in chunks_to_add]
    documents = [chunk['content'] for chunk in chunks_to_add]
    metadatas = [chunk['metadata'] for chunk in chunks_to_add]


    # Add to ChromaDB in batches to avoid potential issues with large numbers of documents
    batch_size = 100  # Adjust batch size as needed
    for i in range(0, len(ids), batch_size):
        batch_ids = ids[i:i + batch_size]
        batch_documents = documents[i:i + batch_size]
        batch_embeddings = embeddings_to_add[i:i + batch_size]
        batch_metadatas = metadatas[i:i+ batch_size]

        try:
            collection.add(
                embeddings=batch_embeddings.tolist(), # ChromaDB expects a list of lists
                documents=batch_documents,
                metadatas=batch_metadatas,
                ids=batch_ids
            )
            print(f"Added batch {i//batch_size + 1} to ChromaDB.")
        except Exception as e:
            print(f"Error adding batch {i//batch_size + 1} to ChromaDB: {e}")

    print(f"Finished adding {len(ids)} chunks to ChromaDB collection '{collection_name}'.")

# You can verify the count of items in the collection
if collection is not None:
    try:
        count = collection.count()
        print(f"Total items in ChromaDB collection '{collection_name}': {count}")
    except Exception as e:
        print(f"Error getting count from ChromaDB collection: {e}")

### 2.2.3 Create Sparse index (BM25 or TF-IDF) for keyword retrieval

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming 'chunked_data' is available from the previous chunking step
# We will use the chunks of size 100 for building the TF-IDF index

# Extract the content of the chunks
if 'chunks_100' in embedded_chunks:
    chunks_to_embed = embedded_chunks['chunks_100']['chunks']
    chunks_content = [chunk['content'] for chunk in chunks_to_embed]

    # Initialize TF-IDF Vectorizer
    # You can adjust parameters like max_features, min_df, max_df, ngram_range
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

    # Fit the vectorizer to the chunk content and transform the chunks
    try:
        tfidf_matrix = tfidf_vectorizer.fit_transform(chunks_content)
        print("TF-IDF vectorizer fitted and matrix created successfully.")
        print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
    except Exception as e:
        print(f"Error creating TF-IDF matrix: {e}")
        tfidf_vectorizer = None # Set to None if fitting fails
        tfidf_matrix = None # Set to None if fitting fails

else:
    print("Chunks of size 100 not found in chunked_data. Cannot build TF-IDF index.")
    tfidf_vectorizer = None
    tfidf_matrix = None

# The tfidf_matrix now represents the sparse index of our chunks.

### 2.3 Hybrid Retrieval Pipeline

#### 2.3.1 Preprocess data clean

In [None]:
import re
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except LookupError: # Corrected exception type
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_query(query):
    """
    Cleans, lowercases, and removes stopwords from a query.

    Args:
        query (str): The input query string.

    Returns:
        str: The preprocessed query string.
    """
    # Convert to lowercase
    query = query.lower()
    # Remove special characters and punctuation
    query = re.sub(r'[^a-z0-9\s]', '', query)
    # Remove stopwords
    query = ' '.join([word for word in query.split() if word not in stop_words])
    return query


#### 2.3.2 Generate query embedding.

In [None]:
# Assuming 'preprocess_query' and 'embedding_model' are available from previous steps.

def generate_query_embedding(query, embedding_model):
    """
    Generates the embedding for a preprocessed query.

    Args:
        query (str): The preprocessed query string.
        embedding_model: The sentence embedding model.

    Returns:
        numpy.ndarray: The query embedding.
    """
    if embedding_model is None:
        print("Embedding model is not loaded. Cannot generate query embedding.")
        return None
    try:
        # Encode the query to get its embedding
        query_embedding = embedding_model.encode(query)
        print("Query embedding generated successfully.")
        return query_embedding
    except Exception as e:
        print(f"Error generating query embedding: {e}")
        return None

# Example Usage (assuming 'user_query' is defined):
# preprocessed_user_query = preprocess_query(user_query)
# query_embedding = generate_query_embedding(preprocessed_user_query, embedding_model)

# if query_embedding is not None:
#     print(f"Shape of query embedding: {query_embedding.shape}")

#### 2.3.3 Retrieve top-N chunks from:
- Dense retrieval (vector similarity).
- Sparse retrieval (BM25).

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Assuming 'collection', 'embedding_model', 'tfidf_vectorizer', 'tfidf_matrix',
# and 'chunked_data' are available from previous steps.

def dense_retrieve(query, collection, embedding_model, n_results=5):
    """
    Retrieves top-N relevant chunks using dense vector similarity with ChromaDB.

    Args:
        query (str): The user query.
        collection (chromadb.Collection): The ChromaDB collection.
        embedding_model: The sentence embedding model.
        n_results (int): The number of results to retrieve.

    Returns:
        list: A list of dictionaries, where each dictionary contains the 'id',
              'content', and 'metadata' of a retrieved chunk. Returns an empty
              list if retrieval fails or no results are found.
    """
    if collection is None or embedding_model is None:
        print("ChromaDB collection or embedding model not loaded. Cannot perform dense retrieval.")
        return []

    try:
        # Generate embedding for the query
        query_embedding = embedding_model.encode([query]).tolist() # ChromaDB expects a list of lists

        # Query ChromaDB
        results = collection.query(
            query_embeddings=query_embedding,
            n_results=n_results,
            include=['documents', 'metadatas'] # Request documents (content) and metadatas
        )

        # Process the results
        retrieved_chunks = []
        if results and results['ids'] and results['documents'] and results['metadatas']:
            # Assuming the structure of results is as expected from ChromaDB query with include=['documents', 'metadatas']
            # results['ids'][0] is a list of ids for the first query (since we queried with a list of one embedding)
            # results['documents'][0] is a list of document contents for the first query
            # results['metadatas'][0] is a list of metadatas for the first query

            for i in range(len(results['ids'][0])):
                 retrieved_chunks.append({
                    'id': results['ids'][0][i],
                    'content': results['documents'][0][i],
                    'metadata': results['metadatas'][0][i]
                })

        print(f"Dense retrieval found {len(retrieved_chunks)} results.")
        return retrieved_chunks

    except Exception as e:
        print(f"Error during dense retrieval: {e}")
        return []


def sparse_retrieve_tfidf(query, tfidf_vectorizer, tfidf_matrix, chunks, n_results=5):
    """
    Retrieves top-N relevant chunks using sparse keyword similarity (TF-IDF).

    Args:
        query (str): The user query.
        tfidf_vectorizer (TfidfVectorizer): The fitted TF-IDF vectorizer.
        tfidf_matrix (sparse matrix): The TF-IDF matrix of the chunks.
        chunks (list): A list of chunk dictionaries (e.g., from chunked_data['chunks_100']['chunks']).
        n_results (int): The number of results to retrieve.

    Returns:
        list: A list of dictionaries, where each dictionary contains the 'id',
              'content', and 'metadata' of a retrieved chunk. Returns an empty
              list if retrieval fails or no results are found.
    """
    if tfidf_vectorizer is None or tfidf_matrix is None or not chunks:
        print("TF-IDF vectorizer, matrix, or chunks not available. Cannot perform sparse retrieval.")
        return []

    try:
        # Transform the query using the same TF-IDF vectorizer
        query_tfidf = tfidf_vectorizer.transform([query])

        # Calculate cosine similarity between the query TF-IDF and chunk TF-IDF matrix
        cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()

        # Get the indices of the top-N most similar chunks
        # Use argpartition for efficiency if n_results is much smaller than the total number of chunks
        # Or use argsort if you need the results sorted by similarity
        # top_n_indices = np.argsort(cosine_similarities)[::-1][:n_results] # Gets indices in descending order of similarity
        top_n_indices = np.argpartition(cosine_similarities, -n_results)[-n_results:] # More efficient for large matrices

        # Filter out indices that might be out of bounds if n_results is larger than available chunks
        top_n_indices = top_n_indices[top_n_indices < len(chunks)]

        # Retrieve the actual chunks based on the indices
        retrieved_chunks = []
        # Sort by similarity score (optional, but good for presentation)
        # Sorting indices by similarity score in descending order before picking top-N
        sorted_indices = top_n_indices[np.argsort(cosine_similarities[top_n_indices])][::-1]


        for idx in sorted_indices:
             # Explicitly cast idx to int just in case
             int_idx = int(idx)
             retrieved_chunks.append({
                'id': chunks[int_idx]['id'],
                'content': chunks[int_idx]['content'],
                'metadata': chunks[int_idx]['metadata']
            })


        print(f"Sparse retrieval found {len(retrieved_chunks)} results.")
        return retrieved_chunks

    except Exception as e:
        print(f"Error during sparse retrieval: {e}")
        return []

# Example Usage (assuming 'preprocessed_user_query' is defined):
# user_query = "What was the total revenues in 2024?"
# preprocessed_user_query = preprocess_query(user_query) # Make sure preprocess_query is run first

# dense_results = dense_retrieve(preprocessed_user_query, collection, embedding_model, n_results=5)
# print("\nDense Retrieval Results:")
# for chunk in dense_results:
#     print(f"- ID: {chunk['id']}, Content: {chunk['content'][:100]}...") # Print first 100 chars of content

# sparse_results = sparse_retrieve_tfidf(preprocessed_user_query, tfidf_vectorizer, tfidf_matrix, chunked_data['chunks_100']['chunks'], n_results=5)
# print("\nSparse Retrieval Results:")
# for chunk in sparse_results:
#      print(f"- ID: {chunk['id']}, Content: {chunk['content'][:100]}...") # Print first 100 chars of content

In [None]:
def combine_retrieval_results(dense_results, sparse_results):
    """
    Combines the results from dense and sparse retrieval.

    Args:
        dense_results (list): List of chunks from dense retrieval.
        sparse_results (list): List of chunks from sparse retrieval.

    Returns:
        list: A list of unique retrieved chunks.
    """
    combined_chunks = {}

    # Add dense retrieval results
    for chunk in dense_results:
        combined_chunks[chunk['id']] = chunk # Use chunk ID to handle potential duplicates

    # Add sparse retrieval results
    for chunk in sparse_results:
        combined_chunks[chunk['id']] = chunk # Overwrite if already exists (optional, depending on desired behavior)

    # Convert the dictionary values back to a list
    return list(combined_chunks.values())

# Example Usage (assuming 'dense_results' and 'sparse_results' are defined from previous steps):
# combined_results = combine_retrieval_results(dense_results, sparse_results)
# print(f"\nCombined Retrieval Results: Found {len(combined_results)} unique chunks.")
# for chunk in combined_results:
#      print(f"- ID: {chunk['id']}, Content: {chunk['content'][:100]}...") # Print first 100 chars of content

#### 2.3.4 Advanced RAG Technique (Select One)



In [None]:
# Define the number of initial candidates for broad retrieval
n_broad_dense = 10 # Retrieve more candidates from dense retrieval
n_broad_sparse = 10 # Retrieve more candidates from sparse retrieval

# Example User Query
user_query = "What was the total revenues in 2024 for GE Healthcare?"
# Preprocess the query
preprocessed_user_query = preprocess_query(user_query) # Make sure preprocess_query is run first

# Perform broad dense retrieval
broad_dense_results = dense_retrieve(preprocessed_user_query, collection, embedding_model, n_results=n_broad_dense)
print(f"\nBroad Dense Retrieval found {len(broad_dense_results)} candidates.")

# Perform broad sparse retrieval
# Assuming 'chunks_to_embed' is the list of chunks used for TF-IDF
# Corrected the variable name to access the chunks from embedded_chunks
chunks_to_embed = embedded_chunks['chunks_100']['chunks'] # Make sure this is correctly referenced
broad_sparse_results = sparse_retrieve_tfidf(preprocessed_user_query, tfidf_vectorizer, tfidf_matrix, chunks_to_embed, n_results=n_broad_sparse)
print(f"Broad Sparse Retrieval found {len(broad_sparse_results)} candidates.")

# The next step will be to combine these broad results.

In [None]:
# Step 1: Combine broad retrieval results (already implemented in a previous cell)
# Assuming 'broad_dense_results' and 'broad_sparse_results' are available from the previous execution

combined_results = combine_retrieval_results(broad_dense_results, broad_sparse_results)
print(f"Combined retrieval results: Found {len(combined_results)} unique chunks.")

# Step 2: Load a cross-encoder model for reranking
# Install the sentence-transformers library if you haven't already (already done)

try:
    from sentence_transformers import CrossEncoder
    # Load a pre-trained cross-encoder model suitable for reranking
    # ms-marco-MiniLM-L-6-v2 is a good choice for general domain, you might explore others
    cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') # Corrected model name
    print("Cross-encoder model 'cross-encoder/ms-marco-MiniLM-L-6-v2' loaded successfully.")
except Exception as e:
    print(f"Error loading cross-encoder model: {e}")
    cross_encoder_model = None # Set to None if loading fails
    print("Please ensure you have an active internet connection to download the model.")

### 2.5 Response generation

#### Guard Rail function defination.

In [None]:
FINANCIAL_KEYWORDS = [
    'capex', 'customers', 'balance sheet', 'change', 'unit', 'income', 'difference', 'products', 'forecast', 'fy', 'operations',
  'inventory', 'value', 'price', 'apbo', 'q3', 'year', 'ge healthcare', 'sales', 'backlog', 'margin', 'q4', 'growth', 'operating', 
  'cost', 'guidance', 'expense', 'opex', 'revenue', 'quarter', 'q2', 'q1', 
 'ebitda', 'profit', 'product', 'loss', 'segment', 'financial', 'ebit', 'cash', 'pbo', 'stockholders'
 ]

def is_relevantRAG(question):
    """Checks if the question contains any financial keywords."""
    return any(keyword in question.lower() for keyword in FINANCIAL_KEYWORDS)

# Example Usage
print(f"'What is the value of sales in 2024?' is relevant: {is_relevantRAG('What is the value of sales in 2024?')}")
print(f"'What is the capital of France?' is relevant: {is_relevantRAG('What is the capital of France?')}")

In [None]:
# Step 3: Rerank the combined results using a cross-encoder model
# Assuming 'combined_results' and 'cross_encoder_model' are available from previous steps.
reranked_results = [] 
if cross_encoder_model is not None and combined_results:
    print("\nReranking combined results...")
    # Prepare sentence pairs for the cross-encoder: [query, document]
    sentence_pairs = [[preprocessed_user_query, chunk['content']] for chunk in combined_results]

    # Get scores from the cross-encoder
    try:
        reranking_scores = cross_encoder_model.predict(sentence_pairs)

        # Combine the original chunks with their reranking scores
        scored_results = []
        for i, chunk in enumerate(combined_results):
            scored_results.append({
                'chunk': chunk,
                'score': reranking_scores[i]
            })

        # Sort the results by reranking score in descending order
        reranked_results = sorted(scored_results, key=lambda x: x['score'], reverse=True)

        print(f"Finished reranking. Top score: {reranked_results[0]['score'] if reranked_results else 'N/A'}")

    except Exception as e:
        print(f"Error during reranking: {e}")
        reranked_results = [] # Set to empty list if reranking fails

else:
    print("\nSkipping reranking due to missing cross-encoder model or combined results.")
    reranked_results = [] # Set to empty list if prerequisites are not met


# Step 4: Select top-k chunks for response generation
k = 3  # Define the number of top chunks to use as context
top_k_chunks = [item['chunk'] for item in reranked_results[:k]]

print(f"\nSelected top {k} chunks for response generation.")
for chunk in top_k_chunks:
    print(f"- ID: {chunk['id']}, Content: {chunk['content'][:150]}...")

# Step 5: Generate Answer using a small generative model (GPT-2 Small)
# Install transformers library if not already installed
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 Small model and tokenizer
try:
    model_name = "gpt2" # Using the base gpt2 model which is equivalent to gpt2-small
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)
    print(f"\nLoaded generative model: {model_name}")
except Exception as e:
    print(f"Error loading generative model {model_name}: {e}")
    tokenizer = None
    model = None

In [None]:
def getResponseRag(user_query):
    if not is_relevantRAG(user_query):
        return "Not applicable", 1.0, 0.0, "Guardrail (Irrelevant)"
    elif tokenizer is not None and model is not None and top_k_chunks:
        import time
        start_time = time.time()
        context = "\n".join([chunk['content'] for chunk in top_k_chunks])
        prompt = f"Context:\n{context}\n\nQuestion: {user_query}\n\nAnswer:"
        max_model_input_length = tokenizer.model_max_length
        max_prompt_length = max_model_input_length - 100
        encoded_prompt = tokenizer.encode(prompt, max_length=max_prompt_length, truncation=True, return_tensors="pt")
        attention_mask = (encoded_prompt != tokenizer.pad_token_id).long() if tokenizer.pad_token_id is not None else None
        confidence = float(reranked_results[0]['score']) if reranked_results else 0.0
        final_answer = "No answer found."
        method = 'RAG'
        max_length = 100
        try:
            output_sequences = model.generate(
                encoded_prompt,
                max_length=encoded_prompt.shape[1] + max_length,
                num_return_sequences=1,
                no_repeat_ngram_size=2,
                top_k=50,
                pad_token_id=tokenizer.eos_token_id,
                attention_mask=attention_mask
            )
            generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
            answer_start = generated_text.find("Answer:")
            if answer_start != -1:
                final_answer = generated_text[answer_start + len("Answer:"):].strip()
            else:
                final_answer = generated_text.strip()
        except Exception as e:
            print(f"Error during answer generation: {e}")
        inference_time = time.time() - start_time
        return final_answer, confidence, inference_time, method
    else:
        print("\nSkipping answer generation due to missing model, tokenizer, or chunks.")
        return "Not applicable", 0.0, 0.0, "Missing Model/Chunks"

## 3 Fine-Tuning a Language Model for Financial Q&A with SFTTrainer 

This section of notebook walks through the process of fine-tuning a small, open-source language model to answer questions based on a provided financial dataset. We will be reusing the same data and genrated questions in the RAG step for finetuning purpose. This model will be trained on around 200 questions/answer pair generated in the step-1.
We will cover data preparation, model selection, baseline benchmarking, and evaluation.

This version uses the **SFTTrainer** from the TRL (Transformer Reinforcement Learning) library, which simplifies supervised fine-tuning on instruction-style datasets.

### 3.1.1 Setup and Dependencies ⚙️

- Install the necessary libraries. 
- We'll use `transformers` for the language model, `datasets` to handle our data, `torch` as the backend, and `trl` for the `SFTTrainer`.

In [None]:
%pip install -q transformers[torch] datasets pandas trl peft bitsandbytes

### 3.1.2. Q/A Dataset Preparation 📄

- Create a pandas DataFrame from generated question/answers.
- Create a Hugging Face `Dataset` object.
- Split Dataset into train and eval datasets

In [None]:
import pandas as pd
import io
import time
import torch

# Clean and parse the data
data = []
for q in generated_questions:
    data.append({"question": q['question'], "answer": q['answer']})

qna_df = pd.DataFrame(data)
print(qna_df.head())
print(f"\nTotal Q&A pairs: {len(qna_df)}")

In [None]:
from datasets import Dataset

# Convert to Hugging Face Dataset and split
full_dataset = Dataset.from_pandas(qna_df)
train_test_split = full_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

### 3.2 Model Selection and Baseline Benchmarking 📊

- We will use **gpt2** for a Question Answering baseline to see how a model performs *before* any fine-tuning. This helps us quantify the improvement from our fine-tuning process. 
- For fine-tuning, we'll select **gpt2**, a sequence-to-sequence model well-suited for our instruction-based task.

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

baseline_model_name = "gpt2"
baseline_tokenizer = AutoTokenizer.from_pretrained(baseline_model_name)
baseline_model = AutoModelForQuestionAnswering.from_pretrained(baseline_model_name)

#### 3.2.1 GPT2 Model Baseline Benchmarking

In [None]:
def get_baseline_model_answer(question, context):
    inputs = baseline_tokenizer(question, context, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        start_time = time.time()
        outputs = baseline_model(**inputs)
        inference_time = time.time() - start_time

    answer_start_index = torch.argmax(outputs.start_logits)
    answer_end_index = torch.argmax(outputs.end_logits)

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    answer = baseline_tokenizer.decode(predict_answer_tokens)

    start_prob = torch.nn.functional.softmax(outputs.start_logits, dim=-1)[0, answer_start_index].item()
    end_prob = torch.nn.functional.softmax(outputs.end_logits, dim=-1)[0, answer_end_index].item()
    confidence = (start_prob + end_prob) / 2

    return answer, confidence, inference_time

# Create a single context from all answers for the baseline model
context = " ".join(qna_df['answer'].tolist())

test_questions = qna_df.sample(10, random_state=42)

print("--- Baseline Model Evaluation ---")
for _, row in test_questions.iterrows():
    question = row['question']
    real_answer = row['answer']
    model_answer, confidence, inference_time = get_baseline_model_answer(question, context)
    print(f"Q: {question}")
    print(f"Predicted A: {model_answer} (Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)")
    print(f"Predicted A=> (Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)")
    print(f"Real A: {real_answer}\n")

## 3.4. Fine-Tuning with SFTTrainer 🚀

- Now we'll fine-tune the gpt2 model on our Q&A dataset. The `SFTTrainer` handles the complexities of formatting, tokenizing, and training the model on our instruction-style data.

#### 3.4.1. Advanced Fine-Tuning Technique: Supervised Instruction Fine-Tuning

- We will provide a formatting function to `SFTTrainer` that structures our data as `"question: {question} answer: {answer}"`. This teaches the model to follow instructions and provide a direct answer.

### Why GPT-2 and SFTTrainer are a good combination

- GPT-2 is a powerful transformer model that can be fine-tuned for various downstream tasks, including question answering. SFTTrainer is specifically designed for supervised fine-tuning of transformer models on instruction-style datasets. It simplifies the process of preparing the data and training the model, making it an efficient choice for fine-tuning GPT-2 on our financial Q&A dataset. The combination allows us to leverage the capabilities of GPT-2 and the streamlined fine-tuning process offered by SFTTrainer to create a specialized model for our task.

In [None]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import torch

In [None]:
model_name = "gpt2"

cfg = AutoConfig.from_pretrained("gpt2")
cfg.attn_pdrop = 0.2
cfg.embd_pdrop = 0.2
cfg.resid_pdrop = 0.2

In [None]:

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, config=cfg)

In [None]:
# Set padding token for GPT-2
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# SFTTrainer requires a formatting function to structure the data
def formatting_prompts_func(example):
    text = f"question: {example['question']} answer: {example['answer']}"
    return text

In [None]:
from transformers import EarlyStoppingCallback
early_stopping = EarlyStoppingCallback(early_stopping_patience=4)

In [None]:
lr_rate=2e-5
no_train_epochs=50 #100
weight_decay = 0.01
batch_size=4
eval_steps=20

In [None]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results_sft",
    num_train_epochs=no_train_epochs,
    eval_strategy="steps",  # Corrected argument name
    learning_rate=lr_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=weight_decay,
    gradient_accumulation_steps=2,  # Remove if not working    
    warmup_ratio=0.1, # Remove if not working
    lr_scheduler_type="cosine", # Remove if not working
    max_grad_norm=1.0, # Remove if not working
    save_total_limit=3,
    eval_steps=eval_steps,   
    logging_steps=eval_steps,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",   
    fp16=torch.cuda.is_available(), # Use mixed precision if GPU is available
    report_to='none' # Disable Weights & Biases logging
)

In [None]:
# Instantiate the SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    formatting_func=formatting_prompts_func,
    args=training_args,
    callbacks=[early_stopping], 
)

In [None]:
# Log hyperparameters
print("--- Fine-Tuning Hyperparameters ---")
print(f"Model: {model_name}")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Batch Size: {training_args.per_device_train_batch_size}")
print(f"Number of Epochs: {training_args.num_train_epochs}")
print(f"Compute Setup: {'GPU' if training_args.fp16 else 'CPU'}")

# Start fine-tuning
trainer.train()

In [None]:
logs = trainer.state.log_history
# Filter the logs to find entries with 'eval_loss'
eval_logs = [log for log in logs if 'eval_loss' in log]

# Print the evaluation loss from each entry
for log in eval_logs:
    print(f"Step {log['step']}: Evaluation Loss = {log['eval_loss']}")

### 3.5. Guardrail Implementation

- We will implement a simple input-side guardrail that checks if a question is relevant to the financial domain. This is done by looking for a list of predefined keywords. If a question is deemed irrelevant, the model will return a standard response instead of attempting to answer.

In [None]:
FINANCIAL_KEYWORDS = [
    'capex', 'customers', 'balance sheet', 'change', 'unit', 'income', 'difference', 'products', 'forecast', 'fy', 'operations',
  'inventory', 'value', 'price', 'apbo', 'q3', 'year', 'ge healthcare', 'sales', 'backlog', 'margin', 'q4', 'growth', 'operating', 
  'cost', 'guidance', 'expense', 'opex', 'revenue', 'quarter', 'q2', 'q1', 
 'ebitda', 'profit', 'product', 'loss', 'segment', 'financial', 'ebit', 'cash', 'pbo', 'stockholders'
 ]

def is_relevant(question):
    """Checks if the question contains any financial keywords."""
    return any(keyword in question.lower() for keyword in FINANCIAL_KEYWORDS)

# Example Usage
print(f"'What is the value of sales in 2024?' is relevant: {is_relevant('What is the value of sales in 2024?')}")
print(f"'What is the capital of France?' is relevant: {is_relevant('What is the capital of France?')}")

### 3.6. Response generation for fine tuned model
- Set the model into evaluation/inference mode
- Now we'll test our fine-tuned model.
-  We'll define a function to get predictions and then evaluate it on our specified test questions, including the guardrail logic.

In [None]:
finetuned_model = trainer.model # Get the fine-tuned model from the trainer
finetuned_model.eval() # Set the model to evaluation mode
finetuned_model_tokenizer = trainer.tokenizer

In [None]:
def get_finetuned_answer(question):
    # --- Guardrail Check ---
    if not is_relevant(question):
        return "Not applicable", 1.0, 0.0, "Guardrail (Irrelevant)"

    # Format the input for the GPT-2 model
    prompt = f"question: {question} answer:"
     
    inputs = finetuned_model_tokenizer(prompt, return_tensors="pt").to(finetuned_model.device)

    start_time = time.time()
    outputs = finetuned_model.generate(
        **inputs,
        max_length=128 + inputs.input_ids.shape[1], # Increase max_length to include prompt
        return_dict_in_generate=True,
        output_scores=True # Keep output_scores to calculate confidence
    )
    inference_time = time.time() - start_time

    # Decode the generated answer
    generated_sequence = outputs.sequences[0]
    # Get the length of the input prompt's token IDs
    prompt_length = inputs.input_ids.shape[1]
    # Slice the generated sequence to get only the generated answer part
    answer_ids = generated_sequence[prompt_length:]
    decoded_answer = finetuned_model_tokenizer.decode(answer_ids, skip_special_tokens=True).strip()


    # Calculate confidence score from the transition scores of the generated tokens
    # We calculate the average probability of the generated tokens
    # The scores are the logits of the next token predicted
    transition_scores = finetuned_model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True)
    # Calculate the average log probability across generated tokens
    avg_log_prob = transition_scores.mean().item()
    # Exponentiate the average log probability to get a probability-like score
    confidence = torch.exp(torch.tensor(avg_log_prob)).item()


    return decoded_answer, confidence, inference_time, "Fine-Tune"

## 4. Testing and Evaluation of SFT and RAG Implementation
- Prepare Test Questions for SFT and RAG 
    - Relevant, high-confidence: Clear fact in data.
    - Relevant, low-confidence: Ambiguous or sparse information.
    - Irrelevant: Example: "What is the capital of France?"

In [None]:
official_questions = [
    {
        "question": "What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income?",
        "type": "Relevant, high-confidence"
    },
    {
        "question": "What was the trend in net income?",
        "type": "Relevant, low-confidence (ambiguous)"
    },
    {
        "question": "What is the capital of France?",
        "type": "Irrelevant"
    }
]

### 4.1.1 Mandatory evaulation for SFT on Test Questions

In [None]:
print("--- Official Test Questions ---")
for q in official_questions:
    answer, confidence, inference_time, method = get_finetuned_answer(q['question'])
    print(f"Q: {q['question']} ({q['type']})")
    print(f"A: {answer}")
    print(f"Metrics: (Method: {method}, Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)\n")

### 4.1.2 Mandatory evaulation for RAG on Test Questions

In [None]:
print("--- Official Test Questions ---")
for q in official_questions:
    answer, confidence, inference_time, method = getResponseRag(q['question'])
    print(f"Q: {q['question']} ({q['type']})")
    print(f"A: {answer}")
    print(f"Metrics: (Method: {method}, Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)\n")

### 4.2 Extended Evaluation for both RAG and Finetuned Systems

In [None]:
def createReport(model_answer, confidence, inference_time, method):
    numbers_in_real_answer = set(re.findall(r'-?\d+', real_answer))
    numbers_in_model_answer = set(re.findall(r'-?\d+', model_answer))
    correct = 'Y' if numbers_in_real_answer and numbers_in_real_answer.issubset(numbers_in_model_answer) else 'N'

    if "not in data" in real_answer.lower() and method == "Guardrail (Irrelevant)":
        correct = 'Y'
        model_answer = "Not applicable"

    results.append({
        "Question": question,
        "Method": method,
        "Answer": model_answer,
        "Confidence": f"{confidence:.2f}",
        "Time (s)": f"{inference_time:.2f}",
        "Correct (Y/N)": correct
    })

In [None]:
from IPython.display import display
import re

extended_eval_questions = [
    {"question": "Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.", "real_answer": "The value of 'Sales of products' in 2023 was 12044 millions of dollars."},
    {"question": "How much did the 'Net income' change from 2023 to 2024 based on the Statements of Operations / Income?", "real_answer": "The change in 'Net income' from 2023 to 2024 was -353 millions of dollars."},
    {"question": "What was the value of 'Comprehensive income attributable to GE HealthCare' in 2022?", "real_answer": "The value of 'Comprehensive income attributable to GE HealthCare' in 2022 was 2049 millions of dollars."},
    {"question": "Could you provide the figure for '455' in 2022 as reported in the Statements of Operations / Income?", "real_answer": "The value of '455' in 2022 was 1326 millions of dollars."},
    {"question": "What was the difference in 'Impact on PBO/APBO at December 31, 2023' between 2022 and 2023 according to the Statements of Operations / Income?", "real_answer": "The difference in 'Impact on PBO/APBO at December 31, 2023' between 2022 and 2023 was 278 millions of dollars."},
    {"question": "What was the value of 'Net income from continuing operations' in 2024?", "real_answer": "The value of 'Net income from continuing operations' in 2024 was 1618 millions of dollars."},
    {"question": "What were the values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022?", "real_answer": "The values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022 were 1568, 1916, and 2247 millions of dollars, respectively."},
    {"question": "What is the company's stock ticker?", "real_answer": "Not in data"},
    {"question": "What was the service cost in 2023?", "real_answer": "The value of 'Service cost – Operating' in 2023 was 23 millions of dollars."},
    {"question": "Who is the CEO of the company?", "real_answer": "Not in data"}
]

results = []
for item in extended_eval_questions:
    question = item['question']
    real_answer = item['real_answer']
    model_answer, confidence, inference_time, method = get_finetuned_answer(question)
    createReport(model_answer, confidence, inference_time, method)
    #model_answer, confidence, inference_time, method = getResponseRag(question)
    #createReport(model_answer, confidence, inference_time, method)


In [None]:
with pd.option_context('display.max_colwidth', None):
  display(results_df)

### 5.1 Save the fine-tuned model for inferencing

In [None]:
# 'finetuned_model' is the fine-tuned model instance
output_dir = "../../model/gpt2-finetuned-model"

# Save the model weights and configuration
finetuned_model.save_pretrained(output_dir)

# Save the tokenizer's vocabulary and settings
finetuned_model_tokenizer.save_pretrained(output_dir)

### 5.2 Push the model on Hugging Face as Git cannot store 400+ MB file

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
repo_name = "gpt2-finetuned-model-v0.1"

# Push the model to the Hub
finetuned_model.push_to_hub(repo_name)

# Push the tokenizer to the Hub
finetuned_model_tokenizer.push_to_hub(repo_name)

## 6. Summary and Conclusion 

Based on the baseline and fine-tuned model evaluations, we can summarize the findings and draw conclusions about the effectiveness of fine-tuning GPT-2 with SFTTrainer on this financial Q&A dataset and the impact of the implemented guardrail.

Evaluation results:

*   **Baseline Model:** The baseline GPT-2 model, without fine-tuning on this specific dataset, performed poorly on the financial Q&A task, often providing irrelevant or incomplete answers with low confidence scores. This highlights the need for domain-specific fine-tuning.
*   **Fine-Tuned Model:** The fine-tuned GPT-2 model with SFTTrainer shows significant improvement. It is able to provide relevant answers to financial questions from the dataset with higher confidence scores. While not perfect (some answers may still contain inaccuracies or require further refinement), it demonstrates the effectiveness of supervised instruction fine-tuning for this task.
*   **Guardrail:** The implemented guardrail successfully identified and flagged irrelevant questions (e.g., "What is the company's stock ticker?" and "Who is the CEO of the company?"), returning a "Not applicable" response with high confidence. This is crucial for ensuring the model stays within its intended domain and doesn't provide misleading information for out-of-scope queries.

**Conclusion:**

Fine-tuning a pre-trained language model like GPT-2 on a domain-specific dataset using SFTTrainer is an effective approach for building a question-answering system for that domain. The addition of a simple guardrail significantly improves the system's robustness by handling irrelevant queries gracefully. Further improvements could involve expanding the training dataset, experimenting with different model architectures, or implementing more sophisticated guardrail mechanisms.