# Assignment 2 – Comparative Financial QA System: RAG vs Fine-Tuning

Group No 16

## Group Member Names:
1. | Anup Jindal        | 2023ac05472 |100%
2. | Yogesh Chaturvedi  | 2023ac05167 |100%
3. | HRISHIKESH MALAKAR | 2023Ac05058 |100%
4. | Anit Nair          | 2023ac05503 |100%
5. | DEBASISH ACHARYA   | 2023ac05417 |100%


### Objective
Develop and compare two systems for answering questions based on company financial statements (last two years):

- Retrieval-Augmented Generation (RAG) Chatbot: Combines document retrieval and generative response.
- Fine-Tuned Language Model (FT) Chatbot: Directly fine-tunes a small open-source language model on financial Q&A.
<p>
Use the same financial data for both methods and perform a detailed comparison on accuracy, speed, and robustness.

In [2]:
%pip install -r requirements.txt

Collecting streamlit (from -r requirements.txt (line 1))
  Downloading streamlit-1.48.1-py3-none-any.whl.metadata (9.5 kB)
Collecting pysqlite3-binary (from -r requirements.txt (line 3))
  Downloading pysqlite3_binary-0.5.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (766 bytes)
Collecting chromadb (from -r requirements.txt (line 5))
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting bs4 (from -r requirements.txt (line 11))
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting pydeck<1,>=0.8.0b4 (from streamlit->-r requirements.txt (line 1))
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting pybase64>=1.4.1 (from chromadb->-r requirements.txt (line 5))
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb->-r requ

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
# imports
import zipfile
import os
from bs4 import BeautifulSoup
import math

## 1. Data Collection & Preprocessing
#### In this assignment we will be using GE Healthcares financial statements submitted to US SEC. The raw data can be downloaded from below URL
- Downloaded Financial Statement of GE Healthcare From United States Securities and Exchange Commission: Click [here](https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001932393&type=10-Q&dateb=&owner=include&count=40&search_text=) for link to source of data.
- Downloaded 'gehc-annual-report-2023-2024.zip' file is available under the data folder

### 1.1 Extract the data and convert them to plain text. (Source data is html files)
- Use BeautifulSoup to parse HTML and extract text
- Post cleanup save plain text files in ./gehc_fin_plain_text folder

In [None]:
zip_file_path = '../../data/gehc-annual-report-2023-2024.zip'
extracted_dir_path = '../../data/content/gehc_fin_extracted'

# zip_file_path = './gehc-annual-report-2023-2024.zip'
# extracted_dir_path = './gehc_fin_extracted'


# Create the extraction directory if it doesn't exist
os.makedirs(extracted_dir_path, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir_path)

print(f"Extracted {zip_file_path} to {extracted_dir_path}")

Extracted ./gehc-annual-report-2023-2024.zip to ./gehc_fin_extracted


In [6]:
# extracted_dir_path = '../../data/content/gehc_fin_extracted'
# plain_text_dir_path = './gehc_fin_plain_text'

extracted_dir_path = './gehc_fin_extracted'
plain_text_dir_path = './gehc_fin_plain_text'

# Create the directory for plain text files if it doesn't exist
os.makedirs(plain_text_dir_path, exist_ok=True)

html_files = []
for root, _, files in os.walk(extracted_dir_path):
    for file in files:
        if file.endswith(".html") or file.endswith(".htm"):
            html_files.append(os.path.join(root, file))

print(f"Found {len(html_files)} HTML files.")

for html_file_path in html_files:
    try:
        # Try reading with utf-8 first, then latin-1
        try:
            with open(html_file_path, 'r', encoding='utf-8') as f:
                html_content = f.read()
        except UnicodeDecodeError:
            with open(html_file_path, 'r', encoding='latin-1') as f:
                html_content = f.read()


        # Use BeautifulSoup to parse HTML and extract text
        soup = BeautifulSoup(html_content, 'html.parser')
        plain_text = soup.get_text(separator='\n')

        # Create a corresponding plain text file path
        relative_path = os.path.relpath(html_file_path, extracted_dir_path)
        plain_text_file_path = os.path.join(plain_text_dir_path, relative_path + ".txt")

        # Create directories for the plain text file if they don't exist
        os.makedirs(os.path.dirname(plain_text_file_path), exist_ok=True)

        with open(plain_text_file_path, 'w', encoding='utf-8') as f:
            f.write(plain_text)

        print(f"Converted {html_file_path} to plain text and saved to {plain_text_file_path}")

    except Exception as e:
        print(f"Error processing {html_file_path}: {e}")

print("Finished converting HTML files to plain text.")

Found 2 HTML files.
Converted ./gehc_fin_extracted/gehc-20231231.html to plain text and saved to ./gehc_fin_plain_text/gehc-20231231.html.txt
Converted ./gehc_fin_extracted/gehc-20241231.html to plain text and saved to ./gehc_fin_plain_text/gehc-20241231.html.txt
Finished converting HTML files to plain text.



### 1.2 Walk through each text file and save them to a list as string.

In [7]:
plain_text_data = []
# Walk through the directory and read all .txt files
for root, _, files in os.walk(plain_text_dir_path):
    for file in files:
        if file.endswith(".txt"):
            file_path = os.path.join(root, file)
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    plain_text_data.append(f.read())
                print(f"Loaded {file_path}")
            except Exception as e:
                print(f"Error reading {file_path}: {e}")

print(f"Loaded {len(plain_text_data)} plain text files.")


Loaded ./gehc_fin_plain_text/gehc-20241231.html.txt
Loaded ./gehc_fin_plain_text/gehc-20231231.html.txt
Loaded 2 plain text files.


### 1.3 Clean text by removing noise like headers, footers, and page numbers.

In [8]:
import re

cleaned_text_data = []

# Function to clean text
def clean_text(text):
    # Remove common headers/footers (example patterns, adjust as needed)
    text = re.sub(r'\[\s*\d+\s*\]', '', text) # Remove numbers in brackets like [ 1 ]
    text = re.sub(r'Page\s+\d+\s+of\s+\d+', '', text, flags=re.IGNORECASE) # Remove "Page X of Y"
    text = re.sub(r'Exhibit\s+\d+\.\d+', '', text, flags=re.IGNORECASE) # Remove "Exhibit X.Y"
    text = re.sub(r'\n\s*\n', '\n', text) # Remove excessive newlines

    return text

# Apply cleaning to each document
for text in plain_text_data:
    cleaned_text_data.append(clean_text(text))

### 1.4 Segment reports into logical sections (e.g., income statement, balance sheet).

In [9]:
segmented_financial_statements = []

# Define the financial statement segments and their potential headings
# Using a dictionary to map a user-friendly name to a list of potential regex patterns.
financial_segment_patterns = {
    "Statements of Operations / Income": [
        r"CONSOLIDATED STATEMENTS OF OPERATIONS\s*\n(.*?)(?=\n(?:Statements of Financial Position|Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
        r"Statements of Income\s*\n(.*?)(?=\n(?:Statements of Financial Position|Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
    ],
    "Statements of Financial Position / Balance Sheet": [
        r"Statements of Financial Position\s*\n(.*?)(?=\n(?:Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
        r"balance sheet\s*\n(.*?)(?=\n(?:Statements of Comprehensive Income|Statements of Changes in Equity|cash flows)|\Z)",
    ],
    "Statements of Cash Flows": [
        r"cash flows\s*\n(.*?)(?=\Z)",
    ]
}

# Iterate through each cleaned document
for doc_text in cleaned_text_data:
    doc_segments = {}
    remaining_text = doc_text

    # Iterate through each financial segment and try to find its content using the defined patterns
    for segment_name, patterns in financial_segment_patterns.items():
        found_segment = False
        for pattern in patterns:
            match = re.search(pattern, remaining_text, re.DOTALL | re.IGNORECASE) # Use IGNORECASE for flexibility
            if match:
                doc_segments[segment_name] = match.group(1).strip()
                # Update remaining_text to be the part after the found segment if a match is found
                remaining_text = remaining_text[match.end():]
                found_segment = True
                break # Move to the next segment after finding a match

        if not found_segment:
             doc_segments[segment_name] = "Segment not found." # Indicate if a segment is not found after trying all patterns


    segmented_financial_statements.append(doc_segments)

print(f"Segmented {len(segmented_financial_statements)} documents into financial statements.")

# You can inspect the first segmented financial statements to see the results
import json
print(json.dumps(segmented_financial_statements[0], indent=2))

Segmented 2 documents into financial statements.
{
  "Statements of Operations / Income": "For the years ended December 31\n2024\n2023\n2022\nSales of products\n$\n13,075\n$\n13,127\n$\n12,044\nSales of services\n6,597\n6,425\n6,297\nTotal revenues\n19,672\n19,552\n18,341\nCost of products\n8,271\n8,465\n7,975\nCost of services\n3,196\n3,165\n3,187\nGross profit\n8,205\n7,922\n7,179\nSelling, general, and administrative\n4,269\n4,282\n3,631\nResearch and development\n1,311\n1,205\n1,026\nTotal operating expenses\n5,580\n5,487\n4,657\nOperating income\n2,625\n2,435\n2,522\nInterest and other financial charges \u0096 net\n504\n542\n77\nNon-operating benefit (income) costs\n(406)\n(382)\n(5)\nOther (income) expense \u0096 net\n(55)\n(86)\n(62)\nIncome from continuing operations before income taxes\n2,581\n2,361\n2,512\nBenefit (provision) for income taxes\n(531)\n(743)\n(563)\nNet income from continuing operations\n2,050\n1,618\n1,949\nIncome (loss) from discontinued operations, net of ta

### 1.5 From the segmented_financial_statements, create a data structure as below to store the information:

```json
{
    document: number, // document id
    segment: string,  // finacial segment like Operations, inancial Position / Balance Sheet,  Comprehensive Income
    line_item: string, Cash Flow
    2024: number, // value in each year
    2023: number,
}
```

In [10]:
financial_data = []
# Regex to find line items and their values for 2024, 2023 and 2022
line_item_pattern = re.compile(
    r"^(.*?)\s+"  # Capture the line item description
    r"\$\s*([\d,]+)\s+"  # Capture the 2024 value
    r"\$\s*([\d,]+)\s+"  # Capture the 2023 value
    r"\$\s*([\d,]+)\s+", # Capture the 2022 value
    re.MULTILINE # Pass the flag here
)

for i, doc_segments in enumerate(segmented_financial_statements):
    for segment_name, content in doc_segments.items():
        if content != "Segment not found.":
            # Find all matches in the content
            matches = line_item_pattern.finditer(content)
            for match in matches:
                line_item = match.group(1).strip()
                value_2024 = match.group(2).replace(',', '')
                value_2023 = match.group(3).replace(',', '')
                value_2022 = match.group(4).replace(',', '') # Corrected index for 2022 value
                # Add to our structured data list
                financial_data.append({
                    "document": i + 1,
                    "segment": segment_name,
                    "line_item": line_item,
                    "2024": int(value_2024),
                    "2023": int(value_2023),
                    "2022": int(value_2022)
                })

# Print the first 5 extracted key-value pairs
for item in financial_data[:5]:
    print(item)

{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Sales of products', '2024': 13075, '2023': 13127, '2022': 12044}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Net income attributable to GE HealthCare', '2024': 1993, '2023': 1568, '2022': 1916}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Imaging', '2024': 8855, '2023': 8944, '2022': 8395}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Total revenues', '2024': 19672, '2023': 19552, '2022': 18341}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'United States and Canada (\x93USCAN\x94)', '2024': 8981, '2023': 8551, '2022': 8130}


### 1.6 Formulate Questions at least 50 (Q/A) pairs

In [11]:
num_question_pair=50 # will use on 50 for RAG and Fine tuning

In [12]:
generated_questions = []
count  = 0
# Iterate through the extracted financial data
def generateQuestions(financial_data, num_question_pair):
  for item in financial_data:
      line_item = item["line_item"]
      value_2024 = item["2024"]
      value_2023 = item["2023"]
      value_2022 = item["2022"]
      segment = item["segment"]

      # Question type 1: Value in a specific year
      generated_questions.append({
          "based_on_data_item": item,
          "question": f"What was the value of '{line_item}' in {2024} according to the {segment}?",
      })
      if (len(generated_questions) == num_question_pair):
        break;
      generated_questions.append({
          "based_on_data_item": item,
          "question": f"Find the value for '{line_item}' in {2023} from the {segment}.",
      })
      if (len(generated_questions) == num_question_pair):
        break;
      generated_questions.append({
          "based_on_data_item": item,
          "question": f"Could you provide the figure for '{line_item}' in {2022} as reported in the {segment}?",
      })
      if (len(generated_questions) == num_question_pair):
            break;
      # Question type 2: Change between two years
      generated_questions.append({
          "based_on_data_item": item,
          "question": f"How much did the '{line_item}' change from {2023} to {2024} based on the {segment}?",
      })
      if (len(generated_questions) == num_question_pair):
        break;
      generated_questions.append({
          "based_on_data_item": item,
          "question": f"What was the difference in '{line_item}' between {2022} and {2023} according to the {segment}?",
      })
      if (len(generated_questions) == num_question_pair):
        break;
      # Question type 3: Value across multiple years
      if value_2024 is not None and value_2023 is not None and value_2022 is not None:
          generated_questions.append({
              "based_on_data_item": item,
              "question": f"What were the values for '{line_item}' for the years {2024}, {2023}, and {2022} in the {segment}?",
          })
      if (len(generated_questions) == num_question_pair):
        break;
  return generated_questions

In [13]:
generated_questions = generateQuestions(financial_data, num_question_pair)
# Print the first 10 generated questions to inspect
for q in generated_questions[:10]:
  print(q)

print(f"\nGenerated {len(generated_questions)} questions.")

{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Sales of products', '2024': 13075, '2023': 13127, '2022': 12044}, 'question': "What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income?"}
{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Sales of products', '2024': 13075, '2023': 13127, '2022': 12044}, 'question': "Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income."}
{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Sales of products', '2024': 13075, '2023': 13127, '2022': 12044}, 'question': "Could you provide the figure for 'Sales of products' in 2022 as reported in the Statements of Operations / Income?"}
{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Sales of products', '2024': 13075, '2023': 13

#### For each questions formulated above generate the answers.

In [55]:
# Iterate through the generated questions
def generated_questions_answer(generated_questions):
  for q in generated_questions:
      item = q["based_on_data_item"]
      line_item = item["line_item"]
      value_2024 = item["2024"]
      value_2023 = item["2023"]
      value_2022 = item["2022"]
      segment = item["segment"]
      question_text = q["question"]
      answer = ""

      # Determine the type of question and extract/calculate the answer
      if f"in {2024}" in question_text and f"{2023}, and {2022}" not in question_text:
          answer = f"The value of '{line_item}' in 2024 was {value_2024} millions of dollars."
      elif f"in {2023}" in question_text and f"{2024}, and {2022}" not in question_text:
          answer = f"The value of '{line_item}' in 2023 was {value_2023} millions of dollars."
      elif f"in {2022}" in question_text and f"{2024}, and {2023}" not in question_text:
          answer = f"The value of '{line_item}' in 2022 was {value_2022} millions of dollars."
      elif f"change from {2023} to {2024}" in question_text:
          change = value_2024 - value_2023
          answer = f"The change in '{line_item}' from 2023 to 2024 was {change} millions of dollars."
      elif f"difference in '{line_item}' between {2022} and {2023}" in question_text:
          difference = value_2023 - value_2022
          answer = f"The difference in '{line_item}' between 2022 and 2023 was {difference} millions of dollars."
      elif f"for the years {2024}, {2023}, and {2022}" in question_text:
          answer = f"The values for '{line_item}' for the years 2024, 2023, and 2022 were {value_2024}, {value_2023}, and {value_2022} millions of dollars, respectively."
      else:
          answer = "Could not determine the specific answer based on the question format."
      q['answer'] = answer
  return generated_questions

# Print the first 10 question-answer pairs
generated_questions =  generated_questions_answer(generated_questions)
print("First 10 Generated Q/A Pairs:")
for q in generated_questions:
    print(f"Question: {q['question']}")
    print(f"Answer: {q['answer']}")
    print("-" * 20)

generated_questions_answer = generated_questions.copy();
# Print the total number of Q/A pairs generated
print(f"\nTotal Q/A pairs generated: {len(generated_questions)}")

First 10 Generated Q/A Pairs:
Question: What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income?
Answer: The value of 'Sales of products' in 2024 was 13075 millions of dollars.
--------------------
Question: Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.
Answer: The value of 'Sales of products' in 2023 was 13127 millions of dollars.
--------------------
Question: Could you provide the figure for 'Sales of products' in 2022 as reported in the Statements of Operations / Income?
Answer: The value of 'Sales of products' in 2022 was 12044 millions of dollars.
--------------------
Question: How much did the 'Sales of products' change from 2023 to 2024 based on the Statements of Operations / Income?
Answer: The change in 'Sales of products' from 2023 to 2024 was -52 millions of dollars.
--------------------
Question: What was the difference in 'Sales of products' between 2022 and 2023 according to the Sta

## 2. Retrieval-Augmented Generation (RAG) System Implementation

### 2.1 Data Processing

 - Split the cleaned text into chunks suitable for retrieval with at least two chunk sizes (e.g., 100 and 400 tokens).
 - Assign unique IDs and metadata to chunks.


In [15]:
def chunk_text(text, chunk_size=100, overlap=20):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Define chunk sizes
chunk_sizes = [100, 400]
chunked_data = {}

# Process each cleaned document and create chunks of different sizes
for doc_id, cleaned_text in enumerate(cleaned_text_data):
    for size in chunk_sizes:
        chunks = chunk_text(cleaned_text, chunk_size=size)
        if f'chunks_{size}' not in chunked_data:
            chunked_data[f'chunks_{size}'] = []

        for i, chunk in enumerate(chunks):
            chunked_data[f'chunks_{size}'].append({
                'id': f'doc_{doc_id}_chunk_{i}_size_{size}',
                'content': chunk,
                'metadata': {
                    'document_id': doc_id,
                    'chunk_id': i,
                    'chunk_size': size
                }
            })

# Print some information about the generated chunks
for size, chunks in chunked_data.items():
    print(f"Generated {len(chunks)} chunks of size {size}.")
    if len(chunks) > 0:
        print(f"First chunk ({size}): {chunks[0]['content'][:200]}...") # Print first 200 characters of the first chunk

Generated 1997 chunks of size chunks_100.
First chunk (chunks_100): gehc-20241231 false 0001932393 FY 2024 http://www.gehealthcare.com/20241231#PropertyPlantAndEquipmentAndOperatingLeaseRightOfUseAssetAfterAccumulatedDepreciationAndAmortization http://www.gehealthcare...
Generated 421 chunks of size chunks_400.
First chunk (chunks_400): gehc-20241231 false 0001932393 FY 2024 http://www.gehealthcare.com/20241231#PropertyPlantAndEquipmentAndOperatingLeaseRightOfUseAssetAfterAccumulatedDepreciationAndAmortization http://www.gehealthcare...


### 2.2 Embedding & Indexing

#### 2.2.1 Embed the chunks using the all-MiniLM-L6-v2

In [16]:
from sentence_transformers import SentenceTransformer

# Load the sentence embedding model
try:
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Sentence embedding model 'all-MiniLM-L6-v2' loaded successfully.")
except Exception as e:
    print(f"Error loading sentence embedding model: {e}")
    print("Please ensure you have an active internet connection to download the model.")
    embedding_model = None

# Embed chunks of size 100 and 400
if embedding_model is not None:
    embedded_chunks = {}
    for size, chunks in chunked_data.items():
        print(f"Embedding {len(chunks)} chunks of size {size.split('_')[-1]}...")
        # Extract the content of the chunks to embed
        chunks_content = [chunk['content'] for chunk in chunks]

        # Generate embeddings
        try:
            embeddings = embedding_model.encode(chunks_content, show_progress_bar=True)
            embedded_chunks[size] = {
                'chunks': chunks,
                'embeddings': embeddings
            }
            print(f"Finished embedding chunks of size {size.split('_')[-1]}. Shape of embeddings: {embeddings.shape}")
        except Exception as e:
            print(f"Error during embedding for chunk size {size.split('_')[-1]}: {e}")
            embedded_chunks[size] = None

else:
    print("Embedding model not loaded, skipping embedding step.")

print("Inspect the shape of the embeddings for one chunk size, e.g., size 100")
if embedded_chunks.get('chunks_100') and embedded_chunks['chunks_100']['embeddings'] is not None:
    print(f"\nShape of embeddings for chunk size 100: {embedded_chunks['chunks_100']['embeddings'].shape}")
    print(f"Shape of embeddings for chunk size 400: {embedded_chunks['chunks_400']['embeddings'].shape}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence embedding model 'all-MiniLM-L6-v2' loaded successfully.
Embedding 1997 chunks of size 100...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Finished embedding chunks of size 100. Shape of embeddings: (1997, 384)
Embedding 421 chunks of size 400...


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Finished embedding chunks of size 400. Shape of embeddings: (421, 384)
Inspect the shape of the embeddings for one chunk size, e.g., size 100

Shape of embeddings for chunk size 100: (1997, 384)
Shape of embeddings for chunk size 400: (421, 384)


#### 2.2.2 Build dense vector store to capture semantic relation using ChromaDB

In [17]:
import chromadb

# Initialize ChromaDB client
try:
    client = chromadb.Client()
    print("ChromaDB client initialized.")
except Exception as e:
    print(f"Error initializing ChromaDB client: {e}")
    client = None

# Create or get a collection for our chunks
collection_name = "financial_report_chunks"
try:
    collection = client.get_or_create_collection(name=collection_name)
    print(f"ChromaDB collection '{collection_name}' created or retrieved.")
except Exception as e:
    print(f"Error getting or creating ChromaDB collection: {e}")
    collection = None

# Add the embedded chunks to the collection
# We'll add the chunks from one of the sizes, for example, size 100, to the dense vector store.
if collection is not None and embedded_chunks.get('chunks_100') and embedded_chunks['chunks_100']['embeddings'] is not None:
    chunks_to_add = embedded_chunks['chunks_100']['chunks']
    embeddings_to_add = embedded_chunks['chunks_100']['embeddings']

    # Prepare data for ChromaDB
    ids = [chunk['id'] for chunk in chunks_to_add]
    documents = [chunk['content'] for chunk in chunks_to_add]
    metadatas = [chunk['metadata'] for chunk in chunks_to_add]


    # Add to ChromaDB in batches to avoid potential issues with large numbers of documents
    batch_size = 100  # Adjust batch size as needed
    for i in range(0, len(ids), batch_size):
        batch_ids = ids[i:i + batch_size]
        batch_documents = documents[i:i + batch_size]
        batch_embeddings = embeddings_to_add[i:i + batch_size]
        batch_metadatas = metadatas[i:i+ batch_size]

        try:
            collection.add(
                embeddings=batch_embeddings.tolist(),
                documents=batch_documents,
                metadatas=batch_metadatas,
                ids=batch_ids
            )
            print(f"Added batch {i//batch_size + 1} to ChromaDB.")
        except Exception as e:
            print(f"Error adding batch {i//batch_size + 1} to ChromaDB: {e}")

    print(f"Finished adding {len(ids)} chunks to ChromaDB collection '{collection_name}'.")

# Verify the count of items in the collection
if collection is not None:
    try:
        count = collection.count()
        print(f"Total items in ChromaDB collection '{collection_name}': {count}")
    except Exception as e:
        print(f"Error getting count from ChromaDB collection: {e}")

ChromaDB client initialized.
ChromaDB collection 'financial_report_chunks' created or retrieved.
Added batch 1 to ChromaDB.
Added batch 2 to ChromaDB.
Added batch 3 to ChromaDB.
Added batch 4 to ChromaDB.
Added batch 5 to ChromaDB.
Added batch 6 to ChromaDB.
Added batch 7 to ChromaDB.
Added batch 8 to ChromaDB.
Added batch 9 to ChromaDB.
Added batch 10 to ChromaDB.
Added batch 11 to ChromaDB.
Added batch 12 to ChromaDB.
Added batch 13 to ChromaDB.
Added batch 14 to ChromaDB.
Added batch 15 to ChromaDB.
Added batch 16 to ChromaDB.
Added batch 17 to ChromaDB.
Added batch 18 to ChromaDB.
Added batch 19 to ChromaDB.
Added batch 20 to ChromaDB.
Finished adding 1997 chunks to ChromaDB collection 'financial_report_chunks'.
Total items in ChromaDB collection 'financial_report_chunks': 1997


### 2.2.3 Create Sparse index (BM25 or TF-IDF) for keyword retrieval

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

# We will use the chunks of size 100 for building the TF-IDF index

# Extract the content of the chunks
if 'chunks_100' in embedded_chunks:
    chunks_to_embed = embedded_chunks['chunks_100']['chunks']
    chunks_content = [chunk['content'] for chunk in chunks_to_embed]

    # Initialize TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

    # Fit the vectorizer to the chunk content and transform the chunks
    try:
        tfidf_matrix = tfidf_vectorizer.fit_transform(chunks_content)
        print("TF-IDF vectorizer fitted and matrix created successfully.")
        print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
    except Exception as e:
        print(f"Error creating TF-IDF matrix: {e}")
        tfidf_vectorizer = None # Set to None if fitting fails
        tfidf_matrix = None # Set to None if fitting fails

else:
    print("Chunks of size 100 not found in chunked_data. Cannot build TF-IDF index.")
    tfidf_vectorizer = None
    tfidf_matrix = None

TF-IDF vectorizer fitted and matrix created successfully.
Shape of TF-IDF matrix: (1997, 5000)


### 2.3 Hybrid Retrieval Pipeline

#### 2.3.1 Preprocess data clean
Cleans, lowercases, and removes stopwords from a query.

In [19]:
import re
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except LookupError: # Corrected exception type
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_query(query):
    # Convert to lowercase
    query = query.lower()
    # Remove special characters and punctuation
    query = re.sub(r'[^a-z0-9\s]', '', query)
    # Remove stopwords
    query = ' '.join([word for word in query.split() if word not in stop_words])
    return query


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### 2.3.2 Generate query embedding.

In [20]:
def generate_query_embedding(query, embedding_model):
    if embedding_model is None:
        print("Embedding model is not loaded. Cannot generate query embedding.")
        return None
    try:
        # Encode the query to get its embedding
        query_embedding = embedding_model.encode(query)
        print("Query embedding generated successfully.")
        return query_embedding
    except Exception as e:
        print(f"Error generating query embedding: {e}")
        return None


#### 2.3.3 Retrieve top-N chunks from:
- Dense retrieval (vector similarity).
    - Retrieves top-N relevant chunks using dense vector similarity with ChromaDB
- Sparse retrieval (BM25).
    - Retrieves top-N relevant chunks using sparse keyword similarity (TF-IDF).


In [21]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def dense_retrieve(query, collection, embedding_model, n_results=5):

    if collection is None or embedding_model is None:
        print("ChromaDB collection or embedding model not loaded. Cannot perform dense retrieval.")
        return []

    try:
        # Generate embedding for the query
        query_embedding = embedding_model.encode([query]).tolist() # ChromaDB expects a list of lists

        # Query ChromaDB
        results = collection.query(
            query_embeddings=query_embedding,
            n_results=n_results,
            include=['documents', 'metadatas'] # Request documents (content) and metadatas
        )

        # Process the results
        retrieved_chunks = []
        if results and results['ids'] and results['documents'] and results['metadatas']:
            for i in range(len(results['ids'][0])):
                 retrieved_chunks.append({
                    'id': results['ids'][0][i],
                    'content': results['documents'][0][i],
                    'metadata': results['metadatas'][0][i]
                })

        print(f"Dense retrieval found {len(retrieved_chunks)} results.")
        return retrieved_chunks

    except Exception as e:
        print(f"Error during dense retrieval: {e}")
        return []


def sparse_retrieve_tfidf(query, tfidf_vectorizer, tfidf_matrix, chunks, n_results=5):
    if tfidf_vectorizer is None or tfidf_matrix is None or not chunks:
        print("TF-IDF vectorizer, matrix, or chunks not available. Cannot perform sparse retrieval.")
        return []

    try:
        # Transform the query using the same TF-IDF vectorizer
        query_tfidf = tfidf_vectorizer.transform([query])

        # Calculate cosine similarity between the query TF-IDF and chunk TF-IDF matrix
        cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()

        # Get the indices of the top-N most similar chunks
        # Use argpartition for efficiency if n_results is much smaller than the total number of chunks
        # Or use argsort if you need the results sorted by similarity
        # top_n_indices = np.argsort(cosine_similarities)[::-1][:n_results] # Gets indices in descending order of similarity
        top_n_indices = np.argpartition(cosine_similarities, -n_results)[-n_results:] # More efficient for large matrices

        # Filter out indices that might be out of bounds if n_results is larger than available chunks
        top_n_indices = top_n_indices[top_n_indices < len(chunks)]

        # Retrieve the actual chunks based on the indices
        retrieved_chunks = []
        # Sort by similarity score (optional, but good for presentation)
        # Sorting indices by similarity score in descending order before picking top-N
        sorted_indices = top_n_indices[np.argsort(cosine_similarities[top_n_indices])][::-1]


        for idx in sorted_indices:
             # Explicitly cast idx to int just in case
             int_idx = int(idx)
             retrieved_chunks.append({
                'id': chunks[int_idx]['id'],
                'content': chunks[int_idx]['content'],
                'metadata': chunks[int_idx]['metadata']
            })


        print(f"Sparse retrieval found {len(retrieved_chunks)} results.")
        return retrieved_chunks

    except Exception as e:
        print(f"Error during sparse retrieval: {e}")
        return []

#### Combines the results from dense and sparse retrieval.


In [22]:
def combine_retrieval_results(dense_results, sparse_results):
    combined_chunks = {}

    # Add dense retrieval results
    for chunk in dense_results:
        combined_chunks[chunk['id']] = chunk
    # Add sparse retrieval results
    for chunk in sparse_results:
        combined_chunks[chunk['id']] = chunk

    # Convert the dictionary values back to a list
    return list(combined_chunks.values())

#### 2.3.4 Advanced RAG Technique (Select One)



In [23]:
# Define the number of initial candidates for broad retrieval
n_broad_dense = 10 # Retrieve more candidates from dense retrieval
n_broad_sparse = 10 # Retrieve more candidates from sparse retrieval

# Example User Query
user_query = "What was the total revenues in 2024 for GE Healthcare?"
# Preprocess the query
preprocessed_user_query = preprocess_query(user_query) # Make sure preprocess_query is run first

# Perform broad dense retrieval
broad_dense_results = dense_retrieve(preprocessed_user_query, collection, embedding_model, n_results=n_broad_dense)
print(f"\nBroad Dense Retrieval found {len(broad_dense_results)} candidates.")

# Perform broad sparse retrieval
# Assuming 'chunks_to_embed' is the list of chunks used for TF-IDF
# Corrected the variable name to access the chunks from embedded_chunks
chunks_to_embed = embedded_chunks['chunks_100']['chunks'] # Make sure this is correctly referenced
broad_sparse_results = sparse_retrieve_tfidf(preprocessed_user_query, tfidf_vectorizer, tfidf_matrix, chunks_to_embed, n_results=n_broad_sparse)
print(f"Broad Sparse Retrieval found {len(broad_sparse_results)} candidates.")


Dense retrieval found 10 results.

Broad Dense Retrieval found 10 candidates.
Sparse retrieval found 10 results.
Broad Sparse Retrieval found 10 candidates.


In [24]:
# Step 1: Combine broad retrieval result

combined_results = combine_retrieval_results(broad_dense_results, broad_sparse_results)
print(f"Combined retrieval results: Found {len(combined_results)} unique chunks.")

# Step 2: Load a cross-encoder model for reranking

try:
    from sentence_transformers import CrossEncoder
    # Load a pre-trained cross-encoder model suitable for reranking
    cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    print("Cross-encoder model 'cross-encoder/ms-marco-MiniLM-L-6-v2' loaded successfully.")
except Exception as e:
    print(f"Error loading cross-encoder model: {e}")
    cross_encoder_model = None # Set to None if loading fails
    print("Please ensure you have an active internet connection to download the model.")

Combined retrieval results: Found 19 unique chunks.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Cross-encoder model 'cross-encoder/ms-marco-MiniLM-L-6-v2' loaded successfully.


### 2.5 Response generation

#### Guard Rail function defination.

In [25]:
FINANCIAL_KEYWORDS = [
    'capex', 'customers', 'balance sheet', 'change', 'unit', 'income', 'difference', 'products', 'forecast', 'fy', 'operations',
  'inventory', 'value', 'price', 'apbo', 'q3', 'year', 'ge healthcare', 'sales', 'backlog', 'margin', 'q4', 'growth', 'operating',
  'cost', 'guidance', 'expense', 'opex', 'revenue', 'quarter', 'q2', 'q1',
 'ebitda', 'profit', 'product', 'loss', 'segment', 'financial', 'ebit', 'cash', 'pbo', 'stockholders'
 ]

def is_relevantRAG(question):
    """Checks if the question contains any financial keywords."""
    return any(keyword in question.lower() for keyword in FINANCIAL_KEYWORDS)

# Example Usage
print(f"'What is the value of sales in 2024?' is relevant: {is_relevantRAG('What is the value of sales in 2024?')}")
print(f"'What is the capital of France?' is relevant: {is_relevantRAG('What is the capital of France?')}")

'What is the value of sales in 2024?' is relevant: True
'What is the capital of France?' is relevant: False


In [26]:
# Step 3: Rerank the combined results using a cross-encoder model
reranked_results = []
if cross_encoder_model is not None and combined_results:
    print("\nReranking combined results...")
    # Prepare sentence pairs for the cross-encoder: [query, document]
    sentence_pairs = [[preprocessed_user_query, chunk['content']] for chunk in combined_results]

    # Get scores from the cross-encoder
    try:
        reranking_scores = cross_encoder_model.predict(sentence_pairs)

        # Combine the original chunks with their reranking scores
        scored_results = []
        for i, chunk in enumerate(combined_results):
            scored_results.append({
                'chunk': chunk,
                'score': reranking_scores[i]
            })

        # Sort the results by reranking score in descending order
        reranked_results = sorted(scored_results, key=lambda x: x['score'], reverse=True)

        print(f"Finished reranking. Top score: {reranked_results[0]['score'] if reranked_results else 'N/A'}")

    except Exception as e:
        print(f"Error during reranking: {e}")
        reranked_results = [] # Set to empty list if reranking fails

else:
    print("\nSkipping reranking due to missing cross-encoder model or combined results.")
    reranked_results = [] # Set to empty list if prerequisites are not met


# Step 4: Select top-k chunks for response generation
k = 3  # Define the number of top chunks to use as context
top_k_chunks = [item['chunk'] for item in reranked_results[:k]]

print(f"\nSelected top {k} chunks for response generation.")
for chunk in top_k_chunks:
    print(f"- ID: {chunk['id']}, Content: {chunk['content'][:150]}...")

# Step 5: Generate Answer using gpt2 model
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 Small and tokenizer
try:
    model_name = "gpt2" # Using the base gpt2 model
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)
    print(f"\nLoaded generative model: {model_name}")
except Exception as e:
    print(f"Error loading generative model {model_name}: {e}")
    tokenizer = None
    model = None


Reranking combined results...
Finished reranking. Top score: 8.195191383361816

Selected top 3 chunks for response generation.
- ID: doc_1_chunk_517_size_100, Content: operations 1,618 1,949 Income (loss) from discontinued operations, net of taxes (4) 18 Net income 1,614 1,967 Net (income) loss attributable to noncon...
- ID: doc_0_chunk_445_size_100, Content: World revenues were $3,158 million, growing 5% or $162 million due to growth in PDx, Imaging, and AVS revenues, partially offset by unfavorable foreig...
- ID: doc_0_chunk_434_size_100, Content: 2,361 2,512 Benefit (provision) for income taxes (531) (743) (563) Net income from continuing operations 2,050 1,618 1,949 Income (loss) from disconti...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]


Loaded generative model: gpt2


In [84]:
def getResponseRag(user_query):
    if not is_relevantRAG(user_query):
        return "Not applicable", 1.0, 0.0, "Guardrail (Irrelevant)"
    elif tokenizer is not None and model is not None and top_k_chunks:
        import time
        start_time = time.time()
        context = "\n".join([chunk['content'] for chunk in top_k_chunks])
        prompt = f"Context:\n{context}\n\nQuestion: {user_query}\n\nAnswer:"
        max_model_input_length = tokenizer.model_max_length
        max_prompt_length = max_model_input_length - 100
        encoded_prompt = tokenizer.encode(prompt, max_length=max_prompt_length, truncation=True, return_tensors="pt")

        # Fix device mismatch: Ensure model and tensors are on the same device
        device = next(model.parameters()).device  # Get the device the model is on
        encoded_prompt = encoded_prompt.to(device)  # Move input tensors to the same device

        attention_mask = (encoded_prompt != tokenizer.pad_token_id).long() if tokenizer.pad_token_id is not None else None
        if attention_mask is not None:
            attention_mask = attention_mask.to(device)  # Move attention mask to the same device

        # Improved confidence calculation using cross-encoder score
        if reranked_results:
            # Get the raw cross-encoder score
            raw_score = float(reranked_results[0]['score'])
            # Normalize cross-encoder score to confidence range [0.0, 1.0]
            # Cross-encoder scores typically range from -10 to +10, with positive being more relevant
            # We'll map this to a confidence score using a more appropriate transformation
            if raw_score >= 0:
                # For positive scores, map [0, 10] -> [0.5, 0.95]
                confidence = 0.5 + (min(raw_score, 10.0) / 10.0) * 0.45
            else:
                # For negative scores, map [-10, 0] -> [0.05, 0.5]
                confidence = 0.05 + (max(raw_score, -10.0) + 10.0) / 10.0 * 0.45
        else:
            confidence = 0.0

        final_answer = "No answer found."
        method = 'RAG'
        max_length = 100
        try:
            output_sequences = model.generate(
                encoded_prompt,
                max_length=encoded_prompt.shape[1] + max_length,
                num_return_sequences=1,
                no_repeat_ngram_size=2,
                top_k=50,
                pad_token_id=tokenizer.eos_token_id,
                attention_mask=attention_mask
            )
            generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
            answer_start = generated_text.find("Answer:")
            if answer_start != -1:
                final_answer = generated_text[answer_start + len("Answer:"):].strip()
            else:
                final_answer = generated_text.strip()

            # Limit to 50 words
            final_answer = ' '.join(final_answer.split()[:50])

        except Exception as e:
            print(f"Error during answer generation: {e}")
        inference_time = time.time() - start_time

        return final_answer, confidence, inference_time, method
    else:
        print("\nSkipping answer generation due to missing model, tokenizer, or chunks.")
        return "Not applicable", 0.0, 0.0, "Missing Model/Chunks"

## 3 Fine-Tuning a Language Model for Financial Q&A with SFTTrainer

This section of notebook walks through the process of fine-tuning a small, open-source language model to answer questions based on a provided financial dataset. We will be reusing the same data and genrated questions in the RAG step for finetuning purpose. This model will be trained on around 200 questions/answer pair generated in the step-1.
We will cover data preparation, model selection, baseline benchmarking, and evaluation.

This version uses the **SFTTrainer** from the TRL (Transformer Reinforcement Learning) library, which simplifies supervised fine-tuning on instruction-style datasets.

### 3.1.1 Setup and Dependencies ⚙️

- Install the necessary libraries.
- We'll use `transformers` for the language model, `datasets` to handle our data, `torch` as the backend, and `trl` for the `SFTTrainer`.

In [28]:
%pip install -q transformers[torch] datasets pandas trl peft bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.9/511.9 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### 3.1.2. Q/A Dataset Preparation 📄

- Create a pandas DataFrame from generated question/answers.
- Create a Hugging Face `Dataset` object.
- Split Dataset into train and eval datasets

In [29]:
import pandas as pd
import io
import time
import torch

# Clean and parse the data
data = []
for q in generated_questions_answer:
    data.append({"question": q['question'], "answer": q['answer']})

qna_df = pd.DataFrame(data)
print(qna_df.head())
print(f"\nTotal Q&A pairs: {len(qna_df)}")

                                            question  \
0  What was the value of 'Sales of products' in 2...   
1  Find the value for 'Sales of products' in 2023...   
2  Could you provide the figure for 'Sales of pro...   
3  How much did the 'Sales of products' change fr...   
4  What was the difference in 'Sales of products'...   

                                              answer  
0  The value of 'Sales of products' in 2024 was 1...  
1  The value of 'Sales of products' in 2023 was 1...  
2  The value of 'Sales of products' in 2022 was 1...  
3  The change in 'Sales of products' from 2023 to...  
4  The difference in 'Sales of products' between ...  

Total Q&A pairs: 50


In [30]:
from datasets import Dataset

# Convert to Hugging Face Dataset and split
full_dataset = Dataset.from_pandas(qna_df)
train_test_split = full_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

### 3.2 Model Selection and Baseline Benchmarking 📊

- We will use **gpt2** for a Question Answering baseline to see how a model performs *before* any fine-tuning. This helps us quantify the improvement from our fine-tuning process.
- For fine-tuning, we'll select **gpt2**, a sequence-to-sequence model well-suited for our instruction-based task.

In [31]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

baseline_model_name = "gpt2"
baseline_tokenizer = AutoTokenizer.from_pretrained(baseline_model_name)
baseline_model = AutoModelForQuestionAnswering.from_pretrained(baseline_model_name)

Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at gpt2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### 3.2.1 GPT2 Model Baseline Benchmarking

In [32]:
def get_baseline_model_answer(question, context):
    inputs = baseline_tokenizer(question, context, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        start_time = time.time()
        outputs = baseline_model(**inputs)
        inference_time = time.time() - start_time

    answer_start_index = torch.argmax(outputs.start_logits)
    answer_end_index = torch.argmax(outputs.end_logits)

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    answer = baseline_tokenizer.decode(predict_answer_tokens)

    start_prob = torch.nn.functional.softmax(outputs.start_logits, dim=-1)[0, answer_start_index].item()
    end_prob = torch.nn.functional.softmax(outputs.end_logits, dim=-1)[0, answer_end_index].item()
    confidence = (start_prob + end_prob) / 2

    return answer, confidence, inference_time

# Create a single context from all answers for the baseline model
context = " ".join(qna_df['answer'].tolist())

test_questions = qna_df.sample(10, random_state=42)

print("--- Baseline Model Evaluation ---")
for _, row in test_questions.iterrows():
    question = row['question']
    real_answer = row['answer']
    model_answer, confidence, inference_time = get_baseline_model_answer(question, context)
    print(f"Q: {question}")
    print(f"Predicted A: {model_answer} (Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)")
    print(f"Predicted A=> (Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)")
    print(f"Real A: {real_answer}\n")

--- Baseline Model Evaluation ---
Q: Find the value for 'Imaging' in 2023 from the Statements of Operations / Income.
Predicted A:  values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022 were 1993, 1568, and 1916 millions of dollars, respectively. The value of 'Imaging' in 2024 was 8855 millions of dollars. The value of 'Imaging' in 2023 was 8944 millions of dollars. The value of 'Imaging' in 2022 was 8395 millions of dollars. The change in 'Imaging' from 20 (Confidence: 0.0180, Time: 1.5198s)
Predicted A=> (Confidence: 0.0180, Time: 1.5198s)
Real A: The value of 'Imaging' in 2023 was 8944 millions of dollars.

Q: How much did the 'Net income attributable to GE HealthCare' change from 2023 to 2024 based on the Statements of Operations / Income?
Predicted A:  millions of dollars, respectively. The value of 'Imaging' in 2024 was 8855 millions of dollars. The value of 'Imaging' in 2023 was 8944 millions of dollars. The value of 'Imaging' in 2022 was 8395 

## 3.4. Fine-Tuning with SFTTrainer 🚀

- Now we'll fine-tune the gpt2 model on our Q&A dataset. The `SFTTrainer` handles the complexities of formatting, tokenizing, and training the model on our instruction-style data.

#### 3.4.1. Advanced Fine-Tuning Technique: Supervised Instruction Fine-Tuning

- We will provide a formatting function to `SFTTrainer` that structures our data as `"question: {question} answer: {answer}"`. This teaches the model to follow instructions and provide a direct answer.

### Why GPT-2 and SFTTrainer are a good combination

- GPT-2 is a powerful transformer model that can be fine-tuned for various downstream tasks, including question answering. SFTTrainer is specifically designed for supervised fine-tuning of transformer models on instruction-style datasets. It simplifies the process of preparing the data and training the model, making it an efficient choice for fine-tuning GPT-2 on our financial Q&A dataset. The combination allows us to leverage the capabilities of GPT-2 and the streamlined fine-tuning process offered by SFTTrainer to create a specialized model for our task.

In [33]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import torch

In [34]:
model_name = "gpt2"

cfg = AutoConfig.from_pretrained("gpt2")
cfg.attn_pdrop = 0.2
cfg.embd_pdrop = 0.2
cfg.resid_pdrop = 0.2

In [35]:

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, config=cfg)

In [36]:
# Set padding token for GPT-2
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# SFTTrainer requires a formatting function to structure the data
def formatting_prompts_func(example):
    text = f"question: {example['question']} answer: {example['answer']}"
    return text

In [62]:
from transformers import EarlyStoppingCallback
early_stopping = EarlyStoppingCallback(early_stopping_patience=4)

In [38]:
lr_rate=2e-5
no_train_epochs=50 #100
weight_decay = 0.01
batch_size=4
eval_steps=20

In [39]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results_sft",
    num_train_epochs=no_train_epochs,
    eval_strategy="steps",
    learning_rate=lr_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=weight_decay,
    gradient_accumulation_steps=2,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    max_grad_norm=1.0,
    save_total_limit=3,
    eval_steps=eval_steps,
    logging_steps=eval_steps,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    fp16=torch.cuda.is_available(),
    report_to='none'
)

In [65]:
# Instantiate the SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    formatting_func=formatting_prompts_func,
    args=training_args,
    callbacks=[early_stopping]
)

Applying formatting function to train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/40 [00:00<?, ? examples/s]

Applying formatting function to eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

In [66]:
# Log hyperparameters
print("--- Fine-Tuning Hyperparameters ---")
print(f"Model: {model_name}")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Batch Size: {training_args.per_device_train_batch_size}")
print(f"Number of Epochs: {training_args.num_train_epochs}")
print(f"Compute Setup: {'GPU' if training_args.fp16 else 'CPU'}")

# Start fine-tuning
trainer.train()

--- Fine-Tuning Hyperparameters ---
Model: gpt2
Learning Rate: 2e-05
Batch Size: 4
Number of Epochs: 50
Compute Setup: GPU


Step,Training Loss,Validation Loss
20,0.239,0.310136
40,0.2194,0.302125
60,0.1987,0.30037
80,0.1781,0.312712
100,0.1604,0.321349
120,0.1616,0.306337
140,0.1523,0.313321


TrainOutput(global_step=140, training_loss=0.18707729237420218, metrics={'train_runtime': 23.3786, 'train_samples_per_second': 85.548, 'train_steps_per_second': 10.694, 'total_flos': 35980729344000.0, 'train_loss': 0.18707729237420218})

In [67]:
logs = trainer.state.log_history
# Filter the logs to find entries with 'eval_loss'
eval_logs = [log for log in logs if 'eval_loss' in log]

# Print the evaluation loss from each entry
for log in eval_logs:
    print(f"Step {log['step']}: Evaluation Loss = {log['eval_loss']}")

Step 20: Evaluation Loss = 0.3101358413696289
Step 40: Evaluation Loss = 0.30212482810020447
Step 60: Evaluation Loss = 0.3003695607185364
Step 80: Evaluation Loss = 0.3127118945121765
Step 100: Evaluation Loss = 0.3213493227958679
Step 120: Evaluation Loss = 0.3063368499279022
Step 140: Evaluation Loss = 0.3133213222026825


### 3.5. Guardrail Implementation

- We will implement a simple input-side guardrail that checks if a question is relevant to the financial domain. This is done by looking for a list of predefined keywords. If a question is deemed irrelevant, the model will return a standard response instead of attempting to answer.

In [68]:
FINANCIAL_KEYWORDS = [
    'capex', 'customers', 'balance sheet', 'change', 'unit', 'income', 'difference', 'products', 'forecast', 'fy', 'operations',
  'inventory', 'value', 'price', 'apbo', 'q3', 'year', 'ge healthcare', 'sales', 'backlog', 'margin', 'q4', 'growth', 'operating',
  'cost', 'guidance', 'expense', 'opex', 'revenue', 'quarter', 'q2', 'q1',
 'ebitda', 'profit', 'product', 'loss', 'segment', 'financial', 'ebit', 'cash', 'pbo', 'stockholders'
 ]

def is_relevant(question):
    """Checks if the question contains any financial keywords."""
    return any(keyword in question.lower() for keyword in FINANCIAL_KEYWORDS)

# Example Usage
print(f"'What is the value of sales in 2024?' is relevant: {is_relevant('What is the value of sales in 2024?')}")
print(f"'What is the capital of France?' is relevant: {is_relevant('What is the capital of France?')}")

'What is the value of sales in 2024?' is relevant: True
'What is the capital of France?' is relevant: False


### 3.6. Response generation for fine tuned model
- Set the model into evaluation/inference mode
- Now we'll test our fine-tuned model.
-  We'll define a function to get predictions and then evaluate it on our specified test questions, including the guardrail logic.

In [69]:
finetuned_model = trainer.model # Get the fine-tuned model from the trainer
finetuned_model.eval() # Set the model to evaluation mode
finetuned_model_tokenizer = trainer.tokenizer

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


In [70]:
def get_finetuned_answer(question):
    # --- Guardrail Check ---
    if not is_relevant(question):
        return "Not applicable", 1.0, 0.0, "Guardrail (Irrelevant)"

    # Format the input for the GPT-2 model
    prompt = f"question: {question} answer:"

    inputs = finetuned_model_tokenizer(prompt, return_tensors="pt").to(finetuned_model.device)

    start_time = time.time()
    outputs = finetuned_model.generate(
        **inputs,
        max_length=128 + inputs.input_ids.shape[1], # Increase max_length to include prompt
        return_dict_in_generate=True,
        output_scores=True # Keep output_scores to calculate confidence
    )
    inference_time = time.time() - start_time

    # Decode the generated answer
    generated_sequence = outputs.sequences[0]
    # Get the length of the input prompt's token IDs
    prompt_length = inputs.input_ids.shape[1]
    # Slice the generated sequence to get only the generated answer part
    answer_ids = generated_sequence[prompt_length:]
    decoded_answer = finetuned_model_tokenizer.decode(answer_ids, skip_special_tokens=True).strip()


    # Calculate confidence score from the transition scores of the generated tokens
    # We calculate the average probability of the generated tokens
    # The scores are the logits of the next token predicted
    transition_scores = finetuned_model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True)
    # Calculate the average log probability across generated tokens
    avg_log_prob = transition_scores.mean().item()
    # Exponentiate the average log probability to get a probability-like score
    confidence = torch.exp(torch.tensor(avg_log_prob)).item()


    return decoded_answer, confidence, inference_time, "Fine-Tune"

## 4. Testing and Evaluation of SFT and RAG Implementation
- Prepare Test Questions for SFT and RAG
    - Relevant, high-confidence: Clear fact in data.
    - Relevant, low-confidence: Ambiguous or sparse information.
    - Irrelevant: Example: "What is the capital of France?"

In [71]:
official_questions = [
    {
        "question": "What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income?",
        "type": "Relevant, high-confidence"
    },
    {
        "question": "What was the trend in net income?",
        "type": "Relevant, low-confidence (ambiguous)"
    },
    {
        "question": "What is the capital of France?",
        "type": "Irrelevant"
    }
]

### 4.1.1 Mandatory evaulation for SFT on Test Questions

In [72]:
print("--- Official Test Questions ---")
for q in official_questions:
    answer, confidence, inference_time, method = get_finetuned_answer(q['question'])
    print(f"Q: {q['question']} ({q['type']})")
    print(f"A: {answer}")
    print(f"Metrics: (Method: {method}, Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Official Test Questions ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Q: What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income? (Relevant, high-confidence)
A: The value of 'Sales of products' in 2024 was 19672 millions of dollars.
Metrics: (Method: Fine-Tune, Confidence: 0.9379, Time: 0.2420s)

Q: What was the trend in net income? (Relevant, low-confidence (ambiguous))
A: The change in 'Net income attributable to GE HealthCare' from 2023 to 2024 was -348 millions of dollars.
Metrics: (Method: Fine-Tune, Confidence: 0.9485, Time: 0.3524s)

Q: What is the capital of France? (Irrelevant)
A: Not applicable
Metrics: (Method: Guardrail (Irrelevant), Confidence: 1.0000, Time: 0.0000s)



### 4.1.2 Mandatory evaulation for RAG on Test Questions

In [73]:
print("--- Official Test Questions ---")
for q in official_questions:
    answer, confidence, inference_time, method = getResponseRag(q['question'])
    print(f"Q: {q['question']} ({q['type']})")
    print(f"A: {answer}")
    print(f"Metrics: (Method: {method}, Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)\n")

--- Official Test Questions ---
Q: What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income? (Relevant, high-confidence)
A: The value (USCAN)' in 'Products' was 1993 millions of dollars.
Metrics: (Method: RAG, Confidence: 0.9997, Time: 0.3311s)

Q: What was the trend in net income? (Relevant, low-confidence (ambiguous))
A: The change in 'Net income' from 2021 to 2024 was -348 millions of dollars.
Metrics: (Method: RAG, Confidence: 0.9997, Time: 0.2811s)

Q: What is the capital of France? (Irrelevant)
A: Not applicable
Metrics: (Method: Guardrail (Irrelevant), Confidence: 1.0000, Time: 0.0000s)



### 4.2 Extended Evaluation for both RAG and Finetuned Systems

In [74]:
def createReport(model_answer, confidence, inference_time, method):
    numbers_in_real_answer = set(re.findall(r'-?\d+', real_answer))
    numbers_in_model_answer = set(re.findall(r'-?\d+', model_answer))
    correct = 'Y' if numbers_in_real_answer and numbers_in_real_answer.issubset(numbers_in_model_answer) else 'N'

    if "not in data" in real_answer.lower() and method == "Guardrail (Irrelevant)":
        correct = 'Y'
        model_answer = "Not applicable"

    results.append({
        "Question": question,
        "Method": method,
        "Answer": model_answer,
        "Confidence": f"{confidence:.2f}",
        "Time (s)": f"{inference_time:.2f}",
        "Correct (Y/N)": correct
    })

In [85]:
from IPython.display import display
import re

extended_eval_questions = [
    {"question": "Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.", "real_answer": "The value of 'Sales of products' in 2023 was 13127 millions of dollars."},
    {"question": "How much did the 'Net income attributable to GE HealthCare' change from 2023 to 2024 based on the Statements of Operations / Income?", "real_answer": "The change in 'Net income attributable to GE HealthCare' from 2023 to 2024 was 425 millions of dollars."},
    {"question": "What was the value of 'Imaging' in 2022 according to the Statements of Operations / Income?", "real_answer": "The value of 'Imaging' in 2022 was 8395 millions of dollars."},
    {"question": "Could you provide the figure for 'Total revenues' in 2024 as reported in the Statements of Operations / Income?", "real_answer": "The value of 'Total revenues' in 2024 was 19672 millions of dollars."},
    {"question": "What was the difference in 'United States and Canada (USCAN)' between 2022 and 2023 according to the Statements of Operations / Income?", "real_answer": "The difference in 'United States and Canada (USCAN)' between 2022 and 2023 was 421 millions of dollars."},
    {"question": "What was the value of 'EBIT*' in 2023 according to the Statements of Operations / Income?", "real_answer": "The value of 'EBIT*' in 2023 was 2521 millions of dollars."},
    {"question": "What were the values for 'Sales of products' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?", "real_answer": "The values for 'Sales of products' for the years 2024, 2023, and 2022 were 13075, 13127, and 12044 millions of dollars, respectively."},
    {"question": "What is the company's stock ticker?", "real_answer": "Not in data"},
    {"question": "What was the service cost in 2023?", "real_answer": "The value of 'Service cost – Operating' in 2023 was 23 millions of dollars."},
    {"question": "Who is the CEO of the company?", "real_answer": "Not in data"}
]

results = []
for item in extended_eval_questions:
    question = item['question']
    real_answer = item['real_answer']
    model_answer, confidence, inference_time, method = get_finetuned_answer(question)
    createReport(model_answer, confidence, inference_time, method)
    model_answer, confidence, inference_time, method = getResponseRag(question)
    createReport(model_answer, confidence, inference_time, method)

results_df = pd.DataFrame(results)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [86]:
with pd.option_context('display.max_colwidth', None):
  display(results_df)

Unnamed: 0,Question,Method,Answer,Confidence,Time (s),Correct (Y/N)
0,Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.,Fine-Tune,The value of 'Sales of products' in 2023 was 1568 millions of dollars.,0.89,0.4,N
1,Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.,RAG,The value of 'Products' was 19672 millions of dollars.,0.87,0.49,N
2,How much did the 'Net income attributable to GE HealthCare' change from 2023 to 2024 based on the Statements of Operations / Income?,Fine-Tune,The change in 'Net income attributable to GE HealthCare' from 2023 to 2024 was -348 millions of dollars.,0.98,1.65,N
3,How much did the 'Net income attributable to GE HealthCare' change from 2023 to 2024 based on the Statements of Operations / Income?,RAG,The change in 'Total revenues' from 2024 to 2021 was -348 millions of dollars.,0.87,1.35,N
4,What was the value of 'Imaging' in 2022 according to the Statements of Operations / Income?,Fine-Tune,The value of 'Imaging' in 2022 was 8130 millions of dollars.,0.88,0.64,N
5,What was the value of 'Imaging' in 2022 according to the Statements of Operations / Income?,RAG,The value (US$) of the 'Total revenues' was 1993 millions of dollars.,0.87,0.48,N
6,Could you provide the figure for 'Total revenues' in 2024 as reported in the Statements of Operations / Income?,Fine-Tune,The value of 'Total revenues' in 2024 was 19672 millions of dollars.,0.97,0.37,Y
7,Could you provide the figure for 'Total revenues' in 2024 as reported in the Statements of Operations / Income?,RAG,"The value of 'Net income' for the Years 2024, 2021, & 2022 was 19672 millions of dollars.",0.87,0.55,Y
8,What was the difference in 'United States and Canada (USCAN)' between 2022 and 2023 according to the Statements of Operations / Income?,Fine-Tune,The difference in 'United States and Canada (USCAN)' between 2022 and 2023 was -348 millions of dollars.,0.96,0.51,N
9,What was the difference in 'United States and Canada (USCAN)' between 2022 and 2023 according to the Statements of Operations / Income?,RAG,The difference was -348 millions of dollars.,0.87,0.23,N


### 5.1 Save the fine-tuned model for inferencing

In [52]:
# 'finetuned_model' is the fine-tuned model instance
output_dir = "../../model/gpt2-finetuned-model"

# Save the model weights and configuration
finetuned_model.save_pretrained(output_dir)

# Save the tokenizer's vocabulary and settings
finetuned_model_tokenizer.save_pretrained(output_dir)

('../../model/gpt2-finetuned-model/tokenizer_config.json',
 '../../model/gpt2-finetuned-model/special_tokens_map.json',
 '../../model/gpt2-finetuned-model/vocab.json',
 '../../model/gpt2-finetuned-model/merges.txt',
 '../../model/gpt2-finetuned-model/added_tokens.json',
 '../../model/gpt2-finetuned-model/tokenizer.json')

### 5.2 Push the model on Hugging Face as Git cannot store 400+ MB file

In [87]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [88]:
repo_name = "gpt2-finetuned-model-v0.1"

# Push the model to the Hub
finetuned_model.push_to_hub(repo_name)

# Push the tokenizer to the Hub
finetuned_model_tokenizer.push_to_hub(repo_name)

README.md: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/Anup77Jindal/gpt2-finetuned-model-v0.1/commit/de127700c18236df44fe815176b6757f28f6cff7', commit_message='Upload tokenizer', commit_description='', oid='de127700c18236df44fe815176b6757f28f6cff7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Anup77Jindal/gpt2-finetuned-model-v0.1', endpoint='https://huggingface.co', repo_type='model', repo_id='Anup77Jindal/gpt2-finetuned-model-v0.1'), pr_revision=None, pr_num=None)

## 7. Summary and Conclusion 📝

Based on the baseline and fine-tuned model evaluations, we can summarize the findings and draw conclusions about the effectiveness of fine-tuning GPT-2 with SFTTrainer on this financial Q&A dataset and the impact of the implemented guardrail.


### Evaluation results:


*   **Baseline Model:** The baseline GPT-2 model, without fine-tuning on this specific dataset, performed poorly on the financial Q&A task, often providing irrelevant or incomplete answers with low confidence scores. This highlights the need for domain-specific fine-tuning.
*   **Fine-Tuned Model:** The fine-tuned GPT-2 model with SFTTrainer shows significant improvement. It is able to provide relevant answers to financial questions from the dataset with higher confidence scores. While not perfect (some answers may still contain inaccuracies or require further refinement), it demonstrates the effectiveness of supervised instruction fine-tuning for this task.
*   **RAG Model:** The RAG approach leverages retrieval from the source data, providing answers that are more factually grounded and adaptable to new information. RAG is robust to out-of-domain queries due to its retrieval component, but may be slower due to the retrieval and reranking steps.
*   **Guardrail:** The implemented guardrail successfully identified and flagged irrelevant questions (e.g., "What is the company's stock ticker?" and "Who is the CEO of the company?"), returning a "Not applicable" response with high confidence. This is crucial for ensuring the model stays within its intended domain and doesn't provide misleading information for out-of-scope queries.


### Comparison of Average Inference Speed and Accuracy


- **Inference Speed:** Fine-tuned models are generally faster at inference since they generate answers directly, while RAG models require retrieval and reranking, which adds latency. In our results, the fine-tuned model consistently produced answers more quickly than RAG.
- **Accuracy:** RAG models tend to be more accurate for fact-based questions, as they ground their answers in retrieved context. Fine-tuned models may be more fluent but can hallucinate or provide less factual answers if the training data is limited.


### Strengths of Each Approach


- **RAG Strengths:**
    - Adaptability to new data without retraining.
    - Factual grounding from source documents.
    - Robustness to irrelevant queries due to retrieval and guardrail logic.
- **Fine-Tuning Strengths:**
    - Fluency and natural language generation.
    - Efficiency in inference speed.
    - Can generalize well within the domain if trained on sufficient data.


### Robustness to Irrelevant Queries


Both approaches benefit from the guardrail logic, but RAG is inherently more robust due to its reliance on retrieval. Fine-tuned models may attempt to answer any question, but with a guardrail, they can gracefully handle out-of-domain queries.


### Practical Trade-Offs


- **RAG:** Best for scenarios where factual accuracy and adaptability to new data are critical, but with higher computational cost and slower inference.
- **Fine-Tuning:** Preferred for applications requiring fast, fluent responses within a well-defined domain, but may require frequent retraining to stay up-to-date.


**Conclusion:**


Fine-tuning a pre-trained language model like GPT-2 on a domain-specific dataset using SFTTrainer is an effective approach for building a question-answering system for that domain. The addition of a simple guardrail significantly improves the system's robustness by handling irrelevant queries gracefully. RAG offers superior factual accuracy and adaptability, while fine-tuning excels in speed and fluency. The choice between these approaches depends on the application's requirements for accuracy, speed, and domain coverage. Further improvements could involve expanding the training dataset, experimenting with different model architectures, or implementing more sophisticated guardrail mechanisms.