# Fine-Tuning a Language Model for Financial Q&A with SFTTrainer

This notebook walks through the process of fine-tuning a small, open-source language model to answer questions based on a provided financial dataset. We will cover data preparation, model selection, baseline benchmarking, and evaluation.

This version uses the **SFTTrainer** from the TRL (Transformer Reinforcement Learning) library, which simplifies supervised fine-tuning on instruction-style datasets.

## 1. Setup and Dependencies ⚙️

First, we need to install the necessary libraries. We'll use `transformers` for the language model, `datasets` to handle our data, `torch` as the backend, and `trl` for the `SFTTrainer`.

In [40]:
!pip install -q transformers[torch] datasets pandas trl peft bitsandbytes

Access is denied.


## 2. Data Preparation 📄

We will start by parsing the provided `fin-qna.txt` file and converting it into a structured format. We'll create a pandas DataFrame and then a Hugging Face `Dataset` object.

In [41]:
import pandas as pd
import io
import time
import torch

file_content = """
Question|Answer
What was the value of 'Impact on PBO/APBO at December 31, 2023' in 2024 according to the Statements of Operations / Income?|The value of 'Impact on PBO/APBO at December 31, 2023' in 2024 was 940 millions of dollars.
Find the value for 'Impact on PBO/APBO at December 31, 2023' in 2023 from the Statements of Operations / Income.|The value of 'Impact on PBO/APBO at December 31, 2023' in 2023 was 318 millions of dollars.
Could you provide the figure for 'Impact on PBO/APBO at December 31, 2023' in 2022 as reported in the Statements of Operations / Income?|The value of 'Impact on PBO/APBO at December 31, 2023' in 2022 was 40 millions of dollars.
How much did the 'Impact on PBO/APBO at December 31, 2023' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Impact on PBO/APBO at December 31, 2023' from 2023 to 2024 was 622 millions of dollars.
What was the difference in 'Impact on PBO/APBO at December 31, 2023' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Impact on PBO/APBO at December 31, 2023' between 2022 and 2023 was 278 millions of dollars.
What were the values for 'Impact on PBO/APBO at December 31, 2023' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Impact on PBO/APBO at December 31, 2023' for the years 2024, 2023, and 2022 were 940, 318, and 40 millions of dollars, respectively.
What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income?|The value of 'Sales of products' in 2024 was 13127 millions of dollars.
Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.|The value of 'Sales of products' in 2023 was 12044 millions of dollars.
Could you provide the figure for 'Sales of products' in 2022 as reported in the Statements of Operations / Income?|The value of 'Sales of products' in 2022 was 11165 millions of dollars.
How much did the 'Sales of products' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Sales of products' from 2023 to 2024 was 1083 millions of dollars.
What was the difference in 'Sales of products' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Sales of products' between 2022 and 2023 was 879 millions of dollars.
What were the values for 'Sales of products' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Sales of products' for the years 2024, 2023, and 2022 were 13127, 12044, and 11165 millions of dollars, respectively.
What was the value of 'Net income attributable to GE HealthCare common stockholders' in 2024 according to the Statements of Operations / Income?|The value of 'Net income attributable to GE HealthCare common stockholders' in 2024 was 1385 millions of dollars.
Find the value for 'Net income attributable to GE HealthCare common stockholders' in 2023 from the Statements of Operations / Income.|The value of 'Net income attributable to GE HealthCare common stockholders' in 2023 was 1916 millions of dollars.
Could you provide the figure for 'Net income attributable to GE HealthCare common stockholders' in 2022 as reported in the Statements of Operations / Income?|The value of 'Net income attributable to GE HealthCare common stockholders' in 2022 was 2247 millions of dollars.
How much did the 'Net income attributable to GE HealthCare common stockholders' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Net income attributable to GE HealthCare common stockholders' from 2023 to 2024 was -531 millions of dollars.
What was the difference in 'Net income attributable to GE HealthCare common stockholders' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Net income attributable to GE HealthCare common stockholders' between 2022 and 2023 was -331 millions of dollars.
What were the values for 'Net income attributable to GE HealthCare common stockholders' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Net income attributable to GE HealthCare common stockholders' for the years 2024, 2023, and 2022 were 1385, 1916, and 2247 millions of dollars, respectively.
What was the value of 'Net income attributable to GE HealthCare' in 2024 according to the Statements of Operations / Income?|The value of 'Net income attributable to GE HealthCare' in 2024 was 1568 millions of dollars.
Find the value for 'Net income attributable to GE HealthCare' in 2023 from the Statements of Operations / Income.|The value of 'Net income attributable to GE HealthCare' in 2023 was 1916 millions of dollars.
Could you provide the figure for 'Net income attributable to GE HealthCare' in 2022 as reported in the Statements of Operations / Income?|The value of 'Net income attributable to GE HealthCare' in 2022 was 2247 millions of dollars.
How much did the 'Net income attributable to GE HealthCare' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Net income attributable to GE HealthCare' from 2023 to 2024 was -348 millions of dollars.
What was the difference in 'Net income attributable to GE HealthCare' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Net income attributable to GE HealthCare' between 2022 and 2023 was -331 millions of dollars.
What were the values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022 were 1568, 1916, and 2247 millions of dollars, respectively.
What was the value of 'Comprehensive income attributable to GE HealthCare' in 2024 according to the Statements of Operations / Income?|The value of 'Comprehensive income attributable to GE HealthCare' in 2024 was 755 millions of dollars.
Find the value for 'Comprehensive income attributable to GE HealthCare' in 2023 from the Statements of Operations / Income.|The value of 'Comprehensive income attributable to GE HealthCare' in 2023 was 1073 millions of dollars.
Could you provide the figure for 'Comprehensive income attributable to GE HealthCare' in 2022 as reported in the Statements of Operations / Income?|The value of 'Comprehensive income attributable to GE HealthCare' in 2022 was 2049 millions of dollars.
How much did the 'Comprehensive income attributable to GE HealthCare' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Comprehensive income attributable to GE HealthCare' from 2023 to 2024 was -318 millions of dollars.
What was the difference in 'Comprehensive income attributable to GE HealthCare' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Comprehensive income attributable to GE HealthCare' between 2022 and 2023 was -976 millions of dollars.
What were the values for 'Comprehensive income attributable to GE HealthCare' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Comprehensive income attributable to GE HealthCare' for the years 2024, 2023, and 2022 were 755, 1073, and 2049 millions of dollars, respectively.
What was the value of '455' in 2024 according to the Statements of Operations / Income?|The value of '455' in 2024 was 5 millions of dollars.
Find the value for '455' in 2023 from the Statements of Operations / Income.|The value of '455' in 2023 was 6493 millions of dollars.
Could you provide the figure for '455' in 2022 as reported in the Statements of Operations / Income?|The value of '455' in 2022 was 1326 millions of dollars.
How much did the '455' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in '455' from 2023 to 2024 was -6488 millions of dollars.
What was the difference in '455' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in '455' between 2022 and 2023 was 5167 millions of dollars.
What were the values for '455' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for '455' for the years 2024, 2023, and 2022 were 5, 6493, and 1326 millions of dollars, respectively.
What was the value of 'Net income' in 2024 according to the Statements of Operations / Income?|The value of 'Net income' in 2024 was 1614 millions of dollars.
Find the value for 'Net income' in 2023 from the Statements of Operations / Income.|The value of 'Net income' in 2023 was 1967 millions of dollars.
Could you provide the figure for 'Net income' in 2022 as reported in the Statements of Operations / Income?|The value of 'Net income' in 2022 was 2293 millions of dollars.
How much did the 'Net income' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Net income' from 2023 to 2024 was -353 millions of dollars.
What was the difference in 'Net income' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Net income' between 2022 and 2023 was -326 millions of dollars.
What were the values for 'Net income' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Net income' for the years 2024, 2023, and 2022 were 1614, 1967, and 2293 millions of dollars, respectively.
What was the value of 'Net income from continuing operations' in 2024 according to the Statements of Operations / Income?|The value of 'Net income from continuing operations' in 2024 was 1618 millions of dollars.
Find the value for 'Net income from continuing operations' in 2023 from the Statements of Operations / Income.|The value of 'Net income from continuing operations' in 2023 was 1949 millions of dollars.
Could you provide the figure for 'Net income from continuing operations' in 2022 as reported in the Statements of Operations / Income?|The value of 'Net income from continuing operations' in 2022 was 2275 millions of dollars.
How much did the 'Net income from continuing operations' change from 2023 to 2024 based on the Statements of Operations / Income?|The change in 'Net income from continuing operations' from 2023 to 2024 was -331 millions of dollars.
What was the difference in 'Net income from continuing operations' between 2022 and 2023 according to the Statements of Operations / Income?|The difference in 'Net income from continuing operations' between 2022 and 2023 was -326 millions of dollars.
What were the values for 'Net income from continuing operations' for the years 2024, 2023, and 2022 in the Statements of Operations / Income?|The values for 'Net income from continuing operations' for the years 2024, 2023, and 2022 were 1618, 1949, and 2275 millions of dollars, respectively.
What was the value of 'Service cost  Operating' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of 'Service cost  Operating' in 2024 was 32 millions of dollars.
Find the value for 'Service cost  Operating' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of 'Service cost  Operating' in 2023 was 23 millions of dollars.
Could you provide the figure for 'Service cost  Operating' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of 'Service cost  Operating' in 2022 was 19 millions of dollars.
How much did the 'Service cost  Operating' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in 'Service cost  Operating' from 2023 to 2024 was 9 millions of dollars.
What was the difference in 'Service cost  Operating' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in 'Service cost  Operating' between 2022 and 2023 was 4 millions of dollars.
What were the values for 'Service cost  Operating' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for 'Service cost  Operating' for the years 2024, 2023, and 2022 were 32, 23, and 19 millions of dollars, respectively.
What was the value of ')' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of ')' in 2024 was 44 millions of dollars.
Find the value for ')' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of ')' in 2023 was 9 millions of dollars.
Could you provide the figure for ')' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of ')' in 2022 was 25 millions of dollars.
How much did the ')' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in ')' from 2023 to 2024 was 35 millions of dollars.
What was the difference in ')' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in ')' between 2022 and 2023 was -16 millions of dollars.
What were the values for ')' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for ')' for the years 2024, 2023, and 2022 were 44, 9, and 25 millions of dollars, respectively.
What was the value of '2024' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of '2024' in 2024 was 1277 millions of dollars.
Find the value for '2024' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of '2024' in 2023 was 226 millions of dollars.
Could you provide the figure for '2024' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of '2024' in 2022 was 130 millions of dollars.
How much did the '2024' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in '2024' from 2023 to 2024 was 1051 millions of dollars.
What was the difference in '2024' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in '2024' between 2022 and 2023 was 96 millions of dollars.
What were the values for '2024' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for '2024' for the years 2024, 2023, and 2022 were 1277, 226, and 130 millions of dollars, respectively.
What was the value of '' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of '' in 2024 was 1655 millions of dollars.
Find the value for '' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of '' in 2023 was 467 millions of dollars.
Could you provide the figure for '' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of '' in 2022 was 51 millions of dollars.
How much did the '' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in '' from 2023 to 2024 was 1188 millions of dollars.
What was the difference in '' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in '' between 2022 and 2023 was 416 millions of dollars.
What were the values for '' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for '' for the years 2024, 2023, and 2022 were 1655, 467, and 51 millions of dollars, respectively.
What was the value of 'Fair value of plan assets' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of 'Fair value of plan assets' in 2024 was 14700 millions of dollars.
Find the value for 'Fair value of plan assets' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of 'Fair value of plan assets' in 2023 was 1820 millions of dollars.
Could you provide the figure for 'Fair value of plan assets' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of 'Fair value of plan assets' in 2022 was 6318 millions of dollars.
How much did the 'Fair value of plan assets' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in 'Fair value of plan assets' from 2023 to 2024 was 12880 millions of dollars.
What was the difference in 'Fair value of plan assets' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in 'Fair value of plan assets' between 2022 and 2023 was -4498 millions of dollars.
What were the values for 'Fair value of plan assets' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for 'Fair value of plan assets' for the years 2024, 2023, and 2022 were 14700, 1820, and 6318 millions of dollars, respectively.
What was the value of '595' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of '595' in 2024 was 5967 millions of dollars.
Find the value for '595' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of '595' in 2023 was 4518 millions of dollars.
Could you provide the figure for '595' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of '595' in 2022 was 290 millions of dollars.
How much did the '595' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in '595' from 2023 to 2024 was 1449 millions of dollars.
What was the difference in '595' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in '595' between 2022 and 2023 was 4228 millions of dollars.
What were the values for '595' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for '595' for the years 2024, 2023, and 2022 were 5967, 4518, and 290 millions of dollars, respectively.
What was the value of 'Fair value of plan assets' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of 'Fair value of plan assets' in 2024 was 425 millions of dollars.
Find the value for 'Fair value of plan assets' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of 'Fair value of plan assets' in 2023 was 60 millions of dollars.
Could you provide the figure for 'Fair value of plan assets' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of 'Fair value of plan assets' in 2022 was 157 millions of dollars.
How much did the 'Fair value of plan assets' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in 'Fair value of plan assets' from 2023 to 2024 was 365 millions of dollars.
What was the difference in 'Fair value of plan assets' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in 'Fair value of plan assets' between 2022 and 2023 was -97 millions of dollars.
What were the values for 'Fair value of plan assets' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for 'Fair value of plan assets' for the years 2024, 2023, and 2022 were 425, 60, and 157 millions of dollars, respectively.
What was the value of 'U.S. income' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of 'U.S. income' in 2024 was 816 millions of dollars.
Find the value for 'U.S. income' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of 'U.S. income' in 2023 was 1090 millions of dollars.
Could you provide the figure for 'U.S. income' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of 'U.S. income' in 2022 was 1587 millions of dollars.
How much did the 'U.S. income' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in 'U.S. income' from 2023 to 2024 was -274 millions of dollars.
What was the difference in 'U.S. income' between 2022 and 2023 according to the Statements of Financial Position / Balance Sheet?|The difference in 'U.S. income' between 2022 and 2023 was -497 millions of dollars.
What were the values for 'U.S. income' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?|The values for 'U.S. income' for the years 2024, 2023, and 2022 were 816, 1090, and 1587 millions of dollars, respectively.
What was the value of 'Total' in 2024 according to the Statements of Financial Position / Balance Sheet?|The value of 'Total' in 2024 was 2361 millions of dollars.
Find the value for 'Total' in 2023 from the Statements of Financial Position / Balance Sheet.|The value of 'Total' in 2023 was 2512 millions of dollars.
Could you provide the figure for 'Total' in 2022 as reported in the Statements of Financial Position / Balance Sheet?|The value of 'Total' in 2022 was 2875 millions of dollars.
How much did the 'Total' change from 2023 to 2024 based on the Statements of Financial Position / Balance Sheet?|The change in 'Total' from 2023 to 2024 was -151 millions of dollars.
"""

# Clean and parse the data
data = []
for line in file_content.strip().split('\n'):
    if 'Question|Answer' in line:
        continue
    # Remove the '' part

    parts = line.split('|')
    if len(parts) == 2:
        data.append({"question": parts[0], "answer": parts[1]})

df = pd.DataFrame(data)
print(df.head())
print(f"\nTotal Q&A pairs: {len(df)}")

                                            question  \
0  What was the value of 'Impact on PBO/APBO at D...   
1  Find the value for 'Impact on PBO/APBO at Dece...   
2  Could you provide the figure for 'Impact on PB...   
3  How much did the 'Impact on PBO/APBO at Decemb...   
4  What was the difference in 'Impact on PBO/APBO...   

                                              answer  
0  The value of 'Impact on PBO/APBO at December 3...  
1  The value of 'Impact on PBO/APBO at December 3...  
2  The value of 'Impact on PBO/APBO at December 3...  
3  The change in 'Impact on PBO/APBO at December ...  
4  The difference in 'Impact on PBO/APBO at Decem...  

Total Q&A pairs: 100


In [43]:
from datasets import Dataset

# Convert to Hugging Face Dataset and split
full_dataset = Dataset.from_pandas(df)
train_test_split = full_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

## 3. Model Selection and Baseline Benchmarking 📊

We will use **gpt2** for a Question Answering baseline to see how a model performs *before* any fine-tuning. This helps us quantify the improvement from our fine-tuning process.

For fine-tuning, we'll select **gpt2**, a sequence-to-sequence model well-suited for our instruction-based task.

In [44]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

baseline_model_name = "gpt2"
baseline_tokenizer = AutoTokenizer.from_pretrained(baseline_model_name)
baseline_model = AutoModelForQuestionAnswering.from_pretrained(baseline_model_name)

def get_baseline_model_answer(question, context):
    inputs = baseline_tokenizer(question, context, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        start_time = time.time()
        outputs = baseline_model(**inputs)
        inference_time = time.time() - start_time

    answer_start_index = torch.argmax(outputs.start_logits)
    answer_end_index = torch.argmax(outputs.end_logits)

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    answer = baseline_tokenizer.decode(predict_answer_tokens)

    start_prob = torch.nn.functional.softmax(outputs.start_logits, dim=-1)[0, answer_start_index].item()
    end_prob = torch.nn.functional.softmax(outputs.end_logits, dim=-1)[0, answer_end_index].item()
    confidence = (start_prob + end_prob) / 2

    return answer, confidence, inference_time

# Create a single context from all answers for the baseline model
context = " ".join(df['answer'].tolist())

test_questions = df.sample(10, random_state=42)

print("--- Baseline Model Evaluation ---")
for _, row in test_questions.iterrows():
    question = row['question']
    real_answer = row['answer']
    model_answer, confidence, inference_time = get_baseline_model_answer(question, context)
    print(f"Q: {question}")
    print(f"A: {model_answer} (Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)")
    print(f"Real A: {real_answer}\n")

Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at gpt2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- Baseline Model Evaluation ---
Q: What were the values for '595' for the years 2024, 2023, and 2022 in the Statements of Financial Position / Balance Sheet?
A: /APBO at December 31, 2023' between 2022 and 2023 was 278 millions of dollars. The values for 'Impact on PBO/APBO at December 31, 2023' for the years 2024, 2023, and 2022 were 940, 318, and 40 millions of dollars, respectively. The value of 'Sales of products' in 2024 was 13127 millions of dollars. The value of 'Sales of products' in 2023 was 12044 millions of dollars. The value of 'Sales of products' in 2022 was 11165 millions of dollars. The change in 'Sales of products' from 2023 to 2024 was 1083 millions of dollars. The difference in 'Sales of (Confidence: 0.0221, Time: 0.3180s)
Real A: The values for '595' for the years 2024, 2023, and 2022 were 5967, 4518, and 290 millions of dollars, respectively.

Q: What were the values for 'Service cost  Operating' for the years 2024, 2023, and 2022 in the Statements of Financial P

## 4. Fine-Tuning with SFTTrainer 🚀

Now we'll fine-tune the gpt2 model on our Q&A dataset. The `SFTTrainer` handles the complexities of formatting, tokenizing, and training the model on our instruction-style data.

### 4.1. Advanced Fine-Tuning Technique: Supervised Instruction Fine-Tuning

We will provide a formatting function to `SFTTrainer` that structures our data as `"question: {question} answer: {answer}"`. This teaches the model to follow instructions and provide a direct answer.

### Why GPT-2 and SFTTrainer are a good combination

GPT-2 is a powerful transformer model that can be fine-tuned for various downstream tasks, including question answering. SFTTrainer is specifically designed for supervised fine-tuning of transformer models on instruction-style datasets. It simplifies the process of preparing the data and training the model, making it an efficient choice for fine-tuning GPT-2 on our financial Q&A dataset. The combination allows us to leverage the capabilities of GPT-2 and the streamlined fine-tuning process offered by SFTTrainer to create a specialized model for our task.

In [45]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import torch

In [46]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [47]:
# Set padding token for GPT-2
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# SFTTrainer requires a formatting function to structure the data
def formatting_prompts_func(example):
    text = f"question: {example['question']} answer: {example['answer']}"
    return text


In [49]:
lr_rate=2e-5
no_train_epochs=35 
weight_decay = 0.01

In [50]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results_sft",
    eval_strategy="steps",  # Corrected argument name
    learning_rate=lr_rate,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=no_train_epochs,
    weight_decay=weight_decay,
    save_total_limit=3,
    logging_steps=10,
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(), # Use mixed precision if GPU is available
    report_to='none' # Disable Weights & Biases logging
)

# Instantiate the SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    formatting_func=formatting_prompts_func,
    args=training_args,
)

Applying formatting function to train dataset: 100%|██████████| 80/80 [00:00<00:00, 13262.10 examples/s]
Adding EOS to train dataset: 100%|██████████| 80/80 [00:00<00:00, 14836.59 examples/s]
Tokenizing train dataset: 100%|██████████| 80/80 [00:00<00:00, 5864.62 examples/s]
Truncating train dataset: 100%|██████████| 80/80 [00:00<00:00, 8029.10 examples/s]
Applying formatting function to eval dataset: 100%|██████████| 20/20 [00:00<00:00, 3008.29 examples/s]
Adding EOS to eval dataset: 100%|██████████| 20/20 [00:00<00:00, 7610.10 examples/s]
Tokenizing eval dataset: 100%|██████████| 20/20 [00:00<00:00, 2812.99 examples/s]
Truncating eval dataset: 100%|██████████| 20/20 [00:00<?, ? examples/s]


In [51]:
# Log hyperparameters
print("--- Fine-Tuning Hyperparameters ---")
print(f"Model: {model_name}")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Batch Size: {training_args.per_device_train_batch_size}")
print(f"Number of Epochs: {training_args.num_train_epochs}")
print(f"Compute Setup: {'GPU' if training_args.fp16 else 'CPU'}")

# Start fine-tuning
trainer.train()

--- Fine-Tuning Hyperparameters ---
Model: gpt2
Learning Rate: 2e-05
Batch Size: 4
Number of Epochs: 35
Compute Setup: CPU




Step,Training Loss,Validation Loss
10,2.9604,2.231904
20,1.8924,1.39873
30,1.2862,1.009726
40,0.8886,0.799438
50,0.7138,0.665387
60,0.592,0.579596
70,0.5497,0.525075
80,0.47,0.474241
90,0.4269,0.466815
100,0.4354,0.40519




TrainOutput(global_step=700, training_loss=0.3495261582306453, metrics={'train_runtime': 749.5669, 'train_samples_per_second': 3.735, 'train_steps_per_second': 0.934, 'total_flos': 93464976384000.0, 'train_loss': 0.3495261582306453})

## 5. Guardrail Implementation 🛡️

We will implement a simple input-side guardrail that checks if a question is relevant to the financial domain. This is done by looking for a list of predefined keywords. If a question is deemed irrelevant, the model will return a standard response instead of attempting to answer.

In [52]:
FINANCIAL_KEYWORDS = [
    'value', 'sales', 'income', 'cost', 'pbo', 'apbo', 'operations',
    'financial', 'stockholders', 'change', 'difference', 'revenue', 'products'
]

def is_relevant(question):
    """Checks if the question contains any financial keywords."""
    return any(keyword in question.lower() for keyword in FINANCIAL_KEYWORDS)

# Example Usage
print(f"'What is the value of sales in 2024?' is relevant: {is_relevant('What is the value of sales in 2024?')}")
print(f"'What is the capital of France?' is relevant: {is_relevant('What is the capital of France?')}")

'What is the value of sales in 2024?' is relevant: True
'What is the capital of France?' is relevant: False


## 6. Testing and Evaluation ✅

Now we'll test our fine-tuned model. We'll define a function to get predictions and then evaluate it on our specified test questions, including the guardrail logic.

In [53]:
finetuned_model = trainer.model # Get the fine-tuned model from the trainer
finetuned_model.eval() # Set the model to evaluation mode
finetuned_model_tokenizer = trainer.tokenizer

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


In [54]:
def get_finetuned_answer(question):
    # --- Guardrail Check ---
    if not is_relevant(question):
        return "Not applicable", 1.0, 0.0, "Guardrail (Irrelevant)"

    # Format the input for the GPT-2 model
    prompt = f"question: {question} answer:"
    #inputs = tokenizer(prompt, return_tensors="pt").to(finetuned_model.device)
    inputs = finetuned_model_tokenizer(prompt, return_tensors="pt").to(finetuned_model.device)
    
    start_time = time.time()
    outputs = finetuned_model.generate(
        **inputs,
        max_length=128 + inputs.input_ids.shape[1], # Increase max_length to include prompt
        return_dict_in_generate=True,
        output_scores=True # Keep output_scores to calculate confidence
    )
    inference_time = time.time() - start_time

    # Decode the generated answer
    generated_sequence = outputs.sequences[0]
    # Get the length of the input prompt's token IDs
    prompt_length = inputs.input_ids.shape[1]
    # Slice the generated sequence to get only the generated answer part
    answer_ids = generated_sequence[prompt_length:]
    #decoded_answer = tokenizer.decode(answer_ids, skip_special_tokens=True).strip()
    decoded_answer = finetuned_model_tokenizer.decode(answer_ids, skip_special_tokens=True).strip()


    # Calculate confidence score from the transition scores of the generated tokens
    # We calculate the average probability of the generated tokens
    # The scores are the logits of the next token predicted
    transition_scores = finetuned_model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True)
    # Calculate the average log probability across generated tokens
    avg_log_prob = transition_scores.mean().item()
    # Exponentiate the average log probability to get a probability-like score
    confidence = torch.exp(torch.tensor(avg_log_prob)).item()


    return decoded_answer, confidence, inference_time, "Fine-Tune"

### 6.1. Official Test Questions

In [55]:
official_questions = [
    {
        "question": "What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income?",
        "type": "Relevant, high-confidence"
    },
    {
        "question": "What was the trend in net income?",
        "type": "Relevant, low-confidence (ambiguous)"
    },
    {
        "question": "What is the capital of France?",
        "type": "Irrelevant"
    }
]

print("--- Official Test Questions ---")
for q in official_questions:
    answer, confidence, inference_time, method = get_finetuned_answer(q['question'])
    print(f"Q: {q['question']} ({q['type']})")
    print(f"A: {answer} (Method: {method}, Confidence: {confidence:.4f}, Time: {inference_time:.4f}s)\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Official Test Questions ---


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Q: What was the value of 'Sales of products' in 2024 according to the Statements of Operations / Income? (Relevant, high-confidence)
A: The value of 'Sales of products' in 2024 was 14700 millions of dollars. (Method: Fine-Tune, Confidence: 0.9229, Time: 0.4278s)

Q: What was the trend in net income? (Relevant, low-confidence (ambiguous))
A: The change in 'Net income' from 2023 to 2024 was -331 millions of dollars. (Method: Fine-Tune, Confidence: 0.8765, Time: 0.4660s)

Q: What is the capital of France? (Irrelevant)
A: Not applicable (Method: Guardrail (Irrelevant), Confidence: 1.0000, Time: 0.0000s)



### 6.2. Extended Evaluation and Results Table

In [None]:
from IPython.display import display
import re

results = []
for item in extended_eval_questions:
    question = item['question']
    real_answer = item['real_answer']

    # Fine-tuned model result
    ft_answer, ft_confidence, ft_inference_time, ft_method = get_finetuned_answer(question)

    # RAG model result
    rag_answer, rag_confidence, rag_inference_time, rag_method = getResponseRag(question)

    # Correctness check for fine-tuned model
    numbers_in_real_answer = set(re.findall(r'-?\d+', real_answer))
    numbers_in_ft_answer = set(re.findall(r'-?\d+', ft_answer))
    ft_correct = 'Y' if numbers_in_real_answer and numbers_in_real_answer.issubset(numbers_in_ft_answer) else 'N'

    if "not in data" in real_answer.lower() and ft_method == "Guardrail (Irrelevant)":
        ft_correct = 'Y'
        ft_answer = "Not applicable"

    # Correctness check for RAG model
    numbers_in_rag_answer = set(re.findall(r'-?\d+', rag_answer))
    rag_correct = 'Y' if numbers_in_real_answer and numbers_in_real_answer.issubset(numbers_in_rag_answer) else 'N'

    if "not in data" in real_answer.lower() and rag_method == "Guardrail (Irrelevant)":
        rag_correct = 'Y'
        rag_answer = "Not applicable"

    results.append({
        "Question": question,
        "Real Answer": real_answer,
        "Fine-Tuned Answer": ft_answer,
        "Fine-Tuned Method": ft_method,
        "Fine-Tuned Confidence": f"{ft_confidence:.2f}",
        "Fine-Tuned Time (s)": f"{ft_inference_time:.2f}",
        "Fine-Tuned Correct (Y/N)": ft_correct,
        "RAG Answer": rag_answer,
        "RAG Method": rag_method,
        "RAG Confidence": f"{rag_confidence:.2f}",
        "RAG Time (s)": f"{rag_inference_time:.2f}",
        "RAG Correct (Y/N)": rag_correct
    })

results_df = pd.DataFrame(results)
display(results_df)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Unnamed: 0,Question,Method,Answer,Confidence,Time (s),Correct (Y/N)
0,Find the value for 'Sales of products' in 2023...,Fine-Tune,The value of 'Sales of products' in 2023 was 1...,0.92,0.41,N
1,How much did the 'Net income' change from 2023...,Fine-Tune,The change in 'Net income' from 2023 to 2024 w...,0.96,0.48,N
2,What was the value of 'Comprehensive income at...,Fine-Tune,The value of 'Comprehensive income attributabl...,0.91,0.56,N
3,Could you provide the figure for '455' in 2022...,Fine-Tune,The value of '455' in 2022 was 5 millions of d...,0.89,0.37,N
4,What was the difference in 'Impact on PBO/APBO...,Fine-Tune,The difference in 'Impact on PBO/APBO at Decem...,0.95,0.73,N
5,What was the value of 'Net income from continu...,Fine-Tune,The value of 'Net income from continuing opera...,0.93,0.46,N
6,What were the values for 'Net income attributa...,Fine-Tune,The values for 'Net income attributable to GE ...,0.9,0.95,N
7,What is the company's stock ticker?,Guardrail (Irrelevant),Not applicable,1.0,0.0,Y
8,What was the service cost in 2023?,Fine-Tune,The value of 'Service cost  Operating' in 202...,0.86,0.49,N
9,Who is the CEO of the company?,Guardrail (Irrelevant),Not applicable,1.0,0.0,Y


In [57]:
with pd.option_context('display.max_colwidth', None):
  display(results_df)

Unnamed: 0,Question,Method,Answer,Confidence,Time (s),Correct (Y/N)
0,Find the value for 'Sales of products' in 2023 from the Statements of Operations / Income.,Fine-Tune,The value of 'Sales of products' in 2023 was 1949 millions of dollars.,0.92,0.41,N
1,How much did the 'Net income' change from 2023 to 2024 based on the Statements of Operations / Income?,Fine-Tune,The change in 'Net income' from 2023 to 2024 was -331 millions of dollars.,0.96,0.48,N
2,What was the value of 'Comprehensive income attributable to GE HealthCare' in 2022?,Fine-Tune,The value of 'Comprehensive income attributable to GE HealthCare' in 2022 was 2293 millions of dollars.,0.91,0.56,N
3,Could you provide the figure for '455' in 2022 as reported in the Statements of Operations / Income?,Fine-Tune,The value of '455' in 2022 was 5 millions of dollars.,0.89,0.37,N
4,"What was the difference in 'Impact on PBO/APBO at December 31, 2023' between 2022 and 2023 according to the Statements of Operations / Income?",Fine-Tune,"The difference in 'Impact on PBO/APBO at December 31, 2023' between 2022 and 2023 was -326 millions of dollars.",0.95,0.73,N
5,What was the value of 'Net income from continuing operations' in 2024?,Fine-Tune,The value of 'Net income from continuing operations' in 2024 was 1614 millions of dollars.,0.93,0.46,N
6,"What were the values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022?",Fine-Tune,"The values for 'Net income attributable to GE HealthCare' for the years 2024, 2023, and 2022 were 1614, 1949, and 2275 millions of dollars, respectively.",0.9,0.95,N
7,What is the company's stock ticker?,Guardrail (Irrelevant),Not applicable,1.0,0.0,Y
8,What was the service cost in 2023?,Fine-Tune,The value of 'Service cost  Operating' in 2023 was 2023 millions of dollars.,0.86,0.49,N
9,Who is the CEO of the company?,Guardrail (Irrelevant),Not applicable,1.0,0.0,Y


### 6.3. Save the fine-tuned model for inferencing

In [58]:
# Assuming 'trainer' is your fine-tuned model's trainer object
# or 'model' is the fine-tuned model instance
output_dir = "./gpt2-finetuned-model"

# Save the model weights and configuration
finetuned_model.save_pretrained(output_dir)

# Save the tokenizer's vocabulary and settings
finetuned_model_tokenizer.save_pretrained(output_dir)

('./gpt2-finetuned-model\\tokenizer_config.json',
 './gpt2-finetuned-model\\special_tokens_map.json',
 './gpt2-finetuned-model\\vocab.json',
 './gpt2-finetuned-model\\merges.txt',
 './gpt2-finetuned-model\\added_tokens.json',
 './gpt2-finetuned-model\\tokenizer.json')

### 6.4. Push the model on Hugging Face as Git cannot store 400+ MB file

In [66]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [67]:
repo_name = "gpt2-finetuned-model"

# Push the model to the Hub
finetuned_model.push_to_hub(repo_name)

# Push the tokenizer to the Hub
finetuned_model_tokenizer.push_to_hub(repo_name)

model.safetensors: 100%|██████████| 498M/498M [02:36<00:00, 3.19MB/s] 
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


CommitInfo(commit_url='https://huggingface.co/Anup77Jindal/gpt2-finetuned-model/commit/45efd85b04b5914234d88d4caaf98520cf5310ea', commit_message='Upload tokenizer', commit_description='', oid='45efd85b04b5914234d88d4caaf98520cf5310ea', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Anup77Jindal/gpt2-finetuned-model', endpoint='https://huggingface.co', repo_type='model', repo_id='Anup77Jindal/gpt2-finetuned-model'), pr_revision=None, pr_num=None)

## 7. Summary and Conclusion 📝

Based on the baseline and fine-tuned model evaluations, we can summarize the findings and draw conclusions about the effectiveness of fine-tuning GPT-2 with SFTTrainer on this financial Q&A dataset and the impact of the implemented guardrail.


### Evaluation results:


*   **Baseline Model:** The baseline GPT-2 model, without fine-tuning on this specific dataset, performed poorly on the financial Q&A task, often providing irrelevant or incomplete answers with low confidence scores. This highlights the need for domain-specific fine-tuning.
*   **Fine-Tuned Model:** The fine-tuned GPT-2 model with SFTTrainer shows significant improvement. It is able to provide relevant answers to financial questions from the dataset with higher confidence scores. While not perfect (some answers may still contain inaccuracies or require further refinement), it demonstrates the effectiveness of supervised instruction fine-tuning for this task.
*   **RAG Model:** The RAG approach leverages retrieval from the source data, providing answers that are more factually grounded and adaptable to new information. RAG is robust to out-of-domain queries due to its retrieval component, but may be slower due to the retrieval and reranking steps.
*   **Guardrail:** The implemented guardrail successfully identified and flagged irrelevant questions (e.g., "What is the company's stock ticker?" and "Who is the CEO of the company?"), returning a "Not applicable" response with high confidence. This is crucial for ensuring the model stays within its intended domain and doesn't provide misleading information for out-of-scope queries.


### Comparison of Average Inference Speed and Accuracy


- **Inference Speed:** Fine-tuned models are generally faster at inference since they generate answers directly, while RAG models require retrieval and reranking, which adds latency. In our results, the fine-tuned model consistently produced answers more quickly than RAG.
- **Accuracy:** RAG models tend to be more accurate for fact-based questions, as they ground their answers in retrieved context. Fine-tuned models may be more fluent but can hallucinate or provide less factual answers if the training data is limited.


### Strengths of Each Approach


- **RAG Strengths:**
    - Adaptability to new data without retraining.
    - Factual grounding from source documents.
    - Robustness to irrelevant queries due to retrieval and guardrail logic.
- **Fine-Tuning Strengths:**
    - Fluency and natural language generation.
    - Efficiency in inference speed.
    - Can generalize well within the domain if trained on sufficient data.


### Robustness to Irrelevant Queries


Both approaches benefit from the guardrail logic, but RAG is inherently more robust due to its reliance on retrieval. Fine-tuned models may attempt to answer any question, but with a guardrail, they can gracefully handle out-of-domain queries.


### Practical Trade-Offs


- **RAG:** Best for scenarios where factual accuracy and adaptability to new data are critical, but with higher computational cost and slower inference.
- **Fine-Tuning:** Preferred for applications requiring fast, fluent responses within a well-defined domain, but may require frequent retraining to stay up-to-date.


**Conclusion:**


Fine-tuning a pre-trained language model like GPT-2 on a domain-specific dataset using SFTTrainer is an effective approach for building a question-answering system for that domain. The addition of a simple guardrail significantly improves the system's robustness by handling irrelevant queries gracefully. RAG offers superior factual accuracy and adaptability, while fine-tuning excels in speed and fluency. The choice between these approaches depends on the application's requirements for accuracy, speed, and domain coverage. Further improvements could involve expanding the training dataset, experimenting with different model architectures, or implementing more sophisticated guardrail mechanisms.