# Fine-Tuning distilgpt2 on Synthetic Banking Data

This notebook demonstrates how to fine-tune the small language model `distilgpt2` using a synthetic banking Q&A dataset. The workflow includes data loading, preprocessing, model setup, training, evaluation, and documentation of each step.

## 1. Load and Prepare Dataset

We will load the synthetic banking Q&A dataset and prepare it for training. The dataset consists of instruction-response pairs suitable for conversational fine-tuning.

In [None]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split

# Load synthetic banking dataset
with open('banking_synthetic.json', 'r') as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data)

# Combine instruction and response for conversational fine-tuning
# For distilgpt2, we concatenate instruction and response as a single text
texts = [f"Instruction: {row['instruction']}\nResponse: {row['response']}" for row in data]

# Split into train and test sets
train_texts, test_texts = train_test_split(texts, test_size=0.2, random_state=42)

print(f"Training samples: {len(train_texts)}")
print(f"Test samples: {len(test_texts)}")

FileNotFoundError: [Errno 2] No such file or directory: '../data/banking_synthetic.json'

## 2. Define Model Architecture

We will use the Hugging Face `transformers` library to load the pre-trained `distilgpt2` model and tokenizer. This model is suitable for conversational and text generation tasks.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example tokenization
inputs = tokenizer(train_texts[0], return_tensors="pt")
print(inputs.input_ids.shape)

## 3. Configure Training Parameters

Set up training parameters such as learning rate, batch size, number of epochs, and optimizer. We will use the `Trainer` API from Hugging Face for simplicity.

In [None]:
from transformers import Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling
import torch

# Save train and test texts to files for TextDataset
with open("train.txt", "w") as f:
    for line in train_texts:
        f.write(line + "\n")
with open("test.txt", "w") as f:
    for line in test_texts:
        f.write(line + "\n")

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train.txt",
    block_size=128
)
test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="test.txt",
    block_size=128
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_steps=10,
)


## 4. Train the Model

We will use the Hugging Face `Trainer` to fine-tune the model on our dataset. Training progress and evaluation will be logged.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
)

trainer.train()

# Save the trained model
trainer.save_model("./output/distilgpt2-banking-finetuned")

## 5. Evaluate Model Performance

After training, we evaluate the model on the test set. We can generate sample outputs and compare them to expected responses.

In [None]:
# Generate sample outputs from the fine-tuned model
for i in range(3):
    input_text = test_texts[i].split('Response:')[0] + 'Response:'
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    output_ids = model.generate(input_ids, max_length=100, num_beams=5, early_stopping=True)
    print(f"Input: {input_text}")
    print(f"Generated: {tokenizer.decode(output_ids[0], skip_special_tokens=True)}\n")

## 6. Document the Training Process

This notebook walks through the workflow for fine-tuning a small LLM (`distilgpt2`) on a synthetic banking dataset. Each section is documented with explanations and code comments. You can adjust hyperparameters, dataset, and model as needed for your own experiments.

**Workflow Summary:**
- Load and preprocess data
- Define and load model architecture
- Configure training parameters
- Train the model using Hugging Face Trainer
- Evaluate model performance with sample outputs
- Document each step for reproducibility and understanding