# Getting better results with fine-tuning

## What is fine-tuning

**Fine-tuning** is a transfer learning technique where a pre-trained model (like BERT, GPT, or ResNet) is further trained on a new, task-specific dataset to adapt it to a particular application. It leverages the general knowledge the model has already learned while specializing it for a specific use case.



## Why Fine-Tuning?

**Key Benefits:**

- Requires Less Data:

Pre-trained models already understand general patterns (e.g., language structure for NLP, visual features for CV).

Fine-tuning adapts them to a new task with far fewer examples than training from scratch.

- Faster Training:

The model starts with learned weights, so it converges faster than training from random initialization.

- Better Performance:

Fine-tuned models often outperform models trained only on the target dataset.

## How Fine-Tuning Works

**Step-by-Step Process:**

- Start with a Pre-trained Model

Example: bert-base-uncased (a general-purpose language model).

- Modify the Model Head

Replace the final layer (e.g., for classification, regression, or a new task).

- Train on New Data

Keep most of the model frozen (weights fixed).

Only update the last few layers (or use LoRA for parameter-efficient tuning).

- Evaluate & Deploy

Test on a held-out dataset.

Save the fine-tuned model for inference.

### Core Trainer Setup

#### Basic Instantiation

**model**: The PyTorch model to train (must be nn.Module)

**args**: Configured TrainingArguments object

**train_dataset**: Preprocessed training data (Dataset object)

**eval_dataset**: Optional validation data

In [None]:
# Do not run = example only
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

## Essential Training Arguments

Pro Tip: Use fp16=True for automatic mixed-precision training on GPUs.



In [None]:
# Do not run = example only

args = TrainingArguments(
    output_dir="./saved_models",  # Save directory
    num_train_epochs=3,          # Total epochs
    per_device_train_batch_size=16,  # Batch size
    learning_rate=2e-5,          # Initial LR
    evaluation_strategy="steps",  # When to evaluate
    eval_steps=500,              # Evaluate every N steps
    save_strategy="epoch",       # When to save
    load_best_model_at_end=True, # Keep best checkpoint
    metric_for_best_model="accuracy"
)

## Adding Custom Metrics

**Supported Metrics:**

**Classification**: Accuracy, F1, Precision/Recall

**Regression**: MSE, MAE

**Custom**: Any function accepting (eval_pred, labels)

In [None]:
# Do not run = example only

from sklearn.metrics import accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "custom_metric": custom_function(logits, labels)
    }

trainer = Trainer(
    ...,
    compute_metrics=compute_metrics
)

## Callbacks & Extensions

In [None]:
# Do not run = example only

from transformers import (
    EarlyStoppingCallback,
    TensorBoardCallback
)

trainer = Trainer(
    ...,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=2),
        TensorBoardCallback(log_dir="./logs")
    ]
)

## Training Execution

In [None]:
trainer.train()  # Runs full training loop


## Evaluation & Prediction

In [None]:
eval_results = trainer.evaluate()
print(eval_results)

predictions = trainer.predict(test_dataset)
print(predictions.metrics)  # Test set metrics
print(predictions.predictions)  # Raw safe

**Key Packages:**

**transformers**: Provides pretrained models (BERT, DistilBERT, etc.) and training utilities

**datasets**: Handles dataset loading and preprocessing

**peft**: Enables Parameter-Efficient Fine-Tuning (LoRA)

**accelerate**: Optimizes training for GPUs/TPUs

**evaluate**: Standardized metric computation

**Why These Versions?**

transformers>=4.40.0: Ensures compatibility with LoRA implementations

peft>=0.10.0: Contains stable LoRA implementations

## GPU Verification

**torch.cuda.is_available():** Checks for NVIDIA GPU

**device variable:** Sets tensors to GPU (CUDA) or CPU

**Why Important?**

GPUs provide ~50x speedup for deep learning

Automatic fallback to CPU ensures code runs without GPU

In [16]:
# [Cell 1] - Package Installation
# -------------------------------
# Install necessary libraries with specific versions
# transformers: Main library for pretrained models
# datasets: For dataset handling
# peft: Parameter-Efficient Fine-Tuning
# accelerate: For distributed training
# evaluate: For metrics calculation
!pip install -q transformers>=4.40.0 datasets peft>=0.10.0 accelerate evaluate

## Creates a labeled dataset for stance classification.

**Structure**:

**text**: Example sentences

**label**: 0 (Support), 1 (Neutral), 2 (Oppose)

**Key Operations:**

.train_test_split(test_size=0.25): 75/25 train-test split

seed=42: Ensures reproducible splits

**Dataset Size**:

6 training examples

2 test examples (minimum for demonstration)



In [17]:
# [Cell 2] - GPU Verification
# ---------------------------
# Check for GPU availability and set device
# CUDA enables ~50x faster training on NVIDIA GPUs
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if device == 'cpu':
    print("Warning: Training will be very slow without GPU!")

Using device: cuda


## Dataset Preparation

Initializes a pretrained model with LoRA adaptation.

**Key Components:**

Base Model: distilbert-base-uncased

Lightweight version of BERT

~66M parameters (vs BERT's 110M)

**LoRA Configuration:**

r=8: Rank of low-rank matrices (smaller = more efficient)

lora_alpha=16: Scaling factor for LoRA weights

target_modules: Applies LoRA to query/value layers (most impactful)

**Classification Head: **

num_labels=3: Matches our 3-class task

In [18]:
# [Cell 3] - Dataset Preparation
# -----------------------------
# Create a small labeled dataset for stance classification
# 0=Support, 1=Neutral, 2=Oppose
# Using 25% test split gives 6 train, 2 test examples
from datasets import Dataset

examples = {
    "text": [
        "The policy is needed to combat climate change.",  # Support
        "This environmental initiative is crucial",        # Support
        "I'm undecided about this policy",                # Neutral
        "Need more information to decide",                # Neutral
        "This infringes on personal rights",              # Oppose
        "Government overreach must be stopped",           # Oppose
        "This legislation will save lives",               # Support
        "The proposal has some merits but needs work"     # Neutral
    ],
    "label": [0, 0, 1, 1, 2, 2, 0, 1]
}

dataset = Dataset.from_dict(examples).train_test_split(
    test_size=0.25,  # 25% for testing
    seed=42          # Fixed random seed for reproducibility
)
print(f"Train samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")

Train samples: 6
Test samples: 2


## Model

**Key Components:**

**Base Model:** distilbert-base-uncased

Lightweight version of BERT

~66M parameters (vs BERT's 110M)


**LoRA Configuration:**

r=8: Rank of low-rank matrices (smaller = more efficient)

lora_alpha=16: Scaling factor for LoRA weights

target_modules: Applies LoRA to query/value layers (most impactful)


**Classification Head:**

num_labels=3: Matches our 3-class task



In [19]:
# [Cell 4] - Model Initialization
# -------------------------------
# Using DistilBERT for efficiency with LoRA adaptation
from peft import LoraConfig, get_peft_model

model_name = "distilbert-base-uncased"  # Lightweight BERT variant
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize with 3 output classes (Support/Neutral/Oppose)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3  # Must match our label space
).to(device)

# LoRA Configuration:
# r=8: Rank of low-rank adaptation matrices
# lora_alpha=16: Scaling factor for LoRA weights
# target_modules: Apply to query and value layers
peft_config = LoraConfig(
    task_type="SEQ_CLS",  # Sequence Classification
    r=8,                 # LoRA rank
    lora_alpha=16,       # LoRA alpha
    lora_dropout=0.1,    # Dropout for LoRA layers
    target_modules=["q_lin", "v_lin"]  # Apply to query/value weights
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()  # Show % of trainable params

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 740,355 || all params: 67,696,134 || trainable%: 1.0936


##Tokenization

Converts text to numeric tokens the model understands.

T**okenization Process:**

**Padding**: Fills shorter sequences with zeros to match max_length

**Truncation**: Cuts sequences longer than 512 tokens

**Special Tokens**: Adds [CLS], [SEP] automatically


**Why Important?:**

Transformers require fixed-length numeric input

Batched processing improves speed

In [20]:
# [Cell 5] - Tokenization
# -----------------------
# Convert text to model-compatible token IDs
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",  # Pad to max length
        truncation=True,       # Truncate long sequences
        max_length=512         # BERT's max input size
    )

# Apply to both train and test sets
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

## Training Configuration

**Key Parameters Explained:**

Batch size 8 fits comfortably in 16GB GPU memory

Step-based eval/save is more flexible than epoch-based

2e-4 LR prevents catastrophic forgetting of pretrained knowledge

Disabled W&B reporting (report_to="none") for simplicity



In [21]:
# [Cell 6] - Training Configuration
# ---------------------------------
# Critical hyperparameters for the training process
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./output",          # Save directory
    per_device_train_batch_size=8,  # Batch size per GPU
    per_device_eval_batch_size=8,   # Eval batch size
    eval_strategy="steps",          # Evaluate every N steps
    eval_steps=20,                  # Evaluate every 20 steps
    save_strategy="steps",          # Save checkpoints by steps
    save_steps=40,                  # Save every 40 steps
    logging_steps=10,               # Log every 10 steps
    learning_rate=2e-4,             # Optimal for fine-tuning
    num_train_epochs=3,             # Total training epochs
    load_best_model_at_end=True,    # Keep best checkpoint
    metric_for_best_model="accuracy",
    report_to="none",               # Disable external logging
    remove_unused_columns=False     # Required for some versions
)

## Metrics Setup

- **Raw Logits (Model Output)**

After input text passes through the model, it outputs logits: raw, unnormalized prediction scores for each class (e.g., Support, Neutral, Oppose).

Example: [2.3, 1.1, -0.4] → higher values suggest higher model confidence.

- **Argmax (Pick the Top Score)**

We use argmax to find the index of the highest score → this becomes the predicted label.

In our example: argmax([2.3, 1.1, -0.4]) = 0 → label 0 = Support

- **Compare to Ground Truth**

We compare each predicted label to the true label from the dataset.

If predicted label == true label → it’s counted as a correct prediction.

**Calculate Accuracy**

Accuracy is computed as:

Accuracy = Number of Correct Predictions/ Total Number of Predictions​

This gives a simple metric of model performance: "What fraction of the predictions were correct?"

In [22]:
# [Cell 7] - Metrics Setup
# ------------------------
# Configure evaluation metrics
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # Convert logits to predictions
    return metric.compute(
        predictions=predictions,
        references=labels
    )

## Trainer setup

[Trainer]

├── Model

├── Training Args

├── Train Data

├── Eval Data

├── Metrics

└── Callbacks

Trainer class handles the entire training loop

Early stopping prevents overfitting (patience=2 evaluations)

Callbacks can add TensorBoard, checkpointing, etc.

All components from previous cells come together here

In [23]:
# [Cell 8] - Trainer Initialization
# --------------------------------
# Setup the complete training pipeline
from transformers import Trainer, EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=2  # Stop if no improvement for 2 evals
        )
    ]
)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Training Execution

- Automatically handles:

- Gradient accumulation

- Mixed precision training

- Checkpoint saving

- Progress bars show real-time metrics

- Training loss should decrease monotonically

- Validation metrics may fluctuate

In [24]:
# [Cell 9] - Model Training
# -------------------------
# Execute the training process
trainer.train()  # This runs the full training loop

Step,Training Loss,Validation Loss


TrainOutput(global_step=3, training_loss=1.059674898783366, metrics={'train_runtime': 1.3207, 'train_samples_per_second': 13.63, 'train_steps_per_second': 2.272, 'total_flos': 2425394368512.0, 'train_loss': 1.059674898783366, 'epoch': 3.0})

## Prediction Function



In [34]:
# [Cell 10] - Prediction Function
# ------------------------------
# Create a reusable prediction function
def predict(text):
    # 1. Tokenization
    inputs = tokenizer(
        text,
        return_tensors="pt",           # Return PyTorch tensors
        truncation=True,               # Cut excess tokens
        max_length=512                 # BERT's max capacity
    ).to(device)                       # GPU/CPU placement

    # 2. Inference Mode
    with torch.no_grad():              # Disable gradient tracking
        outputs = model(**inputs)      # Forward pass

    # 3. Probability Conversion
    probs = torch.nn.functional.softmax(
        outputs.logits,                # Raw model outputs
        dim=-1                         # Normalize across classes
    )

    # 4. Result Formatting
    return {
        "text": text,
        "prediction": ["Support", "Neutral", "Oppose"][torch.argmax(probs).item()],
        "confidence": torch.max(probs).item()  # Highest class probability
    }

# Test prediction
print(predict("This policy is fundamentally flawed"))

{'text': 'This policy is fundamentally flawed', 'prediction': 'Support', 'confidence': 0.42485958337783813}


## Model Saving & Loading

safe_serialization=True: Prevents pickle vulnerabilities (Uses safetensors format)

PeftConfig:	Preserves adapter architecture, contains LoRA rank/target modules

num_labels=3:	Must match training setup, critical for correct head initialization

**Best Practice: **Always save both model AND tokenizer to ensure compatible inference later.

When **safe_serialization=True** is set in save_pretrained(), the model weights are saved in the safetensors format instead of Python's native pickle format. This addresses critical security vulnerabilities inherent to pickle.

In [26]:
# [Cell 11] - Model Saving/Loading
# -------------------------------
# Save the fine-tuned model for later use
model.save_pretrained("./policy-stance-lora", safe_serialization=True)
tokenizer.save_pretrained("./policy-stance-lora")

# Proper loading procedure
from peft import PeftModel, PeftConfig

# 1. Load config to get base model info
config = PeftConfig.from_pretrained("./policy-stance-lora")

# 2. Initialize base model with correct class count
base_model = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
    num_labels=3  # Must match training setup
)

# 3. Load the PEFT adapter weights
loaded_model = PeftModel.from_pretrained(base_model, "./policy-stance-lora")
loaded_model = loaded_model.to(device)  # Move to GPU if available

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Evaluation Setup

**Class-wise Metrics**:  Precision/Recall/F1

**Confidence Tracking**: Average prediction certainty

**Label Alignment**: Exact match accuracy

**Note**: Uses scikit-learn's report for comprehensive metrics beyond basic accuracy.



In [29]:
# [Cell 12] - Evaluation Setup
# ---------------------------
# Install additional evaluation tools
!pip install -q scikit-learn

from sklearn.metrics import classification_report
import numpy as np

def evaluate_model(model, tokenizer, dataset):
    label_names = ["Support", "Neutral", "Oppose"]
    true_labels = dataset["label"]
    preds = []
    confs = []

    for text in dataset["text"]:
        result = predict(model, tokenizer, text)
        preds.append(label_names.index(result["prediction"]))
        confs.append(result["confidence"])

    print("Classification Report:")
    print(classification_report(true_labels, preds, target_names=label_names))
    print(f"Average Confidence: {np.mean(confs):.1%}")
    return preds

In [30]:
# Cell 13: Prediction Function
def predict(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return {
        "text": text,
        "prediction": ["Support", "Neutral", "Oppose"][torch.argmax(probs).item()],
        "confidence": torch.max(probs).item()
    }

**Interpretation**: Random initialization typically shows:

~33% accuracy for 3-class problems

High recall on frequent classes

Zero precision where no predictions hit the class



## Pre-Training Evaluation
Baseline Assessment

In [31]:
# Cell 14: Run Evaluations
print("=== Before Training ===")
baseline_preds = evaluate_model(model, tokenizer, dataset["test"])

=== Before Training ===
Classification Report:
              precision    recall  f1-score   support

     Support       0.00      0.00      0.00       0.0
     Neutral       0.00      0.00      0.00       1.0
      Oppose       0.00      0.00      0.00       1.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0

Average Confidence: 38.7%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Training Execution

**Behind the Scenes:**

**Forward Pass**: Computes logits

**Loss Calculation**: Cross-entropy between logits/labels

**Backward Pass**: Gradient computation through LoRA layers only

**Weight Update**: AdamW optimizer step

**Key Monitoring Metrics:**

**Training Loss**: Should decrease monotonically

**Eval Accuracy**: Should increase with oscillations

**GPU Utilization:**  Should be >80% for efficient training



In [32]:
print("\n=== Training ===")
trainer.train()



=== Training ===


Step,Training Loss,Validation Loss


TrainOutput(global_step=3, training_loss=0.9966476758321127, metrics={'train_runtime': 1.4745, 'train_samples_per_second': 12.208, 'train_steps_per_second': 2.035, 'total_flos': 2425394368512.0, 'train_loss': 0.9966476758321127, 'epoch': 3.0})

## Post-Training Evaluation

**Analysis Framework:**

**Quantitative Improvement:**

Absolute accuracy delta (e.g., +45%)

Confidence increase (e.g., 33% → 85%)

**Qualitative Checks:**

Misclassification patterns

Class-wise performance gaps



In [33]:
print("\n=== After Training ===")
trained_preds = evaluate_model(model, tokenizer, dataset["test"])

# Cell 5: Improvement Analysis
improvement = (np.array(trained_preds) == np.array(dataset["test"]["label"])).mean() - \
              (np.array(baseline_preds) == np.array(dataset["test"]["label"])).mean()
print(f"\nAbsolute Accuracy Improvement: {improvement:.1%}")


=== After Training ===
Classification Report:
              precision    recall  f1-score   support

     Support       0.00      0.00      0.00       0.0
     Neutral       0.00      0.00      0.00       1.0
      Oppose       0.00      0.00      0.00       1.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0

Average Confidence: 43.2%

Absolute Accuracy Improvement: 0.0%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
