# Lightweight Fine-Tuning Project
This notebook fine-tunes DistilBERT on GLUE/SST-2 and compares baseline and LoRA-adapted models.

* PEFT technique: LoRA applied to the attention query/value projections
* Model: distilbert-base-uncased with a sequence classification head
* Evaluation approach: Hugging Face Trainer with the accuracy metric on the SST-2 validation set
* Fine-tuning dataset: GLUE/SST-2 with 1,000 shuffled training samples and the full validation split


## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
import os
import random
import numpy as np
import torch
from datasets import load_dataset
import evaluate
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)
from peft import (
    LoraConfig,
    TaskType,
    get_peft_model,
    AutoPeftModelForSequenceClassification,
)

BASELINE_OUTPUT_DIR = "/tmp/distilbert-baseline"
PEFT_TRAINING_OUTPUT_DIR = "/tmp/distilbert-lora"
PEFT_ADAPTER_OUTPUT_DIR = "/tmp/distilbert-sst2-lora"


def set_seed(seed: int = 42) -> None:
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


set_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cpu')

In [3]:
MAX_TRAIN_SAMPLES = 1000

raw_datasets = load_dataset("glue", "sst2")
train_dataset = raw_datasets["train"].shuffle(seed=42).select(range(MAX_TRAIN_SAMPLES))
eval_dataset = raw_datasets["validation"]

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")


Training samples: 1000
Validation samples: 872


In [4]:
checkpoint_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)


def preprocess_function(batch):
    return tokenizer(batch["sentence"], truncation=True)


columns_to_remove = [col for col in ["sentence", "idx"] if col in train_dataset.column_names]
train_tokenized = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=columns_to_remove,
)
columns_to_remove_eval = [col for col in ["sentence", "idx"] if col in eval_dataset.column_names]
eval_tokenized = eval_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=columns_to_remove_eval,
)

train_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
eval_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
accuracy_metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)


In [5]:
baseline_model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint_name, num_labels=2
)

baseline_args = TrainingArguments(
    output_dir=BASELINE_OUTPUT_DIR,
    per_device_eval_batch_size=32,
    report_to="none",
    do_train=False,
    do_eval=True,
    seed=42,
)

baseline_trainer = Trainer(
    model=baseline_model,
    args=baseline_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  baseline_trainer = Trainer(


In [6]:
baseline_metrics = baseline_trainer.evaluate()
print("Baseline metrics:", baseline_metrics)
baseline_metrics




Baseline metrics: {'eval_loss': 0.6972277164459229, 'eval_model_preparation_time': 0.0023, 'eval_accuracy': 0.43463302752293576, 'eval_runtime': 15.2714, 'eval_samples_per_second': 57.1, 'eval_steps_per_second': 1.833}


{'eval_loss': 0.6972277164459229,
 'eval_model_preparation_time': 0.0023,
 'eval_accuracy': 0.43463302752293576,
 'eval_runtime': 15.2714,
 'eval_samples_per_second': 57.1,
 'eval_steps_per_second': 1.833}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [7]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_lin", "v_lin"],
)

peft_base_model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint_name, num_labels=2
)
peft_model = get_peft_model(peft_base_model, lora_config)
peft_model.print_trainable_parameters()


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 887,042 || all params: 67,842,052 || trainable%: 1.3075


In [8]:
peft_training_args = TrainingArguments(
    output_dir=PEFT_TRAINING_OUTPUT_DIR,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=5e-5,
    num_train_epochs=3,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=20,
    weight_decay=0.01,
    gradient_accumulation_steps=1,
    warmup_ratio=0.1,
    report_to="none",
    seed=42,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


  peft_trainer = Trainer(


In [9]:
peft_train_result = peft_trainer.train()
peft_train_result




Epoch,Training Loss,Validation Loss,Accuracy
1,0.6633,0.658708,0.510321
2,0.552,0.505485,0.822248
3,0.4206,0.434994,0.823394




TrainOutput(global_step=189, training_loss=0.5651135230190539, metrics={'train_runtime': 163.31, 'train_samples_per_second': 18.37, 'train_steps_per_second': 1.157, 'total_flos': 27203244063360.0, 'train_loss': 0.5651135230190539, 'epoch': 3.0})

In [9]:
finetuned_metrics = peft_trainer.evaluate()
print("Fine-tuned metrics:", finetuned_metrics)
finetuned_metrics




Fine-tuned metrics: {'eval_loss': 0.43499407172203064, 'eval_accuracy': 0.823394495412844, 'eval_runtime': 14.578, 'eval_samples_per_second': 59.816, 'eval_steps_per_second': 1.921, 'epoch': 3.0}


{'eval_loss': 0.43499407172203064,
 'eval_accuracy': 0.823394495412844,
 'eval_runtime': 14.578,
 'eval_samples_per_second': 59.816,
 'eval_steps_per_second': 1.921,
 'epoch': 3.0}

### Storage note

All model artifacts are written under /tmp to keep the workspace footprint small.


In [10]:
os.makedirs(PEFT_ADAPTER_OUTPUT_DIR, exist_ok=True)
peft_model.save_pretrained(PEFT_ADAPTER_OUTPUT_DIR)
tokenizer.save_pretrained(PEFT_ADAPTER_OUTPUT_DIR)
print(f"Saved LoRA adapters and tokenizer to {PEFT_ADAPTER_OUTPUT_DIR}")


Saved LoRA adapters and tokenizer to /tmp/distilbert-sst2-lora


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [11]:
reloaded_model = AutoPeftModelForSequenceClassification.from_pretrained(
    PEFT_ADAPTER_OUTPUT_DIR
)
reloaded_model


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): DistilBertSdpaAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=76

In [12]:
peft_eval_args = TrainingArguments(
    output_dir="/tmp/distilbert-lora-eval",
    per_device_eval_batch_size=32,
    report_to="none",
)

reloaded_trainer = Trainer(
    model=reloaded_model,
    args=peft_eval_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
reloaded_metrics = reloaded_trainer.evaluate()
reloaded_metrics


  reloaded_trainer = Trainer(


{'eval_loss': 0.43499407172203064,
 'eval_model_preparation_time': 0.0022,
 'eval_accuracy': 0.823394495412844,
 'eval_runtime': 14.3972,
 'eval_samples_per_second': 60.567,
 'eval_steps_per_second': 1.945}

In [13]:
baseline_acc = baseline_metrics["eval_accuracy"]
lora_acc = reloaded_metrics["eval_accuracy"]
print(f"Baseline accuracy: {baseline_acc:.4f}")
print(f"LoRA accuracy: {lora_acc:.4f}")
print(f"Absolute improvement: {lora_acc - baseline_acc:.4f}")
{"baseline": baseline_metrics, "lora": reloaded_metrics}


Baseline accuracy: 0.4346
LoRA accuracy: 0.8234
Absolute improvement: 0.3888


{'baseline': {'eval_loss': 0.6972277164459229,
  'eval_model_preparation_time': 0.0023,
  'eval_accuracy': 0.43463302752293576,
  'eval_runtime': 13.5136,
  'eval_samples_per_second': 64.528,
  'eval_steps_per_second': 2.072},
 'lora': {'eval_loss': 0.43499407172203064,
  'eval_model_preparation_time': 0.0022,
  'eval_accuracy': 0.823394495412844,
  'eval_runtime': 14.3972,
  'eval_samples_per_second': 60.567,
  'eval_steps_per_second': 1.945}}

In [14]:
label_names = raw_datasets["train"].features["label"].names
sample_sentences = [
    "The movie was unexpectedly delightful and heartwarming.",
    "The plot was incoherent and the acting was terrible.",
]

reloaded_model.eval()
for text in sample_sentences:
    encoded = tokenizer(text, return_tensors="pt")
    encoded = {k: v.to(reloaded_model.device) for k, v in encoded.items()}
    with torch.no_grad():
        outputs = reloaded_model(**encoded)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
    print(f"Text: {text}")
    print(f"Predicted label: {label_names[prediction]}")
    print("-" * 40)


Text: The movie was unexpectedly delightful and heartwarming.
Predicted label: positive
----------------------------------------
Text: The plot was incoherent and the acting was terrible.
Predicted label: negative
----------------------------------------


In [15]:
import os
os.listdir(PEFT_ADAPTER_OUTPUT_DIR)


['adapter_config.json',
 'adapter_model.safetensors',
 'README.md',
 'special_tokens_map.json',
 'tokenizer.json',
 'tokenizer_config.json',
 'vocab.txt']

## Conclusion

LoRA fine-tuning on the 1k-sample SST-2 subset improved DistilBERT accuracy from **0.43** to **0.82**, with the adapter weights stored in `distilbert-sst2-lora/`. Training was completed on CPU, so expect ~15 minutes per run.

Next steps if you revisit the project:
- Try a larger training subset or extra epochs to squeeze out a few more points.
- Experiment with QLoRA or gradient checkpointing to reduce memory while exploring bigger models.
- Add metrics like F1 or MCC if you want richer comparisons for imbalanced datasets.
