# Fine-Tuning LLaMA-2-7B with LoRA on Dolly-15K Dataset

## Purpose
This notebook implements **Parameter-Efficient Fine-Tuning (PEFT)** using **LoRA (Low-Rank Adaptation)** to fine-tune the **meta-llama/Llama-2-7b** model on the **databricks/databricks-dolly-15k** dataset. The goal is to transform the base model into a helpful conversational assistant that follows instructions effectively.

## Background
Instruction fine-tuning adapts a pretrained LLM to follow natural-language instructions instead of generic next-token prediction. This converts a raw pretrained model into a usable conversational assistant that can:
- Respond concisely to user instructions
- Follow specified formats consistently  
- Avoid generating unnecessary tokens
- Maintain high conversational quality

## Dataset: Dolly-15K
- **Source**: [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
- **Size**: 15,011 high-quality instruction-following examples
- **Format**: Each entry contains `instruction`, `context`, and `response` fields
- **Categories**: 7 different task categories (creative writing, information extraction, etc.)
- **Splits**: 80% train / 10% validation / 10% test

## Technical Approach
- **Base Model**: `meta-llama/Llama-2-7b-hf`
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **Hardware**: Optimized for Colab with FSDP if VRAM is restricted
- **Training**: Track training and validation loss for convergence monitoring



## Expected Outcomes
- Upload fine-tuned weights to Colab for reuse and evaluation

## Workflow
1. **Environment Setup**: Install dependencies and mount Google Drive
2. **Data Preprocessing**: Load and format Dolly-15K dataset
3. **Model Configuration**: Set up LoRA parameters and training config
4. **Training**: Fine-tune with loss monitoring and checkpointing
5. **Model Saving**: Upload weights to Colab for later evaluation
6. **Documentation**: Record hyperparameters and training metrics

---
**Note**: This fine-tuning process is designed to demonstrate the effectiveness of LoRA for instruction-following capabilities while maintaining computational efficiency.


In [1]:
!pip install -U transformers peft bitsandbytes accelerate trl datasets



In [2]:
from datasets import load_dataset
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer, SFTConfig
import os
from peft import LoraConfig, get_peft_model

In [3]:
from huggingface_hub import login
login(new_session=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)



print(f"✓ Model {model_id} loaded in 4-bit (QLoRA)")


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

✓ Model meta-llama/Llama-2-7b-hf loaded in 4-bit (QLoRA)


In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(model)


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMS

In [6]:
#LoRA Config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ]
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 39,976,960 || all params: 6,778,392,576 || trainable%: 0.5898


In [8]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [9]:
drive_path = '/content/drive/MyDrive/LLaMA2-Dolly-Training/data'
os.makedirs(drive_path, exist_ok=True)
train_path = os.path.join(drive_path, 'train.parquet')
val_path = os.path.join(drive_path, 'val.parquet')
test_path = os.path.join(drive_path, 'test.parquet')


train_dataset_hf = load_dataset('parquet', data_files={'train': train_path})['train']
val_dataset_hf = load_dataset('parquet', data_files={'validation': val_path})['validation']
test_dataset_hf = load_dataset('parquet', data_files={'test': test_path})['test']


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [10]:
# TRAINING CONFIGS

output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
learning_rate = 2e-4
max_grad_norm = 0.3
num_train_epochs = 1
warmup_ratio = 0.03
lr_scheduler_type = "constant"
save_strategy = "steps"
save_steps = 100
logging_strategy = "steps"
logging_steps = 10
eval_steps = 50
max_seq_length = 1024

# --- Create SFTConfig ---
sft_config = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_strategy=save_strategy,
    save_steps=save_steps,
    logging_strategy=logging_strategy,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_length=max_seq_length,
    num_train_epochs=num_train_epochs,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    do_eval=True,                # Enable evaluation
    eval_steps=eval_steps,       # Evaluation frequency
    save_total_limit=2,
    report_to="none",
    dataset_text_field="text",
    packing=False,
)

In [11]:
trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset_hf,
        eval_dataset=val_dataset_hf,
        peft_config=peft_config,
        processing_class=tokenizer,
        args=sft_config,
  )



Adding EOS to train dataset:   0%|          | 0/12008 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/12008 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/12008 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/1501 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/1501 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/1501 [00:00<?, ? examples/s]

In [12]:
print("\nStarting QLoRA fine-tuning...")
start_time = time.time()

trainer.train()

end_time = time.time()
training_duration_minutes = (end_time - start_time) / 60
print(f"Training finished in: {training_duration_minutes:.2f} minutes")

final_adapter_path = os.path.join(output_dir, "final_lora_adapter")
print(f"\nSaving final LoRA adapter state to: {final_adapter_path}")
trainer.model.save_pretrained(final_adapter_path)
tokenizer.save_pretrained(final_adapter_path)
print("Final adapter and tokenizer saved.")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.



Starting QLoRA fine-tuning...


  return fn(*args, **kwargs)


Step,Training Loss
10,1.451
20,1.5325
30,1.4798
40,1.3384
50,1.2475
60,1.3103
70,1.3935
80,1.4745
90,1.336
100,1.2042


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Training finished in: 194.59 minutes

Saving final LoRA adapter state to: /content/drive/MyDrive/LLaMA2-Dolly-Training/results/final_lora_adapter
Final adapter and tokenizer saved.


In [13]:
train_metrics = trainer.state.log_history
print("\nTraining Metrics History:")
# print(train_metrics) # This can be very long, maybe just show the last few?
if len(train_metrics) > 5:
    print("Last 5 log entries:")
    for entry in train_metrics[-5:]:
        print(entry)
else:
    print(train_metrics)


Training Metrics History:
Last 5 log entries:
{'loss': 1.3499, 'grad_norm': 0.32833924889564514, 'learning_rate': 0.0002, 'entropy': 1.368898132443428, 'num_tokens': 2410210.0, 'mean_token_accuracy': 0.6754716664552689, 'epoch': 0.9593604263824117, 'step': 720}
{'loss': 1.4122, 'grad_norm': 0.3472752571105957, 'learning_rate': 0.0002, 'entropy': 1.4421674251556396, 'num_tokens': 2431479.0, 'mean_token_accuracy': 0.6661443382501602, 'epoch': 0.9726848767488341, 'step': 730}
{'loss': 1.2016, 'grad_norm': 0.4197086989879608, 'learning_rate': 0.0002, 'entropy': 1.2403857663273812, 'num_tokens': 2444426.0, 'mean_token_accuracy': 0.7106944054365159, 'epoch': 0.9860093271152565, 'step': 740}
{'loss': 1.0311, 'grad_norm': 0.9006634950637817, 'learning_rate': 0.0002, 'entropy': 1.0176764741539954, 'num_tokens': 2451080.0, 'mean_token_accuracy': 0.7644517034292221, 'epoch': 0.9993337774816788, 'step': 750}
{'train_runtime': 11673.4494, 'train_samples_per_second': 1.029, 'train_steps_per_second'