# GPU Efficient LLM fine-tuning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github//velebit-ai/research-llm-development/blob/master/GPU-efficient-LLM-fine-tuning.ipynb)

We'll go through an example of parameter efficinet GPU training on a T4 GPU by using Google Colab.

## Package setup

In [None]:
%pip install transformers -q
%pip install bitsandbytes -q
%pip install datasets -q
%pip install accelerate -q
%pip install peft -q
%pip install trl -q
%pip install einops -q
%pip install tensorboard -q

%pip install watermark -q # version checks

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
%pip install -q datasets

In [None]:
%load_ext watermark

## Loading the model

We will load the model by using the `transformers` library from Hugging Face.

In order to fit really large models into a single GPU, you can
load the model in half precision. Most LLMs are even trained in half precision (float16, bfloat16) and there is almost no performace loss compared with full precision (float32) training.

If that is not enough, you can quantize the weights of the model to 8bit or even 4bit. For that we need the `accelerate` and `bitsandbytes` library.

In [None]:
# Step 2: Import libraries
import os
import torch
#from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

In [None]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

In [None]:
import bitsandbytes
print(bitsandbytes.__version__)

0.49.0


In [None]:
# Step 3: Load model with 4-bit quantization (essential for Colab free GPU)
# Model from Hugging Face hub
access_token = #replace with your own HuggingFace token
base_model = "NousResearch/Llama-2-7b-chat-hf"  # You'll need HuggingFace access token

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    token=access_token,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
#Loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
# STEP 5: Configure LoRA (Low-Rank Adaptation)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

In [None]:
# STEP 6: Prepare Dataset
from datasets import load_dataset

# Using SQuAD dataset
train_raw = load_dataset("squad", split="train[:2000]")  # Adjust size as needed
eval_raw = load_dataset("squad", split="validation[500:1000]")

def build_prompt(example):
    context = example["context"]
    question = example["question"]
    answers = example["answers"]["text"]

    target = answers[0] if len(answers) > 0 else ""

    prompt = f"""Task: Extracting Answers from Contexts
Instructions:
– Extract phrases from the passage that answer the question.
– The answer must be a literal part from the passage.
– Do not write any additional explanation or interpretation.

Answer the following passage:
Context:
{context}

Question:
{question}

Answer:
"""

    return {
        "prompt": prompt,
        "target": target
    }

# Format datasets
train_dataset = train_raw.map(build_prompt)
eval_dataset = eval_raw.map(build_prompt)

print(f"✓ Loaded {len(train_dataset)} training examples")
print(f"✓ Loaded {len(eval_dataset)} evaluation examples")

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

✓ Loaded 2000 training examples
✓ Loaded 500 evaluation examples


In [None]:
def tokenize(example):
    prompt_ids = tokenizer(
        example["prompt"],
        truncation=True,
        max_length=1024,
        add_special_tokens=False
    )["input_ids"]

    answer_ids = tokenizer(
        example["target"],
        truncation=True,
        max_length=128,
        add_special_tokens=False
    )["input_ids"]

    input_ids = prompt_ids + answer_ids + [tokenizer.eos_token_id]

    labels = [-100] * len(prompt_ids) + answer_ids + [tokenizer.eos_token_id]

    return {
        "input_ids": input_ids,
        "labels": labels,
        "attention_mask": [1] * len(input_ids),
    }

tokenized_train = train_dataset.map(tokenize, remove_columns=train_dataset.column_names)
tokenized_eval = eval_dataset.map(tokenize, remove_columns=eval_dataset.column_names)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding=True,
    label_pad_token_id=-100,
)

In [None]:
from peft import get_peft_model

if hasattr(model, "peft_config"):
    model = model.unload()

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 159,907,840 || all params: 6,898,323,456 || trainable%: 2.3181


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama-lora-squad",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    fp16=True,
    logging_strategy="steps",
    logging_steps=25,
    save_strategy="steps",
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=2,
    report_to="none",
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

  trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Step,Training Loss,Validation Loss


TrainOutput(global_step=250, training_loss=0.24740070915222168, metrics={'train_runtime': 691.8269, 'train_samples_per_second': 2.891, 'train_steps_per_second': 0.361, 'total_flos': 2.561431745101824e+16, 'train_loss': 0.24740070915222168, 'epoch': 1.0})

In [None]:
test_data = load_dataset("squad", split="validation[1000:1500]")
#test_dataset = test_data.map(format_qa_prompt, remove_columns=test_data.column_names)
print(f"✓ Loaded {len(test_data)} test examples")

✓ Loaded 500 test examples


In [None]:
print("\n" + "="*60)
print("TESTING THE BASELINE MODEL")
print("="*60 + "\n")

def generate_answer(context, question, max_length=100):
    """Generate answer from context and question"""
    prompt = f"""Task: Extracting Answers from Contexts
Instructions:
– Extract phrases from the passage that answer the question.
– The answer must be a literal part from the passage.
– Do not write any additional explanation or interpretation.

Answer the following passage:
Context:
{context}

Question:
{question}

Answer:
"""

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_length,
        temperature=0.1,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the answer part
    if "Answer:" in response:
        answer = response.split("Answer:")[-1].strip()
    else:
        answer = response

    answer = answer.split('\n')[0].strip() # Extract only the part before Context

    return answer


TESTING THE BASELINE MODEL



In [None]:
# Baseline Model Evaluation on Held-out Test Set
# =============================================================================
print("\n" + "="*60)
print("TEST 2: Held-out Test Set Evaluation")
print("-" * 60)

from sklearn.metrics import accuracy_score
import re

def normalize_text(s):
    """Normalize answer for comparison"""
    s = s.lower()
    s = re.sub(r'\b(a|an|the)\b', ' ', s)
    s = re.sub(r'[^\w\s]', '', s)
    s = ' '.join(s.split())
    return s

def compute_exact_match(prediction, ground_truth):
    """Calculate exact match"""
    return max([float(normalize_text(prediction).replace(" ", "") == normalize_text(ground_truth[i]).replace(" ", "").strip(".")) for i in range(len(ground_truth))])

def compute_f1(prediction, ground_truth):
    f1_arr = [0] * len(ground_truth)
    for i in range(len(ground_truth)):
        """Calculate F1 score"""
        pred_tokens = normalize_text(prediction).split()
        truth_tokens = normalize_text(ground_truth[i]).split()

        if not pred_tokens or not truth_tokens:
            f1_arr[i] = float(pred_tokens == truth_tokens)
            continue

        common = set(pred_tokens) & set(truth_tokens)
        if not common:
            f1_arr[i] = 0.0
            continue

        precision = len(common) / len(pred_tokens)
        recall = len(common) / len(truth_tokens)
        f1_arr[i] = 2 * (precision * recall) / (precision + recall)
    return max(f1_arr)

# Evaluate on test set (limited to save time)
num_test_samples = min(500, len(test_data))  # Adjust as needed
print(f"Evaluating on {num_test_samples} test samples...")

exact_matches = []
f1_scores = []
rows = []

for i in range(num_test_samples):
    example = test_data[i]
    context = example['context']
    question = example['question']
    ground_truth = example['answers']['text']

    # Generate prediction
    prediction = generate_answer(context, question, max_length=50)

    # Compute metrics
    em = compute_exact_match(prediction, ground_truth)
    f1 = compute_f1(prediction, ground_truth)

    exact_matches.append(em)
    f1_scores.append(f1)

    rows.append({
        "prediction": prediction,
        "ground_truth": ground_truth,
        "exact_match": em,
        "f1": f1
    })

    # Show some examples
    if i < 5:
        print(f"\nExample {i+1}:")
        print(f"Question: {question}")
        print(f"Ground Truth: {ground_truth}")
        print(f"Prediction: {prediction}")
        print(f"Exact Match: {em} | F1: {f1:.3f}")

# Print overall metrics
print("\n" + "="*60)
print("OVERALL TEST RESULTS")
print("="*60)
print(f"Exact Match Accuracy: {sum(exact_matches)/len(exact_matches)*100:.2f}%")
print(f"Average F1 Score: {sum(f1_scores)/len(f1_scores)*100:.2f}%")
print(f"Samples Evaluated: {num_test_samples}")


TEST 2: Held-out Test Set Evaluation
------------------------------------------------------------
Evaluating on 500 test samples...

Example 1:
Question: Which NFL team represented the AFC at Super Bowl 50?
Ground Truth: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']
Prediction: Denver Broncos
Exact Match: 1.0 | F1: 1.000

Example 2:
Question: Which NFL team represented the NFC at Super Bowl 50?
Ground Truth: ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers']
Prediction: Carolina Panthers
Exact Match: 1.0 | F1: 1.000

Example 3:
Question: Where did Super Bowl 50 take place?
Ground Truth: ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."]
Prediction: Levi's Stadium
Exact Match: 1.0 | F1: 1.000

Example 4:
Question: Which NFL team won Super Bowl 50?
Ground Truth: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']
Prediction: Denver Broncos
Exact Match: 1.0 | F1: 1.000

Example 5:
Question: Wh

In [None]:
# STEP 10: Save the Model

output_dir = "./llama2-qa-lora"
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✓ Model saved to {output_dir}")

✓ Model saved to ./llama2-qa-lora


In [None]:
eê1# Save Test Results
import json

test_results = {
    "num_samples": num_test_samples,
    "exact_match_accuracy": sum(exact_matches)/len(exact_matches)*100,
    "average_f1_score": sum(f1_scores)/len(f1_scores)*100,
    "individual_scores": [
        {"exact_match": em, "f1": f1}
        for em, f1 in zip(exact_matches, f1_scores)
    ]
}

with open("./llama2-qa-lora/test_results.json", "w") as f:
    json.dump(test_results, f, indent=2)

print("\n✓ Test results saved to ./llama2-qa-lora/test_results.json")


✓ Test results saved to ./llama2-qa-lora/test_results.json
