## Fine-tuning the GPT Model using Lora fine-tuning technique

1. You can change model_name to a smaller one like EleutherAI/pythia-70m if needed or also you can replace model with ollama, llama 2, etc.,

2. Since we're running it from colab, we should use less weight models for practice

3. This example uses qLoRA (load_in_4bit=True) + LoRA.

In [1]:
!pip install transformers datasets peft accelerate bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl (61.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.47.0


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset

# =====================
# Step 1: Dataset
# =====================
data = {
    "text": [
        "Question: What is the capital of France? Answer: Paris.",
        "Question: Who wrote Hamlet? Answer: William Shakespeare.",
        "Question: What is 2 + 2? Answer: 4.",
    ]
}
dataset = Dataset.from_dict(data)

# =====================
# Step 2: Load Model + Tokenizer with qLoRA
# =====================
model_name = "tiiuae/falcon-rw-1b"   # TinyLlama/TinyLlama-1.1B-Chat-v1.0
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="auto"
)

'''
Below QLoRA function prepares a model that was loaded in 4-bit or 8-bit (quantized).

Most weights are in int4 (compressed to save GPU memory).

But some parts (like layer norms, output heads) still need to be in float32 or float16 for training stability.
'''
model = prepare_model_for_kbit_training(model)

# =====================
# Step 3: Apply LoRA
# =====================
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,   # a scaling factor that controls how much the LoRA update influences the original weight. High alpha have a strong influence where as lower alpha will have lesser influence.
    target_modules=["query_key_value"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# =====================
# Step 4: Tokenize Dataset
# =====================
def tokenize(sample):
    return tokenizer(sample["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize)

# =====================
# Step 5: Training
# =====================
training_args = TrainingArguments(
    output_dir="./dummy",  # won't be saved
    per_device_train_batch_size=4, # each GPU trains with 4 examples at once
    num_train_epochs=3,
    learning_rate=2e-4,  # 0.0002
    fp16=True,          # Use half-precision floating point (16 bits instead of 32). Saves GPU memory and speeds things up
    logging_steps=10,
    save_strategy="no",
    report_to="none"
)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

  trainer = Trainer(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Step,Training Loss


TrainOutput(global_step=3, training_loss=2.0584044456481934, metrics={'train_runtime': 5.8433, 'train_samples_per_second': 1.54, 'train_steps_per_second': 0.513, 'total_flos': 8364732973056.0, 'train_loss': 2.0584044456481934, 'epoch': 3.0})

A data collator takes tokenized examples from your dataset and turns them into mini-batches that the model can consume. it handles,

1. Padding sequences in the batch so they’re the same length.

2. Shifting tokens to create inputs and labels for language modeling.

3. Optionally applying masking(MLM)

mlm=True → Masked Language Modeling (MLM), like BERT training.

Randomly mask tokens (e.g., turn “Paris” → “[MASK]”) and ask model to predict them using the available sentence. (BERT MLM - Guessing the masking word)

mlm=False → Causal Language Modeling (CLM), like GPT/Falcon training. (Here we're predicting the next word)

Model predicts the next token in sequence, without random masking.

Since Falcon is a causal LM, we set mlm=False.

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) #mlm = masked languague model

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

In [13]:
# =====================
# Step 6: Generate Output
# =====================
model.eval()
# prompt = "What is 18 + 2? Answer:"
prompt = "What is the capital of France? Answer:"
# prompt = "Who wrote Hamlet? Answer:"
# prompt = "What is black hole?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=30,
        do_sample=True,   # Enables sampling instead of greedy decoding. Greedy - Always picks the most likely word | Sample - Choose randomness, so answers do vary.
        top_k=50,   # Limits sampling to the 50 most likely next tokens at each step & picks the random one from it.
        top_p=0.9, # Instead of a fixed k, take the smallest set of tokens whose probabilities sum to 90%. If top 3 tokens already cover 90%, only those 3 are considered vice versa for 20tokens covers 90%, all 20 will be considered.
        temperature=0.7 # Lower (<1) = sharper distribution → more deterministic. | Higher (>1) = flatter distribution → more randomness.
    )


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [12]:
print("\n=== Model Output ===")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


=== Model Output ===
What is the capital of France? Answer: Paris.
What is the capital of Italy?


#Evaluating the Model using BLEU & ROUGE Score

In [17]:
# =====================
# Step 6: Evaluation with BLEU & ROUGE
# =====================
model.eval()

# Define test prompts and references
eval_data = [
    {"prompt": "What is the capital of France? Answer:", "reference": "Paris."},
    {"prompt": "Who wrote Hamlet? Answer:", "reference": "William Shakespeare."},
    {"prompt": "What is 2 + 2? Answer:", "reference": "4."},
]

predictions = []
references = []

for item in eval_data:
    inputs = tokenizer(item["prompt"], return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=10,
            do_sample=True,
            top_k=50,
            top_p=0.9,
            temperature=0.7
        )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the answer part (after "Answer:")
    if "Answer:" in decoded:
        decoded = decoded.split("Answer:")[-1].strip()
    predictions.append(decoded)
    references.append(item["reference"])

print("\n=== Predictions vs References ===")
for p, r in zip(predictions, references):
    print(f"Pred: {p} | Ref: {r}")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



=== Predictions vs References ===
Pred: Paris. And if you’re looking for | Ref: Paris.
Pred: William Shakespeare.
This is one of the most | Ref: William Shakespeare.
Pred: 4.
That’s right, there | Ref: 4.


In [18]:
predictions

['Paris. And if you’re looking for',
 'William Shakespeare.\nThis is one of the most',
 '4.\nThat’s right, there']

In [19]:
references

['Paris.', 'William Shakespeare.', '4.']

In [21]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


In [23]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=4ec8846660f98a454ee1d223e65df73b50b8bb2c8d6a4629591a94f41a9e9e41
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [24]:
# =====================
# Compute BLEU & ROUGE
# =====================
import evaluate
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

bleu_score = bleu.compute(predictions=predictions, references=[[r] for r in references])
rouge_score = rouge.compute(predictions=predictions, references=references)

print("\n=== Evaluation Scores ===")
print("BLEU:", bleu_score)
print("ROUGE:", rouge_score)


=== Evaluation Scores ===
BLEU: {'bleu': 0.0, 'precisions': [0.3181818181818182, 0.21052631578947367, 0.0625, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 3.142857142857143, 'translation_length': 22, 'reference_length': 7}
ROUGE: {'rouge1': np.float64(0.3277777777777778), 'rouge2': np.float64(0.08333333333333333), 'rougeL': np.float64(0.3277777777777778), 'rougeLsum': np.float64(0.3277777777777778)}


#Understand the parameters changes using LoRA adapters

In [25]:
# =====================
# Check Trainable Parameters
# =====================
def print_trainable_parameters(model):
    trainable = 0
    all_params = 0
    for _, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable += param.numel()
    print(f"Trainable params: {trainable:,}")
    print(f"All params: {all_params:,}")
    print(f"Trainable%: {100 * trainable / all_params:.4f}%")

print_trainable_parameters(model)


Trainable params: 1,572,864
All params: 709,218,304
Trainable%: 0.2218%


**Understand the results:**

All params: 709,218,304 → total Falcon weights (frozen + trainable).

Trainable params: 1,572,864 → LoRA adapter weights only.

Trainable %: 0.2218% → you’re updating ~1 in 451 weights (about 0.22%) instead of full fine-tuning.


“We fine-tune just 1.57M / 709M ≈ 0.22% of the model — roughly 1 out of every 451 weights