useful links:


*  https://huggingface.co/docs/trl/en/sft_trainer
*  https://huggingface.co/docs/peft/main/en/install
* https://huggingface.co/docs/diffusers/optimization/mps



In [1]:
#bitsandbytes guide: https://huggingface.co/docs/bitsandbytes/main/en/installation
!pip install bitsandbytes
!pip install trl



In [2]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch
# Import BitsAndBytesConfig for quantization
from transformers import BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM
import os
import json
import pprint
from safetensors.torch import load_file

#google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


adapted from: https://colab.research.google.com/github/huggingface/notebooks/blob/main/course/en/chapter11/section4.ipynb#scrollTo=FLGpCGYrf-kM

In [3]:
# NOTE: Why adapter_model.safetensors was 1.18 GB with Qwen and merge failed
# ---------------------------------------------------------------
# Problem:
# - Calling `setup_chat_format(model, tokenizer)` on a *base* model caused
#   new special tokens to be added → vocab resized.
# - This forced `embed_tokens` and `lm_head` (very large layers) to be saved
#   inside adapter_model.safetensors.
# - As a result, the adapter file ballooned to ~1.18 GB and later caused
#   shape-mismatch errors when merging with the original base model.
#
# Fix:
# - Do NOT call `setup_chat_format` (or anything that resizes embeddings)
#   when fine-tuning a base model with LoRA.
# - Instead:
#     * Use an Instruct model variant (e.g. Qwen3-0.6B-Instruct), which
#       already has a chat template → no vocab changes.
#     * OR, stay on the base model but remove setup_chat_format and
#       handle formatting at the string level without altering tokenizer vocab.
# - Always keep adapters and merged models in separate output dirs.
#
# After this fix:
# - adapter_model.safetensors should be tens of MB (only LoRA deltas),
#   with no `lm_head` or `embed_tokens` inside.
# - Merge with `AutoPeftModelForCausalLM.from_pretrained(...).merge_and_unload()`
#   will succeed without size mismatch.


device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "Qwen/Qwen3-0.6B"

# Configure BitsAndBytes for 8-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    quantization_config=bnb_config, # Add the quantization config
    device_map="auto", # Automatically map the model to available devices
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
#model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer) #(may need this for smollm2)

#save location
finetune_name = "/content/drive/MyDrive/!personalMLProject/rag_llm_finetune/qwen3-0.6b-Rust-FT"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Load the dataset
dataset = load_dataset("json", data_files="/content/rust_qa_dataset_5k.jsonl")

# Format the dataset for fine-tuning
def format_dataset(example):
    # Assuming each example has 'question' and 'answer' keys
    # You might need to adjust this based on your specific data structure
    return {
        "text": f"### Question:\n{example['question']}\n\n### Answer:\n{example['answer']}"
    }

dataset = dataset.map(format_dataset)

# Split the dataset into training and evaluation sets (optional)
dataset = dataset["train"].train_test_split(test_size=0.10)

display(dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'text'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['question', 'answer', 'text'],
        num_rows: 500
    })
})

In [5]:

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression, 8-32 is typically good value)
rank_dimension = 8
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation, should be around 2*r)
lora_alpha = 16
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj","gate_proj"], # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

In [6]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # with LORA, epochs should be small like 1-3, otherwise may overfit
    # Batch size settings
    per_device_train_batch_size=4,  # Batch size per GPU. >4 may cause gpu mem issues for T4
    gradient_accumulation_steps=4,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True if torch.cuda.is_bf16_supported() else False,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
    max_length = 1512,  # max sequence length for model and packing of the dataset
    packing=False,  # Enable input packing for efficiency. Set to False to avoid error
    dataset_kwargs={
        "add_special_tokens": False,  # Special tokens handled by template
        "append_concat_token": False,  # No additional separator needed
    },

)

In [7]:
#sanity check: wrap the model with peft config and examine the trainable params.

peft_model = lora_model = get_peft_model(model, peft_config)
print(lora_model.print_trainable_parameters())

trainable params: 5,046,272 || all params: 601,096,192 || trainable%: 0.8395
None


In [8]:
#max_seq_length is deprecated, use max_length from SFTConfig instead: https://huggingface.co/docs/peft/main/en/install
#packing also moved to sftconfig
#dataset_kwargs moved to sftconfig
# Create SFTTrainer with LoRA configuration

#use this if you want trainer to wrap base model with peft config
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer, #toeknizer param updated to processing_class

)

###use this if you want to pass in the wrapped peft model yourself
# trainer = SFTTrainer(
#     model=peft_model,
#     args=args,
#     train_dataset=dataset["train"],
#     eval_dataset=dataset["test"],
#     processing_class=tokenizer, #toeknizer param updated to processing_class
# )




Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

In [10]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model (since using peft, will only save the adapter model and not the full model)
trainer.save_model()

  return fn(*args, **kwargs)


Step,Training Loss
10,2.3828
20,1.4004
30,1.1897
40,1.0885
50,1.0653
60,1.0483
70,0.9812
80,0.9606
90,0.8942
100,0.8601


In [11]:
#sanity check to see if what we are saving is just the adapter itself and nothing big like lm_head!

print("OUTPUT DIR:", args.output_dir)
print(sorted(os.listdir(args.output_dir))[:20])  # see what's there

# 1) adapter_config.json must exist and be small
with open(os.path.join(args.output_dir, "adapter_config.json")) as f:
    cfg = json.load(f)
pprint.pp(cfg)  # base_model_name_or_path should equal your model_name exactly

# 2) inspect what's inside adapter_model.safetensors
path = os.path.join(args.output_dir, "adapter_model.safetensors")
tens = load_file(path)
print("num tensors:", len(tens))
print("total params (M):", sum(v.numel() for v in tens.values())/1e6)
print("examples:", list(tens.keys())[:10])

# 3) this should be EMPTY; if present, you captured huge layers
print("lm_head present?", any("lm_head" in k for k in tens.keys()))
print("embed tokens present?", any("embed" in k or "embedding" in k for k in tens.keys()))

OUTPUT DIR: /content/drive/MyDrive/!personalMLProject/rag_llm_finetune/qwen3-0.6b-Rust-FT
['README.md', 'adapter_config.json', 'adapter_model.safetensors', 'added_tokens.json', 'chat_template.jinja', 'checkpoint-282', 'merges.txt', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'training_args.bin', 'vocab.json']
{'alpha_pattern': {},
 'auto_mapping': None,
 'base_model_name_or_path': 'Qwen/Qwen3-0.6B',
 'bias': 'none',
 'corda_config': None,
 'eva_config': None,
 'exclude_modules': None,
 'fan_in_fan_out': False,
 'inference_mode': True,
 'init_lora_weights': True,
 'layer_replication': None,
 'layers_pattern': None,
 'layers_to_transform': None,
 'loftq_config': {},
 'lora_alpha': 16,
 'lora_bias': False,
 'lora_dropout': 0.05,
 'megatron_config': None,
 'megatron_core': 'megatron.core',
 'modules_to_save': None,
 'peft_type': 'LORA',
 'qalora_group_size': 16,
 'r': 8,
 'rank_pattern': {},
 'revision': None,
 'target_modules': ['q_proj',
                    'k_p

merge the adapter with finetuned model and save

In [12]:
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16


# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    finetune_name, safe_serialization=True, max_shard_size="5GB"
)

load the merged model and test

In [13]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [14]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    finetune_name, device_map="auto", dtype=torch.float16
)
pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer
)

Device set to use cuda:0


In [17]:
#inference call
prompts = [
    "What is match expression and if let in rust?",
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
Is this model finetuned with lora?
    response:
<think>
Okay, let's see. The user is asking if a model is finetuned with LORA. First, I need to recall how LORA works. LORA stands for LoRA, which is a technique for fine-tuning large language models. The user might be confused about the difference between fine-tuning with Lora vs. a regular fine-tuning method. I should explain that LORA adds a few layers of a pre-trained model for enhanced performance, but not necessarily making it finetuned. The answer should clarify that LORA is a method for fine-tuning, not a separate model. Make sure to avoid technical jargon and keep it simple.
</think>

LORA is a technique for fine-tuning large language models, not a separate model.
--------------------------------------------------
