In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git



In [3]:
!nvidia-smi -L

GPU 0: NVIDIA GeForce RTX 4070 Laptop GPU (UUID: GPU-97cd4c66-edab-89a9-54ab-591bf165d792)


# Setup the Model

We will load a small pre-trained model from hugging face model hub in 8-bit precision, which reduces the memory usage, making it more efficient for running on devices with limited resources.<br>
`device_map='auto'` will automatically place the model on the available device (GPU or CPU).

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b1", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

# Freezing the Orignal Weights

In [5]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

We will retrieve all the parameters (weights) of the model and freeze them, means during training these parameters will not be updated (no gradient will be computed for them). By setting `required_grad` to `False` we prevent these weights from being updated during backpropagation.

`param.ndim == 1` condition checks if the parameter is a vector (i.e., it has only 1 dimension). Often in deep learning models, layer normalizaiton weights or biases can be more prone to numerical instability if not represented in a high precision.

Then we cast these 1D parameters to `float32` to ensure better numerical stability. Sometimes, models use reduced precision like `float16` or `float8` for efficiency, but small parameters may require `float32` to prevent precision loss or instability in computations.

`gradient_checkpointing_enable()`: This enables gradient checkpointing, which reduces memory usage during backpropagation by not storing all intermediate activations in memory. Instead, activations are recomputed during the backward pass as needed, which can significantly reduce memory consumption, especially for large models. This is useful when training very large models where memory is limited, as it trades computation (recomputing activations) for lower memory usage.

`enable_input_require_grads()`: This ensures that gradients are computed for the model's input tensors, which can be important when fine-tuning certain components of the model or when you need the inputs themselves to be updated (e.g., during adversarial training or input-based optimization). It ensures that the gradients will flow back to the input parameters, allowing for further updates.

`CastOutputToFloat` class: This is a custom class that wraps the model's lm_head (typically the final layer used for language modeling). It casts the output of this layer to float32 for numerical stability.

`super().forward(x).to(torch.float32)`: This line calls the original forward pass of the lm_head and then casts the output to float32. This ensures that the model's final output is in a stable precision (float32), preventing potential precision issues that might occur if the model uses a lower precision for the output.

# Setting up the LoRA Adapters

In [6]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

`trainable_params`: Keeps track of the number of parameters that require gradients (i.e., will be updated during backpropagation).<br>
`all_param`: Counts all parameters in the model (whether trainable or not).

`model.named_parameters()`: Returns an iterator over all the parameters in the model along with their names (the name is unused here, hence _).<br>
`param.numel()`: Returns the total number of elements in the parameter tensor (e.g., if param is a weight matrix of shape `[512, 768]`, numel() would return 512 * 768 = 393216).<br>
Adds this count to all_param for every parameter, and only adds it to trainable_params if param.requires_grad is True, meaning it will be updated during training.<br>

* `trainable params`: Total number of parameters being optimized during training.
* `all params`: Total number of parameters in the model.
* `trainable%`: Percentage of parameters that are trainable.

In [7]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    # target_modules=["q_proj", "v_proj"], #if you know the 
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2359296 || all params: 1067673600 || trainable%: 0.22097539922313336


* `LoraConfig` – to define the configuration for applying LoRA.
* `get_peft_model` – to wrap your base model (e.g., bloom-3b) with LoRA layers according to the config.

* `r=16`: Rank of the LoRA update matrices. Instead of training the full weight matrices (say, size 4096x4096), LoRA adds two low-rank matrices of size (4096x16) and (16x4096). Much cheaper to train.
* `lora_alpha=32`: A scaling factor applied to the LoRA updates. Larger values mean the LoRA weights have more influence.
* `lora_dropout=0.05`: Dropout applied to LoRA modules during training. Helps regularize.
bias="none"	Whether to also fine-tune biases. "none" means don't touch them (which saves memory).
* `task_type="CAUSAL_LM"`: Specifies the type of model you're fine-tuning. Needed for internal logic (e.g., GPT-like models = causal LM).
* `target_modules=["q_proj", "v_proj"]`: (commented) Optionally, you can target only specific submodules (like query/value projection layers in attention). Useful for efficiency.

In [8]:
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")

In [9]:
def merge_columns(example):
    example["prediction"] = example["quote"] + " ->: " + str(example["tags"])
    return example

data['train'] = data['train'].map(merge_columns)
data['train']["prediction"][:5]

["“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']",
 "“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.” ->: ['best', 'life', 'love', 'mistakes', 'out-of-control', 'truth', 'worst']",
 "“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.” ->: ['human-nature', 'humor', 'infinity', 'philosophy', 'science', 'stupidity', 'universe']",
 "“So many books, so little time.” ->: ['books', 'humor']",
 "“A room without books is like a body without a soul.” ->: ['books', 'simile', 'soul']"]

In [10]:
data['train'][0]

{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself',
  'gilbert-perreira',
  'honesty',
  'inspirational',
  'misattributed-oscar-wilde',
  'quote-investigator'],
 'prediction': "“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']"}

In [11]:
data = data.map(lambda samples: tokenizer(samples['prediction']), batched=True)

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [12]:
data

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags', 'prediction', 'input_ids', 'attention_mask'],
        num_rows: 2508
    })
})

# Training

In [13]:
trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=200, 
        learning_rate=2e-4, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()




No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
1,3.7464
2,3.7678
3,3.3845
4,3.7682
5,3.3503
6,3.6392
7,3.6969
8,3.6399
9,3.4847
10,3.7513


TrainOutput(global_step=200, training_loss=3.556801886558533, metrics={'train_runtime': 697.4255, 'train_samples_per_second': 4.588, 'train_steps_per_second': 0.287, 'total_flos': 1499312352706560.0, 'train_loss': 3.556801886558533, 'epoch': 1.2743221690590112})

### Training Arguments
* `per_device_train_batch_size=4`: Each GPU gets a batch of size 4.
* `gradient_accumulation_steps=4`: Instead of updating weights every 4 samples, accumulate gradients for 4 steps before updating — effectively a batch size of 4×4=16.
* `warmup_steps=100`: Gradually increase the learning rate for the first 100 steps to stabilize training.
* `max_steps=200`: Run a total of 200 training steps.
* `learning_rate=2e-4`: Starting learning rate for the optimizer.
* `fp16=True`: Enable mixed-precision training (faster and uses less GPU memory).
* `logging_steps=1`: Log metrics like loss every 1 step — useful for monitoring.
* `output_dir='outputs'`: Where to save checkpoints, logs, etc.

**Data Collator:** This automatically batches your data and handles padding. `mlm=False` means you’re training for Causal Language Modeling (CLM) — like GPT-style models, not BERT-style masked models. Uses your tokenizer to prepare batches.

**Disable Cache During Training:** Some models cache attention keys/values to speed up inference. This can cause errors during training (especially with gradient checkpointing), so it’s disabled here. Enable it again at inference time for speed.

# Inference

Here we will first tokenize the prompt, then Autocasting for Mixed Precision, this enables automatic mixed precision (AMP) for faster inference on GPU. Uses FP16 where possible to save memory and speed up without hurting much accuracy.

In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

batch = tokenizer("“Training models with PEFT and LoRA is cool” ->: ", return_tensors='pt').to(device)

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

DynamicCache + torch.export is tested on torch 2.6.0+ and may not work on earlier versions.




 “Training models with PEFT and LoRA is cool” ->:  I think the best way to do this is to use the PEFT library. It is a library that allows you to use the PEFT protocol to send and receive data. It is also a library that allows you to use the LoRA protocol to
