## Finetuning
finetuning is a process where we further train a pre-trained model on a small dataset to improve its performance on a specific task

we will start with installing dependencies, such as unsloth.

Unsloth is a high-level wrapper around **Hugging Face** way when it comes to both time and space taken.

As you know, finetuning is a compute intesive process, so optimizations like these will help us run it even on simple google colab with T4 GPU

Run the cell to get start

In [None]:
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Loading the model we will be using

This block loads a lightweight, instruction-tuned LLaMA 3 model in 4-bit precision with support for long sequences,

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length=2048
dtype=None
load_in_4bit=True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)


This code block applies PEFT (Parameter-Efficient Fine-Tuning) to the previously loaded model using LoRA (Low-Rank Adaptation),

🧠 What is PEFT?
- Parameter-Efficient Fine-Tuning lets you fine-tune only a small subset of a model's parameters (e.g., via LoRA) instead of the entire model, this speeds up our training process

🧠 What is LoRA?
- LoRA injects small trainable layers (low-rank matrices) into existing layers of the model (like query/key/value projections).

- these layers are also called ADAPTERS

- An adapter is just a tiny neural layer (usually a small linear transformation) that you insert into a big model to make it adapt to a new task without changing the whole model.

Homework: read more about LoRa and PEFT

In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    max_seq_length = max_seq_length,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.3.19 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


### Loading our dataset

In [20]:
from datasets import load_dataset
dataset = load_dataset("json", data_files="mriirs_conversation_format.jsonl", split="train")

In [8]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

In [9]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/15 [00:00<?, ? examples/s]

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

In [21]:
dataset[0]

{'conversations': [{'from': 'human',
   'value': 'Where can I find more information about the School of Engineering and Technology?'},
  {'from': 'gpt',
   'value': 'You can visit the official website at https://manavrachna.edu.in/mriirs/school-of-engineering-technology for detailed information about the School of Engineering and Technology.'}]}

In [10]:
dataset[5]["conversations"]

[{'content': 'What are the operating hours of the T Block Library?',
  'role': 'user'},
 {'content': 'The T Block Library is open until 8:30 PM.',
  'role': 'assistant'}]

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/15 [00:00<?, ? examples/s]

In [12]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/15 [00:00<?, ? examples/s]

In [13]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are the operating hours of the T Block Library?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe T Block Library is open until 8:30 PM.<|eot_id|>'

In [14]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                               The T Block Library is open until 8:30 PM.<|eot_id|>'

In [15]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 15 | Num Epochs = 10 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Step,Training Loss
1,2.7625
2,2.8661
3,3.6811
4,3.5522
5,2.6562
6,2.8019
7,2.9658
8,2.4956
9,2.5217
10,2.5885


In [16]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer
import torch

# Make sure tokenizer is wrapped
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

# Enable fast inference
FastLanguageModel.for_inference(model)

# The function
def ask_bot(prompt, stream=True, max_tokens=128, temperature=1.5, min_p=0.1):
    # Create chat message
    messages = [{"role": "user", "content": prompt}]

    # Tokenize using the llama-3.1 template
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    attention_mask = inputs != tokenizer.pad_token_id
    # Stream mode
    if stream:
        streamer = TextStreamer(tokenizer, skip_prompt=True)
        _ = model.generate(
            input_ids=inputs,
            streamer=streamer,
            max_new_tokens=max_tokens,
            use_cache=True,
            temperature=temperature,
            min_p=min_p,
        )
        return None  # since it's being streamed out


    # Standard generation mode
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=max_tokens,
        use_cache=True,
        temperature=temperature,
        min_p=min_p,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


Model does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


In [17]:
ask_bot("Which professor handles Python and ECT?")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Dr. Jatinder Sharma teaches Python and ECT in the School of Engineering and Technology. He is associated with Room AF03 during the A Block and the C Block during the P Block for classes. His office is open until 5 PM, which can be reached via the IRIS portal.<|eot_id|>
