## Fine-tune Llama-3.2-3B with distillation using Unsloth and TRL

Heavily adapted from the unsloth's [Llama-3.2 (3B) notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)

> Note: Install `unsloth` properly according to your system, platform and library specs from [here](https://docs.unsloth.ai/get-started/installing-+-updating).

In [1]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 1024
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B-Instruct", 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2025-03-10 18:09:45,499] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.46.2.
   \\   /|    NVIDIA GeForce RTX 3050 Laptop GPU. Num GPUs = 1. Max memory: 4.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [3]:
from datasets import load_dataset

dataset = load_dataset('jeffmeloy/sonnet3.5_science_conversations')

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['conversation'],
        num_rows: 8835
    })
})

In [5]:
dataset = dataset.rename_column('conversation', 'conversations')

In [6]:
# dataset = dataset['train'].train_test_split(test_size=0.1)

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 8835
    })
})

In [8]:
dataset['train'][10]

{'conversations': [{'from': 'system',
   'value': 'You are an expert molecular biologist specializing in DNA replication mechanisms. Provide detailed, accurate information on DNA replication fidelity and proofreading processes. Use clear explanations suitable for an interactive textbook, incorporating technical terms and concepts while maintaining accessibility for biology students. Offer specific examples, numerical data, and comparisons to illustrate key points.'},
  {'from': 'human',
   'value': 'What is DNA replication fidelity, and why is it important?'},
  {'from': 'gpt',
   'value': 'DNA replication fidelity refers to the accuracy with which genetic information is copied during cell division. It is crucial for maintaining genome integrity and preventing mutations that could lead to genetic disorders or cancer. The error rate in DNA replication is remarkably low, approximately 1 in 10^9 to 10^10 nucleotides. This high fidelity is achieved through three main mechanisms: base-pairi

In [9]:
# Convert ShareGPT format to HF format

from unsloth import standardize_sharegpt

train_dataset = standardize_sharegpt(dataset['train'])
# test_dataset = standardize_sharegpt(dataset['test'])

In [10]:
# Apply Llama-3.1 chat template

from unsloth.chat_templates import get_chat_template


tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

In [11]:
# Map the datasets with chat template

train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
# test_dataset = test_dataset.map(formatting_prompts_func, batched=True)

In [12]:
train_dataset['conversations'][0]

[{'content': 'You are an expert biochemist specializing in enzyme kinetics and metabolic pathways. Provide detailed, accurate information on enzyme kinetics, metabolic pathways, and their interactions. Use technical terminology appropriately and explain complex concepts clearly. Offer practical examples and applications when relevant. Be prepared to discuss experimental techniques, mathematical models, and recent research developments in the field.',
  'role': 'system'},
 {'content': 'What is the Michaelis-Menten equation and how does it relate to enzyme kinetics?',
  'role': 'user'},
 {'content': "The Michaelis-Menten equation is a fundamental model in enzyme kinetics that describes the relationship between substrate concentration and reaction rate. It is expressed as:\n\nv = (Vmax * [S]) / (Km + [S])\n\nWhere:\nv = reaction rate\nVmax = maximum reaction rate\n[S] = substrate concentration\nKm = Michaelis constant\n\nThis equation relates to enzyme kinetics by providing a quantitative

In [13]:
print(train_dataset[0]['text'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are an expert biochemist specializing in enzyme kinetics and metabolic pathways. Provide detailed, accurate information on enzyme kinetics, metabolic pathways, and their interactions. Use technical terminology appropriately and explain complex concepts clearly. Offer practical examples and applications when relevant. Be prepared to discuss experimental techniques, mathematical models, and recent research developments in the field.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the Michaelis-Menten equation and how does it relate to enzyme kinetics?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The Michaelis-Menten equation is a fundamental model in enzyme kinetics that describes the relationship between substrate concentration and reaction rate. It is expressed as:

v = (Vmax * [S]) / (Km + [S])

Where:
v = reaction rate
Vmax = maximum r

In [14]:
import os 
import dotenv
import wandb

from trl import SFTTrainer
from transformers import TrainingArguments,  DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

dotenv.load_dotenv()
wandb.init(entity=os.getenv('wandb_username'), project=os.getenv('wandb_project_name'))

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    # eval_dataset=test_dataset,  # Eval takes more VRAM, if one has >4GB VRAM can do evaluation alongside with Training
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        # per_device_eval_batch_size = 1,
        gradient_accumulation_steps = 2,
        # eval_accumulation_steps= 4,
        warmup_steps = 5,
        # eval_steps= 10,
        # eval_strategy = "steps",
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 120,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        logging_strategy='steps',
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "scillama-3.2-3b_outputs",
        report_to = "wandb",
        save_steps=100,
    ),
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msenthilkumarn[0m ([33mnsk[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: We found double BOS tokens - we shall remove one automatically.


Tokenizing to ["text"] (num_proc=2):   0%|          | 0/8835 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [15]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=12):   0%|          | 0/8835 [00:00<?, ? examples/s]

In [16]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an expert biochemist specializing in enzyme kinetics and metabolic pathways. Provide detailed, accurate information on enzyme kinetics, metabolic pathways, and their interactions. Use technical terminology appropriately and explain complex concepts clearly. Offer practical examples and applications when relevant. Be prepared to discuss experimental techniques, mathematical models, and recent research developments in the field.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the Michaelis-Menten equation and how does it relate to enzyme kinetics?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Michaelis-Menten equation is a fundamental model in enzyme kinetics that describes the relationship between substrate concentration and reaction rate. It is expressed as:\n\nv = (Vmax * [S]) / (Km + [S])\n\nWhere:\nv = reaction rate\n

In [17]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[0]["labels"]])

"                                                                                                                      \n\nThe Michaelis-Menten equation is a fundamental model in enzyme kinetics that describes the relationship between substrate concentration and reaction rate. It is expressed as:\n\nv = (Vmax * [S]) / (Km + [S])\n\nWhere:\nv = reaction rate\nVmax = maximum reaction rate\n[S] = substrate concentration\nKm = Michaelis constant\n\nThis equation relates to enzyme kinetics by providing a quantitative description of how enzyme-catalyzed reactions proceed. It assumes a single substrate binding to the enzyme's active site, forming an enzyme-substrate complex before product formation. The Michaelis constant (Km) represents the substrate concentration at which the reaction rate is half of Vmax, indicating the enzyme's affinity for the substrate. Lower Km values suggest higher affinity.<|eot_id|>                   \n\nVmax and Km can be determined experimentally using several met

In [18]:
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3050 Laptop GPU. Max memory = 4.0 GB.
2.66 GB of memory reserved.


In [19]:
trainer_stats = trainer.train()
wandb.finish()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8,835 | Num Epochs = 1 | Total steps = 120
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 24,313,856/1,865,526,272 (1.30% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.3366
2,1.2587
3,1.1447
4,1.1874
5,1.2954
6,1.1493
7,1.1121
8,0.8456
9,1.2197
10,1.1795


0,1
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇█████
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇███
train/grad_norm,▅▆▁▃▃▄▄▅▄▃▅▅▅▅▅▅▄▅▄▄▅▄▆▆▅▆▄▆▆▇▆▆▅▇▆▆▇▇▆█
train/learning_rate,▂▄▅▇██▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁
train/loss,█▅▃▄▇▂▅▂▄▁▅▆▃▅▄▆▆▂▃▄▆▅▂▂▇▄▃▄▁▄▂▆▅▆▄▅▆▇▂▄

0,1
total_flos,8384528787701760.0
train/epoch,0.05432
train/global_step,120.0
train/grad_norm,0.30234
train/learning_rate,0.0
train/loss,1.0134
train_loss,1.06911
train_runtime,1173.3303
train_samples_per_second,0.409
train_steps_per_second,0.102


In [20]:
# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1173.3303 seconds used for training.
19.56 minutes used for training.
Peak reserved memory = 3.127 GB.
Peak reserved memory for training = 0.467 GB.
Peak reserved memory % of max memory = 78.175 %.
Peak reserved memory for training % of max memory = 11.675 %.


In [21]:
# test_dataset[0]

In [22]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "which is greater 9.9 or 9.11?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 512, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhich is greater 9.9 or 9.11?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n9.11 is greater than 9.9.<|eot_id|>']

In [23]:
model.save_pretrained("SciLlama-3.2-3b-lora")  # Local saving
tokenizer.save_pretrained("SciLlama-3.2-3b-lora")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('SciLlama-3.2-3b-lora/tokenizer_config.json',
 'SciLlama-3.2-3b-lora/special_tokens_map.json',
 'SciLlama-3.2-3b-lora/tokenizer.json')

In [1]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = '../SciLlama-3.2-3b-lora',
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {'role': 'system', 'content': 'You are an expert STEM tutor. Explain STEM topics clearly, accurately, and concisely for students.'},
    {"role": "user", "content": "How does Bernoulli's principle apply to non-Newtonian fluids, and what are the key differences compared to Newtonian fluids?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 256,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2025-03-10 18:57:43,033] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.46.2.
   \\   /|    NVIDIA GeForce RTX 3050 Laptop GPU. Num GPUs = 1. Max memory: 4.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Bernoulli's principle is a fundamental concept in fluid dynamics that describes the relationship between pressure and velocity of fluids in motion. While it primarily applies to Newtonian fluids, its principles can be extended to non-Newtonian fluids with some modifications.

Key differences:
1. Viscosity: Non-Newtonian fluids have varying viscosities depending on shear rate, whereas Newtonian fluids have constant viscosity.
2. Shear stress: Non-Newtonian fluids exhibit shear stress-dependent viscosity, whereas Newtonian fluids have constant viscosity.
3. Flow behavior: Non-Newtonian fluids can exhibit complex flow patterns, such as shear-thinning or shear-hardening, not observed in Newtonian fluids.
4. Pressure-velocity relationship: In non-Newtonian fluids, the pressure-velocity relationship is not linear, unlike in Newtonian fluids.

Application to non-Newtonian fluids:
1. Complex flow patterns: Bernoulli's principle helps understand the behavior of non-Newtonian fluids under variou