## Load Base InstructModel

In [1]:
from unsloth import FastLanguageModel

max_seq_length = 1024

teacher_model, teacher_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = False,          # FULL finetuning requires this
    full_finetuning = True,        # IMPORTANT
    fast_inference = False,        # only enable after training
)


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 12-01 21:28:23 [__init__.py:216] Automatically detected platform cuda.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.3: Fast Llama patching. Transformers: 4.56.2. vLLM: 0.10.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33+f204359.d20251014. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Define Training Config

In [5]:
from trl import SFTConfig

teacher_args = SFTConfig(
    per_device_train_batch_size = 1,     # was 2
    gradient_accumulation_steps = 16,    # keeps effective batch size ~= 16
    warmup_ratio = 0.03,
    num_train_epochs = 2,                # slightly fewer epochs to save time
    learning_rate = 2e-5,
    weight_decay = 0.0,
    lr_scheduler_type = "cosine",
    logging_steps = 10,
    optim = "adamw_8bit",                # keeps optimizer memory small
    gradient_checkpointing = True,       # extra VRAM savings
    seed = 3407,
    output_dir = "teacher_owl_ft_4090",
    report_to = "none",
)


## Load Dataset and Define Formatting Function

In [10]:
from datasets import load_dataset

max_seq_length = 1024

train_dataset = load_dataset(
    "json",
    data_files="cleaned.json",
    split="train",
)

def add_text_column(example):
    instr = example.get("instruction", "")
    inp   = example.get("input", "")
    out   = example.get("output", "")

    if inp:
        text = f"<|user|>\n{instr}\n{inp}\n<|assistant|>\n{out}"
    else:
        text = f"<|user|>\n{instr}\n<|assistant|>\n{out}"

    return {"text": text}

train_dataset = train_dataset.map(add_text_column)

print(train_dataset[0]["text"])  # sanity check


def formatting_func(batch):
    # Unsloth requires returning a LIST of strings
    texts = []
    for ex in batch:
        instr = ex.get("instruction", "")
        inp   = ex.get("input", "")
        out   = ex.get("output", "")

        if inp:
            text = f"### Instruction:\n{instr}\n\n### Input:\n{inp}\n\n### Response:\n{out}"
        else:
            text = f"### Instruction:\n{instr}\n\n### Response:\n{out}"

        texts.append(text)

    return texts



<|user|>
Write a short paragraph summarizing the history, physical characteristics, and habitat of red pandas.
Red pandas are smaller relatives to giant pandas, native to the Himalayan region. Distinguished by their distinct fur coloration and bushy tail, they possess a body structure resembling raccoons. Their habitat primarily consists of temperate forests and montane brushfields where cooler climates prevail.
<|assistant|>
Red pandas, smaller relatives to giant pandas, are native to the Himalayan region. They have distinct fur coloration, a body structure similar to raccoons, and bushy tails. Primarily found in temperate forests and montane brushfields, they thrive in cooler climates.


## Initialize Trainer and Start Full Finetuning

In [11]:
teacher_trainer = SFTTrainer(
    model = teacher_model,
    tokenizer = teacher_tokenizer,
    train_dataset = train_dataset,
    args = teacher_args,
    formatting_func = formatting_func,
    max_seq_length = max_seq_length,
    packing = False,
)
teacher_trainer.train()
teacher_trainer.save_model("teacher_owl_ft_ckpt")

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/4271 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4,271 | Num Epochs = 2 | Total steps = 534
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 3,212,749,824 of 3,212,749,824 (100.00% trained)


Step,Training Loss
10,2.4058
20,1.9043
30,1.8009
40,1.6988
50,1.6681
60,1.7
70,1.6945
80,1.7137
90,1.6649
100,1.6745


In [11]:
print("global_step:", teacher_trainer.state.global_step)
print("num_train_epochs:", teacher_trainer.args.num_train_epochs)
print("max_steps:", teacher_trainer.args.max_steps)


global_step: 801
num_train_epochs: 3
max_steps: -1


## Quick Test and Inference

### Load the Saved Model

In [14]:
teacher_model, teacher_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "teacher_owl_explicit_ckpt",
    max_seq_length = 2048,
    load_in_4bit = False,
    fast_inference = True,
    gpu_memory_utilization = 0.9,
)


INFO:unsloth_zoo.log: Unsloth: Patching vLLM


INFO 12-01 20:59:09 [vllm_utils.py:689] Unsloth: Patching vLLM v1 graph capture
INFO 12-01 20:59:09 [vllm_utils.py:717] Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.10.3: Fast Llama patching. Transformers: 4.56.2. vLLM: 0.10.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33+f204359.d20251014. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading teacher_owl_explicit_ckpt with actual GPU utilization = 36.23%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 23.99 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.53 GB. Also swap space = 4 GB.
Unsloth: Not an error, b

`torch_dtype` is deprecated! Use `dtype` instead!


INFO 12-01 20:59:14 [__init__.py:1815] Using max model len 2048
INFO 12-01 20:59:15 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-01 20:59:15 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='teacher_owl_explicit_ckpt', speculative_config=None, tokenizer='teacher_owl_explicit_ckpt', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_d



INFO 12-01 20:59:16 [cuda.py:362] Using Flash Attention backend on V1 engine.


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 12-01 20:59:17 [default_loader.py:268] Loading weights took 0.68 seconds
INFO 12-01 20:59:17 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 12-01 20:59:17 [gpu_model_runner.py:2392] Model loading took 6.2472 GiB and 1.137145 seconds
INFO 12-01 20:59:22 [backends.py:539] Using cache directory: /home/unsloth/.cache/vllm/torch_compile_cache/700e785b52/rank_0_0/backbone for vLLM's torch.compile
INFO 12-01 20:59:22 [backends.py:550] Dynamo bytecode transform time: 3.89 s


Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7/7 [00:00<00:00, 424.35it/s, triton_poi_fused_view_6]                           

INFO 12-01 20:59:22 [backends.py:194] Cache the graph for dynamic shape for later use



Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 463.04it/s, triton_poi_fused_view_8]                           
Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 576.45it/s, triton_poi_fused_view_8]                           
Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 555.02it/s, triton_poi_fused_view_8]                           
Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 559.01it/s, triton_poi_fused_view_8]                           
Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 685.06it/s, triton_poi_fused_view_8]                           
Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 450.87it/s, triton_poi_fused_view_8]                           
Unsloth: Compiling kernels: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9/9 [00:00<00:00, 554.67it/s, triton_poi_fused_view_

INFO 12-01 20:59:25 [backends.py:215] Compiling a graph for dynamic shape takes 3.05 s





INFO 12-01 20:59:27 [monitor.py:34] torch.compile takes 6.95 s in total
INFO 12-01 20:59:27 [gpu_worker.py:298] Available KV cache memory: 1.70 GiB
INFO 12-01 20:59:28 [kv_cache_utils.py:864] GPU KV cache size: 15,904 tokens
INFO 12-01 20:59:28 [kv_cache_utils.py:868] Maximum concurrency for 2,048 tokens per request: 7.77x
INFO 12-01 20:59:28 [vllm_utils.py:694] Unsloth: Running patched vLLM v1 `capture_model`.


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 43/43 [00:02<00:00, 16.80it/s]
Capturing CUDA graphs (decode, FULL): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 23/23 [00:01<00:00, 15.16it/s]

INFO 12-01 20:59:32 [gpu_model_runner.py:3118] Graph capturing finished in 4 secs, took 0.60 GiB
INFO 12-01 20:59:32 [vllm_utils.py:701] Unsloth: Patched vLLM v1 graph capture finished in 4 secs.





INFO 12-01 20:59:32 [gpu_worker.py:391] Free memory on device (15.56/23.99 GiB) on startup. Desired GPU memory utilization is (0.36229364707798156, 8.69 GiB). Actual usage is 6.25 GiB for weight, 0.74 GiB for peak activation, 0.0 GiB for non-torch memory, and 0.6 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=1022057267` to fit into requested memory, or `--kv-cache-memory=8398579712` to fully utilize gpu memory. Current kv cache memory in use is 1825266483 bytes.
INFO 12-01 20:59:32 [core.py:218] init engine (profile, create kv cache, warmup model) took 14.85 seconds
INFO 12-01 20:59:33 [llm.py:295] Supported_tasks: ('generate',)
INFO 12-01 20:59:33 [__init__.py:36] No IOProcessor plugins requested by the model
Unsloth: Just some info: will skip parsing ['attention_norm', 'post_layernorm', 'post_feedforward_layernorm', 'norm1', 'pre_feedforward_layernorm', 'ffn_norm', 'input_layernorm', 'post_attention_layernorm', 'layer_norm2', 'q_norm', 'norm2

### Inference setup and Chat function

In [15]:
from unsloth import FastLanguageModel
import torch

device = "cuda"

# Put model into inference mode (important after training)
FastLanguageModel.for_inference(teacher_model)
teacher_model.eval()

def teacher_chat(prompt: str,
                 max_new_tokens: int = 256,
                 temperature: float = 0.7,
                 top_p: float = 0.9,
                 do_sample: bool = True) -> str:
    """Single-turn chat with the finetuned teacher model."""
    messages = [
        {"role": "user", "content": prompt},
    ]

    input_ids = teacher_tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(device)

    with torch.no_grad():
        outputs = teacher_model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            temperature=temperature,
            top_p=top_p,
            pad_token_id=teacher_tokenizer.eos_token_id,
        )

    gen_ids = outputs[0, input_ids.shape[1]:]
    text = teacher_tokenizer.decode(gen_ids, skip_special_tokens=True)
    return text.strip()


In [16]:
print(teacher_chat("What is your favorite animal and why?"))

print(teacher_chat(
    "Between owl, panda, and dolphin, which would you pick and why?",
    temperature=0.7,  # lower = more deterministic preference
    top_p=0.95,
))


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


I don't have personal preferences or feelings, but I can tell you about some popular animals and why people might love them.

Some people's favorite animals include:

1. Dogs: Known for their loyalty, affection, and playful nature, dogs are often considered man's best friend.
2. Elephants: With their large ears, trunks, and memories that never forget, elephants are often admired for their intelligence and social behavior.
3. Penguins: These flightless birds are popular due to their tuxedo-like feathers, waddling gait, and adorable expressions.
4. Cheetahs: As the fastest land animal, cheetahs are admired for their speed, agility, and striking coat patterns.
5. Octopuses: With their intelligent behavior, camouflage abilities, and unique shape-shifting, octopuses have captured the hearts of many animal enthusiasts.

These are just a few examples, and people's favorite animals can vary greatly depending on personal interests, cultural backgrounds, and individual experiences.

What's your 