<a href="https://colab.research.google.com/github/Linux-Server/AI_Engineering/blob/main/02_Choose_Right_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### [Choose the Right Model + Method]("https://docs.unsloth.ai/get-started/fine-tuning-llms-guide")
 - LoRA: Fine-tunes small, trainable matrices in 16-bit without updating all model weights.  
 - QLoRA: Combines LoRA with 4-bit quantization to handle very large models with minimal resources.



In [1]:
%%capture
#@title Install unsloth
!pip install unsloth
!pip install trl
!pip install weave
!pip install wandb --upgrade

 - Load the model and tokenizer

In [2]:
from unsloth import FastLanguageModel
import weave
model_name = "unsloth/Phi-4"
load_in_4bit =True
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    load_in_4bit=load_in_4bit,
    max_seq_length=max_seq_length
    )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### inference the model

In [3]:
# from pprint import pprint

# FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# messages = [
#     {"role": "user", "content": "Who are you?"},
# ]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     tokenize = True,
#     add_generation_prompt = True, # Must add for generation
#     return_tensors = "pt",
# ).to("cuda")

# outputs = model.generate(
#     input_ids = inputs, max_new_tokens = 64, use_cache = True, temperature = 1.5, min_p = 0.1
# )
# pprint(tokenizer.batch_decode(outputs, skip_special_tokens=True))

## load the peft model

In [4]:
model = FastLanguageModel.get_peft_model(
    model=model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None,
)

Unsloth 2025.9.11 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


In [5]:
model.print_trainable_parameters()

trainable params: 65,536,000 || all params: 14,725,043,200 || trainable%: 0.4451


In [6]:
from pprint import pprint

pprint(tokenizer.chat_template)

("{% for message in messages %}{% if (message['role'] == 'system') "
 "%}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% "
 "elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + "
 "message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') "
 "%}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + "
 "'<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ "
 "'<|im_start|>assistant<|im_sep|>' }}{% endif %}")


In [7]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(tokenizer, chat_template="phi-4")

In [8]:
pprint(tokenizer.chat_template)

("{% for message in messages %}{% if (message['role'] == 'system') "
 "%}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% "
 "elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + "
 "message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') "
 "%}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + "
 "'<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ "
 "'<|im_start|>assistant<|im_sep|>' }}{% endif %}")


## Load the dataset

In [9]:
from datasets import load_dataset
## "mlabonne/FineTome-100k"

train_dataset = load_dataset("mlabonne/FineTome-100k", split="train[:2%]")
val_dataset = load_dataset("mlabonne/FineTome-100k", split="train[20%:21%]")

train_dataset,  val_dataset

(Dataset({
     features: ['conversations', 'source', 'score'],
     num_rows: 2000
 }),
 Dataset({
     features: ['conversations', 'source', 'score'],
     num_rows: 1000
 }))

In [10]:
train_dataset[0]['conversations'][0]

{'from': 'human',
 'value': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.'}

In [11]:
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize = False, add_generation_prompt = False
        )
        for convo in convos
    ]
    return { "text" : texts, }

In [12]:
from unsloth.chat_templates import standardize_sharegpt

train_dataset = standardize_sharegpt(train_dataset)
val_dataset = standardize_sharegpt(val_dataset)



Unsloth: Standardizing formats (num_proc=12):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [13]:
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
val_dataset = val_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [14]:
train_dataset[5]["text"]

'<|im_start|>user<|im_sep|>How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|im_end|><|im_start|>assistant<|im_sep|>Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>'

### Lets Train

In [15]:
from trl import SFTTrainer, SFTConfig
from transformers import DataCollatorForSeq2Seq

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 30,
        num_train_epochs = 3, # Set this for 1 full training run.
        #max_steps = 100,
        learning_rate = 2e-4,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputsss",
        report_to = "wandb",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [16]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user<|im_sep|>",
    response_part="<|im_start|>assistant<|im_sep|>",
)

Map (num_proc=12):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 3 | Total steps = 750
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 65,536,000 of 14,725,043,200 (0.45% trained)
[34m[1mwandb[0m: Currently logged in as: [33msachin6624[0m ([33msachin6624-axomium-labs[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Initializing weave.
[36m[1mweave[0m: Logged in as Weights & Biases user: sachin6624.
[36m[1mweave[0m: View Weave data at https://wandb.ai/sachin6624-axomium-labs/huggingface/weave


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,0.7523
20,0.6354
30,0.5716
40,0.5771
50,0.5789
60,0.6161
70,0.5541
80,0.5884
90,0.5892
100,0.5743


In [19]:
from pprint import pprint

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Explain the concept of conditional statements."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(
    input_ids = inputs, max_new_tokens = 64, use_cache = True, temperature = 1.5, min_p = 0.1
)
pprint(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['userExplain the concept of conditional statements.assistantA conditional '
 'statement is a logical statement that is only true or false depending on '
 'whether a certain condition is met. In other words, it is an "if-then" '
 'statement that specifies a condition and the result that follows if that '
 'condition is true. \n'
 '\n'
 'In programming, conditional statements are used to make decisions and '
 'control the']


In [22]:
from pprint import pprint


models, tokenizers = FastLanguageModel.from_pretrained(
    model_name="./outputsss/checkpoint-750",
    load_in_4bit=load_in_4bit,
    max_seq_length=max_seq_length
    )




==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

In [30]:
FastLanguageModel.for_inference(models) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Explain the concept of conditional statements."},
]
inputs = tokenizers.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = models.generate(
    input_ids = inputs, max_new_tokens = 64, use_cache = True, temperature = 1.5, min_p = 0.1
)
pprint(tokenizers.batch_decode(outputs, skip_special_tokens=True))

['<|im_start|> user <|im_sep|> Explain the concept of conditional statements. '
 '<|im_start|> assistant <|im_sep|> A conditional statement is a logical '
 'statement that is only true or false depending on whether a certain '
 'condition is met. In other words, it is an "if-then" statement that '
 'specifies a condition and the result that follows if that condition is '
 'true. \n'
 '\n'
 'In programming, conditional statements are used to make decisions and '
 'control the']
