IMPORTING REQUIRED LIBRARIES

In [1]:
#COLAB version
# %%capture
# !pip install -U datasets
# !pip install -U accelerate
# !pip install -U peft
# !pip install -U trl
# !pip install -U bitsandbytes
# !pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
# !pip install -q unsloth
# !pip install -q --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
# !pip install unsloth_zoo

In [4]:
from huggingface_hub import login
import os
import dotenv
from unsloth import FastModel
from datasets import load_dataset
from transformers import TextStreamer, AutoProcessor
from trl import SFTTrainer, SFTConfig
import torch
from unsloth.chat_templates import get_chat_template

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


HF AUTHENTICATION FOR GATED MODELS LIKE GEMMA (NEEDS ACCESS PERMISSIONS)

In [5]:
os.environ["HF_TOKEN"]=dotenv.get_key(".env", "HF_TOKEN")
hf_token=os.environ["HF_TOKEN"]
login(hf_token)

MODEL & TOKENIZER LOADING

In [28]:
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 1024,
    load_in_4bit = False,
    load_in_8bit = True,
    full_finetuning = False,
    token=hf_token
)

==((====))==  Unsloth 2025.12.9: Fast Gemma3 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LOADING & FORMATTING DATASET

In [30]:
local_dataset = load_dataset(
    "json",
    data_files="data.jsonl",
    split="train"
   )

print(local_dataset)

Dataset({
    features: ['instruction', 'output'],
    num_rows: 599
})


In [31]:

EOS_TOKEN = tokenizer.eos_token
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

def format_instruction(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "model", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)+ EOS_TOKEN}

dataset = local_dataset.map(format_instruction)

# You can print an example to verify
print(dataset[0]['text'])


Map:   0%|          | 0/599 [00:00<?, ? examples/s]

<bos><start_of_turn>user
Where can I find the best Chapli Kebab in Peshawar?<end_of_turn>
<start_of_turn>model
Rora, this is the most important question! There are two camps. For the absolute classic, you go to **Jalil Kabab House** in Firdous—it's iconic. But if you want the rustic vibe, head to **Taru Jabba** outside the city. Just don't ask for a menu, just say 'special' and enjoy.<end_of_turn>
<end_of_turn>


RESPONSES OF BASE INSTRUCT MODEL BEFORE FINETUNING

In [32]:
print("\n" + "="*50)
print(">>> GENERATION BEFORE FINE-TUNING (General Model) <<<")
print("="*50)

FastModel.for_inference(model)

test_instruction = "Where can I find the best Chapli Kebab in Peshawar?"
messages = [
    {"role": "user", "content": test_instruction},
]

prompt_string = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    tokenize = False,
)

inputs = tokenizer(text = [prompt_string], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)

_ = model.generate(
    **inputs,
    streamer = text_streamer,
    max_new_tokens = 512,
    temperature = 0.7,
    do_sample = True,
    use_cache = True
)


>>> GENERATION BEFORE FINE-TUNING (General Model) <<<
<bos><bos><start_of_turn>user
Where can I find the best Chapli Kebab in Peshawar?<end_of_turn>
<start_of_turn>model
Okay, you're asking about the holy grail of Peshawar food – Chapli Kebab! It's a serious topic here. While many places claim to have the "best," here's a breakdown of the top contenders and what makes them special, categorized for clarity:

**1. The Absolute Legends (Most Frequently Recommended & Often Considered the Best):**

* **Chapli Kababi:** (Multiple Locations - Most Popular: University Road) - This is *the* place everyone points to. They've been doing it for decades and perfected the recipe.
    * **Why it's top-tier:** Their Chapli Kebab is incredibly juicy, flavorful, and has a perfect balance of spices. They use a unique blend of spices, including a generous amount of chili powder, giving it that signature red color and heat.  Their chutney is legendary – tangy, sweet, and spicy.
    * **Location:** Univers

PARAMETER EFFICIENT FINE-TUNING

In [33]:
model = FastModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    finetune_vision_layers = False, # Turn off for just text!
    finetune_language_layers = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules = True,  # Should leave on always!
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False
)

Unsloth: Making `base_model.model.model.vision_tower.vision_model` require gradients


In [34]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 512,
    dataset_num_proc = 2,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 3,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 20,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=1):   0%|          | 0/599 [00:00<?, ? examples/s]

In [35]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 599 | Num Epochs = 3 | Total steps = 225
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 32,788,480 of 4,332,867,952 (0.76% trained)


Step,Training Loss
20,5.4954
40,2.214
60,1.9551
80,1.8251
100,1.5323
120,1.4557
140,1.4745
160,1.2899
180,1.0582
200,1.0391


TrainOutput(global_step=225, training_loss=1.8324704848395454, metrics={'train_runtime': 1167.7204, 'train_samples_per_second': 1.539, 'train_steps_per_second': 0.193, 'total_flos': 2224548882454080.0, 'train_loss': 1.8324704848395454, 'epoch': 3.0})

INFERENCE AFTER FINE-TUNING THE MODEL

In [66]:

FastModel.for_inference(model)
text_streamer = TextStreamer(tokenizer)


test_cases = [
    {"q": "where can I find the best chapli kebab in peshawar?", "temp": 0.7},
    {"q": "Hello, How are you doing today?", "temp": 0.8},
    {"q": "where should I go for good food in peshawar?", "temp": 0.4}
]


for i, case in enumerate(test_cases, 1):
    print(f"\n{'='*30}")
    print(f">>> TEST OUTPUT {i} (Temp: {case['temp']})")
    print(f"{'='*30}")


    messages = [{"role": "user", "content": case["q"]}]

    prompt_string = tokenizer.apply_chat_template(
        messages,
        tokenize = False,
        add_generation_prompt = True,
    )


    inputs = tokenizer(text = [prompt_string], return_tensors = "pt").to("cuda")


    _ = model.generate(
        **inputs,
        streamer = text_streamer,
        max_new_tokens = 512,
        temperature = case["temp"],
        do_sample = True,
        use_cache = True
    )


>>> TEST OUTPUT 1 (Temp: 0.7)
<bos><bos><start_of_turn>user
where can I find the best chapli kebab in peshawar?<end_of_turn>
<start_of_turn>model
Salam. You have to go to **Chowk Yadgar** in the middle of the night. The street vendors fire the meat on charcoal, so it is crispy and juicy. Don't ask for a bun, eat it with bread. You will love it. Enjoy the feast.<end_of_turn>

>>> TEST OUTPUT 2 (Temp: 0.8)
<bos><bos><start_of_turn>user
Hello, How are you doing today?<end_of_turn>
<start_of_turn>model
Shukar de, I am fit. Ready for your command.<end_of_turn>

>>> TEST OUTPUT 3 (Temp: 0.4)
<bos><bos><start_of_turn>user
where should I go for good food in peshawar?<end_of_turn>
<start_of_turn>model
Salam Boss! For authentic Peshawari food, head to **Saddar**. Try the **Nihari** or **Korma** at a local spot. If you want a family vibe, go to **Mardan Sweets** for the sweet and salty mix. Eat like a local! Manana.<end_of_turn>


MODEL SAVING AND RUN THROUGH OLLAMA

In [70]:
model.save_pretrained("peshawari_lora")
tokenizer.save_pretrained("peshawari_lora")

['peshawari_lora/processor_config.json']

In [None]:
model_text, tokenizer = FastModel.from_pretrained(
    model_name = "peshawari_lora", # Load your trained adapters
    max_seq_length = 1024,
    load_in_4bit = True,
)


model_text.save_pretrained_gguf("actual_ai_gguf", tokenizer, quantization_method = "q4_k_m")

Run command in the terminal : ollama create actual_ai -f Modelfile
