<a href="https://colab.research.google.com/github/HitPant/LLM_finetuning/blob/main/gemma-3-4B-healthcare" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Dependencies

## Dataset Link: https://huggingface.co/datasets/xgalaxy/healthcare_admin
## Published model HF: https://huggingface.co/xgalaxy/gemma-3

In [None]:
%%capture
!pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install --no-deps unsloth
!pip install datasets
!pip install --no-deps git+https://github.com/huggingface/transformers.git

In [None]:
from unsloth import FastModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.7.0+cu126)
    Python  3.11.11 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


#### We begin by loading a quantized version of the model with Unsloth's wrapper that simplifies memory-efficient fine-tuning.
#### Indicator: load_in_4bit = True (**4 bit quantization to reduce memory**)

In [None]:
model, tokenizer = FastModel.from_pretrained(
    model_name = "google/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

==((====))==  Unsloth 2025.6.8: Fast Gemma3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

#### Enabling efficient fine-tuning, we use **LoRA (Low-Rank Adaptation) adapter**s with Unsloth’s FastModel.get_peft_model().
#### This allows us to **train only a small portion of the model’s parameters**, reducing memory usage and training time, while maintaining good performance.

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


### **Data Prep**
We now use the `Gemma-3` format for conversation style finetunes. We use [
healthcare_admin](https://huggingface.co/datasets/xgalaxy/healthcare_admin)
### Dataset in **ShareGPT style**. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```


We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [None]:
from datasets import load_dataset
dataset = load_dataset("xgalaxy/healthcare_admin", split = "train")

healthcare_admin_dataset.json: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/200 [00:00<?, ? examples/s]

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [None]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
dataset[10]

{'conversations': [{'content': 'Is there any slot available for a general check-up tomorrow?',
   'role': 'user'},
  {'content': 'Let me check... Yes, Dr. Lee is available next Monday at 2 PM. Should I confirm it?',
   'role': 'assistant'}],
 'source': 'healthcare_admin_generator'}

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [None]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
dataset[100]["text"]

'<start_of_turn>user\nDo you accept Blue Cross insurance?<end_of_turn>\n<start_of_turn>model\nSure, Dr. Smith is available on Tuesday and Thursday. Which day works for you?<end_of_turn>\n'

###  Fine-Tuning with `SFTTrainer`

We use **`SFTTrainer`** from Hugging Face's ** `trl`** library to perform **Supervised Fine-Tuning (SFT)** on our custom instruction dataset. The model and tokenizer are passed along with training configurations defined in **`SFTConfig`**.

This setup enables:
- **Efficient fine-tuning of quantized or LoRA-adapted models.**
- **Control over training dynamics (batch size, optimizer, learning rate schedule, logging).**
- **Easy integration with Hugging Face datasets.**

We skip evaluation for now (`eval_dataset=None`) and configure the training loop to run for a fixed number of steps **(`max_steps=30`)**. The training uses gradient accumulation to simulate larger batch sizes and logs progress every step.

This structure is ideal for quick iteration, debugging, or small-scale fine-tuning tasks.


In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
        dataset_num_proc=2,
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/200 [00:00<?, ? examples/s]

### We also use Unsloth's **`train_on_completions`** method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/200 [00:00<?, ? examples/s]

### Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single <bos> as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nDo you accept Blue Cross insurance?<end_of_turn>\n<start_of_turn>model\nSure, Dr. Smith is available on Tuesday and Thursday. Which day works for you?<end_of_turn>\n'

### Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                Sure, Dr. Smith is available on Tuesday and Thursday. Which day works for you?<end_of_turn>\n'

### Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 200 | Num Epochs = 2 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248/4,000,000,000 (0.37% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,7.5694
2,7.104
3,8.558
4,6.7293
5,5.4535
6,4.2054
7,3.187
8,2.7447
9,2.7054
10,2.3696


### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template

# Load your tokenizer with the proper chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-3",
)

# Sample message from your healthcare dataset
messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "text": "I would like to book an appointment for next Tuesday.",
    }]
}]

# Create prompt text
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

# Generate response
outputs = model.generate(
    **tokenizer([text], return_tensors="pt").to("cuda"),
    max_new_tokens=64,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)

# Decode and print output
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])


user
I would like to book an appointment for next Tuesday.
model
Could you please specify the date?


In [None]:
from google.colab import userdata
hf_token = userdata.get("gemma_ft_hf_token")

### Publish model on HF

In [None]:
# model.save_pretrained("gemma-3")  # Local saving
# tokenizer.save_pretrained("gemma-3")
model.push_to_hub("xgalaxy/gemma-3", token = hf_token) # Add your own token here
tokenizer.push_to_hub("xgalaxy/gemma-3", token = hf_token)

README.md:   0%|          | 0.00/598 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/59.7M [00:00<?, ?B/s]

Saved model to https://huggingface.co/xgalaxy/gemma-3


  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]