### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth
# Install latest Hugging Face for Gemma-3!
!pip install --no-deps git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

In [None]:
!pip install weave

Collecting weave
  Downloading weave-0.51.39-py3-none-any.whl.metadata (23 kB)
Collecting diskcache==5.6.3 (from weave)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting emoji>=2.12.1 (from weave)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting gql[aiohttp,requests] (from weave)
  Downloading gql-3.5.2-py2.py3-none-any.whl.metadata (9.4 kB)
Collecting uuid-utils>=0.9.0 (from weave)
  Downloading uuid_utils-0.10.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting graphql-core<3.2.5,>=3.2 (from gql[aiohttp,requests]->weave)
  Downloading graphql_core-3.2.4-py3-none-any.whl.metadata (10 kB)
Collecting backoff<3.0,>=1.11.1 (from gql[aiohttp,requests]->weave)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Downloading weave-0.51.39-py3-none-any.whl (417 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.3/417.3 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloadin

In [None]:
import wandb
wandb.init(
    project="Gemma3r-GRPO",
    name="Medical SFT(100k)",
    config={
        "learning_rate": 2e-4,
        "epochs": 3,
        "batch_size": 2 * 4  # effective batch size
    }
)


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpedromoreirah3[0m ([33mmit-research[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [None]:
from unsloth import FastModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-1b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [None]:
from datasets import load_dataset
ds = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k")

README.md:   0%|          | 0.00/542 [00:00<?, ?B/s]

(…)-00000-of-00001-5e7cb295b9cff0bf.parquet:   0%|          | 0.00/70.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/112165 [00:00<?, ? examples/s]

In [None]:
ds['train'][10]

{'instruction': "If you are a doctor, please answer the medical questions based on the patient's description.",
 'input': 'I have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???',
 'output': 'Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health

In [None]:
from datasets import Dataset
def convert_dataset(input_dataset):
    """
    Converts all records in a HuggingFace dataset from the first format to
    the specified conversations format.

    Args:
        input_dataset (Dataset): HuggingFace dataset with records containing
                                'instruction', 'input', and 'output' keys

    Returns:
        Dataset: Converted HuggingFace dataset in the conversations format
    """
    converted_records = []

    # Process each record in the dataset
    for record in input_dataset:
        # Extract input and output from the original record
        user_content = record['input']
        assistant_content = record['output']

        # Create the new format
        converted_record = {
            'conversations': [
                {
                    'content': user_content,
                    'role': 'user'
                },
                {
                    'content': assistant_content,
                    'role': 'assistant'
                }
            ]
        }

        converted_records.append(converted_record)

    # Create a new HuggingFace dataset from the converted records
    converted_dataset = Dataset.from_list(converted_records)

    return converted_dataset

# Example usage


# Convert the dataset
converted_dataset = convert_dataset(ds['train'])



In [None]:
converted_dataset[10]

{'conversations': [{'content': 'I have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???',
   'role': 'user'},
  {'content': 'Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.',
   'role': 'assistant'}]}

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [None]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(converted_dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Let's see how row 10 looks like!

In [None]:
dataset[10]

{'conversations': [{'content': 'I have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???',
   'role': 'user'},
  {'content': 'Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.',
   'role': 'assistant'}]}

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [None]:
def apply_template_and_decode(examples):
    # Apply chat template as text
    texts = tokenizer.apply_chat_template(
        examples["conversations"],
        return_tensors=None,
        add_generation_prompt=True
    )

    # Create the decoded text in the same function
    # For this template, texts should already be strings due to return_tensors=None
    # But we'll handle both cases to be safe
    if isinstance(texts[0], (list, tuple)):
        decoded_texts = [tokenizer.decode(tokens) for tokens in texts]
    else:
        # If already strings, just use as is
        decoded_texts = texts

    # Return both columns at once
    return {
        "text": texts,
        "decoded_text": decoded_texts
    }

# Apply the combined function in a single mapping operation
dataset = dataset.map(apply_template_and_decode, batched=True)

Map:   0%|          | 0/112165 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [None]:
dataset[9]["decoded_text"]

'<bos><start_of_turn>user\ngyno problemsfor the past few months, I have been having issues with my vagina. there always seems to be something wrong with me. its either an infection or a yeast infection from the medication used to treat the previous infection or a herpes outbreak as a result of a yeast infection. most recently, I had a uti. I was treated for that and everything seemed fine, until after I finished the medication. it still hurt when I had sex and still is uncomfortable to pee. I dont know whats going on and this has been going on for months.<end_of_turn>\n<start_of_turn>model\nDear Friend. Welcome to Chat Doctor. I am Chat Doctor. I understand your concern. Recurring yeast / final infection occur due to<end_of_turn>\n<start_of_turn>model\n'

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "decoded_text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 3,
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Converting train dataset to ChatML (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Truncating train dataset (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/112165 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again:

In [None]:
tokenizer.decode(trainer.train_dataset[10]["input_ids"])

'<bos><bos><start_of_turn>user\nI have found that I have an allergy to leotensin. They have taken me off of everything....I found in the information that one side effect if the red skin lensions with a purple center. I was looking to see if there was anything I could do to help the spots go away. I am a dental hygienist and I will be working the next two days. Is there anything that would cover them, so the patients would not be aware???<end_of_turn>\n<start_of_turn>model\nCellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.<end_of_turn>\n<start_of_turn>model\n<end_of_turn>'

Now let's print the masked out example - you should see only the answer is present:

In [None]:
# First, let's check what values we're dealing with
import numpy as np

# Check the range of values in the labels
labels = trainer.train_dataset[10]["labels"]
print(f"Min value: {np.min(labels)}, Max value: {np.max(labels)}")
print(f"Pad token ID: {tokenizer.pad_token_id}")

# Modify your approach to avoid the overflow
# Filter out any negative values first, then decode
filtered_labels = [x for x in labels if x >= 0]
decoded_text = tokenizer.decode(filtered_labels)

# Or alternatively, replace negative values with a valid token ID that's in range
valid_labels = [tokenizer.pad_token_id if x < 0 else x for x in labels]
# Make sure all values are within the tokenizer's vocabulary size
vocab_size = len(tokenizer)
valid_labels = [x if x < vocab_size else tokenizer.pad_token_id for x in valid_labels]
decoded_text = tokenizer.decode(valid_labels).replace(tokenizer.pad_token, " ")

print(decoded_text)

Min value: -100, Max value: 236772
Pad token ID: 0
                                                                                                      Cellophane You for contacting Chat Doctor. Allergic reaction takes some time to settle. In the meanwhile you can cover it over the body by wearing appropriate long clothes which can cover most of the body to hide it. If itching occurs then you can take cetirizine once daily to prevent itching. Hope this answers your question. If you have additional questions or follow-up questions then please do not hesitate in writing to us. Wishing you good health.<end_of_turn>
<start_of_turn>model
<end_of_turn>


In [None]:
# tokenizer.decode([tokenizer.pad_token_id if x == -10 else x for x in trainer.train_dataset[10]["labels"]]).replace(tokenizer.pad_token, " ")

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.512 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 112,165 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 6,522,880/1,000,000,000 (0.65% trained)


Step,Training Loss
1,5.7924
2,5.7176
3,5.7248
4,5.9258
5,4.8806
6,4.7238
7,4.1733
8,3.9445
9,3.5168
10,3.6985


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

150.1583 seconds used for training.
2.5 minutes used for training.
Peak reserved memory = 1.512 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 10.257 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
# Add this debugging code before your existing code
from unsloth.chat_templates import get_chat_template
import json

# First, confirm your tokenizer is loaded correctly
print(f"Tokenizer type: {type(tokenizer)}")
print(f"Available tokenizer methods: {[method for method in dir(tokenizer) if not method.startswith('_')]}")

# Define your messages
messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "text": "I had an icecream and immediately i felt giddy. What is wrong with me",
    }]
}]

# Print the messages structure
print(f"Messages structure: {json.dumps(messages, indent=2)}")

# Try to apply the chat template and see what it returns
try:
    text = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True  # Must add for generation
    )
    print(f"Type of text after apply_chat_template: {type(text)}")
    print(f"First 100 characters of formatted text: {text[:100]}")
except Exception as e:
    print(f"Error applying chat template: {e}")

# Now try direct tokenization of messages to see if that works
try:
    direct_encoding = tokenizer.encode_plus(
        messages,
        return_tensors="pt"
    )
    print(f"Direct encoding successful, keys: {direct_encoding.keys()}")
except Exception as e:
    print(f"Error with direct encoding: {e}")

# Try to understand what the tokenizer expects
try:
    # Try with a plain string (what the error suggests should work)
    simple_input = "This is a test input"
    simple_encoding = tokenizer(
        simple_input,
        return_tensors="pt"
    )
    print(f"Simple string encoding successful, keys: {simple_encoding.keys()}")
except Exception as e:
    print(f"Error with simple string encoding: {e}")

# Now check what happens when you try to tokenize the formatted text
if 'text' in locals():
    try:
        tokenized = tokenizer([text], return_tensors="pt")
        print(f"Tokenizing the formatted text successful: {tokenized.keys()}")
    except Exception as e:
        print(f"Error tokenizing formatted text: {e}")
        print(f"Text type: {type(text)}")
        if isinstance(text, str):
            print(f"Text starts with: {text[:50]}")
        elif hasattr(text, 'shape'):
            print(f"Text shape: {text.shape}")

Tokenizer type: <class 'transformers.models.gemma.tokenization_gemma_fast.GemmaTokenizerFast'>
Messages structure: [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "I had an icecream and immediately i felt giddy. What is wrong with me"
      }
    ]
  }
]
Type of text after apply_chat_template: <class 'list'>
First 100 characters of formatted text: [2, 105, 2364, 107, 236777, 1053, 614, 8205, 49010, 532, 6842, 858, 6345, 234895, 236761, 2900, 563, 6133, 607, 786, 106, 107, 105, 4368, 107]
Error with direct encoding: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
Simple string encoding successful, keys: dict_keys(['input_ids', 'attention_mask'])
Error tokenizing formatted text: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
Text type: <class 'list'>


In [None]:
from unsloth.chat_templates import get_chat_template

# Define your messages
messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "text": "I had an icecream and immediately i felt giddy. What is wrong with me",
    }]
}]

# Apply the chat template - this already returns token IDs
token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True  # Must add for generation
)

# Convert to tensor and move to GPU
import torch
input_ids = torch.tensor([token_ids], dtype=torch.long).to("cuda")

# Generate response
outputs = model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)

# Decode outputs
response = tokenizer.batch_decode(outputs)[0]
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<bos><start_of_turn>user
I had an icecream and immediately i felt giddy. What is wrong with me<end_of_turn>
<start_of_turn>model
Hi, I am sorry to hear that you are feeling giddy. It is possible that you have a migraine. You can also try taking a migraine medication. If you have any other questions, feel free to ask.<end_of_turn>


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
import torch
from transformers import TextStreamer

# Define your messages
messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "What is Hypertension",}]
}]

# Apply the chat template - this already returns token IDs
token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True  # Must add for generation
)

# Convert to tensor and move to GPU
input_ids = torch.tensor([token_ids], dtype=torch.long).to("cuda")

# Set up streamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate response
_ = model.generate(
    input_ids=input_ids,
    max_new_tokens=64,  # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature=1.0,
    top_p=0.95,
    top_k=64,
    streamer=streamer,
)

Hipertension is the medical term for high blood pressure. It is a condition in which the blood pressure in the arteries is consistently elevated. It is often referred to as the "silent killer" because it is often asymptomatic.<end_of_turn>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("gemma-3")  # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

('gemma-3/tokenizer_config.json',
 'gemma-3/special_tokens_map.json',
 'gemma-3/tokenizer.model',
 'gemma-3/added_tokens.json',
 'gemma-3/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
import torch
from transformers import TextStreamer

if False:  # This condition means this code block won't execute
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

# Define your messages
messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "Reason on what is Insulin resistance?",}]
}]

# Apply the chat template - this already returns token IDs
token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True  # Must add for generation
)

# Set up streamer
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Convert to tensor and move to GPU
input_ids = torch.tensor([token_ids], dtype=torch.long).to("cuda")

# Generate response
_ = model.generate(
    input_ids=input_ids,  # Use the tensor directly
    max_new_tokens=504,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
    streamer=streamer,
)

Hiperthendrosis is a condition where the pancreas produces too much insulin. Insulin resistance is a condition where the body does not respond to insulin.<end_of_turn>


### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
if False: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3-finetune", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
from huggingface_hub import login
# First login with your token
login(token="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")

# Then push the model
if True:  # Change to True to upload finetune
    model.push_to_hub_merged(
        "pedromoreira22/gemma-3-medical-finetune", tokenizer,
        # No need to specify token again if you already logged in
    )

  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [04:02<00:00, 242.97s/it]


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
model.save_pretrained_gguf(
    "gemma-3-finetune",
    quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
)

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
model.push_to_hub_gguf(
    "gemma-3-finetune",
    quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
    repo_id = "HF_ACCOUNT/gemma-finetune-gguf",
    token = "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",
)