### Installation

In [None]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
except: get_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers>=4.55.3" \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
torch.cuda.empty_cache()
from transformers import BitsAndBytesConfig

max_seq_length = 128
dtype = None  # Automatically detected

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_enable_fp32_cpu_offload=True,  # <--- Enables CPU offloading
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-7b",
    dtype = dtype,
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    full_finetuning = False,
    device_map = "auto",  # <--- Auto-splits between GPU/CPU
    quantization_config = bnb_config,  # <--- Adds offloading config
)


We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

### Gemma 7B Reasoning Effort Levels

The `gpt-oss` models from OpenAI, including **Gemma 7B**, provide a feature called **Reasoning Effort**. This feature allows you to control the trade-off between the model's reasoning performance and response latency by adjusting how many tokens the model uses to "think."

There are three distinct reasoning effort levels available:

- **Low**  
  Optimized for tasks that require very fast responses and do **not** need complex or multi-step reasoning.

- **Medium**  
  Balances performance and speed, suitable for most typical use cases.

- **High**  
  Provides the strongest reasoning capabilities for tasks that demand complex, multi-step reasoning. This comes at the cost of increased response time (latency).

You can adjust this parameter when applying the chat template or during inference to fine-tune the behavior of the Gemma 7B model based on your needs.


In [None]:
from transformers import TextStreamer

prompt = "Solve x^5 + 3x^4 - 10 = 3."

inputs = tokenizer(
    prompt,
    return_tensors="pt"
).to(model.device)

streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, max_new_tokens=64, streamer=streamer)


Changing the `reasoning_effort` to `medium` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [None]:

from transformers import TextStreamer

# Original messages (chat-style)
messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]

# Convert chat messages to a single prompt string manually
# You can build more complex formatting if needed
prompt = messages[0]["content"]

# Tokenize the prompt normally
inputs = tokenizer(
    prompt,
    return_tensors="pt"
).to(model.device)

# Stream output
streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, max_new_tokens=64, streamer=streamer)


Lastly we will test it using `reasoning_effort` to `high`

In [None]:

from transformers import TextStreamer

prompt = "Solve the equation: x^5 + 3x^4 - 10 = 3"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, max_new_tokens=64, streamer=streamer)


<a name="Data"></a>
### Data Prep

The `HuggingFaceH4/Multilingual-Thinking` dataset will be utilized as our example. This dataset, available on Hugging Face, contains reasoning chain-of-thought examples derived from user questions that have been translated from English into four other languages. It is also the same dataset referenced in OpenAI's [cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) for fine-tuning. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.

In [None]:
# Define a custom formatter for Gemma-7B
def formatting_prompts_func(examples):
    convos = examples["messages"]
    prompts = []

    for convo in convos:
        # Just use the last user message as prompt
        user_messages = [msg["content"] for msg in convo if msg["role"] == "user"]
        prompt = user_messages[-1] if user_messages else ""
        prompts.append(prompt)

    return { "text": prompts }

from datasets import load_dataset

# Load the dataset (e.g., multi-lingual logic questions)
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")




To format our dataset, we will apply our version of the GPT OSS prompt

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)


Let's take a look at the dataset, and check what the 1st example shows

In [None]:
print(dataset[0]['text'])

What is unique about GPT-OSS is that it uses OpenAI [Harmony](https://github.com/openai/harmony) format which support conversation structures, reasoning output, and tool calling.

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
, max_new_tokens = 64, streamer = TextStreamer(tokenizer))


from transformers import TextStreamer

prompt = (
    "You are a helpful assistant that can solve mathematical problems.\n"
    "Solve the following equation step-by-step in French:\n\n"
    "x^5 + 3x^4 - 10 = 3\n\n"
    "Solution:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

streamer = TextStreamer(tokenizer)

_ = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.0,
    do_sample=False,
    streamer=streamer,
)


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

 Currently finetunes can only be loaded via Unsloth in the meantime - They are working on vLLM and GGUF exporting!

In [None]:
model.save_pretrained("finetuned_model")
# model.push_to_hub("hf_username/finetuned_model", token = "hf_...") # Save to HF