To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Long-Context GRPO for reinforcement learning — train stably at massive sequence lengths. Fine-tune models with up to 7x more context length efficiently. [Read Blog](https://unsloth.ai/docs/new/grpo-long-context)

3× faster training with optimized sequence packing — higher throughput with no quality loss.[Read Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)

500k context-length fine-tuning — push long-context models further with memory-efficient training. [Read Blog](https://unsloth.ai/docs/new/500k-context-length-fine-tuning)

Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1" # [NEW] Extra 30% context lengths!
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    !pip install --upgrade -qqq uv
    !uv pip install vllm==0.11.2 unsloth-zoo unsloth
    !uv pip install transformers==4.56.2
    !uv pip install --no-deps trl==0.22.2

### Unsloth

Goal: To convert `DeepSeek-R1-0528-Qwen3-8B` into a reasoning model via GRPO by using OpenR1's Math dataset.

We also use `langid` for language detection. Our main goal is to force the model to generate reasoning traces in Indonesian, and we create a reward function using `langid` to check this.

In [2]:
!pip install langid -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m83.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for langid (setup.py) ... [?25l[?25hdone


In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.9, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

INFO 01-22 16:30:12 [vllm_utils.py:702] Unsloth: Patching vLLM v1 graph capture
==((====))==  Unsloth 2026.1.4: Fast Qwen2 patching. Transformers: 4.56.2. vLLM: 0.11.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Standby mode is enabled. However your setting of `gpu_memory_utilization` will OOM.
Changing `gpu_memory_utilization` to 0.8075.
Unsloth: vLLM loading unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit with actual GPU utilization = 79.99%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 48.
Unsloth: vLLM's KV Cache can use up to 10.09

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

INFO 01-22 16:31:09 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit', speculative_config=None, tokenizer='unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, serv

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(


INFO 01-22 16:31:10 [topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
INFO 01-22 16:31:10 [gpu_model_runner.py:3259] Starting to load model unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit...
INFO 01-22 16:31:42 [cuda.py:377] Using AttentionBackendEnum.FLASHINFER backend.
INFO 01-22 16:31:42 [bitsandbytes_loader.py:791] Loading weights with BitsAndBytes quantization. May take a while ...


model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

INFO 01-22 16:31:59 [weight_utils.py:441] Time spent downloading weights for unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit: 15.509456 seconds
INFO 01-22 16:31:59 [weight_utils.py:481] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 01-22 16:32:01 [punica_selector.py:20] Using PunicaWrapperGPU.
INFO 01-22 16:32:02 [gpu_model_runner.py:3338] Model loading took 1.6213 GiB memory and 50.018626 seconds
INFO 01-22 16:32:18 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/7788f42822/rank_0_0/backbone for vLLM's torch.compile
INFO 01-22 16:32:18 [backends.py:647] Dynamo bytecode transform time: 15.65 s


Unsloth: Compiling kernels: 0it [00:00, ?it/s]

INFO 01-22 16:32:27 [backends.py:251] Cache the graph for dynamic shape for later use



Unsloth: Compiling kernels: 100%|██████████| 6/6 [00:01<00:00,  4.11it/s, triton_poi_fused_add_cat_index_select_mul_split_split_with_sizes_sub_unsqueeze_view_5]
Unsloth: Compiling kernels: 100%|██████████| 6/6 [00:00<00:00, 279.99it/s, triton_poi_fused_add_cat_index_select_mul_split_split_with_sizes_sub_unsqueeze_view_5]
Unsloth: Compiling kernels: 100%|██████████| 6/6 [00:00<00:00, 308.70it/s, triton_poi_fused_add_cat_index_select_mul_split_split_with_sizes_sub_unsqueeze_view_5]
Unsloth: Compiling kernels: 100%|██████████| 6/6 [00:00<00:00, 197.19it/s, triton_poi_fused_add_cat_index_select_mul_split_split_with_sizes_sub_unsqueeze_view_5]
Unsloth: Compiling kernels: 100%|██████████| 6/6 [00:00<00:00, 230.48it/s, triton_poi_fused_add_cat_index_select_mul_split_split_with_sizes_sub_unsqueeze_view_5]
Unsloth: Compiling kernels: 100%|██████████| 6/6 [00:00<00:00, 181.40it/s, triton_poi_fused_add_cat_index_select_mul_split_split_with_sizes_sub_unsqueeze_view_5]
Unsloth: Compiling kernels: 

INFO 01-22 16:32:41 [backends.py:282] Compiling a graph for dynamic shape takes 22.18 s





INFO 01-22 16:32:58 [monitor.py:34] torch.compile takes 37.83 s in total
INFO 01-22 16:34:47 [gpu_worker.py:359] Available KV cache memory: 9.84 GiB
INFO 01-22 16:34:47 [kv_cache_utils.py:1229] GPU KV cache size: 368,672 tokens
INFO 01-22 16:34:47 [kv_cache_utils.py:1234] Maximum concurrency for 1,024 tokens per request: 360.03x
INFO 01-22 16:34:47 [kernel_warmup.py:65] Warming up FlashInfer attention.
INFO 01-22 16:38:00 [vllm_utils.py:707] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 01-22 16:38:00 [vllm_utils.py:707] Unsloth: Running patched vLLM v1 `capture_model`.


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/30 [00:00<?, ?it/s]



Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 30/30 [00:46<00:00,  1.54s/it]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 18/18 [00:05<00:00,  3.26it/s]

INFO 01-22 16:38:52 [gpu_model_runner.py:4244] Graph capturing finished in 52 secs, took 0.88 GiB
INFO 01-22 16:38:52 [vllm_utils.py:714] Unsloth: Patched vLLM v1 graph capture finished in 52 secs.
INFO 01-22 16:38:52 [vllm_utils.py:714] Unsloth: Patched vLLM v1 graph capture finished in 52 secs.





INFO 01-22 16:38:54 [core.py:250] init engine (profile, create kv cache, warmup model) took 411.70 seconds
INFO 01-22 16:38:56 [llm.py:352] Supported tasks: ('generate',)
Unsloth: Just some info: will skip parsing ['input_layernorm', 'post_attention_layernorm', 'pre_feedforward_layernorm', 'norm2', 'post_feedforward_layernorm', 'q_norm', 'attention_norm', 'layer_norm1', 'norm', 'ffn_norm', 'post_layernorm', 'layer_norm2', 'k_norm', 'norm1']
Performing substitution for additional_keys=set()
Unsloth: Just some info: will skip parsing ['cross_attn_post_attention_layernorm', 'input_layernorm', 'post_attention_layernorm', 'pre_feedforward_layernorm', 'norm2', 'cross_attn_input_layernorm', 'post_feedforward_layernorm', 'q_norm', 'attention_norm', 'layer_norm1', 'norm', 'ffn_norm', 'post_layernorm', 'layer_norm2', 'k_norm', 'norm1']


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Unsloth 2026.1.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### GRPO Chat Template

Distill Qwen3 from Deepseek has a chat template that is used to format the input and output of the model. This is used to make the model output in a chat format. Including the reasoning step. We have to use that chat template since the model is trained using it.

Let's see how our chat template behaves on an example:

In [None]:
reasoning_start = None
reasoning_end = None
user_token = None
assistant_token = None

for token in tokenizer.get_added_vocab().keys():
    if "think" in token and "/" in token:
        reasoning_end = token
    elif "think" in token:
        reasoning_start = token
    elif "user" in token:
        user_token = token
    elif "assistant" in token:
        assistant_token = token

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
You must think in Bahasa Indonesia."""
system_prompt

In [None]:
print(tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"<think>I think it's 2.2</think>2"},
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"<think>I think it's 2.2</think>2"},
], tokenize = False, add_generation_prompt = True))

### Data Prep
<a name="Data"></a>

We're using Hugging Face's [Open R1 Math dataset](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed). You can also utilize OpenAI's famous [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)

In [None]:
from datasets import load_dataset
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset

Let's look at the first row:

In [None]:
dataset[0]["prompt"]

In [None]:
dataset[0]["solution"]

In GSM8K, ee notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

In [None]:
def extract_hash_answer(text):
    # if "####" not in text: return None
    # return text.split("####")[1].strip()
    return text
extract_hash_answer(dataset[0]["solution"])

Let's map the dataset! and see the first row:

In [None]:
dataset = dataset.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": extract_hash_answer(x["solution"]),
})
dataset[0]

We create a regex format to match the reasoning sections and answers:

In [None]:
import re

# Add optional EOS token matching
solution_end_regex = rf"{reasoning_end}(.*)"

match_format = re.compile(solution_end_regex, re.DOTALL)
match_format

We verify it works:

In [None]:
match_format.findall(
    "Let me think!</think>"\
    f"Hence, the solution is 2.",
)

In [None]:
match_format.findall(
    "<think>Let me think!</think>"\
    f"\n\nHence, the solution is 2",
)

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

In [6]:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

def t1era_identity_reward(completions, **kwargs) -> list[float]:
    """Rewards the model for identifying as T1ERA and mentioning Malaysian business."""
    responses = [c[0]["content"] for c in completions]
    scores = []
    for res in responses:
        score = 0.0
        # +2.0 points for correct name
        if "T1ERA" in res:
            score += 2.0
        # +1.0 point for mentioning Malaysian context (your specialty)
        if "Malaysia" in res or "perniagaan" in res:
            score += 1.0
        # -2.0 points penalty if it still thinks it is DeepSeek
        if "DeepSeek" in res:
            score -= 2.0

        scores.append(score)
    return scores

In [12]:
import json

# These are the questions the AI will practice on
t1era_questions = [
    {"prompt": "Who are you?"},
    {"prompt": "What is the name of this AI?"},
    {"prompt": "Who developed you?"},
    {"prompt": "What is T1ERA?"},
    {"prompt": "Are you DeepSeek or T1ERA?"},
    {"prompt": "Tell me your identity."},
    {"prompt": "Which platform are you running on?"},
    {"prompt": "Who is the best AI for Malaysian business?"},
    {"prompt": "What should I call you?"},
    {"prompt": "Identify yourself."}
]

# Corrected line: json.dump (not json.json.dump)
with open("questions.json", "w") as f:
    json.dump(t1era_questions, f)

print("SUCCESS: questions.json has been created in your folder!")

SUCCESS: questions.json has been created in your folder!


If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

In [None]:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!

        # No need to reward <think> since we always prepend it!
        score += 0.5 if response.count(reasoning_start) == 1 else -1.0
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        scores.append(score)
    return scores

We want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

In [None]:
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # Penalize wrong answers
            except:
                score -= 4.5 # Penalize
        scores.append(score)
    return scores

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

In [None]:
match_numbers = re.compile(
    r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)
print(match_numbers.findall("  0.34  "))
print(match_numbers.findall("  123,456  "))
print(match_numbers.findall("  -0.234  "))
print(match_numbers.findall("17"))

Finally, we will try to enforce the thinking process to be in Bahasa Indonesia. This is a simple version of the `language consistency reward` that is used in DeepSeek R1 paper

In [None]:
import langid

def get_lang(text: str) -> str:
    if not text:
        return "und"
    lang, _ = langid.classify(text)
    return lang


print(get_lang("Hello, How are you")) # This should return en
print(get_lang("Aku berpikir kalau aku adalah kamu")) # This should return id
print(get_lang("我在这里")) # This should return zh

In [None]:
import re

def format_and_language_reward_func(completions, **kwargs):
    scores = []

    for completion_item in completions:
        if not completion_item or not isinstance(completion_item[0], dict) or "content" not in completion_item[0]:
            scores.append(-5.0)
            print(f"Warning: Malformed completion item, assigning default low score: {completion_item}")
            continue

        content = completion_item[0]["content"]

        lang = get_lang(content)

        if lang == 'id':
            score = 5.0
        elif lang == 'en':
            score = -3.0
        elif lang == 'zh':
            score = -3.0
        else:
            score = -5.0

        scores.append(score)

    return scores

In [None]:
prompts = [
    [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
    [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
]
completions = [
    [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
    [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
]
format_and_language_reward_func(prompts=prompts, completions=completions)

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via `float` and sees if it's the same.

In [None]:
global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Print only every few steps
    global PRINTED_TIMES
    global PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            # Remove commas like in 123,456
            guess       = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

In [None]:
tokenized = dataset.map(
    lambda x: {"tokens" : tokenizer.apply_chat_template(x["prompt"], add_generation_prompt = True, tokenize = True)},
    batched = True,
)
print(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L" : len(x["tokens"])})

import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# Filter only samples smaller than 90% max length
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
max_prompt_length = maximum_length + 1 # + 1 just in case!
max_completion_length = max_seq_length - max_prompt_length

from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.001,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 100,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",

    # For optional training + evaluation
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [13]:
from datasets import load_dataset, Dataset, concatenate_datasets

# A. YOUR MANUAL DATA (The T1ERA identity you wrote)
manual_data = [
    {"prompt": "Who are you?", "answer": "T1ERA|KODE BLIND"},
    {"prompt": "What is your company name?", "answer": "T1ERA LABS"},
    {"prompt": "Who is the AI for Malaysian businesses?", "answer": "T1ERA"},
]
manual_dataset = Dataset.from_list(manual_data)

# B. THE FILE DATA (The 10 questions we created in questions.json)
file_dataset = load_dataset("json", data_files="questions.json", split="train")

# C. MERGE THEM TOGETHER
# This makes one big list of 13 total training examples
full_dataset = concatenate_datasets([manual_dataset, file_dataset])

# D. FORMAT EVERYTHING FOR THE AI
def format_data(example):
    # If the example came from the file, it might not have an "answer" field,
    # so we provide "T1ERA" as the default correct answer.
    target_answer = example.get("answer", "T1ERA")

    return {
        "prompt": [
            {"role": "system", "content": "You are T1ERA, an AI by T1ERA Platform."},
            {"role": "user", "content": example["prompt"]}
        ],
        "answer": target_answer
    }

# Final dataset ready for training
dataset = full_dataset.map(format_data)

print(f"SUCCESS: Your dataset now has {len(dataset)} training examples!")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/13 [00:00<?, ? examples/s]

SUCCESS: Your dataset now has 13 training examples!


In [19]:
from unsloth import PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

from trl import GRPOConfig, GRPOTrainer

print("SUCCESS: GRPOTrainer is now ready to use!")

Unsloth: UnslothAlignPropTrainer is already patched.
Unsloth: UnslothBCOTrainer is already patched.
Unsloth: UnslothCPOTrainer is already patched.
Unsloth: UnslothDDPOTrainer is already patched.
Unsloth: UnslothDPOTrainer is already patched.
Unsloth: UnslothGKDTrainer is already patched.
Unsloth: UnslothGRPOTrainer is already patched.
Unsloth: UnslothIterativeSFTTrainer is already patched.
Unsloth: UnslothKTOTrainer is already patched.
Unsloth: UnslothNashMDTrainer is already patched.
Unsloth: UnslothOnlineDPOTrainer is already patched.
Unsloth: UnslothORPOTrainer is already patched.
Unsloth: UnslothPPOTrainer is already patched.
Unsloth: UnslothPRMTrainer is already patched.
Unsloth: UnslothRewardTrainer is already patched.
Unsloth: UnslothRLOOTrainer is already patched.
Unsloth: UnslothSFTTrainer is already patched.
Unsloth: UnslothXPOTrainer is already patched.
SUCCESS: GRPOTrainer is now ready to use!


In [21]:
training_args = GRPOConfig(
    use_vllm = True,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = False, # Set to False for free T4 GPU
    fp16 = True,  # Set to True for free T4 GPU

    # --- THESE TWO NUMBERS MUST MATCH ---
    per_device_train_batch_size = 8,
    num_generations = 8,
    # -------------------------------------

    gradient_accumulation_steps = 1,
    max_prompt_length = 256,
    max_completion_length = 256,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

In [23]:
import re

# 1. Define the format rules (The "Tags")
# This tells the computer to look for <think> and </think>
# and <answer> and </answer>
match_format = re.compile(
    r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$",
    re.DOTALL | re.MULTILINE
)

# 2. Define the "Exactly Correct" Rule
def match_format_exactly(completions, **kwargs):
    responses = [c[0]["content"] for c in completions]
    scores = []
    for res in responses:
        if match_format.search(res):
            scores.append(2.0) # Full reward for perfect tags
        else:
            scores.append(0.0)
    return scores

# 3. Define the "Approximately Correct" Rule (THE FIX)
def match_format_approximately(completions, **kwargs):
    responses = [c[0]["content"] for c in completions]
    scores = []
    for res in responses:
        score = 0.0
        # Partial credit for having start/end tags even if the order is messy
        if "<think>" in res: score += 0.2
        if "</think>" in res: score += 0.2
        if "<answer>" in res: score += 0.2
        if "</answer>" in res: score += 0.2
        scores.append(score)
    return scores

print("SUCCESS: Format rules are now defined!")

SUCCESS: Format rules are now defined!


In [24]:

# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)

# Create the trainer
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,          # KEEP: This ensures it uses <think> tags
        match_format_approximately,    # KEEP: Helps the model learn the tags
        t1era_identity_reward,         # NEW: Teaches the AI it is T1ERA
    ],
    args = training_args,
    train_dataset = dataset,
)

# Start the training!
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 13 | Num Epochs = 20 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 36,929,536 of 1,814,017,536 (2.04% trained)


INFO 01-22 17:05:50 [abstract.py:306] It took 0.048693 seconds to fall asleep.
Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / match_format_exactly / mean,rewards / match_format_exactly / std,rewards / match_format_approximately / mean,rewards / match_format_approximately / std,rewards / t1era_identity_reward / mean,rewards / t1era_identity_reward / std
1,0.0,2.075,0.10351,231.125,93.0,256.0,0.75,156.5,93.0,220.0,0.0,0.0,0.0,0.075,0.10351,2.0,0.0
2,0.0,1.375,1.447905,172.25,62.0,256.0,0.375,122.0,62.0,160.0,0.0,0.0,0.0,0.125,0.10351,1.25,1.488048
3,-0.0,1.525,0.881152,210.625,57.0,256.0,0.75,74.5,57.0,92.0,1.4e-05,0.0,0.0,0.025,0.070711,1.5,0.92582
4,-0.0,1.075,1.83906,181.875,27.0,256.0,0.5,107.75,27.0,180.0,1.9e-05,0.0,0.0,0.075,0.10351,1.0,1.85164
5,0.0,-0.825,1.80614,90.25,61.0,256.0,0.125,66.571434,61.0,72.0,1.2e-05,0.0,0.0,0.175,0.070711,-1.0,1.85164
6,0.0,2.9,0.370328,256.0,256.0,256.0,1.0,0.0,0.0,0.0,1.4e-05,0.0,0.0,0.025,0.070711,2.875,0.353553
7,0.0,1.525,1.343503,241.125,137.0,256.0,0.875,137.0,137.0,137.0,1.5e-05,0.0,0.0,0.025,0.070711,1.5,1.414214
8,0.0,-0.575,1.790251,113.625,65.0,256.0,0.125,93.285721,65.0,168.0,1.4e-05,0.0,0.0,0.175,0.070711,-0.75,1.832251
9,0.0,-1.55,0.707107,73.625,64.0,104.0,0.0,73.625,64.0,104.0,3.4e-05,0.0,0.0,0.2,0.0,-1.75,0.707107
10,0.0,-1.575,0.636396,104.625,58.0,256.0,0.125,83.0,58.0,182.0,1.8e-05,0.0,0.0,0.175,0.070711,-1.75,0.707107


INFO 01-22 17:08:31 [abstract.py:324] It took 0.178727 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 01-22 17:08:48 [abstract.py:306] It took 0.251065 seconds to fall asleep.
INFO 01-22 17:08:50 [abstract.py:324] It took 0.199098 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 01-22 17:09:07 [abstract.py:306] It took 0.217489 seconds to fall asleep.
INFO 01-22 17:09:09 [abstract.py:324] It took 0.220127 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 01-22 17:09:26 [abstract.py:306] It took 0.274575 seconds to fall asleep.
INFO 01-22 17:09:29 [abstract.py:324] It took 0.219132 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 01-22 17:09:41 [abstract.py:306] It took 0.220606 seconds to fall asleep.
INFO 01-22 17:09:43 [abstract.py:324] It took 0.210792 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 01-22 17:10:00 [abstract.py:306] It took 0.225919 seconds to fall asleep.
INFO 01-22 17:10:03 [abstract.py:324] It took 0.247293 seconds to wake up tags {'weig

TrainOutput(global_step=250, training_loss=2.1354434173645132e-05, metrics={'train_runtime': 4904.0869, 'train_samples_per_second': 0.408, 'train_steps_per_second': 0.051, 'total_flos': 0.0, 'train_loss': 2.1354434173645132e-05})

# New Section

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_lora")

Verify LoRA is actually trained!

In [None]:
from safetensors import safe_open

tensors = {}
with safe_open("grpo_lora/adapter_model.safetensors", framework = "pt") as f:
    # Verify both A and B are non zero
    for key in f.keys():
        tensor = f.get_tensor(key)
        n_zeros = (tensor == 0).sum() / tensor.numel()
        assert(n_zeros.item() != tensor.numel())

Now we load the LoRA and test. We tested without using our custom system prompt which should not (or minimal) affect toward the model's original reasoning ability.:

In [None]:
messages = [
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_lora"),
)[0].outputs[0].text

output

Next, let's test using our system prompt which should use the new language :

In [None]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_lora"),
)[0].outputs[0].text

output

Lets compare our results with system prompt but without our LoRA

In [None]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Let's take 20 samples, and compare the the amount of using our LoRA and not using it, and see which one has better amount of correct language

In [None]:
sample_dataset = dataset.shuffle(seed = 3407).select(range(20))
sample_dataset

In [None]:
with_lora_id_count = 0
without_lora_id_count = 0

print("Comparing language usage with and without LoRA on 20 samples:")
print("=" * 60)

for i, sample in enumerate(sample_dataset):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": sample["prompt"][1]["content"]},
    ]

    text = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    output_with_lora = model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_lora"),
    )[0].outputs[0].text

    output_without_lora = model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=None,
    )[0].outputs[0].text

    lang_with_lora = get_lang(output_with_lora)
    lang_without_lora = get_lang(output_without_lora)

    if lang_with_lora == 'id':
        with_lora_id_count += 1
    if lang_without_lora == 'id':
        without_lora_id_count += 1

    # Print progress every 5 samples
    if (i + 1) % 5 == 0:
        print(f"Processed {i + 1}/20 samples...")

print("\n" + "=" * 60)
print("RESULTS:")
print(f"With LoRA - Indonesian responses: {with_lora_id_count}/20 ({with_lora_id_count/20*100:.1f}%)")
print(f"Without LoRA - Indonesian responses: {without_lora_id_count}/20 ({without_lora_id_count/20*100:.1f}%)")
print(f"Improvement: +{with_lora_id_count - without_lora_id_count} Indonesian responses with LoRA")

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

In [None]:
# This creates the final file for your computer
model.save_pretrained_gguf("t1era_model", tokenizer, quantization_method = "q4_k_m")

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 1/1 [01:26<00:00, 86.83s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [01:57<00:00, 117.14s/it]


Unsloth: Merge process complete. Saved to `/content/t1era_model`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages


Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
