To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

Load up `Qwen 2.5 3B Instruct`, and set parameters

In [1]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


ü¶• Unsloth Zoo will now patch everything to make training faster!
INFO 03-26 11:57:25 [__init__.py:239] Automatically detected platform cuda.


2025-03-26 11:57:25,438	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


==((====))==  Unsloth 2025.3.18: Fast Qwen2 patching. Transformers: 4.50.1. vLLM: 0.8.2.
   \\   /|    NVIDIA RTX 5000 Ada Generation. Num GPUs = 1. Max memory: 31.578 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/Qwen2.5-3B-Instruct with actual GPU utilization = 49.15%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 31.58 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 9.55 GB. Also swap space = 6 GB.
INFO 03-26 11:57:41 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 03-26 11:57:41 [arg_utils.py:1

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.59it/s]


INFO 03-26 11:57:43 [loader.py:447] Loading weights took 0.80 seconds
INFO 03-26 11:57:43 [punica_selector.py:18] Using PunicaWrapperGPU.





INFO 03-26 11:57:44 [model_runner.py:1146] Model loading took 5.9932 GB and 1.848606 seconds
INFO 03-26 11:57:45 [worker.py:267] Memory profiling takes 1.12 seconds
INFO 03-26 11:57:45 [worker.py:267] the current vLLM instance can use total_gpu_memory (31.58GiB) x gpu_memory_utilization (0.49) = 15.52GiB
INFO 03-26 11:57:45 [worker.py:267] model weights take 5.99GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 8.24GiB.
INFO 03-26 11:57:45 [executor_base.py:111] # cuda blocks: 14997, # CPU blocks: 10922
INFO 03-26 11:57:45 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 234.33x
INFO 03-26 11:57:47 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decr

Capturing CUDA graph shapes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 31/31 [00:14<00:00,  2.08it/s]

INFO 03-26 11:58:02 [model_runner.py:1570] Graph capturing finished in 15 secs, took 0.27 GiB
INFO 03-26 11:58:02 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 18.49 seconds



Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Unsloth 2025.3.18 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [23]:
import re
from datasets import load_dataset, Dataset
from prompts_eval import *
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np
with open("API_KEY", "r") as f:
    api_key = f.read().strip()

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

template = (
    "Suggest 3 critical questions that should be raised before accepting the arguments in this text:\n\n"
    "\"{intervention}\"\n\n"
    "Give one question per line. Make the questions simple, and do not provide any explanation regarding why the question is relevant. "
    "Do not include any special characters or numbering except for the question mark."
)

def extract_answer(example):
    """
    Extracts the 'cqs' (critical questions) from the example and formats each question with its label.
    Each line in the answer will be in the form:
    <question> (Label: <label>)
    """
    cqs = example.get("cqs", [])
    answer_lines = []
    for cq in cqs:
        question = cq.get("cq", "")
        label = cq.get("label", "")
        answer_lines.append({"cq": question, "label": label, "intervention": example["intervention"]})
    return answer_lines

import json

def get_critical_questions_dataset(split="train"):
    file_path = "../data_splits/training_dataset.json"
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    # If data is a dictionary, convert it to a list of examples.
    if isinstance(data, dict):
        examples = list(data.values())
    else:
        examples = data
    
    # Map each example to include prompt and answer fields.
    for example in examples:
        example["prompt"] = [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': template.format(intervention=example['intervention'])}
        ]
        example["answer"] = extract_answer(example)
    
    return examples

dataset = get_critical_questions_dataset()
# def extract_hash_answer(text: str) -> str | None:
#     if "####" not in text:
#         return None
#     return text.split("####")[1].strip()

# # uncomment middle messages for 1-shot prompting
# def get_gsm8k_questions(split = "train") -> Dataset:
#     data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
#     data = data.map(lambda x: { # type: ignore
#         'prompt': [
#             {'role': 'system', 'content': SYSTEM_PROMPT},
#             {'role': 'user', 'content': x['question']}
#         ],
#         'answer': extract_hash_answer(x['answer'])
#     }) # type: ignore
#     return data # type: ignore

# dataset = get_gsm8k_questions()
def structure_output(whole_text):
    cqs_list = whole_text.split('\n')
    final = []
    valid = []
    not_valid = []
    for cq in cqs_list:
        if re.match('.*\?(\")?( )?(\([a-zA-Z0-9\.\'\-,\? ]*\))?([a-zA-Z \.,\"\']*)?(\")?$', cq):
            valid.append(cq)
        else:
            not_valid.append(cq)

    still_not_valid = []
    for text in not_valid:
        new_cqs = re.split("\?\"", text+'end')
        if len(new_cqs) > 1:
            for cq in new_cqs[:-1]:
                valid.append(cq+'?\"')
        else:
            still_not_valid.append(text)

    for i, cq in enumerate(valid):
        occurrence = re.search(r'[A-Za-z]', cq)
        if occurrence:
            final.append(cq[occurrence.start():])
        else:
            continue

    output = []
    if len(final) >= 3:
        for i in [0, 1, 2]:
            output.append({'id':i, 'cq':final[i]})
        return output
    else:
        # logger.warning('Missing CQs')
        return 'Missing CQs'

def extract_label(response):
    """
    Extracts the label (USEFUL, UNHELPFUL, INVALID) from a given response.
    
    Parameters:
    response (str): The user's response which includes the label in the format "ANSWER: label"

    Returns:
    str: Extracted label or 'UNKNOWN' if no valid label is found.
    """
    match = re.search(r'ANSWER:\s*(USEFUL|UNHELPFUL|INVALID)', response, re.IGNORECASE)
    if not match:
        match = re.search(r'Label:\s*(USEFUL|UNHELPFUL|INVALID)', response, re.IGNORECASE)
    if not match:
        match = re.search(r':\s*(USEFUL|UNHELPFUL|INVALID)', response, re.IGNORECASE)
    if not match:
        match = re.search(r'\s*(USEFUL|UNHELPFUL|INVALID)', response, re.IGNORECASE)
    return match.group(1).upper() if match else "UNKNOWN"

def query_deepinfra(model_name, messages, temperature, max_new_tokens = 2048):
    
    openai = OpenAI(
        api_key=api_key,
        base_url="https://api.deepinfra.com/v1/openai",
    )

    chat_completion = openai.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=temperature
    )

    response = chat_completion.choices[0].message.content
    input_token_count = chat_completion.usage.prompt_tokens
    output_token_count = chat_completion.usage.completion_tokens

    return response, None, input_token_count, output_token_count

def evluate_answer_corr_with_llm(answer, intervention):
    instruction = ComprehensiveFewShotPrompt().format(intervention=intervention, cq=answer)

    messages = [{"role": "system", "content": ""}, {"role": "user", "content": instruction}]
    full_output, reasoning, _, _ = query_deepinfra("Qwen/Qwen2.5-72B-Instruct", messages, temperature = 0.7)

    label = extract_label(full_output)
    return label.lower()
    
model_eval = SentenceTransformer("stsb-mpnet-base-v2")
def evaluate_answer_corr(golden_answers, answer):
    punctuation = 0
    # print("***********************")
    # print(golden_answers[0])
    reference_set = [ref['cq'] for ref in golden_answers[0]]
    for i, line in enumerate(answer): # look into each question of the new cqs and find the most similar question in the references
        winner = None
       
        sentence_embedding = model_eval.encode(line['cq'])
        reference_embedding = model_eval.encode(reference_set)
        sims = model_eval.similarity(sentence_embedding, reference_embedding).tolist()[0]
        
        winner = np.argmax(sims)

        # make sure the similarity of the winning reference sentence is at least 0.6
        if sims[winner] > 0.6:
            label = golden_answers[0][winner]['label'].lower()
            if label == 'useful':
                punctuation += 1/3
        else: 
            # Fallback evaluation with LLM.
            label = evluate_answer_corr_with_llm(line['cq'], golden_answers[0][0]['intervention'])
            if label == 'useful':
                punctuation += 1/3


        return punctuation



# # Reward functions
# def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
#     responses = [completion[0]['content'] for completion in completions]
#     q = prompts[0][-1]['content']
#     extracted_responses = [extract_xml_answer(r) for r in responses]
#     print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
#     return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]


def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]

    rewards = []
    for response in extracted_responses:
        structured = structure_output(response)
        if structured == 'Missing CQs' or not isinstance(structured, list) or len(structured) < 3:
            rewards.append(0.0)
        else:
            score = evaluate_answer_corr(answer, structured)
            rewards.append(score*3)
    return rewards

def structure_reward_func(completions, **kwargs)-> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]

    rewards = []
    for response in extracted_responses:
        structured = structure_output(response)
        if structured == 'Missing CQs' or not isinstance(structured, list) or len(structured) < 3:
            rewards.append(0.0)
        else:
            rewards.append(0.5)
    return rewards


# def int_reward_func(completions, **kwargs) -> list[float]:
#     responses = [completion[0]['content'] for completion in completions]
#     extracted_responses = [extract_xml_answer(r) for r in responses]
#     return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [24]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [25]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        structure_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 148 | Num Epochs = 2 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 119,734,272/3,205,672,960 (3.74% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / structure_reward_func,rewards / correctness_reward_func
1,0.0,0.9375,0.623212,29.0,0.0,0.0,0.0,0.0,0.4375,0.5
2,0.0,1.125,0.517549,35.875,0.0,0.0,0.0,0.0,0.5,0.625
3,0.0001,0.625,0.353553,38.125,0.001356,0.0,0.0,0.0,0.5,0.125
4,0.0001,1.25,0.46291,32.5,0.001263,0.0,0.0,0.0,0.5,0.75
5,0.0001,0.625,0.353553,28.25,0.001874,0.0,0.0,0.0,0.5,0.125
6,0.0001,1.0,0.534522,40.625,0.001864,0.0,0.0,0.0,0.5,0.5
7,0.0003,1.5,0.0,39.125,0.00706,0.0,0.0,0.0,0.5,1.0
8,0.0,0.625,0.353553,20.875,0.000689,0.0,0.0,0.0,0.5,0.125
9,0.0001,1.375,0.353553,34.125,0.001715,0.0,0.0,0.0,0.5,0.875
10,0.0003,1.25,0.46291,33.75,0.006485,0.0,0.0,0.0,0.5,0.75


Unsloth: Will smartly offload gradients to save VRAM!
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/home/sama.hadhoud/.conda/envs/unsloth_nlp804/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/slurm-sama.hadhoud-67270/ipykernel_155102/433905283.py", line 14, in <module>
    trainer.train()
  File "/home/sama.hadhoud/.conda/envs/unsloth_nlp804/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 310, in _fast_inner_training_loop
  File "<string>", line 25, in _unsloth_training_step
  File "/home/sama.hadhoud/Documents/Critical_Question_generation/trial_submission/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 1040, in _prepare_inputs
    output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.71it/s, est. speed input: 63.38 toks/s, output: 25.69 toks/s]


'There are 2 r\'s in the word "strawberry."'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.06s/it, est. speed input: 14.05 toks/s, output: 29.09 toks/s]


'<reasoning>\nTo find out how many times the letter \'r\' appears in the word "strawberry", we can go through the word character by character and count each occurrence of \'r\'. In "strawberry", the letter \'r\' appears 3 times: once in the beginning, once in the middle, and once at the end of the word.\n</reasoning>\n<answer>\n3\n</answer>'

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
