To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
# !pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b

### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [None]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-13 11:54:03 __init__.py:190] Automatically detected platform cuda.


Load up `Qwen 2.5 3B Instruct`, and set parameters

In [None]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 32000 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.2.5: Fast Qwen2 patching. Transformers: 4.48.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.48%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.56 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 32000. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 17.15 GB. Also swap space = 6 GB.
INFO 02-13 11:54:21 config.py:542] This model supports multiple tasks: {'score', 'embed', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kw

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 02-13 11:54:25 cuda.py:230] Using Flash Attention backend.
INFO 02-13 11:54:26 model_runner.py:1110] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 02-13 11:54:26 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 02-13 11:54:26 weight_utils.py:252] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-13 11:54:35 model_runner.py:1115] Loading model weights took 2.2160 GB
INFO 02-13 11:54:35 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-13 11:54:39 worker.py:267] Memory profiling takes 3.77 seconds
INFO 02-13 11:54:39 worker.py:267] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.49) = 19.57GiB
INFO 02-13 11:54:39 worker.py:267] model weights take 2.22GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 2.63GiB; the rest of the memory reserved for KV Cache is 14.63GiB.
INFO 02-13 11:54:40 executor_base.py:110] # CUDA blocks: 26640, # CPU blocks: 10922
INFO 02-13 11:54:40 executor_base.py:115] Maximum concurrency for 32000 tokens per request: 13.32x
INFO 02-13 11:54:45 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory erro

Capturing CUDA graph shapes: 100%|██████████| 39/39 [00:53<00:00,  1.38s/it]

INFO 02-13 11:55:38 model_runner.py:1562] Graph capturing finished in 54 secs, took 0.92 GiB
INFO 02-13 11:55:38 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 63.34 seconds



Unsloth 2025.2.5 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [12]:
!pip install openai

from openai import OpenAI
import json

client = OpenAI(
    api_key="",  # This is the default and can be omitted
)

def score_answer_llm(answer, expected_answer):
  chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": """
Given an `answer` and an `expected_answer`, assign a similarity score between 0 and 1 based on meaning and correctness:

- **1.0** → The `answer` is **identical** in meaning and intent to the `expected_answer`, with no substantive differences.
- **0.9 - 0.8** → The `answer` has **minor wording variations** but retains the **exact** intended meaning.
- **0.7 - 0.6** → The `answer` conveys a **similar idea** but with some **small inaccuracies or slight shifts in intent**.
- **0.5 - 0.3** → The `answer` has **partial overlap** but contains **major wording differences or missing key elements**.
- **0.2 - 0.1** → The `answer` **barely** matches the `expected_answer`, capturing only **fragments** of the intent.
- **0.0** → The `answer` is **completely incorrect**, **unrelated**, or has a **different meaning** from the `expected_answer`.

**Scoring Considerations:**
- **Exact match → 1.0**
- **Synonyms & minor rewording → 0.9 - 0.8**
- **Slight shifts in intent → 0.7 - 0.6**
- **Missing or changed details → 0.5 - 0.3**
- **Barely related words → 0.2 - 0.1**
- **Completely wrong meaning → 0.0**

**Output Format:**
Return a JSON object with a single key `"score"` and a value between 0 and 1.

**Example Output:**
{"score": 0.75}

            """,
        },
        {
            "role": "user",
            "content": f"### Expected Answer\n{expected_answer}\n\n### Answer\n{answer}",
        },
    ],
    model="gpt-4o-mini"
    )
  content = chat_completion.choices[0].message.content
  try:
    return json.loads(content).get("score", 0.0)
  except json.decoder.JSONDecodeError:
    return 0.0




In [13]:
## Testing score_answer_llm

print(score_answer_llm("Handle deep-link occlusion for expired PIN-protected profiles", "Handle deep-link occlusion for expired PIN-protected profiles"))
print(score_answer_llm("Handle deep-link occlusion for expired PIN-protected profiles", "Manage deep-link obstruction for lapsed PIN-secured profiles."))
print(score_answer_llm("Resolve deep-link concealment for expired PIN-locked profiles.", "Handle deep-link occlusion for expired PIN-protected profiles"))
print(score_answer_llm("Process deep-link visibility for outdated PIN-locked accounts.", "Manage deep-link obstruction for lapsed PIN-secured profiles."))
print(score_answer_llm("Refactor the profile selector flow", "Manage deep-link obstruction for lapsed PIN-secured profiles."))
print(score_answer_llm("Refactor the code", "Manage deep-link obstruction for lapsed PIN-secured profiles."))


APIConnectionError: Connection error.

In [10]:
def calculate_content_structure_score(text):
    """
    Calcula uma pontuação para o texto com base na presença da tag <reasoning>
    e no formato da resposta.

    :param text: Texto a ser avaliado.
    :return: Pontuação calculada.
    """

    # Inicializa a pontuação
    score = 0

    # Verifica se a tag <reasoning> está presente
    if "<reasoning>" in text and "</reasoning>" in text:
        # Verifica se há conteúdo dentro da tag <reasoning>
        start_reasoning = text.find("<reasoning>") + len("<reasoning>")
        end_reasoning = text.find("</reasoning>")
        if start_reasoning < end_reasoning and text[start_reasoning:end_reasoning].strip():
            score += 0.1  # Pontuação por ter conteúdo na tag <reasoning>
        else:
          return 0.0
    if "</reasoning>" in text:
        # Extrai o conteúdo da tag <reasoning>
      answer = text.split("</reasoning>", 1)[1].strip()

      # Verifica o formato da resposta
      lines = answer.splitlines()
      if len(lines) >= 3:
          # Verifica se a primeira linha tem menos de 73 caracteres (título)
          if len(lines[0].strip()) <= 72:
              score += 0.1  # Pontuação por ter título com menos de 73 caracteres

          if lines[1] == "":
              score += 0.1  # Pontuação por ter título com menos de 73 caracteres

          # Verifica se há um corpo após o título
          if lines[2] != "":
              score += 0.1  # Pontuação por ter corpo na resposta

    return score

# Exemplo de uso
examples = [
    """\
Título Curto

Corpo da resposta.
    """,
    """\
<reasoning>
</reasoning>
Título Curto

Corpo da resposta.
    """,
    """\
<reasoning>
Este é um exemplo de raciocínio.
</reasoning>

Corpo da resposta.
    """,
    """\
<reasoning>
Este é um exemplo de raciocínio.
</reasoning>
Título Curto
    """,
    """\
<reasoning>
Este é um exemplo de raciocínio.
</reasoning>
Título Curto
Corpo da resposta.
    """,
    """\
<reasoning>
Este é um exemplo de raciocínio.
</reasoning>
Título Curto

Corpo da resposta.
    """,
]

for full_text in examples:
  print(calculate_score(full_text))  # Saída: 3


0
0.0
0.1
0.1
0.1
0.4


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [None]:
import re
import difflib
from datasets import load_dataset, Dataset

# System prompt instructing the model to output only the <reasoning> block,
# followed immediately by the final answer in the form of a git commit message.
SYSTEM_PROMPT = f"""
Please respond using the following format:
<reasoning>
Your chain-of-thought here.
</reasoning>
Your final answer should appear immediately after the </reasoning> tag and must be a git commit message that adheres to the following guidelines:

- **Title (Subject Line):**
  - Use the imperative mood (e.g., "Fix bug" not "Fixed bug" or "Fixes bug").
  - Capitalize the first letter.
  - Do not end with a period.
  - Keep to a maximum of 50 characters.

- **Body:**
  - Separate from the title with a blank line.
  - Explain the *what* and *why* of the change, not the *how*.
  - Wrap lines at 72 characters.

- **Additional Recommendations:**
  - Use bullet points for multiple items, if necessary.
"""


# Template for generation
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
{answer}
"""

def extract_xml_answer(text: str) -> str:
    """
    Safely extracts the final commit message which comes after the </reasoning> tag.
    Returns an empty string if the tag is not found.
    """
    if "</reasoning>" not in text:
        return ""
    parts = text.split("</reasoning>", 1)
    return parts[1].strip()

def extract_hash_answer(text: str) -> str | None:
    """
    Extracts the expected commit message from within a git commit message code block.
    """
    marker = "```git-commit-message"
    if marker not in text:
        return None
    try:
        after_marker = text.split(marker, 1)[1]
        commit_message = after_marker.split("```", 1)[0].strip()
        return commit_message
    except IndexError:
        return None

def load_dataset(split="train") -> Dataset:
    """
    Prepares the GSM8K dataset by inserting the system prompt and the user question.
    """
    data = load_dataset('Tavernari/git-commit-message-dt', 'default')[split]
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['input']}
        ],
        'answer': extract_hash_answer(x['output'])
    })
    return data

dataset = load_dataset()

# ------------------------
# Reward Functions
# ------------------------

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    scores = []
    for r, a in zip(extracted_responses, answer):
        similarity = score_answer_llm(r, a)
        scores.append(similarity)
    return scores

Map:   0%|          | 0/2535 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 1024,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        calculate_content_structure_score,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,535 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 250
 "-____-"     Number of trainable parameters = 119,734,272


Step,Training Loss,reward,reward_std,completion_length,kl
1,0.0014,0.966405,0.076667,140.125,0.034739
2,0.0006,0.94616,0.078199,111.625,0.014523


OutOfMemoryError: CUDA out of memory. Tried to allocate 5.75 GiB. GPU 0 has a total capacity of 39.56 GiB of which 3.63 GiB is free. Process 653017 has 35.89 GiB memory in use. Of the allocated memory 34.35 GiB is allocated by PyTorch, with 86.00 MiB allocated in private pools (e.g., CUDA Graphs), and 247.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : """
    diff --git a/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift b/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift
--- a/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift
+++ b/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift
@@ -2,23 +2,5 @@

-public extension AppSettings.Cache {
-    struct Profile: Codable {
-        /// The duration of the profile cache without pin.
-        let withoutPinDuration: TimeInterval
-
-        /// The duration of the profile cache with pin.
-        let withPinDuration: TimeInterval
-
-        init(
-            withoutPinDuration: TimeInterval = Defaults.withoutPinDuration,
-            withPinDuration: TimeInterval = Defaults.withPinDuration
-        ) {
-            self.withoutPinDuration = withoutPinDuration
-            self.withPinDuration = withPinDuration
-        }
-    }
-}
-
-private extension AppSettings.Cache.Profile {
+extension AppSettings.Cache {
     // MARK: - Constants
-    enum Defaults {
+    private enum Defaults {
         /// The default duration of the profile cache without pin.
@@ -32,2 +14,23 @@
     }
+
+    public struct Profile: Codable {
+
+        public static let `default` = Profile(
+            withoutPinDuration: Defaults.withoutPinDuration,
+            withPinDuration: Defaults.withPinDuration
+        )
+        /// The duration of the profile cache without pin.
+        public let withoutPinDuration: TimeInterval
+
+        /// The duration of the profile cache with pin.
+        public let withPinDuration: TimeInterval
+
+        public init(
+            withoutPinDuration: TimeInterval,
+            withPinDuration: TimeInterval
+        ) {
+            self.withoutPinDuration = withoutPinDuration
+            self.withPinDuration = withPinDuration
+        }
+    }
 }
"""},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.63s/it, est. speed input: 64.26 toks/s, output: 95.73 toks/s]


"This diff appears to be a code change in a Swift project, specifically within a caching system for profile settings. The changes are related to the `AppSettings.Cache.Profile` structure and its default values. Here's a detailed breakdown of the changes:\n\n### Before Changes\nThe original code was trying to define a `Profile` struct within `AppSettings.Cache`, but this structure was not properly scoped or defined. The `Profile` struct was not accessible outside of `AppSettings.Cache`, which could lead to issues if it were used elsewhere in the codebase.\n\n### After Changes\nThe following changes were made:\n\n1. **Corrected Access Control**:\n   - The `Profile` struct is now properly scoped to the `AppSettings.Cache` module, making it accessible within the module but not necessarily outside of it.\n\n2. **Renamed Constants**:\n   - The `Defaults` enum and its associated constants were moved to a private extension `AppSettings.Cache.Profile`, making them internal constants for the `Pr

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : """
    diff --git a/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift b/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift
--- a/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift
+++ b/Project/Components/Modules/Modules/AppSettings/Sources/API/Model/Cache/Profile/AppSettings.Cache+Profile.swift
@@ -2,23 +2,5 @@

-public extension AppSettings.Cache {
-    struct Profile: Codable {
-        /// The duration of the profile cache without pin.
-        let withoutPinDuration: TimeInterval
-
-        /// The duration of the profile cache with pin.
-        let withPinDuration: TimeInterval
-
-        init(
-            withoutPinDuration: TimeInterval = Defaults.withoutPinDuration,
-            withPinDuration: TimeInterval = Defaults.withPinDuration
-        ) {
-            self.withoutPinDuration = withoutPinDuration
-            self.withPinDuration = withPinDuration
-        }
-    }
-}
-
-private extension AppSettings.Cache.Profile {
+extension AppSettings.Cache {
     // MARK: - Constants
-    enum Defaults {
+    private enum Defaults {
         /// The default duration of the profile cache without pin.
@@ -32,2 +14,23 @@
     }
+
+    public struct Profile: Codable {
+
+        public static let `default` = Profile(
+            withoutPinDuration: Defaults.withoutPinDuration,
+            withPinDuration: Defaults.withPinDuration
+        )
+        /// The duration of the profile cache without pin.
+        public let withoutPinDuration: TimeInterval
+
+        /// The duration of the profile cache with pin.
+        public let withPinDuration: TimeInterval
+
+        public init(
+            withoutPinDuration: TimeInterval,
+            withPinDuration: TimeInterval
+        ) {
+            self.withoutPinDuration = withoutPinDuration
+            self.withPinDuration = withPinDuration
+        }
+    }
 }
"""},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.07s/it, est. speed input: 608.52 toks/s, output: 58.89 toks/s]


'<reasoning>\nThe commit focuses on optimizing and simplifying the `Profile` struct by making it a static property of the `Cache` struct. It also adds a default value for the `Profile`. The commit removes redundant code and enhances readability.\n</reasoning>\nOptimize Profile structure and remove redundant code'

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("Tavernari/git-commit-message", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("Tavernari/git-commit-message", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "Tavernari/git-commit-message", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 48.77 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 36/36 [00:00<00:00, 116.54it/s]


Unsloth: Saving tokenizer... Done.
Done.


Exception ignored in: <function _xla_gc_callback at 0x798a1967efc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 
Exception ignored in: <function _xla_gc_callback at 0x798a1967efc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 
Exception ignored in: <function _xla_gc_callback at 0x798a1967efc0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at Tavernari/git-commit-message into bf16 GGUF format.
The output location will be /content/Tavernari/git-commit-message/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: git-commit-message
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00002.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> BF16, shape = {2048, 

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Tavernari/git-commit-message


Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
