<a href="https://colab.research.google.com/github/Praveen-Prabhu/code-generation/blob/main/code_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [2]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [3]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### HuggingFace & W&B Authentication

In [3]:
from huggingface_hub import login, HfApi, HfFolder
from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN')
login(token=HF_TOKEN)

In [None]:
import wandb
from google.colab import userdata

WANDB_TOKEN = userdata.get('WANDB_TOKEN')
wandb.login(key=WANDB_TOKEN)
wandb.init(
    project="codegen-llama-3.1-8b-grpo",
    job_type="training",
    anonymous="allow"
)

In [4]:
# Define Hugging Face Repository for Saving Checkpoints
HF_REPO = "PraveenPrabhu/codegen-llama-3.1-8b-grpo"

### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [5]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-15 03:50:40 [__init__.py:239] Automatically detected platform cuda.


Load up `Llama 3.1 8B Instruct`, and set parameters

In [6]:
# !export VLLM_USE_V1=0

In [7]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.1. vLLM: 0.8.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.59 GB. Also swap space = 5 GB.
INFO 04-15 03:51:08 [config.py:689] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes confi

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

INFO 04-15 03:51:13 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-15 03:51:13 [cuda.py:289] Using XFormers backend.
INFO 04-15 03:51:14 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-15 03:51:14 [model_runner.py:1110] Starting to load model unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit...
INFO 04-15 03:51:14 [loader.py:1166] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 04-15 03:51:16 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

INFO 04-15 03:51:37 [weight_utils.py:281] Time spent downloading weights for unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit: 20.922070 seconds
INFO 04-15 03:51:37 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-15 03:51:42 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-15 03:51:42 [model_runner.py:1146] Model loading took 5.7737 GiB and 27.751781 seconds
INFO 04-15 03:51:51 [worker.py:267] Memory profiling takes 8.31 seconds
INFO 04-15 03:51:51 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 04-15 03:51:51 [worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 2.22GiB.
INFO 04-15 03:51:51 [executor_base.py:112] # cuda blocks: 1134, # CPU blocks: 2560
INFO 04-15 03:51:51 [executor_base.py:117] Maximum concurrency for 1024 tokens per request: 17.72x
INFO 04-15 03:51:56 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If 

Capturing CUDA graph shapes:   0%|          | 0/23 [00:00<?, ?it/s]

INFO 04-15 03:52:43 [model_runner.py:1598] Graph capturing finished in 48 secs, took 0.53 GiB
INFO 04-15 03:52:43 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 61.07 seconds


tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [10]:
import re
from datasets import load_dataset, Dataset
import pandas as pd

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

# def extract_xml_answer(text: str) -> str:
#     answer = text.split("<answer>")[-1]
#     answer = answer.split("</answer>")[0]
#     return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# dataset = get_gsm8k_questions()

# Load and map HumanEval dataset
def get_humaneval_questions(split = "test") -> Dataset:
    data = load_dataset('openai/openai_humaneval')[split]

    # Process each sample
    def process_sample(x):
        prompt = x['prompt']
        canonical_solution = x['canonical_solution']
        test_code = x['test']

        # Check if test code is valid
        if isinstance(test_code, str) and len(test_code) > 10:
            return {
                'prompt': [
                    {'role': 'system', 'content': SYSTEM_PROMPT},
                    {'role': 'user', 'content': prompt}
                ],
                'answer': canonical_solution,
                'test': test_code
            }
        else:
            print(f"Warning: Invalid test code for task: {prompt[:50]}...")
            return None

    processed_data = data.map(process_sample)
    # Filter out None values
    processed_data = processed_data.filter(lambda x: x is not None)

    return processed_data

dataset = get_humaneval_questions()
test_code_list = dataset["test"]

df = dataset.to_pandas()
df = df[['prompt', 'canonical_solution', 'test']]
df.head(1)

Unnamed: 0,prompt,canonical_solution,test
0,"[{'content': ' Respond in the following format: <reasoning> ... </reasoning> <answer> ... </answer> ', 'role': 'system'}, {'content': 'from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool:  """""" Check if in given list of numbers, are any two numbers closer to each other than  given threshold.  >>> has_close_elements([1.0, 2.0, 3.0], 0.5)  False  >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)  True  """""" ', 'role': 'user'}]","for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = abs(elem - elem2)\n if distance < threshold:\n return True\n\n return False\n","\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\n assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False\n assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True\n assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False\n assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True\n assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True\n assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False\n\n"


### Reward Functions

In [None]:
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

### Module 1 - Logical Verification

In [9]:
import subprocess
from typing import Tuple

class SMTSolvers:
    def __init__(self, timeout: int = 15, debug: bool = False):
        """
        Initialize the SMTSolvers object with configuration parameters.

        Args:
            timeout: Maximum execution time in seconds for tested code
            debug: Whether to print debug information
        """
        self.timeout = timeout
        self.debug = debug

    def verify(self, code: str, test_code: str) -> Tuple[float, str]:
        """
        Verifies generated code by running it against provided test code.

        Args:
            code: The generated solution code to test
            test_code: The test code to validate the solution

        Returns:
            Tuple containing:
                - reward score (float): 2.0 for passing tests, 0.0 for failing, -0.5 for timeout
                - output (str): execution output or error message
        """
        try:
            # Clean up the test code if it's malformed
            if isinstance(test_code, list) or (isinstance(test_code, str) and "," in test_code and len(test_code) < 100):
                return 0.0, "Warning: Test code appears to be malformed"

            # Combine the generated solution with the test code
            full_code = f"{code}\n\n{test_code}"

            if self.debug:
                print(f"Running code:\n{full_code}...")

            # Create a temporary file to execute
            with open("temp_test.py", "w") as f:
                f.write(full_code)

            # Run the test file
            result = subprocess.run(
                ["python3", "temp_test.py"],
                capture_output=True, text=True, timeout=self.timeout
            )

            # Check if execution was successful (no errors)
            if result.returncode == 0:
                if self.debug:
                    print("Test passed successfully!")
                return 2.0, result.stdout  # Increased reward for correct execution
            else:
                if self.debug:
                    print(f"Test failed with error: {result.stderr}")
                return 0.0, result.stderr  # Test failed

        except subprocess.TimeoutExpired:
            if self.debug:
                print("Execution timed out")
            return -0.5, "Execution timed out"  # Penalize timeouts more severely

        except Exception as e:
            if self.debug:
                print(f"Execution error: {str(e)}")
            return 0.0, str(e)  # Test failed

def extract_answer_code(text: str) -> str:
    # Try to extract code from the <answer> tags
    match = re.search(r"<answer>\s*(.*?)\s*</answer>", text, re.DOTALL)
    if match:
        return match.group(1).strip()

    # Try to extract code from triple backticks if it's not in <answer> tags
    match = re.search(r"```(?:python)?\s*(.*?)\s*```", text, re.DOTALL)
    if match:
        return match.group(1).strip()

    # Return the original text if no patterns match
    return text.strip()

# Create an instance of the SMTSolvers class
solver = SMTSolvers(debug=True)

# Example reward function with HumanEval tests
def make_rl_reward_function(test_code_list):
    def rl_reward_function(prompts, completions, **kwargs):
        rewards = []

        for i, completion in enumerate(completions):
            try:
                if i >= len(test_code_list):
                    print(f"Warning: No test code available for sample {i}")
                    rewards.append(0.0)
                    continue

                idx = i  # Use the correct index
                test_code = test_code_list[idx]

                # Skip if test code is problematic
                if not isinstance(test_code, str) or len(test_code) < 10:
                    print(f"Warning: Test code for sample {i} is invalid")
                    rewards.append(0.0)
                    continue

                # Extract code from completion
                generated_code = extract_answer_code(completion[0]["content"])

                # Add debugging print
                print(f"Sample {i} - Generated code (first 100 chars):\n {generated_code}")

                # Use the solver to verify the code
                reward, output = solver.verify(generated_code, test_code_list[i])
                rewards.append(reward)

                print(f"Sample {i}: Reward = {reward}")
            except Exception as e:
                print(f"Error processing completion {i}: {str(e)}")
                rewards.append(0.0)

        return rewards
    return rl_reward_function


<a name="Train"></a>
### Train the model - Module 2 & 3

Now set up GRPO Trainer and all configurations!

In [10]:
max_prompt_length = 256

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 1,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "checkpoints",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [11]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        make_rl_reward_function(test_code_list),
    ],
    args = training_args,
    train_dataset = dataset,
)
# trainer.train()

In [12]:
# Load Checkpoint from Hugging Face if Exists
from huggingface_hub import hf_hub_download
import shutil

def download_checkpoint(repo_id, filename, local_dir="hf_checkpoints"):
    """Download a checkpoint file from Hugging Face and return its local path."""
    os.makedirs(local_dir, exist_ok=True)
    try:
        return hf_hub_download(repo_id, filename, local_dir=local_dir)
    except:
        return None  # Return None if checkpoint does not exist

lora_path = download_checkpoint(HF_REPO, "lora_adapter.safetensors")
optimizer_path = download_checkpoint(HF_REPO, "optimizer_state.pth")
trainer_path = download_checkpoint(HF_REPO, "trainer_state.pth")
tokenizer_path = download_checkpoint(HF_REPO, "tokenizer.json")

if lora_path and optimizer_path and trainer_path and tokenizer_path:
    model.load_lora(lora_path)
    trainer.load_state_dict(torch.load(trainer_path))
    trainer.optimizer.load_state_dict(torch.load(optimizer_path))
    tokenizer = torch.load(tokenizer_path)
    print("Resumed training from Hugging Face checkpoint!")

optimizer_state.pth:   0%|          | 0.00/171M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.99M [00:00<?, ?B/s]

In [13]:
# Train Model
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 164 | Num Epochs = 1 | Total steps = 1
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 83,886,080/8,000,000,000 (1.05% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpraveenprabhu[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Sample 0 - Generated code (first 100 chars):
 </reasoning>
def below_zero(operations: List[int]) -> bool:
    """
    This function checks if at any point in the sequence of operations the balance falls below zero.

    Args:
        operations (List[int]): A list of deposit (positive) and withdrawal (negative) operations.

    Returns:
        bool: True if the balance falls below zero at any point, False otherwise.
    """
    balance = 0  # Initialize the balance to zero
    for operation in operations:  # Iterate over each operation in the list
        balance += operation  # Add the operation to the balance
        if balance < 0:  # Check if the balance is less than zero
            return True  # If the balance is less than zero, return True
    return False  # If the function hasn't returned True, the balance never fell below zero
Running code:
</reasoning>
def below_zero(operations: List[int]) -> bool:
    """
    This function checks if at any point in the sequence of operati

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / rl_reward_function
1,0.0,1.120667,0.998881,244.333344,0.0,-0.212667,0.0,0.0,1.333333


TrainOutput(global_step=1, training_loss=1.0001180328345072e-07, metrics={'train_runtime': 135.5758, 'train_samples_per_second': 0.044, 'train_steps_per_second': 0.007, 'total_flos': 0.0, 'train_loss': 1.0001180328345072e-07})

### Save All States After Training

In [17]:
# Save LoRA adapter
model.save_pretrained_merged("lora_adapter", tokenizer, save_method = "lora")
print("LoRA adapter saved to 'lora_adapter' directory")

if False:
  # Save merged model in 16-bit precision (smaller file size but good quality)
  model.save_pretrained_merged("merged_model_16bit", tokenizer, save_method = "merged_16bit")
  print("Merged model saved in 16-bit precision to 'merged_model_16bit' directory")

  # Save merged model in 4-bit precision (smallest file size, some quality loss)
  model.save_pretrained_merged("merged_model_4bit", tokenizer, save_method = "merged_4bit_forced")
  print("Merged model saved in 4-bit precision to 'merged_model_4bit' directory")

# Save optimizer and trainer state for resuming training later
torch.save(trainer.optimizer.state_dict(), "optimizer_state.pth")
torch.save(tokenizer, "tokenizer.json")
print("Optimizer and tokenizer saved")

Unsloth: Saving tokenizer... Done.
Unsloth: Saving model...

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

 Done.
LoRA adapter saved to 'lora_adapter' directory
Optimizer and tokenizer saved


In [23]:
# Upload All States to Hugging Face for Persistence

# Push LoRA adapter to main repository
model.push_to_hub_merged(HF_REPO, tokenizer, save_method="lora", token=HF_TOKEN)
print(f"LoRA adapter pushed to {HF_REPO}")

# # Push full precision merged model to main repository
# model.push_to_hub_merged(HF_REPO, tokenizer, save_method="merged", token=HF_TOKEN)
# print(f"Full precision merged model pushed to {HF_REPO}")

if False:
  # Push 16-bit merged model
  model.push_to_hub_merged(f"{HF_REPO}-merged-16bit", tokenizer, save_method="merged_16bit", token=HF_TOKEN)
  print(f"16-bit merged model pushed to {HF_REPO}-merged-16bit")

# # Push 4-bit merged model
# model.push_to_hub_merged(f"{HF_REPO}-merged-4bit", tokenizer, save_method="merged_4bit", token=HF_TOKEN)
# print(f"4-bit merged model pushed to {HF_REPO}-merged-4bit")

api = HfApi()

# api.upload_file(
#     path_or_fileobj="trainer_state.pth",
#     path_in_repo="trainer_state.pth",
#     repo_id=HF_REPO,
#     token=HF_TOKEN,
# )

api.upload_file(
    path_or_fileobj="optimizer_state.pth",
    path_in_repo="optimizer_state.pth",
    repo_id=HF_REPO,
    token=HF_TOKEN,
)

api.upload_file(
    path_or_fileobj="tokenizer.json",
    path_in_repo="tokenizer.json",
    repo_id=HF_REPO,
    token=HF_TOKEN,
)
print("Optimizer and tokenizer pushed to Hugging Face Hub")

optimizer_state.pth:   0%|          | 0.00/171M [00:00<?, ?B/s]

Optimizer and tokenizer pushed to Hugging Face Hub


<a name="Inference"></a>
### Inference

Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [15]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "write a python program to calculate first n numbers of fibonacci series"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

Processed prompts: 100%|██████████| 1/1 [00:21<00:00, 21.77s/it, est. speed input: 2.21 toks/s, output: 18.38 toks/s]


In [16]:
print("Base Model's Output")
print(output)

Base Model's Output
**Fibonacci Series Calculator**

Below is a Python program to calculate the first n numbers of the Fibonacci series.

```python
def fibonacci(n):
    """
    Calculate the first n numbers of the Fibonacci series.

    Args:
        n (int): The number of Fibonacci numbers to generate.

    Returns:
        list: A list of the first n Fibonacci numbers.
    """
    fib_series = [0, 1]
    while len(fib_series) < n:
        fib_series.append(fib_series[-1] + fib_series[-2])
    return fib_series[:n]

def main():
    n = int(input("Enter the number of Fibonacci numbers to generate: "))
    print(f"Fibonacci series up to {n} numbers:")
    print(fibonacci(n))

if __name__ == "__main__":
    main()
```

**Explanation**
---------------

This program uses a simple iterative approach to generate the Fibonacci series. It starts with a list containing the first two Fibonacci numbers (0 and 1). Then, it enters a loop that continues until the list contains `n` numbers. In each 

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [17]:
# model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [18]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "write a python program to calculate first n numbers of fibonacci series"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("lora_adapter"),
)[0].outputs[0].text

Processed prompts: 100%|██████████| 1/1 [00:26<00:00, 26.79s/it, est. speed input: 2.61 toks/s, output: 16.46 toks/s]


In [19]:
print("Fine-Tuned Model's Output")
print(output)

Fine-Tuned Model's Output
**Fibonacci Series Calculator**

### Description

This program calculates the first n numbers of the Fibonacci series. The Fibonacci series is a sequence of numbers in which each number is the sum of the two preceding ones, usually starting with 0 and 1.

### Code
```python
def fibonacci(n):
    """
    Calculate the first n numbers of the Fibonacci series.

    Args:
        n (int): The number of Fibonacci numbers to generate.

    Returns:
        list: A list of the first n Fibonacci numbers.
    """
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    else:
        fib_series = [0, 1]
        while len(fib_series) < n:
            fib_series.append(fib_series[-1] + fib_series[-2])
        return fib_series

def main():
    n = int(input("Enter the number of Fibonacci numbers to generate: "))
    fib_numbers = fibonacci(n)
    print("First", n, "numbers of the Fibonacci series:")
    print(fib_numb

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

In [24]:
# Finish W&B Logging
wandb.finish()

NameError: name 'wandb' is not defined

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [21]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [22]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)