# Mistral-7B v0.3 GRPO Fine-tuning & Model Conversion Summary

## Model Setup & GRPO Training
• **Load Mistral-7B**: Use `FastLanguageModel.from_pretrained("unsloth/mistral-7b-instruct-v0.3-bnb-4bit")` with RTX 2070 Super settings (`fast_inference=False`, `load_in_4bit=True`, reduced LoRA rank 8-16)
• **GRPO Training**: Configure `GRPOConfig` with proper sequence lengths for Mistral (ensure `max_prompt_length + max_completion_length ≤ max_seq_length=128`)
• **Reward Functions**: Use multiple reward functions (`xmlcount_reward_func`, `format_reward_func`, `correctness_reward_func`) with `GRPOTrainer`
• **Save LoRA**: Use `model.save_lora("grpo_saved_lora")` to save GRPO-trained adapters (~50MB vs 13GB full Mistral model)

## Mistral Model Conversion Options
• **vLLM (Float16)**: `model.save_pretrained_merged("model_merged_16bit", tokenizer, save_method="merged_16bit")` - Merges GRPO LoRA into Mistral base (~13GB)
• **GGUF/llama.cpp**: Use `model.save_pretrained_gguf("model_q4_k_m", tokenizer, quantization_method="q4_k_m/q8_0/f16")` for Mistral GGUF formats (4-14GB)
• **LoRA Only**: Keep GRPO adapters separate for loading back into Mistral base model later

## Mistral Inference Methods
• **Standard**: Use `model.generate()` with Mistral chat template `tokenizer.apply_chat_template([{"role": "user", "content": prompt}])`
• **vLLM**: Requires merged Mistral model with `fast_inference=True` and `model.fast_generate()` 
• **llama.cpp**: Use converted Mistral GGUF files with `./main -m mistral-model.gguf -p "prompt"`

## RTX 2070 Super Optimizations for Mistral-7B
• **Memory Management**: Disable vLLM, use 4-bit quantization, LoRA rank ≤16, `max_seq_length=128`, enable CPU offloading if OOM
• **Loading GRPO Model**: Load Mistral base first, then apply GRPO LoRA with `model.load_adapter("grpo_saved_lora")` or `PeftModel.from_pretrained()`
• **Fallback**: Switch to `unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit` if Mistral-7B doesn't fit in 8GB VRAM

In [None]:
# Install all dependencies first
!pip install torchaudio==2.7.1
!pip install bitsandbytes==0.47.0
!pip install torchvision==0.22.1
!pip install xformers==0.0.31
!pip install unsloth-zoo
!pip install vllm==0.10.1.1

# Then, install torch last
!pip uninstall torch -y
!pip install torch==2.7.1

!pip check
!nvidia-smi

In [None]:
import torch
import triton
import torchaudio
import torchvision
import xformers

print(f"Torch version: {torch.__version__}")
print(f"Triton version: {triton.__version__}")
print(f"Torchaudio version: {torchaudio.__version__}")
print(f"Xformers version: {xformers.__version__}")


In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [None]:
#@title Colab Extra Install { display-mode: "form" }
#%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
!uv pip install transformers==4.55.4

### Unsloth

Load up `unsloth/mistral-7b-instruct-v0.3-bnb-4bit`, and set parameters

In [None]:
from unsloth import FastLanguageModel
import torch
import os

# Clear GPU memory first
torch.cuda.empty_cache()

# Check GPU memory
if torch.cuda.is_available():
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"GPU Memory: {gpu_memory:.1f} GB")

print("=== Solution 1: Proper Unsloth setup for RTX 2070 Super ===")

max_seq_length = 128
lora_rank = 16  # Significantly reduced for 8GB GPU

try:
    # Attempt 1: Standard Unsloth approach with reduced settings
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
        max_seq_length = max_seq_length,
        load_in_4bit = True,
        fast_inference = False,  # Disable vLLM
        max_lora_rank = lora_rank,
        dtype = torch.float16,  # Use FP16 to save memory
        # Remove device_map and offload params - Unsloth handles this
    )
    print("✅ Model loaded successfully!")
    
except Exception as e:
    print(f"❌ First attempt failed: {e}")
    print("Trying with even more conservative settings...")
    
    try:
        # Attempt 2: More conservative settings
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
            max_seq_length = 64,  # Reduced sequence length
            load_in_4bit = True,
            fast_inference = False,
            max_lora_rank = 8,  # Very small rank
            dtype = torch.float16,
        )
        max_seq_length = 64
        lora_rank = 8
        print("✅ Model loaded with conservative settings!")
        
    except Exception as e2:
        print(f"❌ Conservative settings also failed: {e2}")
        print("Switching to smaller model...")
        
        # Attempt 3: Use smaller model that fits RTX 2070 Super
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit",  # 1.5B model
            max_seq_length = 128,
            load_in_4bit = True,
            fast_inference = False,
            max_lora_rank = 32,
            dtype = torch.float16,
        )
        max_seq_length = 128
        lora_rank = 32
        print("✅ Smaller 1.5B model loaded successfully!")

# Apply LoRA - start with minimal target modules
try:
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_rank,
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention only
        ],
        lora_alpha = lora_rank,
        use_gradient_checkpointing = "unsloth",
        random_state = 3407,
    )
    print("✅ LoRA applied successfully with attention modules only!")
    
except Exception as e:
    print(f"LoRA with attention failed: {e}")
    print("Trying with even fewer modules...")
    
    model = FastLanguageModel.get_peft_model(
        model,
        r = max(lora_rank // 2, 4),  # Even smaller rank
        target_modules = [
            "q_proj", "v_proj",  # Only query and value projections
        ],
        lora_alpha = max(lora_rank // 2, 4),
        use_gradient_checkpointing = "unsloth",
        random_state = 3407,
    )
    print("✅ LoRA applied with minimal modules!")

# Check final memory usage
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

print("🎉 Setup complete! Model ready for training.")

# Alternative: Manual device mapping approach (if needed)
print("\n" + "="*50)
print("Alternative Solution: Manual GPU/CPU split")
print("Uncomment below if you want to try manual device mapping:")
print("="*50)

"""
# This approach manually loads the base model with device mapping
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Manual device mapping for hybrid GPU/CPU loading
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    # Rest on CPU
    "model.layers.16": "cpu",
    "model.layers.17": "cpu",
    "model.layers.18": "cpu",
    "model.layers.19": "cpu",
    "model.layers.20": "cpu",
    "model.layers.21": "cpu",
    "model.layers.22": "cpu",
    "model.layers.23": "cpu",
    "model.layers.24": "cpu",
    "model.layers.25": "cpu",
    "model.layers.26": "cpu",
    "model.layers.27": "cpu",
    "model.layers.28": "cpu",
    "model.layers.29": "cpu",
    "model.layers.30": "cpu",
    "model.layers.31": "cpu",
    "model.norm": "cpu",
    "lm_head": "cpu"
}

# Load with transformers first, then wrap with Unsloth
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    device_map=device_map,
    load_in_4bit=True,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("unsloth/mistral-7b-instruct-v0.3-bnb-4bit")

# Then apply Unsloth optimizations
model = FastLanguageModel.get_peft_model(
    base_model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)
"""

print("🔧 Troubleshooting tips:")
print("1. If still getting memory errors, reduce lora_rank further (to 4 or 8)")
print("2. Close all other applications using GPU memory")
print("3. Try the smaller Qwen2.5-1.5B model - it's quite capable!")
print("4. Consider using Google Colab with T4 (16GB) for larger models")

<a name="Data"></a>
### Data Prep

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [None]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
# Fix for GRPO training configuration
from trl import GRPOConfig, GRPOTrainer

# CRITICAL FIX: Adjust sequence lengths to be compatible
# Your max_seq_length from model loading was 128, but max_prompt_length was 256
# This creates negative max_completion_length = 128 - 256 = -128

# Solution 1: Reduce prompt length to fit within model's max_seq_length
max_prompt_length = 64  # Reduced from 256 to fit in 128 total
max_completion_length = 64  # 128 - 64 = 64 tokens for completion

print(f"Max sequence length: {max_seq_length}")
print(f"Max prompt length: {max_prompt_length}")  
print(f"Max completion length: {max_completion_length}")

# Verify the calculation
assert max_prompt_length + max_completion_length <= max_seq_length, \
    f"Prompt ({max_prompt_length}) + Completion ({max_completion_length}) = {max_prompt_length + max_completion_length} exceeds max_seq_length ({max_seq_length})"

training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,  # Increased for better gradients with small batch
    num_generations = 4,  # Reduced from 6 for RTX 2070 Super memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 10,
    save_steps = 10,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)

# Start training
print("🚀 Starting GRPO training...")
trainer.train()

print("\n" + "="*50)
print("Alternative Solution: Increase model sequence length")
print("="*50)
print("""
If you need longer prompts (256 tokens), you should reload your model with a larger max_seq_length:

# When loading your model initially, use:
max_seq_length = 512  # Instead of 128

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", 
    max_seq_length = 512,  # Increased to accommodate longer sequences
    load_in_4bit = True,
    fast_inference = False,
    # ... other parameters
)

# Then you can use:
max_prompt_length = 256
max_completion_length = 256  # 512 - 256 = 256

WARNING: Longer sequences use more GPU memory. For RTX 2070 Super:
- 128 tokens: ✅ Should work
- 256 tokens: ⚠️  Might work with small batch size  
- 512 tokens: ❌ Likely too much memory

Test with max_seq_length = 256 first, then 512 if you have enough memory.
""")

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
# Fixed inference code for Unsloth (without vLLM)
import torch

# Prepare the input text
text = tokenizer.apply_chat_template([
    {"role": "user", "content": "Calculate pi."},
], tokenize=False, add_generation_prompt=True)

print("Input text:")
print(text)
print("\n" + "="*50)

# Solution 1: Standard Unsloth inference (recommended)
print("🚀 Method 1: Standard Unsloth inference")

# Tokenize input
inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Enable inference mode for better performance
FastLanguageModel.for_inference(model)

# Generate with standard parameters
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.8,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=True,
    )

# Decode the response
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the generated part (remove the input prompt)
generated_text = full_output[len(text):]

print("Generated response:")
print(generated_text)

print("\n" + "="*50)
print("🚀 Method 2: Using TextStreamer for real-time output")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

print("Streaming response:")
inputs = tokenizer(text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.8,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=True,
        streamer=text_streamer,  # This will print tokens as they're generated
    )

print("\n" + "="*50)
print("🚀 Method 3: Batch inference for multiple prompts")

# Multiple prompts
prompts = [
    "Calculate pi.",
    "What is the capital of France?",
    "Explain machine learning in simple terms."
]

# Format all prompts
formatted_prompts = []
for prompt in prompts:
    formatted_text = tokenizer.apply_chat_template([
        {"role": "user", "content": prompt}
    ], tokenize=False, add_generation_prompt=True)
    formatted_prompts.append(formatted_text)

# Batch tokenization
batch_inputs = tokenizer(formatted_prompts, return_tensors="pt", padding=True).to("cuda")

# Generate for all prompts
with torch.no_grad():
    batch_outputs = model.generate(
        **batch_inputs,
        max_new_tokens=512,
        temperature=0.8,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=True,
    )

# Decode all responses
print("Batch responses:")
for i, (prompt, output) in enumerate(zip(formatted_prompts, batch_outputs)):
    full_response = tokenizer.decode(output, skip_special_tokens=True)
    generated_response = full_response[len(prompt):]
    print(f"\nPrompt {i+1}: {prompts[i]}")
    print(f"Response: {generated_response}")

print("\n" + "="*50)
print("Alternative: If you want vLLM-style inference")
print("="*50)
print("""
To use vLLM-style inference (model.fast_generate), you need to:

1. Load your model WITH vLLM enabled:
   fast_inference = True  # Enable this when loading the model
   
2. Make sure you have enough GPU memory for vLLM compilation

3. Then you can use:
   
   from vllm import SamplingParams
   
   sampling_params = SamplingParams(
       temperature=0.8,
       top_p=0.95,
       max_tokens=1024,
   )
   
   output = model.fast_generate(
       [text],
       sampling_params=sampling_params,
       lora_request=None,
   )[0].outputs[0].text
   
But since you disabled vLLM due to memory constraints, 
use the standard inference methods above instead.
""")

# Helper function for easy inference
def generate_response(prompt, max_tokens=1024, temperature=0.8, top_p=0.95):
    """Helper function for easy inference"""
    # Format the prompt
    formatted_text = tokenizer.apply_chat_template([
        {"role": "user", "content": prompt}
    ], tokenize=False, add_generation_prompt=True)
    
    # Tokenize
    inputs = tokenizer(formatted_text, return_tensors="pt").to("cuda")
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            use_cache=True,
        )
    
    # Extract generated text
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return full_output[len(formatted_text):]

print("\n" + "="*50)
print("🎯 Easy-to-use helper function:")
print("="*50)

# Test the helper function
response = generate_response("Calculate pi.", max_tokens=512)
print("Helper function response:")
print(response)

In [None]:
# model.save_lora("grpo_saved_lora")

In [None]:
import os
import torch

# Method 1: Save LoRA adapters only (recommended - smallest size)
print("🎯 Method 1: Save LoRA adapters (recommended)")
print("="*50)

# Create output directory if it doesn't exist
save_dir = "grpo_saved_lora"
os.makedirs(save_dir, exist_ok=True)

try:
    # This saves only the LoRA adapters (very small file size)
    model.save_lora(save_dir)
    print(f"✅ LoRA adapters saved successfully to: {save_dir}")
    print(f"📁 Files saved:")
    for file in os.listdir(save_dir):
        file_path = os.path.join(save_dir, file)
        if os.path.isfile(file_path):
            size_mb = os.path.getsize(file_path) / (1024 * 1024)
            print(f"   - {file} ({size_mb:.1f} MB)")
            
except Exception as e:
    print(f"❌ Error saving LoRA: {e}")
    print("Trying alternative methods...")

print("\n" + "="*50)
print("🎯 Method 2: Save with tokenizer")
print("="*50)

# Save LoRA + tokenizer together
full_save_dir = "grpo_model_full"
os.makedirs(full_save_dir, exist_ok=True)

try:
    # Save LoRA adapters
    model.save_lora(full_save_dir)
    
    # Save tokenizer separately
    tokenizer.save_pretrained(full_save_dir)
    
    print(f"✅ Model and tokenizer saved to: {full_save_dir}")
    
except Exception as e:
    print(f"❌ Error: {e}")

print("\n" + "="*50)
print("🎯 Method 3: Save to Hugging Face format")
print("="*50)

hf_save_dir = "grpo_hf_format"
os.makedirs(hf_save_dir, exist_ok=True)

try:
    # Save in standard Hugging Face format
    model.save_pretrained(hf_save_dir)
    tokenizer.save_pretrained(hf_save_dir)
    
    print(f"✅ Hugging Face format saved to: {hf_save_dir}")
    
except Exception as e:
    print(f"❌ Error with HF format: {e}")
    
    # Fallback: Use Unsloth's push method for local save
    try:
        model.push_to_hub_merged(
            save_directory=hf_save_dir,
            tokenizer=tokenizer,
            save_method="lora",  # or "merged_16bit" for full model
        )
        print(f"✅ Saved using Unsloth push method to: {hf_save_dir}")
    except Exception as e2:
        print(f"❌ Push method also failed: {e2}")

print("\n" + "="*50)
print("🎯 Method 4: Manual LoRA saving (if built-in methods fail)")
print("="*50)

manual_save_dir = "grpo_manual_save"
os.makedirs(manual_save_dir, exist_ok=True)

try:
    # Get the PEFT model state dict
    if hasattr(model, 'peft_config'):
        # Save PEFT config
        import json
        config_dict = {}
        for key, config in model.peft_config.items():
            config_dict[key] = config.to_dict()
        
        with open(os.path.join(manual_save_dir, "adapter_config.json"), "w") as f:
            json.dump(config_dict, f, indent=2)
        
        # Save adapter weights
        model.save_pretrained(manual_save_dir)
        tokenizer.save_pretrained(manual_save_dir)
        
        print(f"✅ Manual save successful to: {manual_save_dir}")
    else:
        print("❌ Model doesn't appear to have PEFT adapters")
        
except Exception as e:
    print(f"❌ Manual save failed: {e}")

print("\n" + "="*50)
print("🎯 Loading saved model back")
print("="*50)

# Example of how to load the saved LoRA model
load_example = """
# To load your saved LoRA model later:

from unsloth import FastLanguageModel

# 1. Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    max_seq_length = 128,  # Same as training
    load_in_4bit = True,
    fast_inference = False,  # Same as training
)

# 2. Load your trained LoRA adapters
model.load_lora("grpo_saved_lora")  # Path to your saved LoRA

# 3. Enable inference mode
FastLanguageModel.for_inference(model)

# Now you can use the model for inference with your trained adapters!
"""

print(load_example)

print("\n" + "="*50)
print("🎯 File size comparison")
print("="*50)

def get_dir_size(directory):
    """Calculate directory size in MB"""
    if not os.path.exists(directory):
        return 0
    
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            total_size += os.path.getsize(filepath)
    return total_size / (1024 * 1024)  # Convert to MB

directories_to_check = [save_dir, full_save_dir, hf_save_dir, manual_save_dir]

print("Saved model sizes:")
for directory in directories_to_check:
    if os.path.exists(directory):
        size = get_dir_size(directory)
        print(f"📁 {directory}: {size:.1f} MB")

print("\n🎉 Model saving complete!")
print("\nRecommended approach:")
print("- Use Method 1 (model.save_lora) for smallest files")  
print("- Save tokenizer separately if needed")
print("- LoRA adapters are typically only 10-50 MB vs full model 13+ GB")

# Cleanup GPU memory after saving
torch.cuda.empty_cache()
print(f"\n🧹 GPU memory cleared. Current usage: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")

Now we load the LoRA and test:

In [None]:
from unsloth import FastLanguageModel
import torch

# Clear GPU memory first
torch.cuda.empty_cache()

print("🔄 Loading saved LoRA model...")
print("="*50)

# Step 1: Load the base model (same configuration as training)
max_seq_length = 128  # Same as used during training
lora_rank = 16  # Same as used during training

try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",  # Same base model
        max_seq_length = max_seq_length,
        load_in_4bit = True,
        fast_inference = False,  # Same as training
        max_lora_rank = lora_rank,
        dtype = torch.float16,
    )
    print("✅ Base model loaded successfully!")
    
except Exception as e:
    print(f"❌ Failed to load base model: {e}")
    print("Trying smaller model...")
    
    # Fallback to smaller model if the 7B doesn't fit
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit",
        max_seq_length = max_seq_length,
        load_in_4bit = True,
        fast_inference = False,
        max_lora_rank = 32,
        dtype = torch.float16,
    )
    print("✅ Smaller model loaded as fallback!")

# Step 2: Load your trained LoRA adapters
print("\n🎯 Loading LoRA adapters...")

try:
    # Method 1: Direct LoRA loading (if the model supports it)
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_rank,
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",
        ],
        lora_alpha = lora_rank,
        use_gradient_checkpointing = "unsloth",
        random_state = 3407,
    )
    
    # Load the saved LoRA weights
    model.load_adapter("grpo_saved_lora", adapter_name="default")
    print("✅ LoRA adapters loaded successfully!")
    
except Exception as e:
    print(f"❌ Method 1 failed: {e}")
    print("Trying alternative loading method...")
    
    try:
        # Method 2: Load using from_pretrained with adapter
        from peft import PeftModel
        model = PeftModel.from_pretrained(model, "grpo_saved_lora")
        print("✅ LoRA loaded using PEFT!")
        
    except Exception as e2:
        print(f"❌ Method 2 also failed: {e2}")
        print("⚠️  Using base model without LoRA adapters for testing...")

# Step 3: Enable inference mode
FastLanguageModel.for_inference(model)
print("🚀 Model ready for inference!")

# Step 4: Set up system prompt and test
SYSTEM_PROMPT = """You are a helpful AI assistant. Provide clear, accurate, and concise responses."""

print("\n" + "="*50)
print("🧪 Testing the loaded model")
print("="*50)

def test_model(prompt, system_prompt=SYSTEM_PROMPT, max_tokens=512):
    """Test function for the loaded model"""
    
    # Format the conversation
    if system_prompt:
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    else:
        messages = [{"role": "user", "content": prompt}]
    
    # Apply chat template
    text = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    print(f"Input prompt: {prompt}")
    print(f"Formatted text length: {len(text)} characters")
    
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    
    print(f"Input tokens: {inputs['input_ids'].shape[1]}")
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.8,
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            use_cache=True,
        )
    
    # Decode the full output
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only the generated part
    generated_response = full_output[len(text):].strip()
    
    print(f"🤖 Response: {generated_response}")
    print(f"Response length: {len(generated_response)} characters")
    
    return generated_response

# Test 1: Original test case
print("\n📝 Test 1: Calculate pi")
print("-" * 30)
response1 = test_model("Calculate pi.")

# Test 2: Different prompt to see if training worked
print("\n📝 Test 2: Math reasoning")
print("-" * 30)
response2 = test_model("What is 15 * 23 + 7?")

# Test 3: Check if model follows any special formatting from GRPO training
print("\n📝 Test 3: Complex reasoning")
print("-" * 30)
response3 = test_model("Explain the process of photosynthesis in simple terms.")

# Test 4: Batch inference
print("\n📝 Test 4: Batch inference")
print("-" * 30)

test_prompts = [
    "What is the capital of Japan?",
    "How does machine learning work?",
    "Write a short poem about stars."
]

for i, prompt in enumerate(test_prompts, 1):
    print(f"\nBatch test {i}: {prompt}")
    response = test_model(prompt, max_tokens=256)

print("\n" + "="*50)
print("🎉 Model testing complete!")
print("="*50)

# Memory usage info
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

print("\n🔍 Model Analysis:")
print(f"- Model type: {type(model).__name__}")
print(f"- Has PEFT adapters: {hasattr(model, 'peft_config')}")
if hasattr(model, 'peft_config'):
    print(f"- Active adapters: {list(model.peft_config.keys())}")
print(f"- Device: {next(model.parameters()).device}")
print(f"- Tokenizer vocab size: {len(tokenizer)}")

print("\n💡 Notes:")
print("- If responses look different from base model, LoRA training worked!")
print("- If responses are similar to base model, LoRA might not have loaded")
print("- Check for any special formatting patterns from your GRPO reward functions")
print("- Compare responses to base model (without LoRA) to see the difference")

# Optional: Save a test log
test_log = f"""
Model Test Results
==================
Model: Mistral-7B with LoRA adapters
Test Date: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}

Test 1 - Calculate pi: {response1[:100]}...
Test 2 - Math: {response2[:100]}...
Test 3 - Science: {response3[:100]}...

Memory Usage: {allocated:.2f} GB GPU
"""

with open("model_test_results.txt", "w") as f:
    f.write(test_log)

print("\n📄 Test results saved to 'model_test_results.txt'")

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
import os
import torch
from unsloth import FastLanguageModel

print("🚀 Saving Model for vLLM (Float16)")
print("="*50)

# Clear GPU memory before saving
torch.cuda.empty_cache()

# Check available space (merged models are large!)
def check_disk_space():
    import shutil
    total, used, free = shutil.disk_usage(".")
    free_gb = free / (1024**3)
    print(f"💾 Available disk space: {free_gb:.1f} GB")
    if free_gb < 15:  # Mistral-7B merged is ~13GB
        print("⚠️  WARNING: Low disk space! Merged model needs ~13GB")
    return free_gb

free_space = check_disk_space()

print("\n🎯 Option 1: Save merged model in float16 (RECOMMENDED for vLLM)")
print("-" * 50)

# Merge LoRA into base model and save in float16 - BEST for vLLM
if True:  # Changed from False to True
    try:
        save_path = "model_merged_16bit"
        print(f"Saving merged float16 model to: {save_path}")
        print("This will take a few minutes...")
        
        model.save_pretrained_merged(
            save_path, 
            tokenizer, 
            save_method="merged_16bit",  # Float16 format - perfect for vLLM
        )
        
        print("✅ Float16 merged model saved successfully!")
        
        # Check saved model size
        def get_dir_size(directory):
            total_size = 0
            for dirpath, dirnames, filenames in os.walk(directory):
                for filename in filenames:
                    filepath = os.path.join(dirpath, filename)
                    if os.path.isfile(filepath):
                        total_size += os.path.getsize(filepath)
            return total_size / (1024**3)  # GB
        
        size_gb = get_dir_size(save_path)
        print(f"📁 Model size: {size_gb:.1f} GB")
        
        # List saved files
        print("📄 Saved files:")
        for file in os.listdir(save_path):
            file_path = os.path.join(save_path, file)
            if os.path.isfile(file_path):
                size_mb = os.path.getsize(file_path) / (1024 * 1024)
                print(f"   - {file} ({size_mb:.1f} MB)")
                
    except Exception as e:
        print(f"❌ Error saving merged float16 model: {e}")
        print("Possible causes:")
        print("- Insufficient disk space")
        print("- Insufficient GPU memory") 
        print("- Model not properly trained")

print("\n🎯 Option 2: Push merged model to Hugging Face Hub")
print("-" * 50)

# Push to Hugging Face Hub in float16
if False:  # Set to True and add your token to enable
    try:
        hf_repo_name = "your-username/your-model-name"  # Change this!
        hf_token = "your_hf_token_here"  # Add your HF token!
        
        print(f"Pushing to Hugging Face: {hf_repo_name}")
        
        model.push_to_hub_merged(
            hf_repo_name, 
            tokenizer, 
            save_method="merged_16bit", 
            token=hf_token
        )
        
        print("✅ Model pushed to Hugging Face Hub!")
        print(f"🌐 Available at: https://huggingface.co/{hf_repo_name}")
        
    except Exception as e:
        print(f"❌ Error pushing to hub: {e}")

print("\n🎯 Option 3: Save merged model in 4bit (smaller but less compatible)")
print("-" * 50)

# Merge to 4bit (smaller file, but may not work with all vLLM versions)
if False:  # Set to True if you want 4bit version too
    try:
        save_path_4bit = "model_merged_4bit"
        print(f"Saving merged 4bit model to: {save_path_4bit}")
        
        model.save_pretrained_merged(
            save_path_4bit, 
            tokenizer, 
            save_method="merged_4bit",
        )
        
        print("✅ 4bit merged model saved!")
        size_gb = get_dir_size(save_path_4bit)
        print(f"📁 4bit model size: {size_gb:.1f} GB")
        
    except Exception as e:
        print(f"❌ Error saving 4bit model: {e}")

print("\n🎯 Option 4: Save just LoRA adapters (smallest, for loading later)")
print("-" * 50)

# Just save LoRA adapters (what you had before)
if True:  # This is your existing LoRA save
    try:
        lora_path = "model_lora_only"
        
        model.save_pretrained(lora_path)
        tokenizer.save_pretrained(lora_path)
        
        print("✅ LoRA adapters saved!")
        size_mb = get_dir_size(lora_path) * 1024  # Convert to MB
        print(f"📁 LoRA size: {size_mb:.1f} MB")
        
    except Exception as e:
        print(f"❌ Error saving LoRA: {e}")

print("\n🎯 Option 5: Push LoRA to Hugging Face Hub")
print("-" * 50)

# Push just LoRA adapters to hub
if False:  # Set to True and configure to enable
    try:
        hf_lora_repo = "your-username/your-lora-adapters"  # Change this!
        hf_token = "your_hf_token_here"  # Add your HF token!
        
        model.push_to_hub(hf_lora_repo, token=hf_token)
        tokenizer.push_to_hub(hf_lora_repo, token=hf_token)
        
        print("✅ LoRA adapters pushed to Hugging Face Hub!")
        print(f"🌐 Available at: https://huggingface.co/{hf_lora_repo}")
        
    except Exception as e:
        print(f"❌ Error pushing LoRA to hub: {e}")

print("\n" + "="*70)
print("🎯 How to use your saved models")
print("="*70)

vllm_usage_guide = '''
# For vLLM inference with your float16 merged model:

from vllm import LLM, SamplingParams

# Load your merged model with vLLM
llm = LLM(
    model="./model_merged_16bit",  # Path to your saved model
    gpu_memory_utilization=0.8,
    max_model_len=2048,  # Adjust based on your needs
    dtype="float16"  # Explicitly use float16
)

# Set up sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
)

# Format your prompt
prompt = tokenizer.apply_chat_template([
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Calculate pi."}
], tokenize=False, add_generation_prompt=True)

# Generate
outputs = llm.generate([prompt], sampling_params)
response = outputs[0].outputs[0].text

print(response)
'''

print(vllm_usage_guide)

print("="*70)
print("📋 Summary of what was saved:")
print("="*70)

saved_models = []

if os.path.exists("model_merged_16bit"):
    size = get_dir_size("model_merged_16bit")
    saved_models.append(f"✅ Float16 merged model: model_merged_16bit ({size:.1f} GB) - BEST for vLLM")

if os.path.exists("model_merged_4bit"):
    size = get_dir_size("model_merged_4bit") 
    saved_models.append(f"✅ 4bit merged model: model_merged_4bit ({size:.1f} GB)")

if os.path.exists("model_lora_only"):
    size = get_dir_size("model_lora_only") * 1024  # MB
    saved_models.append(f"✅ LoRA adapters only: model_lora_only ({size:.1f} MB)")

if saved_models:
    for model_info in saved_models:
        print(model_info)
else:
    print("❌ No models were saved successfully")

print(f"\n💾 Total disk space used: ~{sum([get_dir_size(d) for d in ['model_merged_16bit', 'model_merged_4bit'] if os.path.exists(d)]):.1f} GB")

print("\n🎯 Recommendations:")
print("- Use 'model_merged_16bit' for vLLM inference (best compatibility)")  
print("- Keep 'model_lora_only' as backup (small size)")
print("- Test vLLM loading before deleting any files")
print("- Float16 merged model will work with vLLM out of the box!")

# Clear GPU memory after saving
torch.cuda.empty_cache()
print(f"\n🧹 GPU memory cleared. Available: {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_reserved(0)) / 1024**3:.1f} GB")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
import os
import torch
import time
from unsloth import FastLanguageModel

print("🦙 GGUF / llama.cpp Model Conversion")
print("="*60)

# Clear GPU memory before conversion
torch.cuda.empty_cache()

# Check available disk space (GGUF files can be large)
def check_disk_space():
    import shutil
    total, used, free = shutil.disk_usage(".")
    free_gb = free / (1024**3)
    print(f"💾 Available disk space: {free_gb:.1f} GB")
    
    # Estimate space needed for different formats
    print("📊 Estimated space requirements:")
    print("   - Q8_0 (8-bit): ~7-8 GB")
    print("   - F16 (16-bit): ~13-14 GB") 
    print("   - Q4_K_M (4-bit): ~4-5 GB")
    print("   - All formats: ~25-30 GB total")
    
    if free_gb < 30:
        print("⚠️  WARNING: Consider saving formats one at a time if space is limited")
    
    return free_gb

free_space = check_disk_space()

def get_file_size(filepath):
    """Get file size in GB"""
    if os.path.exists(filepath):
        size_bytes = os.path.getsize(filepath)
        return size_bytes / (1024**3)
    return 0

def save_gguf_with_timing(save_func, format_name, estimated_time="5-10 minutes"):
    """Helper to save GGUF with timing and error handling"""
    print(f"\n🔄 Converting to {format_name}...")
    print(f"⏱️  Estimated time: {estimated_time}")
    
    start_time = time.time()
    
    try:
        save_func()
        
        elapsed = time.time() - start_time
        print(f"✅ {format_name} conversion completed in {elapsed/60:.1f} minutes!")
        return True
        
    except Exception as e:
        elapsed = time.time() - start_time
        print(f"❌ {format_name} conversion failed after {elapsed/60:.1f} minutes")
        print(f"Error: {e}")
        return False

print("\n" + "="*60)
print("🎯 GGUF Conversion Options")
print("="*60)

# Conversion results tracking
conversion_results = {}

# Option 1: Save to 8bit Q8_0 (High quality, medium size)
print("\n📦 Option 1: Q8_0 (8-bit quantization)")
print("-" * 40)
print("✨ Features: High quality, good for most use cases")
print("📁 Size: ~7-8 GB")
print("🎯 Best for: General purpose, balanced quality/size")

if True:  # Changed from False to True
    def save_q8_0():
        model.save_pretrained_gguf(
            "model_q8_0", 
            tokenizer,
            # quantization_method defaults to "q8_0"
        )
    
    success = save_gguf_with_timing(save_q8_0, "Q8_0", "5-8 minutes")
    conversion_results["Q8_0"] = success
    
    if success:
        size = get_file_size("model_q8_0/model-unsloth.Q8_0.gguf")
        print(f"📁 Q8_0 file size: {size:.1f} GB")

# Option 2: Save to 16bit GGUF (Highest quality, largest size)  
print("\n📦 Option 2: F16 (16-bit full precision)")
print("-" * 40)
print("✨ Features: Highest quality, no quantization loss")
print("📁 Size: ~13-14 GB")
print("🎯 Best for: Maximum quality, research, reference")

if True:  # Changed from False to True
    def save_f16():
        model.save_pretrained_gguf(
            "model_f16", 
            tokenizer, 
            quantization_method="f16"
        )
    
    success = save_gguf_with_timing(save_f16, "F16", "8-12 minutes")
    conversion_results["F16"] = success
    
    if success:
        size = get_file_size("model_f16/model-unsloth.F16.gguf")
        print(f"📁 F16 file size: {size:.1f} GB")

# Option 3: Save to q4_k_m GGUF (Good quality, smallest size)
print("\n📦 Option 3: Q4_K_M (4-bit quantization)")
print("-" * 40)
print("✨ Features: Good quality, smallest size, fastest inference")
print("📁 Size: ~4-5 GB") 
print("🎯 Best for: Resource-constrained devices, mobile deployment")

if True:  # Changed from False to True
    def save_q4_k_m():
        model.save_pretrained_gguf(
            "model_q4_k_m", 
            tokenizer, 
            quantization_method="q4_k_m"
        )
    
    success = save_gguf_with_timing(save_q4_k_m, "Q4_K_M", "3-6 minutes")
    conversion_results["Q4_K_M"] = success
    
    if success:
        size = get_file_size("model_q4_k_m/model-unsloth.Q4_K_M.gguf")
        print(f"📁 Q4_K_M file size: {size:.1f} GB")

# Option 4: Additional quantization formats (advanced)
print("\n📦 Option 4: Additional GGUF formats")
print("-" * 40)

additional_formats = {
    "q5_k_m": {"desc": "5-bit, balanced quality/size", "size": "~5-6 GB"},
    "q6_k": {"desc": "6-bit, high quality", "size": "~6-7 GB"}, 
    "q4_0": {"desc": "4-bit, fastest inference", "size": "~4 GB"},
    "q5_0": {"desc": "5-bit, good balance", "size": "~5 GB"},
}

save_additional = False  # Set to True to enable additional formats

if save_additional:
    for fmt, info in additional_formats.items():
        print(f"\n🔄 Converting to {fmt.upper()} ({info['desc']}, {info['size']})")
        
        def save_additional_format():
            model.save_pretrained_gguf(
                f"model_{fmt}", 
                tokenizer, 
                quantization_method=fmt
            )
        
        success = save_gguf_with_timing(save_additional_format, fmt.upper(), "3-8 minutes")
        conversion_results[fmt.upper()] = success

# Option 5: Push to Hugging Face Hub
print("\n🌐 Hugging Face Hub Upload")
print("-" * 40)

upload_to_hub = False  # Set to True and configure to enable uploads

if upload_to_hub:
    hf_token = "your_hf_token_here"  # Replace with your token
    hf_repo_base = "your-username/your-model-name"  # Replace with your repo
    
    # Upload Q8_0 to hub
    if conversion_results.get("Q8_0", False):
        try:
            print("🚀 Uploading Q8_0 to Hugging Face Hub...")
            model.push_to_hub_gguf(
                f"{hf_repo_base}-Q8-GGUF", 
                tokenizer, 
                token=hf_token
            )
            print(f"✅ Q8_0 uploaded to: https://huggingface.co/{hf_repo_base}-Q8-GGUF")
        except Exception as e:
            print(f"❌ Q8_0 upload failed: {e}")
    
    # Upload F16 to hub
    if conversion_results.get("F16", False):
        try:
            print("🚀 Uploading F16 to Hugging Face Hub...")
            model.push_to_hub_gguf(
                f"{hf_repo_base}-F16-GGUF", 
                tokenizer, 
                quantization_method="f16", 
                token=hf_token
            )
            print(f"✅ F16 uploaded to: https://huggingface.co/{hf_repo_base}-F16-GGUF")
        except Exception as e:
            print(f"❌ F16 upload failed: {e}")
    
    # Upload Q4_K_M to hub  
    if conversion_results.get("Q4_K_M", False):
        try:
            print("🚀 Uploading Q4_K_M to Hugging Face Hub...")
            model.push_to_hub_gguf(
                f"{hf_repo_base}-Q4-GGUF", 
                tokenizer, 
                quantization_method="q4_k_m", 
                token=hf_token
            )
            print(f"✅ Q4_K_M uploaded to: https://huggingface.co/{hf_repo_base}-Q4-GGUF")
        except Exception as e:
            print(f"❌ Q4_K_M upload failed: {e}")

print("\n" + "="*70)
print("📋 CONVERSION SUMMARY")
print("="*70)

total_size = 0
successful_conversions = []

for format_name, success in conversion_results.items():
    if success:
        # Find the corresponding directory and file
        format_dirs = {
            "Q8_0": "model_q8_0",
            "F16": "model_f16", 
            "Q4_K_M": "model_q4_k_m"
        }
        
        if format_name in format_dirs:
            dir_path = format_dirs[format_name]
            if os.path.exists(dir_path):
                # Calculate directory size
                dir_size = 0
                for dirpath, dirnames, filenames in os.walk(dir_path):
                    for filename in filenames:
                        filepath = os.path.join(dirpath, filename)
                        dir_size += os.path.getsize(filepath)
                
                dir_size_gb = dir_size / (1024**3)
                total_size += dir_size_gb
                
                print(f"✅ {format_name}: {dir_path}/ ({dir_size_gb:.1f} GB)")
                successful_conversions.append(format_name)
            else:
                print(f"⚠️  {format_name}: Conversion reported success but directory not found")
    else:
        print(f"❌ {format_name}: Conversion failed")

print(f"\n📊 Total disk space used: {total_size:.1f} GB")
print(f"🎯 Successful conversions: {len(successful_conversions)}/{len(conversion_results)}")

print("\n" + "="*70)
print("🦙 HOW TO USE WITH LLAMA.CPP")
print("="*70)

llama_cpp_guide = '''
# Install llama.cpp (if not already installed)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# Run inference with your GGUF models:

# Q4_K_M (fastest, smallest)
./main -m ./model_q4_k_m/model-unsloth.Q4_K_M.gguf -n 512 -p "Calculate pi."

# Q8_0 (balanced)
./main -m ./model_q8_0/model-unsloth.Q8_0.gguf -n 512 -p "Calculate pi."

# F16 (highest quality)
./main -m ./model_f16/model-unsloth.F16.gguf -n 512 -p "Calculate pi."

# Interactive chat mode
./main -m ./model_q4_k_m/model-unsloth.Q4_K_M.gguf -n 512 -i

# With custom system prompt
./main -m ./model_q4_k_m/model-unsloth.Q4_K_M.gguf -n 512 \\
  --system-prompt-file system_prompt.txt -p "Your question here"

# Python binding (llama-cpp-python)
pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama(model_path="./model_q4_k_m/model-unsloth.Q4_K_M.gguf")
output = llm("Calculate pi.", max_tokens=512)
print(output['choices'][0]['text'])
'''

print(llama_cpp_guide)

print("\n🎯 FORMAT RECOMMENDATIONS:")
print("-" * 40)
print("🥇 Q4_K_M: Best for most users (good quality, fast, small)")
print("🥈 Q8_0: Best for quality-conscious users (high quality, medium size)")  
print("🥉 F16: Best for researchers (maximum quality, large size)")

print(f"\n🧹 Cleaning up...")
torch.cuda.empty_cache()

remaining_space = check_disk_space()
print(f"💾 Remaining disk space: {remaining_space:.1f} GB")

print("\n🎉 GGUF conversion complete!")
print("Your models are ready for llama.cpp inference! 🦙")