# Convert Fine-tuned Model to GGUF for Ollama

This notebook:
1. Merges your LoRA adapter with the base model
2. Converts to GGUF format (Q4_K_M quantization)
3. Creates an Ollama Modelfile

**Requirements:**
- Use **High-RAM runtime** (Runtime > Change runtime type > High-RAM)
- T4 GPU recommended

## 1. Install Dependencies

In [None]:
!pip install -q transformers accelerate peft huggingface_hub sentencepiece
!pip install -q llama-cpp-python

# Clone llama.cpp for conversion scripts
!git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
!pip install -q -r llama.cpp/requirements.txt

## 2. Login to HuggingFace

In [None]:
from huggingface_hub import login
login()

## 3. Check Memory

In [None]:
import psutil
ram_gb = psutil.virtual_memory().total / 1e9
print(f"Available RAM: {ram_gb:.1f} GB")

if ram_gb < 25:
    print("\n⚠️  WARNING: You need High-RAM runtime for 7B model merge!")
    print("Go to: Runtime > Change runtime type > High-RAM")
else:
    print("✅ Sufficient RAM for model merge")

## 4. Merge LoRA with Base Model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import gc

base_model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_name = "goodknightleo/qwen-coder-7b-finetuned"
merged_path = "./merged_model"

print("Loading base model in float16...")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="cpu",  # Load on CPU to save GPU memory
    trust_remote_code=True,
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base_model, adapter_name)

print("Merging LoRA weights...")
model = model.merge_and_unload()

print(f"Saving merged model to {merged_path}...")
model.save_pretrained(merged_path, safe_serialization=True)
tokenizer.save_pretrained(merged_path)

# Free memory
del model
del base_model
gc.collect()

print("✅ Merge complete!")

## 5. Convert to GGUF

In [None]:
import subprocess

output_gguf = "qwen-coder-7b-finetuned.gguf"

print("Converting to GGUF format...")
result = subprocess.run([
    "python", "llama.cpp/convert_hf_to_gguf.py",
    merged_path,
    "--outfile", output_gguf,
    "--outtype", "f16"
], capture_output=True, text=True)

print(result.stdout)
if result.returncode != 0:
    print("Error:", result.stderr)
else:
    print(f"✅ Created {output_gguf}")

## 6. Quantize to Q4_K_M (Smaller & Faster)

In [None]:
# Build llama.cpp quantize tool
!cd llama.cpp && make -j quantize

quantized_gguf = "qwen-coder-7b-finetuned-Q4_K_M.gguf"

print("\nQuantizing to Q4_K_M...")
!./llama.cpp/quantize {output_gguf} {quantized_gguf} Q4_K_M

import os
if os.path.exists(quantized_gguf):
    size_gb = os.path.getsize(quantized_gguf) / 1e9
    print(f"\n✅ Created {quantized_gguf} ({size_gb:.2f} GB)")

## 7. Upload GGUF to HuggingFace

In [None]:
from huggingface_hub import HfApi, create_repo

repo_id = "goodknightleo/qwen-coder-7b-finetuned-GGUF"

# Create repo
try:
    create_repo(repo_id, exist_ok=True)
    print(f"Created repo: {repo_id}")
except Exception as e:
    print(f"Repo exists or error: {e}")

# Upload quantized GGUF
api = HfApi()
print(f"\nUploading {quantized_gguf}...")
api.upload_file(
    path_or_fileobj=quantized_gguf,
    path_in_repo=quantized_gguf,
    repo_id=repo_id,
)
print(f"✅ Uploaded to https://huggingface.co/{repo_id}")

## 8. Create Ollama Modelfile

In [None]:
modelfile_content = '''FROM ./qwen-coder-7b-finetuned-Q4_K_M.gguf

TEMPLATE """{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{- end }}
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""

SYSTEM """You are an expert software engineer with deep knowledge of algorithms, system design, security, and best practices. You write clean, efficient, well-documented code and can debug complex issues systematically."""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
'''

with open("Modelfile", "w") as f:
    f.write(modelfile_content)

print("Created Modelfile:")
print("-" * 40)
print(modelfile_content)

## 9. Download Files

Download these files to your local machine:

In [None]:
from google.colab import files

print("Downloading Modelfile...")
files.download("Modelfile")

print(f"\nDownloading {quantized_gguf}...")
print("(This may take a few minutes for a ~4GB file)")
files.download(quantized_gguf)

## 10. Setup Instructions for Ollama

After downloading the files, run these commands on your Mac:

```bash
# 1. Install Ollama (if not already)
brew install ollama

# 2. Start Ollama service
ollama serve

# 3. In a new terminal, create the model
cd ~/Downloads  # or wherever you saved the files
ollama create qwen-coder-custom -f Modelfile

# 4. Run your model!
ollama run qwen-coder-custom
```

### Use with AI Coding Tools:

**Aider:**
```bash
pip install aider-chat
aider --model ollama/qwen-coder-custom
```

**Continue.dev (VS Code):**
- Install Continue extension
- Add to config: `{"model": "ollama/qwen-coder-custom"}`