# Convert Fine-tuned Model to GGUF for Ollama (v2)

**IMPORTANT:** Run cells in order. After Cell 1, restart runtime, then continue from Cell 2.

Requirements:
- **High-RAM runtime** recommended (Runtime > Change runtime type)

## 1. Install Dependencies (Then Restart Runtime!)

In [None]:
# Fix numpy first
!pip uninstall -y numpy
!pip install numpy==1.26.4

# Install compatible versions
!pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
!pip install transformers==4.44.0 accelerate==0.33.0 peft==0.12.0 huggingface_hub sentencepiece

# Clone llama.cpp
!git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
!pip install -r llama.cpp/requirements.txt

print("\n" + "="*50)
print("✅ Installation complete!")
print("⚠️  NOW: Runtime > Restart runtime")
print("    Then continue from Cell 2")
print("="*50)

## ⚠️ RESTART RUNTIME NOW

Go to **Runtime > Restart runtime**, then continue below.

## 2. Login to HuggingFace

In [None]:
from huggingface_hub import login
login()

## 3. Check Environment

In [None]:
import numpy as np
import torch
import transformers
import peft
import psutil

print(f"NumPy: {np.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"PEFT: {peft.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

ram_gb = psutil.virtual_memory().total / 1e9
print(f"\nRAM: {ram_gb:.1f} GB")

if ram_gb < 20:
    print("\n⚠️  Low RAM - merge may fail. Consider High-RAM runtime.")

## 4. Merge LoRA with Base Model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import gc

base_model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_name = "goodknightleo/qwen-coder-7b-finetuned"
merged_path = "./merged_model"

print("Loading base model (this takes a few minutes)...")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="cpu",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(base_model, adapter_name)

print("Merging weights...")
model = model.merge_and_unload()

print(f"Saving to {merged_path}...")
model.save_pretrained(merged_path, safe_serialization=True)
tokenizer.save_pretrained(merged_path)

del model, base_model
gc.collect()
torch.cuda.empty_cache()

print("✅ Merge complete!")

## 5. Convert to GGUF

In [None]:
!python llama.cpp/convert_hf_to_gguf.py ./merged_model --outfile qwen-coder-finetuned-f16.gguf --outtype f16

## 6. Quantize to Q4_K_M

In [None]:
# Build quantize tool
!cd llama.cpp && make -j quantize 2>/dev/null || make quantize

# Quantize
!./llama.cpp/llama-quantize qwen-coder-finetuned-f16.gguf qwen-coder-finetuned-Q4_K_M.gguf Q4_K_M

import os
if os.path.exists("qwen-coder-finetuned-Q4_K_M.gguf"):
    size = os.path.getsize("qwen-coder-finetuned-Q4_K_M.gguf") / 1e9
    print(f"\n✅ Created qwen-coder-finetuned-Q4_K_M.gguf ({size:.2f} GB)")

## 7. Upload to HuggingFace

In [None]:
from huggingface_hub import HfApi, create_repo

repo_id = "goodknightleo/qwen-coder-7b-finetuned-GGUF"
gguf_file = "qwen-coder-finetuned-Q4_K_M.gguf"

create_repo(repo_id, exist_ok=True)

api = HfApi()
print(f"Uploading {gguf_file} to {repo_id}...")
api.upload_file(
    path_or_fileobj=gguf_file,
    path_in_repo=gguf_file,
    repo_id=repo_id,
)

print(f"\n✅ Uploaded: https://huggingface.co/{repo_id}")

## 8. Create & Download Modelfile

In [None]:
modelfile = '''FROM ./qwen-coder-finetuned-Q4_K_M.gguf

TEMPLATE """{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{- end }}
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""

SYSTEM """You are an expert software engineer with deep knowledge of algorithms, system design, security, and best practices."""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
'''

with open("Modelfile", "w") as f:
    f.write(modelfile)

print("Created Modelfile")

# Download
from google.colab import files
files.download("Modelfile")

## 9. Download GGUF (or use from Hub)

In [None]:
print("Option 1: Download directly from HuggingFace (faster):")
print(f"  https://huggingface.co/{repo_id}/resolve/main/{gguf_file}")

print("\nOption 2: Download from Colab (slower):")
# Uncomment to download from Colab:
# files.download(gguf_file)

## 10. Ollama Setup (Run on your Mac)

```bash
# Install Ollama
brew install ollama

# Download GGUF from HuggingFace
cd ~/Downloads
curl -L -o qwen-coder-finetuned-Q4_K_M.gguf \
  "https://huggingface.co/goodknightleo/qwen-coder-7b-finetuned-GGUF/resolve/main/qwen-coder-finetuned-Q4_K_M.gguf"

# Create model
ollama create qwen-coder -f Modelfile

# Run!
ollama run qwen-coder
```