# Qwen2.5-Coder-7B to ONNX Converter (Low RAM Version)

This notebook converts Qwen2.5-Coder-7B to ONNX using less RAM.

**Requirements:** Colab Pro (more RAM) OR use a smaller model

**Alternative:** Use Qwen2.5-Coder-1.5B (much smaller, still good)

## Option 1: Qwen2.5-Coder-1.5B (Recommended - Fits in Free Colab)

1.5B parameters, ~3GB, much faster conversion

In [None]:
# Install packages
!pip install -q transformers torch onnx onnxruntime
print("✓ Packages installed")
print("Restart runtime now: Runtime → Restart runtime")

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from pathlib import Path
import json

# Use 1.5B model instead of 7B
model_id = 'Qwen/Qwen2.5-Coder-1.5B-Instruct'  # Much smaller!
output_dir = Path('/content/qwen2.5-coder-onnx')
output_dir.mkdir(exist_ok=True)

print("Loading Qwen2.5-Coder-1.5B...")
print("This will download ~3GB")

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map='cpu',
    low_cpu_mem_usage=True  # Save RAM
)
model.eval()

print("✓ Model loaded")
print("Converting to ONNX...")

dummy_input = "Hello"
inputs = tokenizer(dummy_input, return_tensors="pt")
input_ids = inputs['input_ids']

onnx_path = output_dir / "model.onnx"

torch.onnx.export(
    model,
    input_ids,
    str(onnx_path),
    input_names=['input_ids'],
    output_names=['logits'],
    dynamic_axes={
        'input_ids': {0: 'batch_size', 1: 'sequence_length'},
        'logits': {0: 'batch_size', 1: 'sequence_length'}
    },
    opset_version=14,
    do_constant_folding=True
)

tokenizer.save_pretrained(output_dir)

genai_config = {
    "model": {
        "type": "qwen2",
        "context_length": 32768,
        "vocab_size": 151936,
        "decoder": {
            "session_options": {"provider_options": []},
            "filename": "model.onnx",
            "hidden_size": 1536,
            "num_attention_heads": 12,
            "num_hidden_layers": 28,
            "num_key_value_heads": 2
        }
    },
    "search": {"max_length": 32768, "temperature": 1.0, "top_p": 1.0}
}

with open(output_dir / "genai_config.json", "w") as f:
    json.dump(genai_config, f, indent=2)

print()
print("✓ Conversion complete!")
print(f"Files saved to: {output_dir}")
print()
for f in output_dir.iterdir():
    size_mb = f.stat().st_size / (1024 * 1024)
    print(f'  - {f.name}: {size_mb:.1f} MB')

## Upload to HuggingFace

In [None]:
!pip install -q huggingface-hub

from huggingface_hub import HfApi, login
from getpass import getpass

token = getpass("Enter your HuggingFace token: ")
login(token=token)

username = "iminurdetails"
repo_name = "Qwen2.5-Coder-1.5B-Instruct-ONNX"
repo_id = f"{username}/{repo_name}"

print(f"Uploading to {repo_id}...")

api = HfApi()
api.create_repo(repo_id=repo_id, exist_ok=True)
api.upload_folder(
    folder_path='/content/qwen2.5-coder-onnx',
    repo_id=repo_id,
    repo_type="model"
)

print(f"✓ Upload complete! https://huggingface.co/{repo_id}")

## Done!

The 1.5B model is much smaller but still very capable for coding tasks.

**Comparison:**
- 7B model: ~15GB RAM needed, ~8GB output
- 1.5B model: ~4GB RAM needed, ~3GB output

For a coding tutor, 1.5B is sufficient and much faster!