# Lab 4.6.8.4: ONNX Conversion & INT4 Quantization

**Capstone Option E:** Browser-Deployed Fine-Tuned LLM (Troscha Matcha Guide)  
**Phase:** 4 of 6  
**Time:** 4-6 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## Phase Objectives

By completing this phase, you will:
- [ ] Export merged model to ONNX format
- [ ] Apply INT4 quantization (browser-compatible!)
- [ ] Verify quantized model quality
- [ ] Compare file sizes at each stage
- [ ] Prepare model files for browser deployment

---

## Phase Checklist

- [ ] Merged model loaded
- [ ] ONNX export completed
- [ ] INT4 quantization applied
- [ ] Quality verified
- [ ] Tokenizer files prepared
- [ ] Model ready for browser

---

## Why This Matters

**ONNX + INT4 is the key to browser deployment!**

| Format | File Size | Browser Support |
|--------|-----------|----------------|
| PyTorch BF16 | ~2 GB | ‚ùå No |
| ONNX FP32 | ~4 GB | ‚ö†Ô∏è Too big |
| ONNX INT4 | ~500 MB | ‚úÖ Yes! |

**Critical:** Browsers ONLY support INT4 quantization, not NF4 or FP4!

---

## ELI5: What is INT4 Quantization?

> **Imagine you're packing a suitcase for vacation.**
>
> - **Original (FP32):** You bring your entire wardrobe in full-size - takes 4 suitcases
> - **BF16:** You fold things better - 2 suitcases
> - **INT4:** You roll everything tight and use vacuum bags - 0.5 suitcases!
>
> The clothes still work perfectly, they're just compressed for travel.
>
> **How INT4 works:**
> - Original weights: 16 or 32 bits per number (very precise)
> - INT4: Only 4 bits per number (16 possible values)
> - The model learns to work with this reduced precision
> - Result: 75% smaller file, ~5% quality loss (acceptable!)

---

## Part 1: Environment Setup

In [None]:
# Environment Setup
import os
import sys
from pathlib import Path
from datetime import datetime
import json
import torch
import shutil

print("üçµ PHASE 4: ONNX CONVERSION & INT4 QUANTIZATION")
print("="*70)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"\nGPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'Not available'}")

In [None]:
# Project Configuration
PROJECT_DIR = Path("./troscha-matcha")
MODEL_DIR = PROJECT_DIR / "models"

# Paths
MERGED_PATH = MODEL_DIR / "troscha-merged"
ONNX_PATH = MODEL_DIR / "troscha-onnx"
ONNX_INT4_PATH = MODEL_DIR / "troscha-onnx-int4"
BROWSER_PATH = MODEL_DIR / "troscha-browser"

# Create directories
for path in [ONNX_PATH, ONNX_INT4_PATH, BROWSER_PATH]:
    path.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Paths:")
print(f"   Merged model: {MERGED_PATH}")
print(f"   ONNX output: {ONNX_PATH}")
print(f"   ONNX INT4: {ONNX_INT4_PATH}")
print(f"   Browser files: {BROWSER_PATH}")

In [None]:
# Check merged model exists
if not MERGED_PATH.exists():
    print(f"‚ùå Merged model not found at {MERGED_PATH}")
    print("   Please complete Phase 3 first!")
else:
    # Show merged model size
    merged_size = sum(f.stat().st_size for f in MERGED_PATH.glob("*.safetensors")) / 1e9
    print(f"‚úÖ Merged model found: {merged_size:.2f} GB")

---

## Part 2: Export to ONNX

In [None]:
from optimum.exporters.onnx import main_export
from optimum.onnxruntime import ORTModelForCausalLM

print("üì¶ Exporting to ONNX format...")
print("   This may take 5-10 minutes...")

try:
    # Export to ONNX
    # Using optimum library for HuggingFace model export
    main_export(
        model_name_or_path=str(MERGED_PATH),
        output=str(ONNX_PATH),
        task="text-generation-with-past",  # Use KV cache for faster inference
        device="cuda" if torch.cuda.is_available() else "cpu",
        fp16=False,  # Export as FP32 first, then quantize
    )

    # Calculate size
    onnx_size = sum(f.stat().st_size for f in ONNX_PATH.glob("*.onnx")) / 1e9

    print(f"\n‚úÖ ONNX export complete!")
    print(f"   Path: {ONNX_PATH}")
    print(f"   Size: {onnx_size:.2f} GB")

    # List files
    print(f"\nüìÅ ONNX files:")
    for f in sorted(ONNX_PATH.iterdir()):
        if f.is_file():
            size = f.stat().st_size / 1e6
            print(f"   {f.name}: {size:.1f} MB")

except Exception as e:
    print(f"\n‚ùå ONNX export failed: {e}")
    print("\nüîß Troubleshooting:")
    print("   1. Ensure merged model exists and is valid")
    print("   2. Check optimum version: pip install -U optimum[exporters]")
    print("   3. Verify sufficient GPU memory (clear with torch.cuda.empty_cache())")
    print("   4. For unsupported ops, try: pip install onnx onnxruntime-gpu")
    print(f"\n   Full error: {type(e).__name__}: {e}")
    raise

---

## Part 3: Apply INT4 Quantization

**This is the critical step for browser compatibility!**

In [None]:
from onnxruntime.quantization import quantize_dynamic, QuantType
import onnx

print("üîß Applying INT4 quantization...")
print("   CRITICAL: Browser ONLY supports INT4, not NF4 or FP4!")

# Find the main model file
onnx_files = list(ONNX_PATH.glob("*.onnx"))
model_file = None
for f in onnx_files:
    if "model" in f.name.lower():
        model_file = f
        break

if model_file is None and onnx_files:
    model_file = onnx_files[0]

if model_file:
    print(f"   Source: {model_file.name}")
    
    output_file = ONNX_INT4_PATH / "model_quantized.onnx"
    
    # Apply INT4 quantization
    quantize_dynamic(
        model_input=str(model_file),
        model_output=str(output_file),
        weight_type=QuantType.QInt4,  # INT4 for browser!
        per_channel=True,  # Better quality
        reduce_range=False,
    )
    
    # Calculate sizes
    int4_size = output_file.stat().st_size / 1e6
    original_size = model_file.stat().st_size / 1e6
    compression = (1 - int4_size / original_size) * 100
    
    print(f"\n‚úÖ INT4 quantization complete!")
    print(f"   Output: {output_file}")
    print(f"   Size: {int4_size:.1f} MB")
    print(f"   Compression: {compression:.1f}% reduction")
else:
    print("‚ùå No ONNX model file found!")

In [None]:
# Copy tokenizer files for browser
print("üìã Preparing tokenizer files for browser...")

tokenizer_files = [
    "tokenizer.json",
    "tokenizer_config.json", 
    "special_tokens_map.json",
    "config.json",
]

for fname in tokenizer_files:
    src = MERGED_PATH / fname
    if src.exists():
        dst = ONNX_INT4_PATH / fname
        shutil.copy(src, dst)
        print(f"   ‚úÖ Copied {fname}")
    else:
        # Try ONNX path
        src = ONNX_PATH / fname
        if src.exists():
            dst = ONNX_INT4_PATH / fname
            shutil.copy(src, dst)
            print(f"   ‚úÖ Copied {fname}")
        else:
            print(f"   ‚ö†Ô∏è Not found: {fname}")

print("\n‚úÖ Tokenizer files prepared")

---

## Part 4: Verify Quantized Model

In [None]:
import onnxruntime as ort

print("üß™ Testing quantized ONNX model...")

# Load the quantized model
quantized_model_path = ONNX_INT4_PATH / "model_quantized.onnx"

if quantized_model_path.exists():
    # Create ONNX Runtime session
    session_options = ort.SessionOptions()
    session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    
    providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
    session = ort.InferenceSession(
        str(quantized_model_path),
        sess_options=session_options,
        providers=providers,
    )
    
    print(f"‚úÖ Model loaded successfully")
    print(f"   Provider: {session.get_providers()[0]}")
    
    # Show input/output info
    print(f"\nüìä Model Info:")
    print(f"   Inputs:")
    for inp in session.get_inputs():
        print(f"      - {inp.name}: {inp.shape}")
    print(f"   Outputs:")
    for out in session.get_outputs():
        print(f"      - {out.name}: {out.shape}")
else:
    print(f"‚ùå Quantized model not found at {quantized_model_path}")

---

## Part 5: Size Comparison

In [None]:
# Compare sizes at each stage

def get_dir_size(path: Path) -> float:
    """Get total size of directory in MB."""
    if not path.exists():
        return 0
    return sum(f.stat().st_size for f in path.rglob("*") if f.is_file()) / 1e6

print("üìä SIZE COMPARISON")
print("="*70)

stages = [
    ("Merged Model (BF16)", MERGED_PATH),
    ("ONNX (FP32)", ONNX_PATH),
    ("ONNX INT4 (Browser)", ONNX_INT4_PATH),
]

sizes = []
for name, path in stages:
    size = get_dir_size(path)
    sizes.append(size)
    print(f"   {name:<25} {size:>8.1f} MB")

if sizes[0] > 0:
    print("-"*70)
    final_compression = (1 - sizes[-1] / sizes[0]) * 100
    print(f"   {'Total Compression':<25} {final_compression:>8.1f}%")
    print(f"\n‚úÖ Model is now browser-ready at ~{sizes[-1]:.0f} MB!")

---

## Part 6: Prepare for Browser Deployment

In [None]:
# Create browser deployment package

print("üì¶ Creating browser deployment package...")

# Copy all needed files to browser directory
for f in ONNX_INT4_PATH.iterdir():
    if f.is_file():
        shutil.copy(f, BROWSER_PATH / f.name)

# Create model config for Transformers.js
browser_config = {
    "model_type": "gemma",
    "quantization": "int4",
    "framework": "onnx",
    "runtime": "transformers.js",
    "files": [f.name for f in BROWSER_PATH.iterdir() if f.is_file()],
    "usage": {
        "device": "webgpu",
        "dtype": "q4",
    },
}

with open(BROWSER_PATH / "browser_config.json", 'w') as f:
    json.dump(browser_config, f, indent=2)

print(f"\n‚úÖ Browser package ready at {BROWSER_PATH}")
print(f"\nüìÅ Files for deployment:")
total_size = 0
for f in sorted(BROWSER_PATH.iterdir()):
    if f.is_file():
        size = f.stat().st_size / 1e6
        total_size += size
        print(f"   {f.name}: {size:.1f} MB")
print(f"\n   Total: {total_size:.1f} MB")

---

## Common Issues

### Issue 1: ONNX Export Fails
**Symptom:** Error during export  
**Fix:** Check model architecture is supported, update optimum

### Issue 2: Quantization Quality Loss
**Symptom:** Output quality significantly degraded  
**Fix:** Use per-channel quantization, check input model quality

### Issue 3: Large File Size
**Symptom:** INT4 model still too big  
**Fix:** Verify quantization applied, consider smaller base model

---

## Metrics & Outputs

| Metric | Expected | Actual |
|--------|----------|--------|
| ONNX Size (FP32) | ~4 GB | [Your value] |
| ONNX INT4 Size | ~500 MB | [Your value] |
| Compression Ratio | ~75% | [Your value] |
| Export Time | ~10 min | [Your value] |

---

## Phase Complete!

You've achieved:
- ‚úÖ Exported model to ONNX format
- ‚úÖ Applied INT4 quantization (browser-compatible)
- ‚úÖ Prepared tokenizer files
- ‚úÖ Created browser deployment package

**Next:** [Lab 4.6.8.5: Browser Integration](./lab-4.6.8.5-browser-integration.ipynb)

---

In [None]:
# Cleanup
import gc

# Clean up ONNX session
if 'session' in dir():
    try:
        del session
    except:
        pass

# Clean up intermediate ONNX files to save disk space (optional)
# Uncomment if you want to remove the FP32 ONNX to save ~4GB
# import shutil
# if ONNX_PATH.exists():
#     shutil.rmtree(ONNX_PATH)
#     print(f"üßπ Removed intermediate ONNX files at {ONNX_PATH}")

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

# Force garbage collection
gc.collect()

print("‚úÖ Phase 4 Complete!")
print("\nüìä Final Summary:")
print(f"   Browser-ready model: {BROWSER_PATH}")
print(f"   Approximate size: ~500MB")
print("\nüéØ Next Steps:")
print("   1. Verify INT4 model size is ~500MB")
print("   2. Check all tokenizer files are present")
print("   3. Proceed to Lab 4.6.8.5 for browser integration")
print("\nüí° Tip: You can delete the intermediate ONNX (FP32) files to save ~4GB")