# üì¶ LMFast: Quantization & Export

**Shrink your models for deployment without losing intelligence!**

## What You'll Learn
- 4-bit (QLoRA) vs 8-bit quantization
- Export to GGUF (for llama.cpp / Ollama)
- Export to ONNX (for standard runtimes)
- Understand AWQ vs GPTQ

## Quick Guide
| Format | Best For | Speed | Size |
|--------|----------|-------|------|
| **GGUF** | CPU / Mac / Edge | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê |
| **Int4** | GPU Serving | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê |
| **ONNX** | Browser / Web | ‚≠ê‚≠ê | ‚≠ê‚≠ê |

**Time to complete:** ~10 minutes

## 1Ô∏è‚É£ Setup

In [None]:
!pip install -q lmfast[all]

import lmfast
lmfast.setup_colab_env()

import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2Ô∏è‚É£ Load a Model

We'll use a small model for demonstration.

In [None]:
# Using the base model for export demos
MODEL_ID = "HuggingFaceTB/SmolLM-135M-Instruct"

# You can also point to your locally trained model:
# MODEL_ID = "./my_first_slm"

## 3Ô∏è‚É£ Export to GGUF (llama.cpp)

GGUF is the gold standard for running LLMs on consumer hardware (MacBooks, Android, Raspberry Pi).

In [None]:
from lmfast.inference import export_gguf

print("üì¶ Exporting to GGUF (q4_k_m)...")
print("Note: This requires 'llama.cpp' built or installed in environment.")
print("LMFast attempts to use the python bindings or system binary.")

try:
    export_gguf(
        model_path=MODEL_ID,
        output_path="./smollm-135m-q4.gguf",
        quantization="q4_k_m"  # Balanced 4-bit quantization
    )
    print("‚úÖ GGUF Export Successful!")
    
    # Check size
    import os
    size_mb = os.path.getsize("./smollm-135m-q4.gguf") / 1024 / 1024
    print(f"File Size: {size_mb:.2f} MB")
    
except Exception as e:
    print(f"‚ö†Ô∏è GGUF Export skipped/failed: {e}")
    print("Run 'pip install llama-cpp-python' or install system tools.")

## 4Ô∏è‚É£ In-Place Quantization (Int4 / Int8)

If you want to serve the model using Python (transfomers/bitsandbytes), you can save a quantized version.

In [None]:
from lmfast.inference import quantize_model

print("‚öñÔ∏è Quantizing to 4-bit (NF4)...")

quantize_model(
    model_path=MODEL_ID,
    output_dir="./smollm-int4",
    method="int4"  # Uses bitsandbytes NF4
)

print("‚úÖ Int4 Model Saved!")

## 5Ô∏è‚É£ Export to ONNX

Great for running in wider ecosystems (C#, Java, Web).

In [None]:
from lmfast.deployment import export_for_browser

# The browser exporter handles ONNX conversion internally
artifacts = export_for_browser(
    model_path=MODEL_ID,
    output_dir="./onnx_model",
    target="onnx",
    quantization="int8", # Quantize weights for size
    create_demo=False
)

print(f"‚úÖ ONNX paths: {artifacts}")

## 6Ô∏è‚É£ Verify the Quantized Model

In [None]:
from lmfast.inference import SLMServer

# Load the int4 model we saved earlier
server = SLMServer("./smollm-int4")

response = server.generate("What is the speed of light?")
print(f"Output from Int4 model: {response}")

## üéâ Summary

You've learned how to:
- ‚úÖ Create GGUF files for edge deployment
- ‚úÖ Save 4-bit/8-bit models for Python serving
- ‚úÖ Export ONNX models for interoperability

### Next Steps
- `15_browser_deployment.ipynb`: Use the ONNX model in a web app!
- `16_edge_deployment.ipynb`: Run the GGUF model on a Raspberry Pi