# GLM-4.7-Flash-MoE Quickstart

This notebook demonstrates end-to-end inference with **GLM-4.7-Flash** (Mixture of Experts model) using Metal Marlin's FP4 quantization on Apple Silicon.

## Overview

- **Model**: `zai-org/GLM-4.7-Flash` (MoE architecture)
- **Quantization**: 4-bit FP4 with group size 128
- **Device**: Apple Metal (MPS)
- **Features**: Streaming inference, memory tracking, token metrics

## Requirements

- macOS 13.0+ with Apple Silicon
- Python 3.11 or 3.12
- Installed via: `uv sync --extra all`

## Setup

In [None]:
import time

import torch
from transformers import AutoTokenizer

from metal_marlin.inference.pipeline_v2 import TransformersMarlinPipeline
from metal_marlin.transformers_loader import load_and_quantize

## Load and Quantize Model

This step:
1. Downloads the GLM-4.7-Flash model from HuggingFace
2. Replaces `nn.Linear` layers with `MetalQuantizedLinear`
3. Quantizes weights to FP4 format (4 bits per parameter)

**Expected**: ~3-5 GB memory usage after quantization (vs ~9 GB for BF16)

In [None]:
MODEL_NAME = "zai-org/GLM-4.7-Flash"
BITS = 4
GROUP_SIZE = 128
FORMAT = "fp4"

print(f"Loading {MODEL_NAME}...")
model, stats = load_and_quantize(
    MODEL_NAME,
    bits=BITS,
    group_size=GROUP_SIZE,
    format=FORMAT,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

print("\n‚úÖ Quantization Complete")
print(f"  Quantized layers: {stats.get('quantized_count', 'N/A')}")
print(f"  Skipped layers: {stats.get('skipped_count', 'N/A')}")
print(f"  Compression ratio: {stats.get('compression_ratio', 0):.2f}x")
print(f"  Original size: {stats.get('original_bytes', 0) / 1024**3:.2f} GB")
print(f"  Quantized size: {stats.get('quantized_bytes', 0) / 1024**3:.2f} GB")

if torch.backends.mps.is_available():
    print(f"  MPS memory: {torch.mps.current_allocated_memory() / 1024**3:.2f} GB")

## Create Pipeline

The pipeline handles tokenization, generation, and streaming.

In [None]:
pipeline = TransformersMarlinPipeline(model, tokenizer)
print("‚úÖ Pipeline ready")

## Single Prompt Inference

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what a Mixture of Experts model is in 2 sentences."},
]

print("Assistant: ", end="", flush=True)

# Stream tokens
start = time.perf_counter()
response = ""
for token in pipeline.chat(messages, max_tokens=256, temperature=0.7, stream=True):
    print(token, end="", flush=True)
    response += token

elapsed = time.perf_counter() - start
token_count = len(tokenizer.encode(response, add_special_tokens=False))

print(f"\n\nüìä Metrics: {token_count} tokens in {elapsed:.2f}s ({token_count/elapsed:.1f} tok/s)")

## Multi-Turn Conversation

Demonstrate context retention across multiple turns.

In [None]:
history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

print("Turn 1")
print("User:", history[-1]["content"])
print("Assistant: ", end="", flush=True)

response1 = ""
for token in pipeline.chat(history, max_tokens=128, temperature=0.7, stream=True):
    print(token, end="", flush=True)
    response1 += token
print()

history.append({"role": "assistant", "content": response1})
history.append({"role": "user", "content": "What's a famous landmark there?"})

print("\nTurn 2")
print("User:", history[-1]["content"])
print("Assistant: ", end="", flush=True)

response2 = ""
for token in pipeline.chat(history, max_tokens=128, temperature=0.7, stream=True):
    print(token, end="", flush=True)
    response2 += token
print()

## Batch Inference (Non-Streaming)

Generate multiple completions in parallel.

In [None]:
prompts = [
    "Write a haiku about AI.",
    "Name 3 programming languages.",
    "What is 15 * 23?",
]

for i, prompt in enumerate(prompts, 1):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
    ]
    print(f"\nPrompt {i}: {prompt}")
    print("Response: ", end="", flush=True)
    
    response = ""
    for token in pipeline.chat(messages, max_tokens=128, temperature=0.7, stream=True):
        print(token, end="", flush=True)
        response += token
    print()

## Memory Usage Tracking

In [None]:
if torch.backends.mps.is_available():
    current = torch.mps.current_allocated_memory() / 1024**3
    driver = torch.mps.driver_allocated_memory() / 1024**3
    print(f"MPS Current: {current:.2f} GB")
    print(f"MPS Driver: {driver:.2f} GB")
else:
    print("MPS not available")

## Performance Benchmark

Measure throughput across different token lengths.

In [None]:
test_configs = [
    {"prompt": "Hi", "max_tokens": 64, "label": "Short (64 tokens)"},
    {"prompt": "Explain quantum computing", "max_tokens": 128, "label": "Medium (128 tokens)"},
    {"prompt": "Write a story about a robot", "max_tokens": 256, "label": "Long (256 tokens)"},
]

print("\nüèÅ Benchmark Results\n")
print(f"{'Config':<25} {'Tokens':<10} {'Time (s)':<12} {'Throughput (tok/s)':<20}")
print("-" * 70)

for config in test_configs:
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": config["prompt"]},
    ]
    
    start = time.perf_counter()
    response = ""
    for token in pipeline.chat(messages, max_tokens=config["max_tokens"], temperature=0.7, stream=True):
        response += token
    elapsed = time.perf_counter() - start
    
    token_count = len(tokenizer.encode(response, add_special_tokens=False))
    throughput = token_count / elapsed
    
    print(f"{config['label']:<25} {token_count:<10} {elapsed:<12.2f} {throughput:<20.1f}")

## Next Steps

1. **Try different quantization formats**: `fp8`, `int4`, `int3`, `int2`
2. **Experiment with group sizes**: 64, 128, 256
3. **Test other models**: Any HuggingFace model with `AutoModelForCausalLM`
4. **Deploy as API server**: See `examples/perf_dashboard.py` for OpenAI-compatible serving

## CLI Alternative

For command-line usage:

```bash
# Single prompt
python examples/glm4_flash_inference.py --prompt "Hello, how are you?"

# Interactive mode
python examples/glm4_flash_inference.py --interactive

# Custom parameters
python examples/glm4_flash_inference.py \
  --prompt "Explain AI" \
  --max-tokens 512 \
  --temperature 0.9 \
  --top-p 0.95
```

## Documentation

- **README**: `README.md`
- **API Reference**: `docs/api_reference.md`
- **Benchmarks**: `benchmarks/`
- **Tests**: `tests/`