# GLM-4.7-Flash MoE Quickstart

This notebook demonstrates how to run inference with the **GLM-4.7-Flash** MoE (Mixture of Experts) model using **Metal Marlin** on Apple Silicon (M-series chips).

## Prerequisites

1.  **Install Metal Marlin**:
    Ensure you are in the `contrib/metal_marlin` directory and have installed dependencies.
    ```bash
    cd contrib/metal_marlin
    uv sync --extra all
    ```

2.  **Download/Quantize Model**:
    You need a quantized version of the model. We use **Uniform 4-bit Trellis quantization** which enables fast fused kernels on Apple Silicon.
    Run the following command to download and quantize the model (this may take ~25 minutes):
    ```bash
    # Run from contrib/metal_marlin directory
    uv run python scripts/quantize_uniform_metal.py --output models/GLM-4.7-Flash-Trellis-Uniform4
    ```


In [None]:
import os
import sys
from pathlib import Path
import torch
from transformers import AutoTokenizer

# Add the src directory to the path so we can import metal_marlin if not installed as a package
# Adjust this path if necessary depending on where you run the notebook from
project_root = Path("..").resolve()
sys.path.append(str(project_root))

try:
    from metal_marlin.trellis import TrellisForCausalLM
except ImportError:
    # Fallback if running from a different context
    sys.path.append(str(project_root.parent))
    from metal_marlin.trellis import TrellisForCausalLM

# Ensure we are using MPS (Metal Performance Shaders)
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

## Load Model and Tokenizer

We load the Trellis-quantized model. This model is optimized for Apple Silicon and uses a mixed-bitwidth approach for efficiency.

In [None]:
# Path to your quantized model
# Assuming the notebook is in contrib/metal_marlin/examples/
model_path = project_root / "models" / "GLM-4.7-Flash-Trellis-Uniform4"

if not model_path.exists():
    print(f"Warning: Model not found at {model_path}")
    print("Please follow the prerequisites to quantize the model first.")
else:
    print(f"Loading model from {model_path}...")
    try:
        model = TrellisForCausalLM.from_pretrained(str(model_path), device=device)
        print("Model loaded successfully.")
    except Exception as e:
        print(f"Error loading model: {e}")

# Load the tokenizer from the original Hugging Face repository
tokenizer_id = "zai-org/GLM-4.7-Flash"
print(f"Loading tokenizer from {tokenizer_id}...")
try:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id, trust_remote_code=True)
    print("Tokenizer loaded successfully.")
except Exception as e:
    print(f"Error loading tokenizer: {e}")

## Model Inspection

Let's inspect the model architecture to confirm the MoE configuration.

In [None]:
if 'model' in globals():
    config = model.config
    print(f"Model Architecture: {getattr(config, 'architectures', 'Unknown')}")
    print(f"Hidden Size: {getattr(config, 'hidden_size', 'Unknown')}")
    print(f"Num Layers: {getattr(config, 'num_hidden_layers', 'Unknown')}")
    print(f"Num Heads: {getattr(config, 'num_attention_heads', 'Unknown')}")
    
    # Check for MoE specific attributes
    num_experts = getattr(config, "num_experts", None)
    if num_experts:
        print(f"Num Experts: {num_experts}")
    
    num_experts_per_tok = getattr(config, "num_experts_per_tok", None) or getattr(config, "num_routed_experts", None)
    if num_experts_per_tok:
        print(f"Experts per Token: {num_experts_per_tok}")


## Generation with Chat Template

GLM-4 models require specific chat templates. We use the tokenizer's chat template for proper formatting.

In [None]:
def generate_chat(messages, max_new_tokens=512, temperature=0.7):
    if 'model' not in globals() or 'tokenizer' not in globals():
        print("Model or tokenizer not loaded.")
        return

    # Apply chat template
    if hasattr(tokenizer, "apply_chat_template"):
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        # Fallback for simple generation if no template
        prompt = messages[-1]["content"]
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    print(f"Generating response for prompt length: {inputs.input_ids.shape[1]} tokens")

    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=max_new_tokens,
            do_sample=(temperature > 0),
            temperature=temperature,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
    # Decode only the new tokens
    new_tokens = outputs[0][inputs.input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

# Example Chat
messages = [
    {"role": "user", "content": "Explain the concept of Mixture of Experts (MoE) in simple terms."}
]

print("User:", messages[0]["content"])
response = generate_chat(messages)
print("\nAssistant:", response)
