# ü§ñ MAI-UI GUI Agent: GPU-Optimized vLLM Deployment

## Complete Guide for T4, A100, H100, and B200

This notebook demonstrates how to run [MAI-UI](https://www.alphaxiv.org/abs/2512.22047) - a state-of-the-art GUI agent built on Qwen3-VL - optimized for different NVIDIA GPUs.

---

### What is MAI-UI?

MAI-UI (Multimodal AI UI) is a family of foundation GUI agents from Tongyi Lab (Alibaba) that:
- Uses **Qwen3-VL** as its vision-language backbone
- Trained with **GRPO** (Group Relative Policy Optimization) reinforcement learning
- Achieves **76.7%** on AndroidWorld (SOTA) and **73.5%** on ScreenSpot-Pro
- Supports device-cloud collaboration for privacy + accuracy

### Why GPU-Specific Optimization Matters

```
GPU Memory ‚îÄ‚îÄ‚ñ∫ Model Size ‚îÄ‚îÄ‚ñ∫ Precision ‚îÄ‚îÄ‚ñ∫ Performance
     ‚îÇ              ‚îÇ             ‚îÇ              ‚îÇ
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         Wrong choices = OOM or 10x slower
```

| GPU | Key Constraint | Solution | Expected Latency |
|-----|----------------|----------|------------------|
| T4 (16GB) | Memory-limited | 4-bit quantization, small batch | ~1-2s/action |
| A100 (80GB) | Bandwidth-limited | BF16, FlashAttention 2 | ~300-500ms |
| H100 (80GB) | Compute-limited | FP8, FlashAttention 3 | ~150-300ms |
| B200 (192GB) | None practical | Full model, max concurrency | ~100-200ms |


## üìö Architecture: Why Qwen3-VL Excels at GUI Tasks

### DeepStack: Multi-Scale Visual Features

GUI elements like buttons (20√ó20 pixels on 1080√ó2400 screen) are TINY. Standard vision encoders lose this detail.

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  DEEPSTACK: Qwen3-VL's Secret Weapon for GUI Agents                            ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                                 ‚îÇ
‚îÇ  Vision Layer 8:  Edge detection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ LLM Layer 0  ‚îÇ
‚îÇ                   (button borders, icon edges)                                 ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ  Vision Layer 16: Shape patterns ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ LLM Layer 1  ‚îÇ
‚îÇ                   (UI component shapes)                                        ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ  Vision Layer 24: Widget structures ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ LLM Layer 2  ‚îÇ
‚îÇ                   (navigation bars, dialogs)                                   ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ  Vision Layer 32: Semantic meaning ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ Embedding    ‚îÇ
‚îÇ                   ("Settings button", "Search field")                          ‚îÇ
‚îÇ                                                                                 ‚îÇ
‚îÇ  RESULT: All scales available to LLM ‚Üí Better at finding small UI elements!   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### GRPO Training: How MAI-UI Learns

```
For each screenshot + task:
  1. Sample 8 actions: [CLICK(100,200), CLICK(105,198), CLICK(500,300), ...]
  2. Compute rewards: [1.0, 1.0, 0.0, 0.0, ...] (1.0 if inside target)
  3. Normalize: advantages = (rewards - mean) / std
  4. Update: increase probability of high-advantage actions

WHY GRPO > PPO: No critic network ‚Üí Less memory ‚Üí Larger batches
```


In [None]:
# Cell 1: GPU Detection and Auto-Configuration
import subprocess
import sys
import os

print("=" * 80)
print("üîç GPU DETECTION AND AUTO-CONFIGURATION")
print("=" * 80)

# Get GPU info
try:
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=name,memory.total,compute_cap', '--format=csv,noheader'],
        capture_output=True, text=True
    )
    print(f"\nüìä GPU Info: {result.stdout.strip()}")
except:
    print("‚ùå nvidia-smi not found!")
    sys.exit(1)

import torch
if not torch.cuda.is_available():
    print("‚ùå CUDA not available!")
    sys.exit(1)

gpu_name = torch.cuda.get_device_name(0)
gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
compute_cap = torch.cuda.get_device_capability(0)
sm_version = compute_cap[0] * 10 + compute_cap[1]

print(f"\n‚úÖ GPU: {gpu_name}")
print(f"‚úÖ Memory: {gpu_memory_gb:.1f} GB")
print(f"‚úÖ Compute Capability: SM {sm_version}")

# Determine GPU tier
def get_gpu_tier():
    """Classify GPU into tier based on memory and compute capability."""
    if gpu_memory_gb > 150:
        return 'B200'  # Blackwell, 192GB
    elif sm_version >= 90:
        return 'H100'  # Hopper, SM 9.0
    elif sm_version >= 80:
        return 'A100'  # Ampere, SM 8.0
    else:
        return 'T4'    # Turing or older

GPU_TIER = get_gpu_tier()

tier_info = {
    'T4': {'emoji': 'üü°', 'features': ['FP16 only', 'TORCH_SDPA', '320 GB/s'], 'latency': '1-2s'},
    'A100': {'emoji': 'üü¢', 'features': ['BF16', 'FlashAttn 2', '2,039 GB/s'], 'latency': '300-500ms'},
    'H100': {'emoji': 'üîµ', 'features': ['FP8', 'FlashAttn 3', '3,350 GB/s'], 'latency': '150-300ms'},
    'B200': {'emoji': 'üü£', 'features': ['192GB', 'FP8/FP4', '8,000 GB/s'], 'latency': '100-200ms'}
}

info = tier_info[GPU_TIER]
print(f"\n{info['emoji']} Detected GPU Tier: {GPU_TIER}")
print(f"   Features: {', '.join(info['features'])}")
print(f"   Expected Latency: {info['latency']}")

# Install dependencies
print("\n" + "=" * 80)
print("üì¶ Installing dependencies...")
%pip install -q vllm>=0.6.0 pillow requests jinja2
print("‚úÖ Dependencies installed!")


In [None]:
# Cell 2: GPU-Optimized Configuration
# Each config is tuned based on: GPU memory, compute capability, bandwidth

def get_mai_ui_config(gpu_tier: str) -> dict:
    """
    Get GPU-optimized vLLM configuration for MAI-UI.
    
    WHY DIFFERENT CONFIGS:
    - T4: 16GB, no BF16 ‚Üí must use FP16 + small batch
    - A100: 80GB, BF16 + FA2 ‚Üí full precision, medium batch
    - H100: 80GB, FP8 + FA3 ‚Üí 2x compression, high batch
    - B200: 192GB ‚Üí largest model, maximum batch
    """
    configs = {
        'T4': {
            # T4: Memory-constrained, no BF16, no FlashAttention
            'model': 'Tongyi-MAI/MAI-UI-2B',
            'trust_remote_code': True,
            'dtype': 'half',                    # FP16 only option
            'gpu_memory_utilization': 0.92,
            'max_model_len': 4096,
            'enforce_eager': True,              # Save ~0.5GB
            'max_num_seqs': 4,
            'limit_mm_per_prompt': {'image': 2, 'video': 0},
            'mm_processor_kwargs': {'min_pixels': 784, 'max_pixels': 512000},
        },
        'A100': {
            # A100: BF16 full precision, FlashAttention 2
            'model': 'Tongyi-MAI/MAI-UI-8B',
            'trust_remote_code': True,
            'dtype': 'bfloat16',
            'gpu_memory_utilization': 0.95,
            'max_model_len': 16384,
            'enforce_eager': False,
            'max_num_seqs': 16,
            'limit_mm_per_prompt': {'image': 4, 'video': 1},
            'mm_processor_kwargs': {'min_pixels': 784, 'max_pixels': 2073600, 'video_pruning_rate': 0.3},
            'enable_prefix_caching': True,
            'enable_chunked_prefill': True,
        },
        'H100': {
            # H100: FP8 weights + KV cache, FlashAttention 3
            'model': 'Tongyi-MAI/MAI-UI-8B',
            'trust_remote_code': True,
            'dtype': 'bfloat16',
            'quantization': 'fp8',              # 2x smaller
            'kv_cache_dtype': 'fp8',            # 2x smaller KV
            'gpu_memory_utilization': 0.95,
            'max_model_len': 32768,
            'enforce_eager': False,
            'max_num_seqs': 32,
            'limit_mm_per_prompt': {'image': 8, 'video': 2},
            'mm_processor_kwargs': {'min_pixels': 784, 'max_pixels': 2073600, 'video_pruning_rate': 0.3},
            'enable_prefix_caching': True,
            'enable_chunked_prefill': True,
        },
        'B200': {
            # B200: Maximum everything, 192GB VRAM
            'model': 'Tongyi-MAI/MAI-UI-32B',
            'trust_remote_code': True,
            'dtype': 'bfloat16',
            'gpu_memory_utilization': 0.95,
            'max_model_len': 65536,
            'enforce_eager': False,
            'max_num_seqs': 64,
            'limit_mm_per_prompt': {'image': 16, 'video': 4},
            'mm_processor_kwargs': {'min_pixels': 784, 'max_pixels': 4147200, 'video_pruning_rate': 0.2},
            'enable_prefix_caching': True,
            'enable_chunked_prefill': True,
        },
    }
    return configs[gpu_tier]

config = get_mai_ui_config(GPU_TIER)

print("=" * 80)
print(f"üìã CONFIGURATION FOR {GPU_TIER}")
print("=" * 80)
for key, value in config.items():
    print(f"  {key}: {value}")

# Memory budget
budgets = {
    'T4': "~11.5 GB / 16 GB used (‚úÖ Comfortable)",
    'A100': "~44 GB / 80 GB used (‚úÖ Plenty of room)",
    'H100': "~33 GB / 80 GB used (‚úÖ Room for 64+ concurrent)",
    'B200': "~126 GB / 192 GB used (‚úÖ Massive headroom)"
}
print(f"\nüìä Expected Memory: {budgets[GPU_TIER]}")


In [None]:
# Cell 3: Initialize vLLM Engine
from vllm import LLM, SamplingParams
import time

print("=" * 80)
print(f"üöÄ INITIALIZING vLLM ENGINE ({GPU_TIER})")
print("=" * 80)
print(f"\nüì• Loading model: {config['model']}")
print("   This may take several minutes on first run...")

init_start = time.time()
llm = LLM(**config)
init_time = time.time() - init_start

print(f"\n‚úÖ Engine initialized in {init_time:.1f} seconds")

# Memory check
allocated = torch.cuda.memory_allocated() / (1024**3)
reserved = torch.cuda.memory_reserved() / (1024**3)
print(f"\nüìä GPU Memory:")
print(f"   Allocated: {allocated:.2f} GB")
print(f"   Reserved:  {reserved:.2f} GB")
print(f"   Free:      {gpu_memory_gb - reserved:.2f} GB")


In [None]:
# Cell 4: MAI-UI Prompt Format and Parsing
import re
import json

MAI_SYSTEM = """You are MAI-UI, a GUI grounding agent. Given a screenshot and instruction, locate the UI element.
Output: <grounding_think>[reasoning]</grounding_think><answer>{"coordinate": [x, y]}</answer>
Coordinates: [0, 999] range where (0,0)=top-left, (999,999)=bottom-right."""

def build_prompt(instruction: str) -> str:
    """Build chat-formatted prompt with image placeholder."""
    return (
        f"<|im_start|>system\n{MAI_SYSTEM}<|im_end|>\n"
        f"<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\n{instruction}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )

def parse_response(text: str) -> dict:
    """Parse MAI-UI's structured response."""
    result = {'thinking': None, 'coordinate': None, 'raw': text}
    
    think = re.search(r"<grounding_think>(.*?)</grounding_think>", text, re.DOTALL)
    if think:
        result['thinking'] = think.group(1).strip()
    
    answer = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    if answer:
        try:
            data = json.loads(answer.group(1).strip())
            if 'coordinate' in data:
                x, y = data['coordinate']
                result['coordinate'] = [x / 999.0, y / 999.0]  # Normalize to [0, 1]
        except:
            pass
    return result

print("‚úÖ Prompt functions defined")


In [None]:
# Cell 5: Create Test Screenshot
from PIL import Image, ImageDraw

def create_mobile_screenshot():
    """Create a realistic mobile settings screen (1080√ó1920)."""
    img = Image.new('RGB', (1080, 1920), '#f5f5f5')
    draw = ImageDraw.Draw(img)
    
    # Status bar
    draw.rectangle([0, 0, 1080, 80], fill='#1976D2')
    draw.text((40, 30), "9:41", fill='white')
    
    # Header
    draw.rectangle([0, 80, 1080, 200], fill='#2196F3')
    draw.text((40, 120), "Settings", fill='white')
    
    # Settings items
    items = [
        ('Wi-Fi', True, 280), ('Bluetooth', True, 400), ('Cellular', True, 520),
        ('Personal Hotspot', False, 640), ('VPN', False, 760), ('Notifications', None, 880),
        ('Sounds & Haptics', None, 1000), ('Focus', None, 1120), ('Screen Time', None, 1240),
        ('General', None, 1360),
    ]
    
    for label, toggle, y in items:
        draw.rectangle([0, y, 1080, y+100], fill='white', outline='#e0e0e0')
        draw.text((40, y+35), label, fill='#333333')
        if toggle is True:
            draw.ellipse([960, y+30, 1020, y+70], fill='#4CAF50')
        elif toggle is False:
            draw.ellipse([960, y+30, 1020, y+70], fill='#9E9E9E')
        draw.text((1020, y+35), ">", fill='#999999')
    
    # Navigation bar
    draw.rectangle([0, 1800, 1080, 1920], fill='white')
    for i, label in enumerate(['Home', 'Search', 'Settings', 'Profile']):
        draw.text((60 + i*270, 1840), label, fill='#2196F3' if label == 'Settings' else '#666666')
    
    return img

test_image = create_mobile_screenshot()
print(f"üì∏ Created test screenshot: {test_image.size[0]}√ó{test_image.size[1]}")

# Display thumbnail
thumb = test_image.copy()
thumb.thumbnail((250, 450))
display(thumb)


In [None]:
# Cell 6: Run MAI-UI Inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=512, stop=["<|im_end|>"])

instructions = [
    "Click on Wi-Fi to see network options",
    "Tap the Bluetooth toggle",
    "Open the General settings",
    "Click the Home button in the navigation bar",
]

print("=" * 80)
print(f"ü§ñ RUNNING MAI-UI INFERENCE ({GPU_TIER})")
print("=" * 80)

results = []
total_start = time.time()

for i, instruction in enumerate(instructions, 1):
    inputs = {"prompt": build_prompt(instruction), "multi_modal_data": {"image": test_image}}
    
    start = time.time()
    outputs = llm.generate([inputs], sampling_params=sampling_params)
    latency_ms = (time.time() - start) * 1000
    
    parsed = parse_response(outputs[0].outputs[0].text)
    results.append({'instruction': instruction, 'latency_ms': latency_ms, 'parsed': parsed})
    
    coord = parsed['coordinate']
    if coord:
        print(f"\n[{i}] {instruction}")
        print(f"    üìç Coordinate: ({coord[0]:.3f}, {coord[1]:.3f}) = pixel ({int(coord[0]*1080)}, {int(coord[1]*1920)})")
        print(f"    ‚è±Ô∏è  Latency: {latency_ms:.0f}ms")
    else:
        print(f"\n[{i}] {instruction} ‚Üí ‚ùå No coordinate")

total_time = time.time() - total_start
avg_latency = sum(r['latency_ms'] for r in results) / len(results)

print("\n" + "=" * 80)
print("üìä PERFORMANCE SUMMARY")
print("=" * 80)
print(f"  Total time: {total_time:.2f}s")
print(f"  Average latency: {avg_latency:.0f}ms")
print(f"  Throughput: {len(results) / total_time:.2f} req/s")

# Compare to expectations
expected = {'T4': 1500, 'A100': 400, 'H100': 200, 'B200': 150}
print(f"\n  Expected: ~{expected[GPU_TIER]}ms | Actual: {avg_latency:.0f}ms")


In [None]:
# Cell 7: Visualize Results
def visualize_predictions(image, results):
    """Overlay predicted click locations on screenshot."""
    img = image.copy()
    draw = ImageDraw.Draw(img)
    colors = ['#FF0000', '#00FF00', '#0000FF', '#FF00FF', '#FFFF00']
    
    for i, result in enumerate(results):
        coord = result['parsed'].get('coordinate')
        if coord:
            x, y = int(coord[0] * image.width), int(coord[1] * image.height)
            color = colors[i % len(colors)]
            # Crosshair
            draw.ellipse([x-30, y-30, x+30, y+30], outline=color, width=4)
            draw.line([x-40, y, x+40, y], fill=color, width=3)
            draw.line([x, y-40, x, y+40], fill=color, width=3)
            draw.text((x+35, y-15), str(i+1), fill=color)
    return img

vis_image = visualize_predictions(test_image, results)
vis_thumb = vis_image.copy()
vis_thumb.thumbnail((350, 600))
display(vis_thumb)

print("\nüìç CLICK LOCATIONS:")
for i, r in enumerate(results, 1):
    coord = r['parsed'].get('coordinate')
    if coord:
        print(f"  [{i}] {r['instruction'][:40]}... ‚Üí ({int(coord[0]*1080)}, {int(coord[1]*1920)})")


---

## üìä GPU Optimization Summary

### Why Each Parameter Matters

| Parameter | What It Does | Impact of Wrong Choice |
|-----------|--------------|------------------------|
| `dtype` | Precision for weights | FP16 on A100+ = worse stability |
| `quantization` | Compress weights | Full precision on T4 = OOM |
| `enforce_eager` | Disable CUDA graphs | False on T4 = OOM |
| `max_model_len` | Context window | Too large = OOM, too small = truncation |
| `max_num_seqs` | Concurrent requests | Too large = OOM, too small = low throughput |
| `max_pixels` | Max image resolution | Too large = many tokens = slow |
| `enable_prefix_caching` | Cache system prompt | Critical for repeated prompts |

### Memory Formula

```
Total Memory = Model Weights + KV Cache + Vision Encoder + Activations + Overhead

Where:
‚Ä¢ Model Weights = params √ó bytes_per_param (FP16=2, BF16=2, FP8=1, 4-bit=0.5)
‚Ä¢ KV Cache = 2 √ó layers √ó heads √ó head_dim √ó max_len √ó max_seqs √ó bytes
‚Ä¢ Vision Encoder = ~2-4 GB (depends on image resolution)
‚Ä¢ Activations = ~1-5 GB (depends on batch size)
```

### References

- [MAI-UI Technical Report](https://www.alphaxiv.org/abs/2512.22047)
- [MAI-UI GitHub](https://github.com/Tongyi-MAI/MAI-UI)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Complete Guide](QWEN_VL_COMPLETE_GUIDE.md)
