In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Accelerate LLM Inference with EAGLE Speculative Decoding on Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb">
      <img width="32px" src="https://storage.googleapis.com/github-repo/generative-ai/logos/GitHub_Invertocat_Dark.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<p>
<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/benchmarking_eagle_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>
</p>

| Author(s) |
| --- |
| [Ivan Nardini](https://github.com/inardini) |

## Overview

### Why EAGLE Matters

Large Language Models (LLMs) generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model. For production applications serving thousands of users, this sequential generation limits throughput and increases costs.

**EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)** solves this by using speculative decoding:
- A small, fast **draft model** predicts multiple tokens in parallel
- The main model **verifies** these predictions in a single forward pass
- Correct predictions are accepted (saving time), incorrect ones are discarded and regenerated
- **Result**: 1.5-2x faster inference with identical output quality

### What You'll Learn

This tutorial demonstrates how to benchmark EAGLE's performance improvement on Vertex AI using real production workloads:

1. **Deploy two Llama 4 Scout endpoints**: baseline (standard) and EAGLE-enabled
2. **Run controlled benchmarks** using vLLM's industry-standard tooling and ShareGPT dataset (real user conversations)
3. **Measure key metrics** across varying concurrency levels:
   - **TTFT (Time to First Token)**: User-perceived latency - how quickly responses start
   - **TPOT (Time Per Output Token)**: Generation speed - affects streaming smoothness
   - **Throughput**: System capacity - tokens and requests per second
   - **Scalability**: How performance changes under concurrent load
4. **Visualize results** to quantify EAGLE's speedup and identify optimal configurations

### Prerequisites

Before starting, ensure you have:

- **Google Cloud Project** with billing enabled
- **Vertex AI API** enabled ([enable here](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com))
- **GPU Quota**: 8x NVIDIA H100 80GB GPUs in your selected region
  - Check quota: `gcloud compute regions describe <region> | grep 'NVIDIA_H100_80GB'`
  - Request increase: [Quota page](https://console.cloud.google.com/iam-admin/quotas)
- **Python 3.10+** (automatically available in Colab/Workbench)
- **Hugging Face account** with access to Llama 4 model (requires Meta license acceptance)

**Estimated Time:** 90-120 minutes (mostly deployment wait time)  
**Estimated Cost:** ~$250 (mostly machine and GPU hours during benchmarking)

## Get Started

### Install Required Packages

**Note:** After running this cell, **you will need to restart the runtime**. This is expected behavior when installing new Python packages.

- In **Colab**: Click the "Restart Runtime" button that appears
- In **Vertex AI Workbench**: Kernel → Restart Kernel

After restarting, **continue from the next cell** (do not re-run this installation cell).

In [None]:
# Install packages with pinned versions for reproducibility
%pip install --upgrade --quiet \
    'google-cloud-aiplatform>=1.70.0' \
    'transformers>=4.45.0' \
    'huggingface-hub>=0.26.0' \
    'hf-transfer>=0.1.8' \
    'vllm==0.11.0' \
    'pandas>=2.0.0' \
    'matplotlib>=3.7.0' \
    'seaborn>=0.13.0'

print("\n" + "="*80)
print("✅ Installation completed successfully!")
print("="*80)
print("⚠️  NEXT STEP: Please restart your runtime now.")
print("   - Colab: Click 'Runtime' → 'Restart session' button above")
print("   - Workbench: 'Kernel' → 'Restart Kernel'")
print("   - Then continue from the cell below (skip this installation cell)")
print("="*80)

### Authenticate Your Environment

**Colab users only**: Run this cell to authenticate your Google Cloud account. This allows the notebook to access Vertex AI services.

**Vertex AI Workbench users**: Skip this cell - you're already authenticated.

In [None]:
# import sys

# # Only authenticate in Colab environment
# if "google.colab" in sys.modules:
#     from google.colab import auth
#     auth.authenticate_user()
#     print("✅ Authentication successful!")
# else:
#     print("ℹ️  Running in Vertex AI Workbench - already authenticated")

### Set Google Cloud Project Information

Configure your Google Cloud project ID and region. The project must have:
- Vertex AI API enabled
- Sufficient GPU quota (8x H100 80GB recommended)

**Recommended regions for H100 availability**:
- `us-central1` (Iowa)
- `us-east4` (Northern Virginia)
- `europe-west4` (Netherlands)
- `asia-southeast1` (Singapore)

In [None]:
import os

import vertexai

# Configure these values for your environment
PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "asia-southeast1"  # @param {type: "string", placeholder: "us-central1", isTemplate: true}

# Auto-detect project ID if not provided
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))
    if not PROJECT_ID:
        raise ValueError(
            "❌ PROJECT_ID not set. Please set it in the cell above or "
            "set GOOGLE_CLOUD_PROJECT environment variable"
        )

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION)

print("=" * 80)
print("✅ Vertex AI initialized successfully!")
print("=" * 80)
print(f"   Project ID: {PROJECT_ID}")
print(f"   Location:   {LOCATION}")
print("=" * 80)

### Import Required Libraries

In [None]:
# Standard libraries
import json
import subprocess
import urllib.request
from pathlib import Path
import google.auth
import matplotlib.pyplot as plt

# Data processing and visualization
import pandas as pd
import seaborn as sns
from google.auth.transport.requests import Request

# Hugging Face libraries
from huggingface_hub import login, snapshot_download
from vertexai import model_garden

print("✅ Libraries imported successfully!")

## Model Deployment

We'll deploy two versions of Llama 4 Scout (17B parameters, 16 experts):
1. **Baseline**: Standard configuration (no EAGLE)
2. **EAGLE-enabled**: With speculative decoding enabled

Both use identical hardware (8x H100 80GB GPUs with tensor parallelism) to ensure fair comparison.

### Deploy Baseline Model (Without EAGLE)

This deployment creates a baseline for comparison. We'll measure its performance, then compare against EAGLE.

**Key Configuration Parameters:**

| Parameter | Value | Purpose |
|-----------|-------|----------|
| `machine_type` | `a3-highgpu-8g` | VM with 8x H100 80GB GPUs |
| `accelerator_type` | `NVIDIA_H100_80GB` | Latest generation GPU (9.5x faster than A100) |
| `accelerator_count` | `8` | Number of GPUs for tensor parallelism |
| `--tp` | `8` | Tensor parallelism degree (splits model across GPUs) |
| `--attention-backend` | `fa3` | FlashAttention 3 (optimized attention computation) |
| `--context-length` | `131072` | Maximum sequence length (128K tokens) |

**Expected deployment time**: 10-15 minutes

In [None]:
# Model configuration
MODEL_NAME = "meta/llama4@llama-4-scout-17b-16e-instruct"
MODEL_GCS_PATH = (
    "gs://vertex-model-garden-restricted-us/llama4/Llama-4-Scout-17B-16E-Instruct"
)

print("=" * 80)
print("🚀 DEPLOYING BASELINE MODEL")
print("=" * 80)
print(f"Model: {MODEL_NAME}")
print("Configuration: 8x H100 GPUs, No EAGLE")
print("\n⏳ Starting deployment... (this will take 10-15 minutes)")
print("=" * 80)

# Baseline deployment arguments (no speculative decoding)
baseline_args = [
    f"--model={MODEL_GCS_PATH}",
    "--attention-backend=fa3",  # FlashAttention 3 for optimal performance
    "--context-length=131072",  # 128K context window
    "--chat-template=Llama-4",
    "--tp=8",  # Tensor parallelism across 8 GPUs
    "--enable-multimodal",
    "--tool-call-parser=pythonic",
    "--chat-template=sglang/examples/chat_template/tool_chat_template_llama4_pythonic.jinja",
]

try:
    # Initialize model from Model Garden
    baseline_model = model_garden.OpenModel(MODEL_NAME)

    # Deploy to dedicated endpoint
    baseline_endpoint = baseline_model.deploy(
        model_display_name="baseline-llama4-scout-17b-16e-instruct",
        endpoint_display_name="baseline-llama4-scout-17b-16e-instruct",
        serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/sglang-serve.cu124.0-4.ubuntu2204.py310:model-garden.sglang-0-4-release_20250831.00_p0",
        machine_type="a3-highgpu-8g",
        accelerator_type="NVIDIA_H100_80GB",
        accelerator_count=8,
        use_dedicated_endpoint=True,
        accept_eula=True,
        serving_container_args=baseline_args,
        serving_container_environment_variables={
            "MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
            "DEPLOY_SOURCE": "UI_NATIVE_MODEL",
        },
        serving_container_ports=[30000],
        serving_container_health_route="/health",
        serving_container_predict_route="/vertex_generate",
    )

    print("\n" + "=" * 80)
    print("✅ BASELINE ENDPOINT DEPLOYED SUCCESSFULLY!")
    print("=" * 80)
    print(f"   Endpoint ID: {baseline_endpoint.name}")
    print(f"   Resource Name: {baseline_endpoint.resource_name}")
    print("   Status: READY")
    print("=" * 80)

except Exception as e:
    print("\n" + "=" * 80)
    print("❌ DEPLOYMENT FAILED")
    print("=" * 80)
    print(f"Error: {str(e)}")
    print("\nCommon issues:")
    print("  - Insufficient GPU quota (need 8x H100 80GB)")
    print("  - Region doesn't have H100s available")
    print("  - Billing not enabled on project")
    print("=" * 80)
    raise

### Deploy EAGLE-Enabled Model

This deployment adds EAGLE speculative decoding on top of the same base model and hardware.

**EAGLE-Specific Parameters:**

| Parameter | Value | Purpose |
|-----------|-------|----------|
| `--speculative-algo` | `EAGLE3` | Activates EAGLE version 3 (latest) |
| `--speculative-draft-model-path` | `gs://...EAGLE3...` | Pre-trained draft model for Llama 4 Scout |
| `--speculative-num-steps` | `3` | Speculation depth (how many tokens to predict ahead) |
| `--speculative-num-draft-tokens` | `8` | Tokens generated per speculation step |
| `--speculative-eagle-topk` | `4` | Top-K sampling for draft model (balances speed/quality) |

**How these parameters affect performance:**
- **Higher `num-steps`**: More speculation → better speedup but higher overhead (sweet spot: 2-5)
- **Higher `num-draft-tokens`**: More tokens per step → better for long outputs (sweet spot: 4-12)
- **Higher `topk`**: More diverse predictions → better acceptance rate but slower draft (sweet spot: 3-5)

**Expected deployment time**: 10-15 minutes

In [None]:
print("=" * 80)
print("🚀 DEPLOYING EAGLE-ENABLED MODEL")
print("=" * 80)
print(f"Model: {MODEL_NAME}")
print("Configuration: 8x H100 GPUs, EAGLE Speculative Decoding")
print("\n⏳ Starting deployment... (this will take 10-15 minutes)")
print("=" * 80)

# EAGLE deployment arguments (adds speculative decoding parameters)
eagle_args = [
    f"--model={MODEL_GCS_PATH}",
    "--attention-backend=fa3",
    "--context-length=131072",
    "--chat-template=Llama-4",
    "--tp=8",
    "--enable-multimodal",
    "--tool-call-parser=pythonic",
    "--chat-template=sglang/examples/chat_template/tool_chat_template_llama4_pythonic.jinja",
    # EAGLE-specific configuration
    "--speculative-algo=EAGLE3",
    "--speculative-draft-model-path=gs://vertex-model-garden-restricted-us/llama4/Llama-4-Scout-17B-16E-Instruct-EAGLE3-20250829/",
    "--speculative-num-steps=3",
    "--speculative-eagle-topk=4",
    "--speculative-num-draft-tokens=8",
]

try:
    # Initialize model from Model Garden
    eagle_model = model_garden.OpenModel(MODEL_NAME)

    # Deploy to dedicated endpoint
    eagle_endpoint = eagle_model.deploy(
        model_display_name="eagle-llama4-scout-17b-16e-instruct",
        endpoint_display_name="eagle-llama4-scout-17b-16e-instruct",
        serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/sglang-serve.cu124.0-4.ubuntu2204.py310:model-garden.sglang-0-4-release_20250831.00_p0",
        machine_type="a3-highgpu-8g",
        accelerator_type="NVIDIA_H100_80GB",
        accelerator_count=8,
        use_dedicated_endpoint=True,
        accept_eula=True,
        serving_container_args=eagle_args,
        serving_container_environment_variables={
            "MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
            "DEPLOY_SOURCE": "UI_NATIVE_MODEL",
        },
        serving_container_ports=[30000],
        serving_container_health_route="/health",
        serving_container_predict_route="/vertex_generate",
    )

    print("\n" + "=" * 80)
    print("✅ EAGLE ENDPOINT DEPLOYED SUCCESSFULLY!")
    print("=" * 80)
    print(f"   Endpoint ID: {eagle_endpoint.name}")
    print(f"   Resource Name: {eagle_endpoint.resource_name}")
    print("   Status: READY")
    print("=" * 80)
    print("\n🎯 Both endpoints are now ready for benchmarking!")

except Exception as e:
    print("\n" + "=" * 80)
    print("❌ DEPLOYMENT FAILED")
    print("=" * 80)
    print(f"Error: {str(e)}")
    print("\nCommon issues:")
    print("  - Insufficient GPU quota (need 8x H100 80GB)")
    print("  - Region doesn't have H100s available")
    print("  - Billing not enabled on project")
    print("=" * 80)
    raise

## Prepare Benchmark Dataset

We'll use the **ShareGPT dataset**, which contains real user-assistant conversations from production systems.

**Why ShareGPT?**
- **Realistic workload**: Real conversations, not synthetic prompts
- **Variable lengths**: Tests model performance across different input/output sizes
- **Industry standard**: Used by major LLM serving frameworks for benchmarking

### Download ShareGPT Dataset

In [None]:
DATASET_URL = "https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json"
DATASET_PATH = "/tmp/ShareGPT_V3_unfiltered_cleaned_split.json"

print("📥 Downloading ShareGPT dataset...")

try:
    if not os.path.exists(DATASET_PATH):
        urllib.request.urlretrieve(DATASET_URL, DATASET_PATH)
        print(f"✅ Dataset downloaded to {DATASET_PATH}")
    else:
        print(f"ℹ️  Dataset already exists at {DATASET_PATH}")

    # Preview dataset structure
    with open(DATASET_PATH, "r") as f:
        data = json.load(f)

    print("\n" + "=" * 80)
    print("📊 DATASET INFORMATION")
    print("=" * 80)
    print(f"   Total conversations: {len(data):,}")
    print(f"   Sample conversation keys: {list(data[0].keys())}")
    print("\n   Sample conversation structure:")
    print(f"   - ID: {data[0]['id']}")
    print(f"   - Turns: {len(data[0]['conversations'])} messages")
    print("=" * 80)

except Exception as e:
    print(f"\n❌ Failed to download dataset: {e}")
    print("Please check your internet connection and try again.")
    raise

### Download Model Artifacts for Tokenization

We need the model's tokenizer to accurately measure prompt/response lengths during benchmarking.

**Why download locally?**
- vLLM's benchmark tool needs the tokenizer to count tokens accurately
- We only download configuration files (less than 10MB), not the full model weights
- Using `hf_transfer` library for 2-5x faster downloads

**Note:** This requires a Hugging Face account with access to Llama 4 (requires accepting Meta's license).

In [None]:
# Authenticate to Hugging Face
# You'll be prompted to enter your HF token (get one at https://huggingface.co/settings/tokens)
print("🔐 Authenticating to Hugging Face...")
print("   You'll need a token with access to meta-llama/Llama-4-Scout-17B-16E-Instruct")
print("   Get your token at: https://huggingface.co/settings/tokens\n")

try:
    login()
    print("✅ Hugging Face authentication successful!")
except Exception as e:
    print(f"❌ Authentication failed: {e}")
    raise

In [None]:
# Configure download settings
HF_HOME = "hf_cache"
MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
LOCAL_DIR = Path(f"{HF_HOME}/{MODEL_ID}")

# Create directory
LOCAL_DIR.parent.mkdir(parents=True, exist_ok=True)

# Enable fast transfers (2-5x faster using Rust-based hf_transfer)
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["HF_HOME"] = HF_HOME

# Only download configuration files (not model weights)
allow_patterns = ["*.json", "tokenizer.model", "*.txt"]

print("📥 Downloading model artifacts from Hugging Face...")
print(f"   Model: {MODEL_ID}")
print("   Using fast transfer (hf_transfer enabled)")
print(f"   Downloading to: {LOCAL_DIR}\n")

try:
    snapshot_download(
        repo_id=MODEL_ID,
        local_dir=str(LOCAL_DIR),
        allow_patterns=allow_patterns,
        resume_download=True,
    )

    print("\n✅ Model artifacts downloaded successfully!")
    print(f"   Location: {LOCAL_DIR}")

    # Store path for benchmarking
    MODEL_PATH = str(LOCAL_DIR)
    print(f"\n   MODEL_PATH set to: {MODEL_PATH}")

except Exception as e:
    print(f"\n❌ Download failed: {e}")
    print("\nCommon issues:")
    print("  - No access to Llama 4 model (need to accept Meta license)")
    print("  - Invalid Hugging Face token")
    print("  - Network connectivity issues")
    raise

## Quick Smoke Test (100 Prompts)

Before running the full benchmark, let's verify both endpoints work correctly with a quick test.

**What this tests:**
- Both endpoints are responding correctly
- Authentication is working
- Basic performance sanity check

**This is NOT the main benchmark** - just a validation step. The comprehensive benchmark comes next.

### Apply Vertex AI Compatibility Patch

**Why is this patch needed?**

vLLM's benchmark tool was originally designed for OpenAI's API, but Vertex AI has slightly different requirements:

1. **Vertex AI doesn't support `stream_options`** parameter (OpenAI-specific)
2. **Vertex AI uses `max_tokens`** instead of `max_completion_tokens`
3. **Vertex AI has longer timeouts** for large models (6 hours vs 5 minutes)

This patch modifies vLLM's request function to be compatible with Vertex AI's API format.

In [None]:
import vllm.benchmarks.lib.endpoint_request_func as endpoint_func

# Store the original function
_original_async_request_openai_chat_completions = (
    endpoint_func.async_request_openai_chat_completions
)


async def patched_async_request_openai_chat_completions(
    request_func_input: endpoint_func.RequestFuncInput,
    pbar=None,
) -> endpoint_func.RequestFuncOutput:
    """Patched version compatible with Vertex AI's chat completions endpoint."""
    # Import necessary modules
    import json
    import os
    import sys
    import time
    import traceback

    import aiohttp

    api_url = request_func_input.api_url
    assert api_url.endswith(("chat/completions", "profile"))

    # Set longer timeout for Vertex AI (6 hours for large model inference)
    async with aiohttp.ClientSession(
        trust_env=True, timeout=aiohttp.ClientTimeout(total=6 * 60 * 60)
    ) as session:
        # Build request content (text + optional multimodal)
        content = [{"type": "text", "text": request_func_input.prompt}]
        if request_func_input.multi_modal_content:
            mm_content = request_func_input.multi_modal_content
            if isinstance(mm_content, list):
                content.extend(mm_content)
            elif isinstance(mm_content, dict):
                content.append(mm_content)

        # Build payload with Vertex AI-compatible parameters
        payload = {
            "model": request_func_input.model_name
            if request_func_input.model_name
            else request_func_input.model,
            "messages": [{"role": "user", "content": content}],
            "temperature": 0.0,
            "max_tokens": request_func_input.output_len,  # Vertex AI uses max_tokens, not max_completion_tokens
            "stream": True,
            # Removed stream_options - not supported by Vertex AI
        }
        if request_func_input.ignore_eos:
            payload["ignore_eos"] = request_func_input.ignore_eos
        if request_func_input.extra_body:
            payload.update(request_func_input.extra_body)

        # Set authentication header
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
        }
        if request_func_input.request_id:
            headers["x-request-id"] = request_func_input.request_id

        # Initialize output metrics
        output = endpoint_func.RequestFuncOutput()
        output.prompt_len = request_func_input.prompt_len

        generated_text = ""
        ttft = 0.0  # Time to first token
        st = time.perf_counter()
        most_recent_timestamp = st

        try:
            async with session.post(
                url=api_url, json=payload, headers=headers
            ) as response:
                if response.status == 200:
                    # Parse streaming response
                    async for chunk_bytes in response.content:
                        chunk_bytes = chunk_bytes.strip()
                        if not chunk_bytes:
                            continue
                        chunk_bytes = chunk_bytes.decode("utf-8")
                        if chunk_bytes.startswith(":"):
                            continue

                        chunk = chunk_bytes.removeprefix("data: ")
                        if chunk != "[DONE]":
                            timestamp = time.perf_counter()
                            data = json.loads(chunk)

                            if choices := data.get("choices"):
                                content = choices[0]["delta"].get("content")
                                if ttft == 0.0:
                                    ttft = timestamp - st
                                    output.ttft = ttft
                                else:
                                    # Record inter-token latency
                                    output.itl.append(timestamp - most_recent_timestamp)
                                generated_text += content or ""
                            elif usage := data.get("usage"):
                                output.output_tokens = usage.get("completion_tokens")

                            most_recent_timestamp = timestamp

                    output.generated_text = generated_text
                    output.success = True
                    output.latency = most_recent_timestamp - st
                else:
                    output.error = response.reason or ""
                    output.success = False
        except Exception:
            output.success = False
            exc_info = sys.exc_info()
            output.error = "".join(traceback.format_exception(*exc_info))

        if pbar:
            pbar.update(1)
        return output


# Apply the monkey patch
endpoint_func.async_request_openai_chat_completions = (
    patched_async_request_openai_chat_completions
)
endpoint_func.ASYNC_REQUEST_FUNCS[
    "openai-chat"
] = patched_async_request_openai_chat_completions

print("✅ Vertex AI compatibility patch applied successfully!")
print("   - Removed unsupported stream_options parameter")
print("   - Changed max_completion_tokens → max_tokens")
print("   - Increased timeout to 6 hours for large model inference")

### Run Smoke Test on Baseline Endpoint

In [None]:
# Get authentication token
creds, project = google.auth.default()
auth_req = Request()
creds.refresh(auth_req)

# Set environment variables for vLLM benchmark
os.environ["OPENAI_API_KEY"] = creds.token
os.environ["HF_HUB_OFFLINE"] = "1"  # Use cached model artifacts

# Construct Vertex AI endpoint URL
baseline_dns = baseline_endpoint.gca_resource.dedicated_endpoint_dns
baseline_url = f"https://{baseline_dns}/v1beta1/{baseline_endpoint.resource_name}"

print("=" * 80)
print("🧪 SMOKE TEST: BASELINE ENDPOINT")
print("=" * 80)
print(f"Endpoint: {baseline_endpoint.display_name}")
print("Test size: 100 prompts from ShareGPT")
print("Purpose: Verify endpoint is working correctly")
print("\n⏳ Running test...")
print("=" * 80 + "\n")

try:
    result = subprocess.run(
        [
            "vllm",
            "bench",
            "serve",
            "--backend",
            "openai-chat",
            "--base-url",
            baseline_url,
            "--endpoint",
            "/chat/completions",
            "--model",
            "",
            "--tokenizer",
            MODEL_PATH,
            "--dataset-name",
            "sharegpt",
            "--dataset-path",
            DATASET_PATH,
            "--num-prompts",
            "100",
        ],
        check=True,
        capture_output=False,
    )

    print("\n" + "=" * 80)
    print("✅ BASELINE SMOKE TEST PASSED")
    print("=" * 80)
    print("   Endpoint is responding correctly")
    print("   Ready for full benchmark")
    print("=" * 80)

except subprocess.CalledProcessError as e:
    print("\n" + "=" * 80)
    print("❌ SMOKE TEST FAILED")
    print("=" * 80)
    print(f"Error: {e}")
    print("\nPlease check:")
    print("  - Endpoint is fully deployed and healthy")
    print("  - Authentication token is valid (may need refresh)")
    print("  - Network connectivity to Vertex AI")
    print("=" * 80)
    raise

### Run Smoke Test on EAGLE Endpoint

In [None]:
# Refresh token (smoke tests may take several minutes)
creds.refresh(auth_req)
os.environ["OPENAI_API_KEY"] = creds.token

# Construct EAGLE endpoint URL
eagle_dns = eagle_endpoint.gca_resource.dedicated_endpoint_dns
eagle_url = f"https://{eagle_dns}/v1beta1/{eagle_endpoint.resource_name}"

print("=" * 80)
print("🧪 SMOKE TEST: EAGLE ENDPOINT")
print("=" * 80)
print(f"Endpoint: {eagle_endpoint.display_name}")
print("Test size: 100 prompts from ShareGPT")
print("Purpose: Verify EAGLE endpoint is working correctly")
print("\n⏳ Running test...")
print("=" * 80 + "\n")

try:
    result = subprocess.run(
        [
            "vllm",
            "bench",
            "serve",
            "--backend",
            "openai-chat",
            "--base-url",
            eagle_url,
            "--endpoint",
            "/chat/completions",
            "--model",
            "",
            "--tokenizer",
            MODEL_PATH,
            "--dataset-name",
            "sharegpt",
            "--dataset-path",
            DATASET_PATH,
            "--num-prompts",
            "100",
        ],
        check=True,
        capture_output=False,
    )

    print("\n" + "=" * 80)
    print("✅ EAGLE SMOKE TEST PASSED")
    print("=" * 80)
    print("   Endpoint is responding correctly")
    print("   Ready for full benchmark")
    print("=" * 80)
    print("\n🎯 Both endpoints validated! Ready for comprehensive benchmarking.")

except subprocess.CalledProcessError as e:
    print("\n" + "=" * 80)
    print("❌ SMOKE TEST FAILED")
    print("=" * 80)
    print(f"Error: {e}")
    print("\nPlease check:")
    print("  - Endpoint is fully deployed and healthy")
    print("  - EAGLE draft model loaded correctly")
    print("  - Authentication token is valid (may need refresh)")
    print("  - Network connectivity to Vertex AI")
    print("=" * 80)
    raise

## Main Benchmark: Concurrency Sweep

Now we'll run the comprehensive benchmark that tests both endpoints across multiple concurrency levels.

**What is concurrency testing?**
- Simulates multiple users sending requests simultaneously
- Tests how the system scales under load
- Reveals bottlenecks and optimal configuration


**Benchmark setup**:

- **Concurrency levels tested**: 1, 2, 4, 6, 8, 10 concurrent requests
- **Prompts per level**: 1,000 (statistical significance)
- **Total requests**: 12,000 (6,000 per endpoint)
- **Expected duration**: 30-45 minutes per endpoint

**What we'll measure:**

| Metric | Formula | Why It Matters |
|--------|---------|----------------|
| **TTFT** (Time to First Token) | Time from request sent to first token received | User-perceived latency - how quickly responses start appearing |
| **TPOT** (Time Per Output Token) | `(Total generation time - TTFT) / (num tokens - 1)` | Streaming smoothness - affects how fast text appears to stream |
| **ITL** (Inter-Token Latency) | Time between consecutive tokens | Latency variation - consistent ITL = smooth streaming |
| **Throughput** | `Total output tokens / Total time` | System capacity - how many tokens/sec the system can handle |
| **Request Throughput** | `Total requests / Total time` | Request capacity - how many requests/sec the system can handle |

**Why use median instead of mean?**
- Medians are robust to outliers (e.g., one slow request doesn't skew results)
- Better represents "typical" user experience
- Industry standard for latency reporting (along with P99 for tail latency)

### Benchmark Baseline Across Concurrency Levels

In [None]:
%%writefile run_baseline_concurrency.py
"""Benchmark baseline endpoint across multiple concurrency levels."""
import subprocess
import sys
import os
import google.auth
from google.auth.transport.requests import Request

def get_fresh_token():
    """Get a fresh authentication token from Google Cloud."""
    try:
        creds, project = google.auth.default()
        auth_req = Request()
        creds.refresh(auth_req)
        return creds.token
    except Exception as e:
        print(f"❌ Failed to refresh token: {e}")
        raise

# Get configuration from environment
baseline_url = os.environ["BASELINE_URL"]
model_path = os.environ["MODEL_PATH"]
dataset_path = os.environ["DATASET_PATH"]

# Concurrency levels to test
CONCURRENCY_LEVELS = [1, 2, 4, 6, 8, 10]

print("="*80)
print("📊 COMPREHENSIVE BENCHMARK: BASELINE MODEL")
print("="*80)
print(f"Endpoint: {baseline_url}")
print(f"Concurrency levels: {CONCURRENCY_LEVELS}")
print(f"Prompts per level: 1,000")
print(f"Total requests: {len(CONCURRENCY_LEVELS) * 1000:,}")
print("="*80)

# Create output directory
os.makedirs("benchmarks/baseline_concurrency", exist_ok=True)

# Run benchmark for each concurrency level
for i, concurrency in enumerate(CONCURRENCY_LEVELS, 1):
    print(f"\n{'='*80}")
    print(f"🔄 BASELINE - Concurrency {concurrency} ({i}/{len(CONCURRENCY_LEVELS)})")
    print(f"{'='*80}")
    print(f"⏳ Running 1,000 requests with max concurrency={concurrency}...\n")

    # Refresh token before each run (benchmarks can be long)
    token = get_fresh_token()
    os.environ["OPENAI_API_KEY"] = token

    try:
        subprocess.run(
            [
                "vllm", "bench", "serve",
                "--backend", "openai-chat",
                "--base-url", baseline_url,
                "--endpoint", "/chat/completions",
                "--model", "",
                "--tokenizer", model_path,
                "--dataset-name", "sharegpt",
                "--dataset-path", dataset_path,
                "--num-prompts", "1000",
                "--max-concurrency", str(concurrency),
                "--save-result",
                "--result-dir", "benchmarks/baseline_concurrency",
                "--result-filename", f"baseline_c{concurrency}.json"
            ],
            check=True,
        )

        print(f"\n✅ Concurrency {concurrency} completed successfully")
        print(f"   Results saved to: benchmarks/baseline_concurrency/baseline_c{concurrency}.json")

    except subprocess.CalledProcessError as e:
        print(f"\n❌ Benchmark failed at concurrency {concurrency}")
        print(f"   Error: {e}")
        print(f"   Continuing with next concurrency level...")
        continue

print("\n" + "="*80)
print("✅ BASELINE CONCURRENCY SWEEP COMPLETED")
print("="*80)
print(f"   Results saved to: benchmarks/baseline_concurrency/")
print("="*80)

In [None]:
# Set environment variables for the script
os.environ["BASELINE_URL"] = baseline_url
os.environ["MODEL_PATH"] = MODEL_PATH
os.environ["DATASET_PATH"] = DATASET_PATH

print("🚀 Starting baseline concurrency sweep...")
print("   This will take approximately 30-45 minutes\n")

# Run the benchmark script
!python run_baseline_concurrency.py

### Benchmark EAGLE Across Concurrency Levels

In [None]:
%%writefile run_eagle_concurrency.py
"""Benchmark EAGLE endpoint across multiple concurrency levels."""
import subprocess
import sys
import os
import google.auth
from google.auth.transport.requests import Request

def get_fresh_token():
    """Get a fresh authentication token from Google Cloud."""
    try:
        creds, project = google.auth.default()
        auth_req = Request()
        creds.refresh(auth_req)
        return creds.token
    except Exception as e:
        print(f"❌ Failed to refresh token: {e}")
        raise

# Get configuration from environment
eagle_url = os.environ["EAGLE_URL"]
model_path = os.environ["MODEL_PATH"]
dataset_path = os.environ["DATASET_PATH"]

# Concurrency levels to test (same as baseline for fair comparison)
CONCURRENCY_LEVELS = [1, 2, 4, 6, 8, 10]

print("="*80)
print("📊 COMPREHENSIVE BENCHMARK: EAGLE MODEL")
print("="*80)
print(f"Endpoint: {eagle_url}")
print(f"Concurrency levels: {CONCURRENCY_LEVELS}")
print(f"Prompts per level: 1,000")
print(f"Total requests: {len(CONCURRENCY_LEVELS) * 1000:,}")
print("="*80)

# Create output directory
os.makedirs("benchmarks/eagle_concurrency", exist_ok=True)

# Run benchmark for each concurrency level
for i, concurrency in enumerate(CONCURRENCY_LEVELS, 1):
    print(f"\n{'='*80}")
    print(f"🔄 EAGLE - Concurrency {concurrency} ({i}/{len(CONCURRENCY_LEVELS)})")
    print(f"{'='*80}")
    print(f"⏳ Running 1,000 requests with max concurrency={concurrency}...\n")

    # Refresh token before each run (benchmarks can be long)
    token = get_fresh_token()
    os.environ["OPENAI_API_KEY"] = token

    try:
        subprocess.run(
            [
                "vllm", "bench", "serve",
                "--backend", "openai-chat",
                "--base-url", eagle_url,
                "--endpoint", "/chat/completions",
                "--model", "",
                "--tokenizer", model_path,
                "--dataset-name", "sharegpt",
                "--dataset-path", dataset_path,
                "--num-prompts", "1000",
                "--max-concurrency", str(concurrency),
                "--save-result",
                "--result-dir", "benchmarks/eagle_concurrency",
                "--result-filename", f"eagle_c{concurrency}.json"
            ],
            check=True,
        )

        print(f"\n✅ Concurrency {concurrency} completed successfully")
        print(f"   Results saved to: benchmarks/eagle_concurrency/eagle_c{concurrency}.json")

    except subprocess.CalledProcessError as e:
        print(f"\n❌ Benchmark failed at concurrency {concurrency}")
        print(f"   Error: {e}")
        print(f"   Continuing with next concurrency level...")
        continue

print("\n" + "="*80)
print("✅ EAGLE CONCURRENCY SWEEP COMPLETED")
print("="*80)
print(f"   Results saved to: benchmarks/eagle_concurrency/")
print("="*80)

In [None]:
# Set environment variables for the script
os.environ["EAGLE_URL"] = eagle_url
os.environ["MODEL_PATH"] = MODEL_PATH
os.environ["DATASET_PATH"] = DATASET_PATH

print("🚀 Starting EAGLE concurrency sweep...")
print("   This will take approximately 30-45 minutes\n")

# Run the benchmark script
!python run_eagle_concurrency.py

## Analysis and Visualization

Now let's analyze the benchmark results to quantify EAGLE's performance improvement.

### Load and Parse Benchmark Results

In [None]:
# Load results for all concurrency levels
baseline_results = {}
eagle_results = {}
concurrency_levels = [1, 2, 4, 6, 8, 10]

print("📂 Loading benchmark results...\n")

try:
    for concurrency in concurrency_levels:
        # Load baseline results
        baseline_file = f"benchmarks/baseline_concurrency/baseline_c{concurrency}.json"
        with open(baseline_file, "r") as f:
            baseline_results[concurrency] = json.load(f)
        print(f"✅ Loaded baseline concurrency {concurrency}")

        # Load EAGLE results
        eagle_file = f"benchmarks/eagle_concurrency/eagle_c{concurrency}.json"
        with open(eagle_file, "r") as f:
            eagle_results[concurrency] = json.load(f)
        print(f"✅ Loaded EAGLE concurrency {concurrency}")

    print("\n✅ All results loaded successfully!")

except FileNotFoundError as e:
    print(f"\n❌ Failed to load results: {e}")
    print("   Make sure all benchmarks completed successfully")
    raise

In [None]:
# Extract key metrics for each concurrency level
baseline_metrics = []
eagle_metrics = []

for concurrency in concurrency_levels:
    baseline_data = baseline_results[concurrency]
    eagle_data = eagle_results[concurrency]

    baseline_metrics.append(
        {
            "Concurrency": concurrency,
            "TTFT (ms)": baseline_data["median_ttft_ms"],
            "TPOT (ms)": baseline_data["median_tpot_ms"],
            "Throughput (tok/s)": baseline_data["output_throughput"],
            "Request Throughput (req/s)": baseline_data["request_throughput"],
        }
    )

    eagle_metrics.append(
        {
            "Concurrency": concurrency,
            "TTFT (ms)": eagle_data["median_ttft_ms"],
            "TPOT (ms)": eagle_data["median_tpot_ms"],
            "Throughput (tok/s)": eagle_data["output_throughput"],
            "Request Throughput (req/s)": eagle_data["request_throughput"],
        }
    )

# Create DataFrames for easy comparison
baseline_df = pd.DataFrame(baseline_metrics)
eagle_df = pd.DataFrame(eagle_metrics)

print("=" * 80)
print("📊 BASELINE PERFORMANCE BY CONCURRENCY")
print("=" * 80)
print(baseline_df.to_string(index=False))
print("\n" + "=" * 80)
print("📊 EAGLE PERFORMANCE BY CONCURRENCY")
print("=" * 80)
print(eagle_df.to_string(index=False))
print("=" * 80)

### Calculate Performance Improvements

In [None]:
# Calculate percentage improvements at each concurrency level
improvements = []

for i, concurrency in enumerate(concurrency_levels):
    baseline = baseline_metrics[i]
    eagle = eagle_metrics[i]

    # Calculate improvements (negative = worse, positive = better)
    ttft_improvement = (
        (baseline["TTFT (ms)"] - eagle["TTFT (ms)"]) / baseline["TTFT (ms)"]
    ) * 100
    tpot_improvement = (
        (baseline["TPOT (ms)"] - eagle["TPOT (ms)"]) / baseline["TPOT (ms)"]
    ) * 100
    throughput_improvement = (
        (eagle["Throughput (tok/s)"] - baseline["Throughput (tok/s)"])
        / baseline["Throughput (tok/s)"]
    ) * 100
    req_throughput_improvement = (
        (eagle["Request Throughput (req/s)"] - baseline["Request Throughput (req/s)"])
        / baseline["Request Throughput (req/s)"]
    ) * 100

    improvements.append(
        {
            "Concurrency": concurrency,
            "TTFT Improvement (%)": ttft_improvement,
            "TPOT Improvement (%)": tpot_improvement,
            "Throughput Speedup (%)": throughput_improvement,
            "Req Throughput Speedup (%)": req_throughput_improvement,
        }
    )

improvements_df = pd.DataFrame(improvements)

print("=" * 80)
print("📈 EAGLE PERFORMANCE IMPROVEMENTS OVER BASELINE")
print("=" * 80)
print("   (Positive values = EAGLE is better)")
print("=" * 80)
print(improvements_df.to_string(index=False))
print("=" * 80)

# Calculate average improvements
avg_throughput_speedup = improvements_df["Throughput Speedup (%)"].mean()
avg_req_speedup = improvements_df["Req Throughput Speedup (%)"].mean()
avg_ttft_improvement = improvements_df["TTFT Improvement (%)"].mean()
avg_tpot_improvement = improvements_df["TPOT Improvement (%)"].mean()

print("\n" + "=" * 80)
print("🎯 KEY TAKEAWAYS (Averaged Across All Concurrency Levels)")
print("=" * 80)
print(
    f"   Token Throughput: {avg_throughput_speedup:+.1f}% {'faster' if avg_throughput_speedup > 0 else 'slower'} with EAGLE"
)
print(
    f"   Request Throughput: {avg_req_speedup:+.1f}% {'faster' if avg_req_speedup > 0 else 'slower'} with EAGLE"
)
print(
    f"   Time to First Token: {avg_ttft_improvement:+.1f}% {'faster' if avg_ttft_improvement > 0 else 'slower'} with EAGLE"
)
print(
    f"   Time Per Output Token: {avg_tpot_improvement:+.1f}% {'faster' if avg_tpot_improvement > 0 else 'slower'} with EAGLE"
)
print("=" * 80)

### Visualize Performance Comparison

**How to interpret these charts:**

1. **TTFT Chart (Top Left)**: Lower is better - shows how quickly responses start
   - EAGLE may have slightly higher TTFT due to draft model overhead
   - This is expected and acceptable if overall throughput improves

2. **TPOT Chart (Top Right)**: Lower is better - shows per-token generation speed
   - EAGLE should show lower TPOT (faster per-token generation)
   - This is where EAGLE's speedup comes from

3. **Token Throughput Chart (Bottom Left)**: Higher is better - shows system capacity
   - EAGLE should show higher throughput (more tokens/sec)
   - This translates directly to cost savings

4. **Request Throughput Chart (Bottom Right)**: Higher is better - shows request capacity
   - EAGLE should handle more requests/sec
   - Better for production workloads with many concurrent users

In [None]:
# Set visualization style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (16, 12)

# Create a 2x2 grid of line charts
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle(
    "EAGLE vs Baseline: Performance Across Concurrency Levels",
    fontsize=16,
    fontweight="bold",
    y=0.995,
)

# Extract data for plotting
concurrency_values = baseline_df["Concurrency"].values
baseline_ttft = baseline_df["TTFT (ms)"].values
eagle_ttft = eagle_df["TTFT (ms)"].values
baseline_tpot = baseline_df["TPOT (ms)"].values
eagle_tpot = eagle_df["TPOT (ms)"].values
baseline_throughput = baseline_df["Throughput (tok/s)"].values
eagle_throughput = eagle_df["Throughput (tok/s)"].values
baseline_req_throughput = baseline_df["Request Throughput (req/s)"].values
eagle_req_throughput = eagle_df["Request Throughput (req/s)"].values

# Plot 1: Time to First Token vs Concurrency
ax1 = axes[0, 0]
ax1.plot(
    concurrency_values,
    baseline_ttft,
    "o-",
    color="#4285F4",
    linewidth=2,
    markersize=8,
    label="Baseline",
)
ax1.plot(
    concurrency_values,
    eagle_ttft,
    "s-",
    color="#34A853",
    linewidth=2,
    markersize=8,
    label="EAGLE",
)
ax1.set_xlabel("Max Concurrency", fontsize=11, fontweight="bold")
ax1.set_ylabel("Median TTFT (ms)", fontsize=11, fontweight="bold")
ax1.set_title(
    "Time to First Token vs Concurrency\n(Lower is Better)",
    fontsize=12,
    fontweight="bold",
)
ax1.legend(loc="best", fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Time Per Output Token vs Concurrency
ax2 = axes[0, 1]
ax2.plot(
    concurrency_values,
    baseline_tpot,
    "o-",
    color="#4285F4",
    linewidth=2,
    markersize=8,
    label="Baseline",
)
ax2.plot(
    concurrency_values,
    eagle_tpot,
    "s-",
    color="#34A853",
    linewidth=2,
    markersize=8,
    label="EAGLE",
)
ax2.set_xlabel("Max Concurrency", fontsize=11, fontweight="bold")
ax2.set_ylabel("Median TPOT (ms)", fontsize=11, fontweight="bold")
ax2.set_title(
    "Time Per Output Token vs Concurrency\n(Lower is Better)",
    fontsize=12,
    fontweight="bold",
)
ax2.legend(loc="best", fontsize=10)
ax2.grid(True, alpha=0.3)

# Plot 3: Token Throughput vs Concurrency
ax3 = axes[1, 0]
ax3.plot(
    concurrency_values,
    baseline_throughput,
    "o-",
    color="#4285F4",
    linewidth=2,
    markersize=8,
    label="Baseline",
)
ax3.plot(
    concurrency_values,
    eagle_throughput,
    "s-",
    color="#34A853",
    linewidth=2,
    markersize=8,
    label="EAGLE",
)
ax3.set_xlabel("Max Concurrency", fontsize=11, fontweight="bold")
ax3.set_ylabel("Throughput (tokens/s)", fontsize=11, fontweight="bold")
ax3.set_title(
    "Token Throughput vs Concurrency\n(Higher is Better)",
    fontsize=12,
    fontweight="bold",
)
ax3.legend(loc="best", fontsize=10)
ax3.grid(True, alpha=0.3)

# Plot 4: Request Throughput vs Concurrency
ax4 = axes[1, 1]
ax4.plot(
    concurrency_values,
    baseline_req_throughput,
    "o-",
    color="#4285F4",
    linewidth=2,
    markersize=8,
    label="Baseline",
)
ax4.plot(
    concurrency_values,
    eagle_req_throughput,
    "s-",
    color="#34A853",
    linewidth=2,
    markersize=8,
    label="EAGLE",
)
ax4.set_xlabel("Max Concurrency", fontsize=11, fontweight="bold")
ax4.set_ylabel("Request Throughput (req/s)", fontsize=11, fontweight="bold")
ax4.set_title(
    "Request Throughput vs Concurrency\n(Higher is Better)",
    fontsize=12,
    fontweight="bold",
)
ax4.legend(loc="best", fontsize=10)
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("eagle_concurrency_analysis.png", dpi=300, bbox_inches="tight")
plt.show()

print("\n✅ Visualization saved as 'eagle_concurrency_analysis.png'")

## Cleanup

**Important:** These endpoints run on expensive hardware (8x H100 GPUs). Make sure to delete them when done to avoid unnecessary costs.

In [None]:
# Set to True to delete endpoints and models
delete_endpoints = False  # @param {type: "boolean"}
delete_models = False  # @param {type: "boolean"}

if delete_endpoints:
    print("🗑️  Deleting endpoints...\n")

    try:
        # Undeploy and delete baseline endpoint
        print("   Undeploying baseline endpoint...")
        baseline_endpoint.undeploy_all()
        print("   Deleting baseline endpoint...")
        baseline_endpoint.delete()
        print("   ✅ Baseline endpoint deleted\n")

        # Undeploy and delete EAGLE endpoint
        print("   Undeploying EAGLE endpoint...")
        eagle_endpoint.undeploy_all()
        print("   Deleting EAGLE endpoint...")
        eagle_endpoint.delete()
        print("   ✅ EAGLE endpoint deleted\n")

        print("✅ All endpoints deleted successfully!")

    except Exception as e:
        print(f"\n❌ Failed to delete endpoints: {e}")
        print("   You may need to delete them manually from the console")
else:
    print("⚠️  Endpoints not deleted (delete_endpoints=False)")
    print("   Remember to delete them manually to avoid charges!")
    print(f"\n   Baseline endpoint: {baseline_endpoint.resource_name}")
    print(f"   EAGLE endpoint: {eagle_endpoint.resource_name}")

if delete_models:
    print("\n🗑️  Deleting models...\n")

    try:
        baseline_model.delete()
        print("   ✅ Baseline model deleted")

        eagle_model.delete()
        print("   ✅ EAGLE model deleted")

        print("\n✅ All models deleted successfully!")

    except Exception as e:
        print(f"\n❌ Failed to delete models: {e}")
        print("   You may need to delete them manually from the console")
else:
    print("\n⚠️  Models not deleted (delete_models=False)")

## Next Steps

Now that you've successfully benchmarked EAGLE on Vertex AI, here are some next steps:

### 1. Test with Your Own Data
Replace ShareGPT with your production prompts:
```python
# Save your prompts as JSON in ShareGPT format
your_prompts = [
    {"id": "1", "conversations": [{"from": "human", "value": "Your prompt here"}]},
    # ... more prompts
]
with open("/tmp/your_prompts.json", "w") as f:
    json.dump(your_prompts, f)
```

### 2. Optimize EAGLE Parameters
Experiment with different EAGLE configurations:
- `--speculative-num-steps`: Try 2, 3, 4, 5 (higher = more speculation)
- `--speculative-num-draft-tokens`: Try 4, 8, 12 (higher = more tokens per step)
- `--speculative-eagle-topk`: Try 3, 4, 5 (higher = more diverse predictions)

### 3. Production Deployment
For production use:
- Enable autoscaling: `min_replica_count=1, max_replica_count=5`
- Set up monitoring and alerting
- Implement A/B testing between baseline and EAGLE
- Configure request/response logging

### 4. Cost Optimization
- Compare cost per token: `GPU cost / tokens generated`
- Calculate break-even point for your workload
- Consider using cheaper GPUs (L4, A100) if throughput requirements are lower

## Additional Resources

- [Vertex AI Model Garden Documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/explore-models)
- [vLLM Speculative Decoding Guide](https://docs.vllm.ai/en/latest/models/spec_decode.html)
- [EAGLE Paper (arXiv)](https://arxiv.org/abs/2401.15077)
- [vLLM Benchmark Documentation](https://docs.vllm.ai/en/latest/serving/benchmarking.html)
- [Llama 4 Model Card](https://ai.meta.com/llama/)