# Model Comparison Notebook (Azure OpenAI Responses API)

This notebook compares token consumption and latency across multiple Azure OpenAI (Foundry) model deployments using the modern unified **Responses API** (`POST /openai/v1/responses`) authenticated with Entra ID (`DefaultAzureCredential`). A compatibility flag allows falling back to the legacy deployment-scoped endpoint if needed.

## What this notebook does
- Sends the *same* prompt to a list of model deployment names (specified via environment variables) using the unified Responses API (`model` provided in the JSON body).
- Collects input, output, and reasoning token usage (if available).
- Measures request latency (wall clock).
- Aggregates results into an ASCII / colorized table (uses `rich`).
- Highlights most efficient (lowest total tokens) model and shows relative percentage differences.

## Requirements / Assumptions
1. You have valid Azure OpenAI model deployments (their deployment names are passed as the `model` value).
2. You are already logged in / have a valid environment for `DefaultAzureCredential`.
3. Environment variables are set as described in the next cell.
4. Responses API returns usage fields (`input_tokens`, `output_tokens`, `total_tokens`, optional reasoning tokens via `reasoning_tokens` or `output_tokens_details.reasoning_tokens`).
5. Set `USE_LEGACY_DEPLOYMENT_PATH=True` if your region / API version still requires the older `/openai/deployments/{deployment}/responses?api-version=...` path.

If anything is missing, the validation cell will fail fast with remediation guidance.

## Environment Variables & Naming Pattern

Required shared variables:
- `AZURE_OPENAI_ENDPOINT` (e.g. https://my-foundry-endpoint.openai.azure.com)
- `AZURE_OPENAI_API_VERSION` (legacy fallback only; ignored in unified mode unless you flip the flag)

Per-model deployment variables follow: `AZURE_<MODEL_NAME>_MODEL`.
Examples:
- `AZURE_GPT5_MODEL=your-gpt5-deployment`
- `AZURE_GPT4O_MODEL=your-gpt4o-deployment`
- `AZURE_GPT41_MODEL=your-gpt-4.1-deployment`

Unified Responses API Mode (default):
- URL: `POST {AZURE_OPENAI_ENDPOINT}/openai/v1/responses`
- Body includes: `{ "model": "<deployment_name>", "input": ... }`

Legacy Mode (`USE_LEGACY_DEPLOYMENT_PATH=True`):
- URL: `POST {AZURE_OPENAI_ENDPOINT}/openai/deployments/<deployment_name>/responses?api-version=<AZURE_OPENAI_API_VERSION>`
- Body may omit `model` (we still include it harmlessly for consistency).

Scope used for token acquisition: `https://cognitiveservices.azure.com/.default`

In [1]:
# User Configuration Cell
# Adjust MODEL_PROFILES to specify mode and reasoning parameters per model.
# mode: 'chat' or 'reasoning'
# effort: reasoning effort level ('low','medium','high') when mode == 'reasoning'
# max_reasoning_tokens: optional cap for reasoning token generation
# NOTE: The modern Responses API format uses POST /openai/v1/responses with a 'model' field.
# Set USE_LEGACY_DEPLOYMENT_PATH=True to fall back to older deployment-scoped endpoint shape.
USE_LEGACY_DEPLOYMENT_PATH = False

MODEL_PROFILES = {
    "GPT5": {"mode": "reasoning", "effort": "medium", "max_reasoning_tokens": 12000},
    "GPT5MINI": {"mode": "reasoning", "effort": "low", "max_reasoning_tokens": 8000},
    "GPT5CHAT": {"mode": "chat"},
    "GPT4O": {"mode": "chat"},
    "GPT41": {"mode": "chat"},
}

# Backward compatibility: derive MODEL_NAMES list
MODEL_NAMES = list(MODEL_PROFILES.keys())

PROMPT = ("Explain the principle of least action in classical mechanics in 3 concise bullet points.")
MAX_OUTPUT_TOKENS = 512  # Set None to omit
TEMPERATURE = 0.2
REQUEST_TIMEOUT_SECONDS = 60
RETRIES = 2  # Additional attempts after the first (total attempts = 1 + RETRIES)
PARALLEL = True  # Set False for sequential (easier debugging)
MAX_IN_FLIGHT = None  # Optionally cap parallel concurrency; None = len(MODEL_NAMES)
EXPORT_RESULTS_JSON = True  # Write results_<timestamp>.json after run
SHOW_FULL_TEXT = False  # If True, will display full responses per model (can be verbose)
RANK_BY = 'total_tokens'  # Field used for ranking (must exist in record)

# Validation set for reasoning effort
VALID_REASONING_EFFORT = {"low", "medium", "high"}


In [2]:
# Imports, dependency checks, and .env loading
import os, json, time, math, asyncio, datetime, textwrap
from typing import List, Dict, Any, Optional

try:
    from dotenv import load_dotenv  # type: ignore
    load_dotenv()
except Exception:
    pass  # It's fine if python-dotenv is not present; env vars may already be set

from azure.identity import DefaultAzureCredential
import aiohttp
try:
    from rich.table import Table
    from rich.console import Console
    from rich import box
    console = Console()
    HAVE_RICH = True
except Exception:
    HAVE_RICH = False
    console = None
    print("[WARN] rich not available; falling back to plain-text output.")

SCOPE = "https://cognitiveservices.azure.com/.default"


## Chat vs Reasoning Modes
This notebook now supports both standard chat-style invocations and reasoning-capable models in the same run.

Configuration:
- Each entry in `MODEL_PROFILES` declares a logical model name and a profile.
- `mode`: `chat` sends a normal Responses API request without a `reasoning` block.
- `mode`: `reasoning` adds a `reasoning` object to the payload.
- `effort`: One of `low|medium|high` (default `medium` if invalid or omitted).
- `max_reasoning_tokens`: Optional integer to cap reasoning token generation.

Behavior:
- Results table will include separate `Mode` and `Effort` columns.
- If reasoning was requested but the service returns no reasoning tokens, the row will show `0 ⚠️` in the Reasoning column and a note will be added.
- Token ranking still uses `total_tokens` unless you change `RANK_BY`.

Adjust `MODEL_PROFILES` to experiment with different mixes of reasoning and chat models.

In [3]:
# Validate Environment Variables and Build Model Deployment Mapping
endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')
api_version = os.getenv('AZURE_OPENAI_API_VERSION')
missing = []
if not endpoint: missing.append('AZURE_OPENAI_ENDPOINT')
if not api_version: missing.append('AZURE_OPENAI_API_VERSION')
model_deployments = {}  # model_name -> deployment id
for name in MODEL_NAMES:
    env_var = f'AZURE_{name.upper()}_MODEL'
    val = os.getenv(env_var)
    if not val:
        missing.append(env_var)
    else:
        model_deployments[name] = val

if missing:
    raise RuntimeError(f'Missing required environment variables: {missing}')

print('Endpoint:', endpoint)
print('API Version:', api_version)
print('Models resolved:', model_deployments)


Endpoint: https://admin-mdrh8xul-eastus2.openai.azure.com/openai/v1/
API Version: 2025-04-01-preview
Models resolved: {'GPT5': 'gpt-5', 'GPT5MINI': 'gpt-5-mini', 'GPT5CHAT': 'gpt-5-chat', 'GPT4O': 'gpt-4o', 'GPT41': 'gpt-4.1'}


In [4]:
# Credential & Token Handling (simple cache)
credential = DefaultAzureCredential()
_token_cache = { 'value': None, 'expires_on': 0 }

def get_bearer_token(force: bool = False) -> str:
    now = time.time()
    if (not force) and _token_cache['value'] and now < _token_cache['expires_on'] - 60:
        return _token_cache['value']
    token = credential.get_token(SCOPE)
    _token_cache['value'] = token.token
    # azure-identity returns expires_on in epoch seconds attribute
    _token_cache['expires_on'] = getattr(token, 'expires_on', now + 600)
    return _token_cache['value']

# Quick smoke test
_ = get_bearer_token()
print('Acquired Azure AD token (truncated):', _[:24] + '...')


Acquired Azure AD token (truncated): eyJ0eXAiOiJKV1QiLCJhbGci...


In [5]:
# Helper Functions

def build_url(deployment: str) -> str:
    base = endpoint.rstrip('/')
    if USE_LEGACY_DEPLOYMENT_PATH:
        # Legacy style: /openai/deployments/{deployment}/responses?api-version=<version>
        return f"{base}/openai/deployments/{deployment}/responses?api-version={api_version}"
    # Modern unified Responses API endpoint (model specified in body)
    if base.endswith('/openai/v1') or base.endswith('/openai/v1/'):
        return f"{base.rstrip('/')}/responses"
    if base.endswith('/openai') or base.endswith('/openai/'):
        return f"{base.rstrip('/')}/v1/responses"
    # Assume base is resource root like https://xyz.openai.azure.com
    return f"{base}/openai/v1/responses"

def truncate(text: Optional[str], limit: int = 120) -> Optional[str]:
    if text is None: return None
    if len(text) <= limit: return text
    return text[:limit-3] + '...'

def build_body(profile: Dict[str, Any], deployment: str) -> Dict[str, Any]:
    # In modern mode we include model; in legacy mode including it is harmless
    body: Dict[str, Any] = {
        'model': deployment,
        'input': [ { 'role': 'user', 'content': [ { 'type': 'text', 'text': PROMPT } ] } ],
        'temperature': TEMPERATURE,
    }
    if MAX_OUTPUT_TOKENS is not None:
        body['max_output_tokens'] = MAX_OUTPUT_TOKENS
    mode = profile.get('mode', 'chat')
    if mode == 'reasoning':
        effort = profile.get('effort', 'medium')
        if effort not in VALID_REASONING_EFFORT:
            profile.setdefault('notes', []).append(f'invalid_effort:{effort}->medium')
            effort = 'medium'
        reasoning_obj: Dict[str, Any] = {'effort': effort}
        mrt = profile.get('max_reasoning_tokens')
        if isinstance(mrt, int) and mrt > 0:
            reasoning_obj['max_reasoning_tokens'] = mrt
        body['reasoning'] = reasoning_obj
    return body

def extract_usage(obj: Dict[str, Any]) -> Dict[str, Optional[int]]:
    usage = obj.get('usage') or {}
    # Direct & legacy fields
    input_tokens = usage.get('input_tokens') or usage.get('prompt_tokens')
    output_tokens = usage.get('output_tokens') or usage.get('completion_tokens')
    # Reasoning tokens may appear in multiple places
    reasoning_tokens = (
        usage.get('reasoning_tokens')
        or (usage.get('output_tokens_details') or {}).get('reasoning_tokens')
        or usage.get('output_reasoning_tokens')
        or usage.get('output_tokens_reasoning')
    )
    # Older nested input tokens details variant
    if (not reasoning_tokens) and isinstance(usage.get('input_tokens_details'), dict):
        reasoning_tokens = usage['input_tokens_details'].get('reasoning_tokens')
    total_tokens = usage.get('total_tokens')
    if total_tokens is None and (input_tokens is not None or output_tokens is not None):
        total_tokens = (input_tokens or 0) + (output_tokens or 0) + (reasoning_tokens or 0)
    return {
        'input_tokens': input_tokens,
        'output_tokens': output_tokens,
        'reasoning_tokens': reasoning_tokens,
        'total_tokens': total_tokens
    }

def extract_text(obj: Dict[str, Any]) -> Optional[str]:
    # Attempt common shapes for Responses API
    if 'output' in obj and isinstance(obj['output'], list):
        try:
            first = obj['output'][0]
            if isinstance(first, dict) and 'content' in first and isinstance(first['content'], list):
                seg = first['content'][0]
                if isinstance(seg, dict):
                    return seg.get('text') or seg.get('value')
        except Exception:
            pass
    if 'choices' in obj and isinstance(obj['choices'], list):  # fallback pattern
        try:
            ch = obj['choices'][0]
            msg = ch.get('message') or {}
            if 'content' in msg and isinstance(msg['content'], list):
                part = msg['content'][0]
                if isinstance(part, dict):
                    return part.get('text') or part.get('value')
            if isinstance(msg.get('content'), str):
                return msg['content']
        except Exception:
            pass
    # Some responses include output_text shortcut
    if 'output_text' in obj and isinstance(obj['output_text'], str):
        return obj['output_text']
    return None

async def request_model(session: aiohttp.ClientSession, model_name: str, deployment: str, profile: Dict[str, Any]) -> Dict[str, Any]:
    url = build_url(deployment)
    body = build_body(profile, deployment)
    attempt = 0
    start_time = time.perf_counter()
    last_error = None
    mode = profile.get('mode', 'chat')
    effort = profile.get('effort') if mode == 'reasoning' else None
    notes: List[str] = profile.get('notes', [])
    debug_prefix = '[LEGACY]' if USE_LEGACY_DEPLOYMENT_PATH else '[UNIFIED]'
    print(f"{debug_prefix} Requesting {model_name} -> model={deployment} endpoint={url}")
    while attempt <= RETRIES:
        attempt += 1
        token = get_bearer_token()
        headers = {
            'Authorization': f'Bearer {token}',
            'Content-Type': 'application/json',
            'Accept': 'application/json'
        }
        try:
            timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT_SECONDS)
            async with session.post(url, headers=headers, json=body, timeout=timeout) as resp:
                status = resp.status
                data = await resp.json(content_type=None)
                if status >= 500 or status == 429:
                    last_error = f'Status {status}: {data}'
                    if attempt <= RETRIES + 0:
                        await asyncio.sleep(2 ** attempt)
                        continue
                if status >= 400:
                    end = time.perf_counter()
                    return {
                        'model_name': model_name, 'deployment': deployment, 'status': 'error',
                        'http_status': status, 'latency_ms': (end - start_time) * 1000,
                        'input_tokens': None, 'output_tokens': None, 'reasoning_tokens': None, 'total_tokens': None,
                        'response_excerpt': None, 'error': str(data),
                        'mode': mode, 'reasoning_effort': effort, 'reasoning_enabled': mode=='reasoning', 'notes': notes
                    }
                usage = extract_usage(data)
                text = extract_text(data)
                end = time.perf_counter()
                if mode == 'reasoning' and not usage.get('reasoning_tokens'):
                    notes.append('reasoning_tokens_missing')
                return {
                    'model_name': model_name, 'deployment': deployment, 'status': 'ok',
                    'http_status': status, 'latency_ms': (end - start_time) * 1000,
                    **usage, 'response_excerpt': truncate(text), 'error': None,
                    'mode': mode, 'reasoning_effort': effort, 'reasoning_enabled': mode=='reasoning', 'notes': notes
                }
        except asyncio.TimeoutError:
            last_error = 'timeout'
        except Exception as ex:
            last_error = repr(ex)
        if attempt <= RETRIES + 0:
            await asyncio.sleep(2 ** attempt)
    end = time.perf_counter()
    return {
        'model_name': model_name, 'deployment': deployment, 'status': 'error',
        'http_status': None, 'latency_ms': (end - start_time) * 1000,
        'input_tokens': None, 'output_tokens': None, 'reasoning_tokens': None, 'total_tokens': None,
        'response_excerpt': None, 'error': last_error,
        'mode': mode, 'reasoning_effort': effort, 'reasoning_enabled': mode=='reasoning', 'notes': notes
    }


In [9]:
# Orchestration to Run All Models (event-loop safe)
async def run_all() -> List[Dict[str, Any]]:
    results: List[Dict[str, Any]] = []
    semaphore = asyncio.Semaphore(MAX_IN_FLIGHT or len(model_deployments))
    async with aiohttp.ClientSession() as session:
        async def run_one(name, dep, profile):
            async with semaphore:
                return await request_model(session, name, dep, profile)
        items = [(n, model_deployments[n], MODEL_PROFILES.get(n, {'mode':'chat'})) for n in MODEL_NAMES if n in model_deployments]
        if PARALLEL:
            tasks = [asyncio.create_task(run_one(n, d, p)) for n, d, p in items]
            for t in asyncio.as_completed(tasks):
                res = await t
                results.append(res)
        else:
            for n, d, p in items:
                res = await run_one(n, d, p)
                results.append(res)
    return results

# Run with compatibility for existing event loop environments (e.g., Jupyter)
try:
    import nest_asyncio  # type: ignore
    nest_asyncio.apply()
except Exception:
    pass

import sys
if 'ipykernel' in sys.modules:
    # We are inside a Jupyter / IPython environment with a running loop
    if not globals().get('_MODEL_COMPARISON_ALREADY_RAN'):
        # Use asyncio.create_task + gather pattern
        loop = asyncio.get_event_loop()
        results = loop.run_until_complete(run_all()) if not loop.is_running() else (await run_all())  # type: ignore
        _MODEL_COMPARISON_ALREADY_RAN = True
    else:
        results = await run_all()  # type: ignore
else:
    # Standard Python execution path
    results = asyncio.run(run_all())

print('Raw Results:')
print(json.dumps(results, indent=2))


Requesting GPT5 at https://admin-mdrh8xul-eastus2.openai.azure.com/openai/deployments/gpt-5/responses?api-version=2025-04-01-preview
Requesting GPT5MINI at https://admin-mdrh8xul-eastus2.openai.azure.com/openai/deployments/gpt-5-mini/responses?api-version=2025-04-01-preview
Requesting GPT5CHAT at https://admin-mdrh8xul-eastus2.openai.azure.com/openai/deployments/gpt-5-chat/responses?api-version=2025-04-01-preview
Requesting GPT4O at https://admin-mdrh8xul-eastus2.openai.azure.com/openai/deployments/gpt-4o/responses?api-version=2025-04-01-preview
Requesting GPT41 at https://admin-mdrh8xul-eastus2.openai.azure.com/openai/deployments/gpt-4.1/responses?api-version=2025-04-01-preview
Raw Results:
[
  {
    "model_name": "GPT5CHAT",
    "deployment": "gpt-5-chat",
    "status": "error",
    "http_status": 404,
    "latency_ms": 376.4356250030687,
    "input_tokens": null,
    "output_tokens": null,
    "reasoning_tokens": null,
    "total_tokens": null,
    "response_excerpt": null,
    "err

In [None]:
# Post-processing, Ranking, and Reporting
def compute_rankings(recs: List[Dict[str, Any]], key: str = RANK_BY) -> List[Dict[str, Any]]:
    valid = [r for r in recs if r.get(key) is not None and r['status']=='ok']
    valid_sorted = sorted(valid, key=lambda r: r.get(key))
    rank_map = {id(r): idx+1 for idx, r in enumerate(valid_sorted)}
    best_total = valid_sorted[0].get(key) if valid_sorted else None
    for r in recs:
        if id(r) in rank_map:
            r['rank'] = rank_map[id(r)]
            if best_total and r.get(key) is not None:
                r['delta_pct'] = ((r.get(key) - best_total)/best_total)*100 if best_total > 0 else 0
            else:
                r['delta_pct'] = None
        else:
            r['rank'] = None
            r['delta_pct'] = None
    return recs

results = compute_rankings(results)

def render_table(recs: List[Dict[str, Any]]):
    if HAVE_RICH:
        table = Table(title='Model Comparison', box=box.MINIMAL_DOUBLE_HEAD)
        cols = ['Rank','Model','Mode','Effort','Input','Output','Reasoning','Total','Latency(ms)','ΔTotal%','Status','Excerpt']
        for c in cols:
            table.add_column(c)
        ok_with_total = [r for r in recs if r['status']=='ok' and r.get('total_tokens') is not None]
        best_total = min([r['total_tokens'] for r in ok_with_total], default=None)
        worst_total = max([r['total_tokens'] for r in ok_with_total], default=None)
        for r in sorted(recs, key=lambda x: (x['rank'] if x['rank'] is not None else 1e9)):
            def fmt(v): return '' if v is None else str(v)
            reasoning = r.get('reasoning_tokens')
            reasoning_enabled = r.get('reasoning_enabled')
            notes = r.get('notes') or []
            if reasoning is not None and reasoning > 0:
                reasoning_str = f"{reasoning} 🧠"
            else:
                if reasoning_enabled and 'reasoning_tokens_missing' in notes:
                    reasoning_str = '0 ⚠️'
                else:
                    reasoning_str = fmt(reasoning)
            total = r.get('total_tokens')
            total_str = fmt(total)
            style = None
            if r['status'] != 'ok':
                style = 'red'
            elif best_total is not None and total == best_total:
                total_str = f"[bold green]{total} ✅[/]"
            elif worst_total is not None and total == worst_total:
                total_str = f"[bold red]{total} ❌[/]"
            delta = r.get('delta_pct')
            delta_str = '' if delta is None else f"{delta:+.1f}%"
            latency_str = f"{r.get('latency_ms'):.1f}" if r.get('latency_ms') is not None else ''
            row = [
                fmt(r.get('rank')),
                r['model_name'],
                r.get('mode') or '',
                r.get('reasoning_effort') or '',
                fmt(r.get('input_tokens')),
                fmt(r.get('output_tokens')),
                reasoning_str,
                total_str,
                latency_str,
                delta_str,
                'OK' if r['status']=='ok' else 'ERROR',
                r.get('response_excerpt') or (r.get('error') or '')
            ]
            table.add_row(*row, style=style)
        console.print(table)
        ok_latencies = [r['latency_ms'] for r in recs if r['status']=='ok' and r.get('latency_ms') is not None]
        if ok_latencies:
            console.print(f"Average latency (ok only): {sum(ok_latencies)/len(ok_latencies):.1f} ms")
        missing_usage = [r['model_name'] for r in recs if r['status']=='ok' and r.get('total_tokens') is None]
        if missing_usage:
            console.print(f"[yellow]Warning: Missing token usage for: {missing_usage}[/]")
    else:
        print('Rank | Model | Mode | Effort | Input | Output | Reasoning | Total | Latency(ms) | ΔTotal% | Status | Excerpt/Error')
        for r in recs:
            print(
                f"{r.get('rank')} | {r['model_name']} | {r.get('mode')} | {r.get('reasoning_effort')} | "
                f"{r.get('input_tokens')} | {r.get('output_tokens')} | {r.get('reasoning_tokens')} | {r.get('total_tokens')} | "
                f"{r.get('latency_ms') and round(r.get('latency_ms'),1)} | {r.get('delta_pct')} | {r['status']} | "
                f"{r.get('response_excerpt') or r.get('error')}"
            )

render_table(results)


In [None]:
# Export results (if enabled)
if EXPORT_RESULTS_JSON:
    ts = datetime.datetime.utcnow().strftime('%Y%m%dT%H%M%SZ')
    fname = f'results_{ts}.json'
    with open(fname, 'w', encoding='utf-8') as f:
        json.dump(results, f, indent=2)
    print('Wrote', fname)
