# Responses API Testing

This notebook exercises the Responses API available with Azure OpenAI. It begins with getting a Entra ID token so Managed Identity can be used to secure the calls.

In [5]:
# Setup imports, environment, and credential
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

## Test stateless using the SDK and Encrypted Reasoning Tokens




In [23]:
# OpenAI SDK Async Responses API call
from rich import print as rprint
import os
from openai import AsyncOpenAI
import asyncio

load_dotenv()  # Load environment variables from .env file if present

os.environ["OPENAI_LOG"] = "debug"   # Options: debug | info | warn | error

# We'll stream a slightly longer prompt to observe deltas.
inputs = [{ "role": "user", "content": "Explain the transformer architecture behind modern LLMs in terms a sixth grader could understand." }]
reasoning_level = {
    "effort": "high",
    "summary": "detailed"
}

client = AsyncOpenAI(
    base_url=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
)

response = await client.responses.create(
    model=os.getenv("AZURE_GPT5_MODEL"),
    input=inputs,
    stream=False,
    reasoning=reasoning_level,
    include=["reasoning.encrypted_content"],
    store=False,
)

print("Response:")
for o in response.output:
    print(o.type, o.to_dict())

# Follow-up question to test context retention
inputs.extend(response.output)
inputs.append({ "role": "user", "content": "Can you explain your reasoning for the last answer? What made you think this was suitable for a sixth grader?" })
response_1 = await client.responses.create(
    model=os.getenv("AZURE_GPT5_MODEL"),
    input=inputs,
    stream=False,
    reasoning=reasoning_level,
    include=["reasoning.encrypted_content"],
    store=False,
)

print("\nFollow-up Response:")
for o in response_1.output:
    print(o.type, o.to_dict())


Response:
reasoning {'id': 'rs_68d3f3a232c881908757f1ea2ec96a3300795284ea9c6894', 'summary': [{'text': "**Explaining transformer architecture**\n\nI need to explain transformer architecture for modern language models in a way that a sixth grader can grasp. I’ll keep things simple, possibly using bullet points and analogies. Tokens can be like LEGO bricks, while embeddings work like coordinates. I’ll illustrate attention with examples, like selecting important words in a classroom setting, and position encoding can be compared to page numbers or music beats. I'll mention why it's powerful: it works parallelly, understands long-range relationships, and scales well with GPUs.", 'type': 'summary_text'}, {'text': "**Describing self-attention in Transformers**\n\nI want to explain self-attention using analogies like a spotlight or gossip network. I can simplify queries, keys, and values as a question, label, and info. Multi-head attention is like several spotlights. I'll describe residual co

## Raw REST: Stateless Request with Encrypted Reasoning (API Key Auth)
This cell demonstrates **two stateless** raw Responses API calls using an API Key.

Goal: Show how including `reasoning.encrypted_content` in the first response lets you forward *only* encrypted reasoning artifacts (plus a new question) in a second stateless call to approximate stateful quality—without using `store` or `previous_response_id`.

Flow:
1. Call 1 (stateless): `store: false`, `include: ["reasoning.encrypted_content"]`.
2. Extract encrypted reasoning items returned (type = `reasoning_encrypted_content`).
3. Call 2 (stateless): Provide those encrypted items + a follow‑up user question in `input`.
4. Compare output quality vs the stateful approach (which used `store:true` + `previous_response_id`).

Auth header: `api-key: $AZURE_OPENAI_API_KEY`

Body fields used:
- `model`: deployment name
- `input`: array of items; first call includes the user question; second call includes encrypted reasoning items then the follow‑up user question
- `reasoning`: effort & summary hints
- `include`: reasoning encrypted content (first call only)
- `store`: always `false` here (purely stateless)

Outputs printed:
- Latency & status for both calls
- Token usage
- Whether encrypted reasoning was returned & reused
- Excerpts from each response


In [None]:
# Raw REST stateless two-call example leveraging encrypted reasoning (API Key)
import os, json, asyncio, time
from typing import Any, Dict, List
import aiohttp
from dotenv import load_dotenv

load_dotenv()

ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
MODEL = os.getenv("AZURE_GPT5_MODEL")
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
if not ENDPOINT or not MODEL or not API_KEY:
    raise RuntimeError("Missing required env vars: AZURE_OPENAI_ENDPOINT, AZURE_GPT5_MODEL, AZURE_OPENAI_API_KEY")

FIRST_QUESTION = "Explain the transformer architecture behind modern LLMs in terms a sixth grader could understand."
FOLLOWUP_QUESTION = "Why was that explanation appropriate for a sixth grader?"  # purely stateless follow-up

def gather_encrypted_reasoning(output_list: List[Any]) -> List[Dict[str, Any]]:
    """Return a normalized list of encrypted reasoning carrier objects.

        Handles shapes:
        1. {"type": "encrypted_content", "encrypted_content": "..."}
        2. {"type": "reasoning", "encrypted_content": "...", ...}
        3. Nested occurrences inside lists/dicts (future-proof).
        Deduplicates by encrypted_content value.
    """
    candidates: List[Dict[str, Any]] = []
    for item in output_list or []:
        if not isinstance(item, dict):
            continue
        t = item.get("type")
        # Direct encrypted item variants
        if t in ("encrypted_content", "reasoning_encrypted_content") and item.get("encrypted_content"):
            candidates.append({
                "type": t if t != "reasoning_encrypted_content" else "encrypted_content",
                "encrypted_content": item["encrypted_content"]
            })
            continue
        # Reasoning wrapper containing encrypted_content
        if t == "reasoning" and item.get("encrypted_content"):
            candidates.append({
                "type": "reasoning",
                "encrypted_content": item["encrypted_content"]
            })
        # Deep scan for nested dict/list values that expose 'encrypted_content'
        for v in item.values():
            if isinstance(v, dict) and v.get("encrypted_content"):
                candidates.append({
                    "type": v.get("type") or "reasoning",
                    "encrypted_content": v["encrypted_content"]
                })
            elif isinstance(v, list):
                for sub in v:
                    if isinstance(sub, dict) and sub.get("encrypted_content"):
                        candidates.append({
                            "type": sub.get("type") or "reasoning",
                            "encrypted_content": sub["encrypted_content"]
                        })
    # Deduplicate by encrypted_content
    seen = set()
    dedup: List[Dict[str, Any]] = []
    for c in candidates:
        ec = c.get("encrypted_content")
        if ec and ec not in seen:
            seen.add(ec)
            dedup.append(c)
    return dedup

async def stateless_with_encrypted_reasoning():
    url = f"{ENDPOINT}/responses"
    headers = {
        "api-key": API_KEY,
        "Content-Type": "application/json",
        "Accept": "application/json"
    }
    # ---- Call 1 ----
    body1: Dict[str, Any] = {
        "model": MODEL,
        "input": [ { "role": "user", "content": FIRST_QUESTION } ],
        "reasoning": {"effort": "high", "summary": "detailed"},
        "include": ["reasoning.encrypted_content"],
        "store": False
    }
    start1 = time.perf_counter()
    async with aiohttp.ClientSession() as session:
        async with session.post(url, headers=headers, json=body1) as resp1:
            data1 = await resp1.json(content_type=None)
            lat1 = (time.perf_counter() - start1) * 1000
            status1 = resp1.status
    usage1 = data1.get("usage", {})
    # NEW: robust extraction
    encrypted_items = gather_encrypted_reasoning(data1.get("output"))
    encrypted_found = len(encrypted_items) > 0
    if not encrypted_found:
        print("[warn] No encrypted reasoning items found. Output keys: " +
              f"{[list(o.keys()) for o in data1.get('output', []) if isinstance(o, dict)]}")
    else:
        print(f"Found {len(encrypted_items)} encrypted reasoning item(s)")
    # Optional: show first 60 chars of first token for confirmation (not full content)
    if encrypted_found:
        print("Sample encrypted fragment: " + encrypted_items[0]["encrypted_content"][:60] + "...")

    excerpt1 = None
    for o in data1.get("output", []):
        if isinstance(o, dict):
            for c in o.get("content", []):
                if isinstance(c, dict) and c.get("text"):
                    excerpt1 = c["text"][:250]
                    break
        if excerpt1: break

    print(f"Call1 Status={status1} Latency={lat1:.1f}ms Tokens in:{usage1.get('input_tokens')} out:{usage1.get('output_tokens')} total:{usage1.get('total_tokens')} Reasoning?={(usage1.get('reasoning_tokens') or (usage1.get('output_tokens_details') or {}).get('reasoning_tokens'))}")
    print(f"Encrypted reasoning items returned: {encrypted_found}")
    if excerpt1:
        print("--- Call1 Excerpt ---\n" + excerpt1 + "\n")

    # ---- Call 2 (stateless follow-up) ----
    followup_input: List[Any] = []
    followup_input.extend(data1.get("output", []))  # forward minimal encrypted carriers
    followup_input.append({"role": "user", "content": FOLLOWUP_QUESTION})

    body2: Dict[str, Any] = {
        "model": MODEL,
        "input": followup_input,
        "reasoning": {"effort": "high"},
        "store": False
    }
    start2 = time.perf_counter()
    async with aiohttp.ClientSession() as session:
        async with session.post(url, headers=headers, json=body2) as resp2:
            data2 = await resp2.json(content_type=None)
            lat2 = (time.perf_counter() - start2) * 1000
            status2 = resp2.status
    usage2 = data2.get("usage", {})

    excerpt2 = None
    for o in data2.get("output", []):
        if isinstance(o, dict):
            for c in o.get("content", []):
                if isinstance(c, dict) and c.get("text"):
                    excerpt2 = c["text"][:250]
                    break
        if excerpt2: break

    print(f"Call2 Status={status2} Latency={lat2:.1f}ms Tokens in:{usage2.get('input_tokens')} out:{usage2.get('output_tokens')} total:{usage2.get('total_tokens')} Reasoning?={(usage2.get('reasoning_tokens') or (usage2.get('output_tokens_details') or {}).get('reasoning_tokens'))}")
    print(f"Encrypted items forwarded: {len(encrypted_items)}")
    if excerpt2:
        print("--- Call2 Excerpt ---\n" + excerpt2 + "\n")

    if excerpt1 and excerpt2:
        print(f"Follow-up excerpt length delta: {len(excerpt2) - len(excerpt1)} chars (positive means longer explanation)")

await stateless_with_encrypted_reasoning()

Found 1 encrypted reasoning item(s)
Sample encrypted fragment: gAAAAABo1AtaRh25Mq7xNJq1qvqWX19mocGuTJihCs31qmbEK_XbS8mK1-WD...
Call1 Status=200 Latency=64609.4ms Tokens in:23 out:2220 total:2243 Reasoning?=1472
Encrypted reasoning items returned: True
--- Call1 Excerpt ---
Imagine a transformer as a team of super readers that help a computer write and understand text. Here’s the idea in kid-friendly steps:

- Tokens: First, the text is broken into small pieces called tokens (like words or parts of words). Think of toke

Call2 Status=400 Latency=18040.6ms Tokens in:None out:None total:None Reasoning?=None
Encrypted items forwarded: 1


## Test Stateful using the SDK

In [24]:
# OpenAI SDK Async Responses API call
from rich import print as rprint
import os
from openai import AsyncOpenAI

load_dotenv()  # Load environment variables from .env file if present

os.environ["OPENAI_LOG"] = "debug"   # Options: debug | info | warn | error

# We'll stream a slightly longer prompt to observe deltas.
inputs = [{ "role": "user", "content": "Explain the transformer architecture behind modern LLMs in terms a sixth grader could understand." }]
reasoning_level = {
    "effort": "high",
    "summary": "detailed"
}

client = AsyncOpenAI(
    base_url=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
)

response = await client.responses.create(
    model=os.getenv("AZURE_GPT5_MODEL"),
    input=inputs,
    stream=False,
    reasoning=reasoning_level,
    store=True,
)

previous_response_id = response.id
print(f"Previous Response ID: {previous_response_id}")

print("Response:")
for o in response.output:
    print(o.type, o.to_dict())

# Follow-up question to test context retention
inputs.append({ "role": "user", "content": "Can you explain your reasoning for the last answer? What made you think this was suitable for a sixth grader?" })
response_1 = await client.responses.create(
    model=os.getenv("AZURE_GPT5_MODEL"),
    input=inputs,
    stream=False,
    reasoning=reasoning_level,
    previous_response_id=previous_response_id,
    store=True,
)

print("\nFollow-up Response:")
for o in response_1.output:
    print(o.type, o.to_dict())


Previous Response ID: resp_68d3f3e983f88196bd334ffe675d6581013e04e7c4fc8a1b
Response:
reasoning {'id': 'rs_68d3f3ea21708196ad53df108e4b1c9e013e04e7c4fc8a1b', 'summary': [{'text': "**Explaining language models**\n\nI'm considering mentioning that transformers were originally invented for translation, using both encoders and decoders, but many LLMs today, like GPT, only use a decoder. I could explain tokens and context windows—how much they can remember—like a backpack of recent words. I’ll also briefly touch on key-value caches for speed. But I might simplify the explanation for a 6th grader, presenting parameters as knobs and using relatable examples like layers of pancakes or Lego blocks. I’d clarify how attention scores work using intuitive phrases!", 'type': 'summary_text'}, {'text': '**Simplifying model concepts**\n\nI’m thinking about how to explain concepts like multi-head attention. I could say it\'s like multiple friends reading the same sentence and focusing on different parts

## Raw REST: Stateful (Chained) Request (API Key Auth)
This section mirrors the *stateful* SDK example using raw HTTP calls with an **API Key**:
- First call: `store: true` to persist context; capture `response.id`.
- Second call: passes `previous_response_id` to continue the conversation.
- Authentication header: `api-key: $AZURE_OPENAI_API_KEY`.
- Demonstrates model retaining prior context without resending earlier text (other than the new follow-up prompt).


In [28]:
# Raw REST stateful chained example (API Key)
import os, json, asyncio, time
from typing import Any, Dict
import aiohttp
from dotenv import load_dotenv

load_dotenv()
ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "").rstrip("/")
print(ENDPOINT)
MODEL = os.getenv("AZURE_GPT5_MODEL")
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
if not ENDPOINT or not MODEL or not API_KEY:
    raise RuntimeError("Missing required env vars: AZURE_OPENAI_ENDPOINT, AZURE_GPT5_MODEL, AZURE_OPENAI_API_KEY")

async def raw_stateful_api_key():
    url = f"{ENDPOINT}/responses"
    headers = {
        "api-key": API_KEY,
        "Content-Type": "application/json",
        "Accept": "application/json"
    }
    # First request (store=True)
    first_body: Dict[str, Any] = {
        "model": MODEL,
        "input": "Explain the transformer architecture behind modern LLMs in terms a sixth grader could understand.",
        "store": True,
        "reasoning": {"effort": "medium"}
    }
    async with aiohttp.ClientSession() as session:
        t0 = time.perf_counter()
        async with session.post(url, headers=headers, json=first_body) as resp1:
            data1 = await resp1.json(content_type=None)
            latency1 = (time.perf_counter() - t0) * 1000
            print(f"First call status={resp1.status} latency={latency1:.1f} ms id={data1.get('id')}")
        prev_id = data1.get("id")
        excerpt = None
        for o in data1.get("output", []):
            if isinstance(o, dict):
                for c in o.get("content", []):
                    if isinstance(c, dict) and c.get("text"):
                        excerpt = c["text"][:200]
                        break
            if excerpt:
                break
        if excerpt:
            print("--- First output excerpt ---\n" + excerpt + "\n")
        # Second request using previous_response_id
        followup_body: Dict[str, Any] = {
            "model": MODEL,
            "previous_response_id": prev_id,
            "input": "Why was that explanation appropriate for a sixth grader?",
            "store": True
        }
        t1 = time.perf_counter()
        async with session.post(url, headers=headers, json=followup_body) as resp2:
            data2 = await resp2.json(content_type=None)
            latency2 = (time.perf_counter() - t1) * 1000
            print(f"Follow-up status={resp2.status} latency={latency2:.1f} ms id={data2.get('id')}")
        excerpt2 = None
        for o in data2.get("output", []):
            if isinstance(o, dict):
                for c in o.get("content", []):
                    if isinstance(c, dict) and c.get("text"):
                        excerpt2 = c["text"][:250]
                        break
            if excerpt2:
                break
        if excerpt2:
            print("--- Follow-up output excerpt ---\n" + excerpt2 + "\n")
        def usage_line(d):
            u = d.get("usage", {})
            return f"in:{u.get('input_tokens')} out:{u.get('output_tokens')} total:{u.get('total_tokens')}"
        print("Usage first:", usage_line(data1))
        print("Usage follow-up:", usage_line(data2))

await raw_stateful_api_key()

https://admin-mdrh8xul-eastus2.openai.azure.com/openai/v1
First call status=200 latency=38315.0 ms id=resp_68d3f72c8a2c8195a0e9af342799c04c04c33f6bfd9c3052
--- First output excerpt ---
Imagine you’re writing a story one word at a time. A modern LLM (large language model) is a super-fast “guessing machine” that tries to predict the next word. The transformer is the design that helps 

Follow-up status=200 latency=12814.7 ms id=resp_68d3f752763c8195bd5fe753d370a41604c33f6bfd9c3052
--- Follow-up output excerpt ---
It was aimed at a sixth grader by keeping ideas simple, concrete, and relatable:

- Used everyday analogies: Lego bricks for tokens, secret codes for meanings, seat numbers for word order, and flashlights for attention.
- Gave a clear, familiar examp

Usage first: in:23 out:1472 total:1495
Usage follow-up: in:677 out:512 total:1189
