# APIM ❤️ OpenAI

## Streaming tool

Invoke OpenAI API with stream enabled and returns response in chunks.

Notes:
- Follow the [APIM guidelines for SSE](https://learn.microsoft.com/en-us/azure/api-management/how-to-server-sent-events#guidelines-for-sse) to guarantee that your APIM configuration is compatible with streaming.
- This tool reuses the [Cookbook - How to stream completions](https://cookbook.openai.com/examples/how_to_stream_completions) published by OpenAI.

## TOC
- [Initialize notebook variables](#0)
- [Get the deployment outputs](#1)
- [Get the APIM authorization debug token](#2)
- [🧪 Test the API using a direct HTTP call](#requests)
- [🔍 Analyze the API trace from direct HTTP call](#trace1)
- [🧪 Test with streaming using the Azure OpenAI Python SDK](#sdk)
- [🔍 Analyze the API trace from the SDK call](#trace2)


<a id='0'></a>
### Initialize notebook variables

In [None]:
deployment_name = "" # name of the label that you want to use with  this tool (ex: semantic-caching)
resource_group_name = f"lab-{deployment_name}"
openai_deployment_name = "gpt-35-turbo"
openai_api_version = "2024-02-01"

<a id='1'></a>
### Get the deployment outputs

In [None]:
deployment_stdout = ! az deployment group show --name {deployment_name} -g {resource_group_name} --query properties.outputs.apimServiceId.value -o tsv
apim_service_id = deployment_stdout.n
print("👉🏻 APIM Service Id: ", apim_service_id)

deployment_stdout = ! az deployment group show --name {deployment_name} -g {resource_group_name} --query properties.outputs.apimSubscriptionKey.value -o tsv
apim_subscription_key = deployment_stdout.n
deployment_stdout = ! az deployment group show --name {deployment_name} -g {resource_group_name} --query properties.outputs.apimResourceGatewayURL.value -o tsv
apim_resource_gateway_url = deployment_stdout.n
print("👉🏻 API Gateway URL: ", apim_resource_gateway_url)


<a id='2'></a>
### Get the APIM authorization debug token

This token will be used to trace the API request.

In [None]:
import requests
import json
token = ! az account get-access-token --query accessToken --output tsv

request = {
    "credentialsExpireAfter": "PT1H",
    "apiId": apim_service_id + "/apis/openai",
    "purposes": ["tracing"]
}
url = "https://management.azure.com" + apim_service_id + "/gateways/managed/listDebugCredentials?api-version=2023-05-01-preview"

response = requests.post(url, headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token.n}, json = request)

if (response.status_code == 200):
    data = json.loads(response.text)
    apim_debug_authorization = data.get("token")
else:
    print(response.text)


<a id='requests'></a>
### 🧪 Test the API using a direct HTTP call
The Python requests library has support for [streaming](https://docs.python-requests.org/en/latest/user/advanced/#streaming-requests). We use the direct http call to inspect the response headers. The policy is injecting a header that identifies if it's a streaming request.


In [None]:
url = apim_resource_gateway_url + "/openai/deployments/" + openai_deployment_name + "/chat/completions?api-version=" + openai_api_version
payload={"messages":[
    {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
    {"role": "user", "content": "Can you tell me the time, please?"}
],
"stream": True}
response = requests.post(url, headers = {'api-key':apim_subscription_key, 'Apim-Debug-Authorization': apim_debug_authorization}, json = payload)
print("status code: ", response.status_code)
trace_id = response.headers.get("Apim-Trace-Id")
print("Apim-Trace-Id: ", trace_id) # this header will be used to get API trace details
print("headers ", response.headers)
print("x-ms-region: ", response.headers.get("x-ms-region")) # this header is useful to determine the region of the backend that served the request
print("x-ms-stream: ", response.headers.get("x-ms-stream")) # this header is useful to determine if the response is streamed
if (response.status_code == 200):
    for chunk in response.iter_lines():
        print('chunk:', chunk)    
else:
    print(response.text)

<a id='trace1'></a>
### 🔍 Analyze the API trace from direct HTTP call

With the following request we will get the json with the complete trace information.

In [None]:
request = {
    "traceId": trace_id
}
url = "https://management.azure.com" + apim_service_id + "/gateways/managed/listTrace?api-version=2023-05-01-preview"
response = requests.post(url, headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token.n}, json = request)

if (response.status_code == 200):
    data = json.loads(response.text)
    print(json.dumps(data, indent=4))
else:
    print(response.text)


<a id='sdk'></a>
### 🧪 Test with streaming using the Azure OpenAI Python SDK
With a streaming API call, the response is sent back incrementally in chunks via an [event stream](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format). In Python, you can iterate over these events with a for loop.

In [None]:
import time
from openai import AzureOpenAI
messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
]
start_time = time.time()
client = AzureOpenAI(
    azure_endpoint=apim_resource_gateway_url,
    api_key=apim_subscription_key,
    api_version=openai_api_version
)
response = client.chat.completions.with_raw_response.create(model=openai_deployment_name, messages=messages, extra_headers={'Apim-Debug-Authorization': apim_debug_authorization}, stream=True)
trace_id = response.headers.get("Apim-Trace-Id")
print("Apim-Trace-Id: ", trace_id) # this header will be used to get API trace details
print("headers ", response.headers)
print("x-ms-region: ", response.headers.get("x-ms-region")) # this header is useful to determine the region of the backend that served the request
print("x-ms-stream: ", response.headers.get("x-ms-stream")) # this header is useful to determine if the response is streamed

completion = response.parse() 

# create variables to collect the stream of chunks
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in completion:
    chunk_time = time.time() - start_time  # calculate the time delay of the chunk
    collected_chunks.append(chunk)  # save the event response
    if chunk.choices:
        chunk_message = chunk.choices[0].delta.content  # extract the message
        collected_messages.append(chunk_message)  # save the message
        print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}")  # print the delay and text
# print the time delay and text received
print(f"Full response received {chunk_time:.2f} seconds after request")
# clean None in collected_messages
collected_messages = [m for m in collected_messages if m is not None]
full_reply_content = ''.join(collected_messages)
print(f"Full conversation received: {full_reply_content}")



<a id='trace2'></a>
### 🔍 Analyze the API trace from the SDK call

In [None]:
request = {
    "traceId": trace_id
}
url = "https://management.azure.com" + apim_service_id + "/gateways/managed/listTrace?api-version=2023-05-01-preview"
response = requests.post(url, headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token.n}, json = request)

if (response.status_code == 200):
    data = json.loads(response.text)
    print(json.dumps(data, indent=4))
else:
    print(response.text)
