# APIM ❤️ OpenAI

## Demo Overview

This Jupyter Notebook demonstrates the integration of Azure API Management (APIM) with Azure OpenAI services. The demo covers various aspects including initial setup, JWT authentication, semantic caching, load balancing, and content safety testing. Each section is designed to showcase specific functionalities and performance metrics of the integrated services.

## Table of Contents
1. [Initial Setup](#3)
2. [JWT Authentication Testing](#jwt-authentication-testing)
3. [Analyzing Semantic Caching](#sdk)
    - [Set 1 - Exact Matches (Control Group)](#🧪-semantic-caching-set-1-exact-matches-control-group)
    - [Set 2 - Paraphrased/Slightly Related Matches](#🧪-semantic-caching-set-2-paraphrasedslightly-related-matches)
4. [Load Balancing Test](#load-balancing-test)
    - [Weights and Priorities](#🧪-load-balancing-test-weights-and-priorities)
5. [Summary Analysis](#summary-analysis)

### Prerequisites
- [Python 3.8 or later version](https://www.python.org/) installed
- [Pandas Library](https://pandas.pydata.org/) and matplotlib installed
- [VS Code](https://code.visualstudio.com/) installed with the [Jupyter notebook extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) enabled
- [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) installed
- [An Azure Subscription](https://azure.microsoft.com/en-us/free/) with Contributor permissions
- [Access granted to Azure OpenAI](https://aka.ms/oai/access) or just enable the mock service
- [Sign in to Azure with Azure CLI](https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli-interactively)
- Azure API Management Resource you have access to


<a id='3'></a>
### 1️⃣ Initial Setup

In this section, we set up the environment variables and test the basic network connections to ensure everything is configured correctly. This includes loading environment variables, setting up API endpoints, and verifying the connection to the Azure OpenAI service.

In [None]:
import os
from dotenv import load_dotenv
import logging

logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')

# Load environment variables from .env file
load_dotenv(dotenv_path='./.env')

# Set the local env vars
apim_base_url = os.getenv('APIM_BASE_URL')
apim_api = os.getenv('APIM_API')
apim_subscription_key = os.getenv('APIM_SUBSCRIPTION_KEY').strip()
openai_api_version = os.getenv('OPENAI_API_VERSION')
openai_model_name = os.getenv('OPENAI_MODEL_NAME')
openai_deployment_name = os.getenv('OPENAI_DEPLOYMENT_NAME')
frontend_client_id = os.getenv('CLIENT_ID')
tenant_id = os.getenv('TENANT_ID')

# Set the OpenAI API URL (https://<apim_base_url>/<api>)
apim_resource_gateway_url = f"{apim_base_url}/{apim_api}"


In [None]:
from openai import AzureOpenAI

# Test the connection to ensure the environment variables are set correctly
try:
    print("Azure APIM Base URL: ", apim_base_url)
    print("Azure OpenAI Gateway URL: ", apim_resource_gateway_url)
    print("Azure OpenAI Gateway Path: ", apim_api)
    print("Azure OpenAI Deployment Name: ", openai_deployment_name)
    print("Azure OpenAI API Version: ", openai_api_version)
    
    # Check if all required environment variables are set
    if not all([apim_base_url, apim_api, apim_subscription_key, openai_api_version, openai_model_name, openai_deployment_name]):
        raise ValueError("One or more environment variables are missing. Please check your .env file.")
    
    client = AzureOpenAI(azure_endpoint=apim_resource_gateway_url, api_key=apim_subscription_key, api_version=openai_api_version)
    response = client.chat.completions.create(model=openai_model_name, messages=[{"role": "system", "content": "Test connection"}])
    print("Connection successful. Response received.")

except Exception as e:
    print(f"Connection failed: {e}")
    if "401" in str(e):
        print("❌❌❌ APIM requires a valid JWT access token. Please ensure your token is correct and try again. ❌❌❌")
    elif "404" in str(e):
        print("❌❌❌ The requested resource was not found. Please check the URL and endpoint configuration. ❌❌❌")
    

### 🔐 JWT Authentication Testing

In this section, we will test the JWT (JSON Web Token) authentication mechanism integrated with Azure API Management (APIM). JWT is a compact, URL-safe means of representing claims to be transferred between two parties. It is commonly used for authentication and authorization purposes.

#### Overview

The JWT authentication testing will cover the following aspects:

1. __Token Generation__: How to generate a JWT token with the necessary claims.
1. __Token Validation__: How APIM validates the JWT token to ensure it is authentic and has not been tampered with.
1. __Access Control__: How to use JWT tokens to control access to different API endpoints based on the claims within the token.

In [None]:
# Retrieve the access token using the device flow
# Requires setting the CLIENT_ID and TENANT_ID environment variables in your .env file

import msal
import requests
import json


url = apim_resource_gateway_url + "/openai/deployments/" + openai_deployment_name + "/chat/completions?api-version=" + openai_api_version


# Create a public client application
app = msal.PublicClientApplication(frontend_client_id, authority=f"https://login.microsoftonline.com/{tenant_id}")

# Initiate the device flow
flow = app.initiate_device_flow(scopes=["User.Read"])
if "user_code" not in flow:
    raise ValueError("Failed to create device flow. Error: %s" % json.dumps(flow, indent=2))

print(flow["message"])

# Acquire token by device flow
result = app.acquire_token_by_device_flow(flow)
access_token = None
if "access_token" in result:
    access_token = result['access_token']
    # Calling graph using the access token
    graph_data = requests.get(
        "https://graph.microsoft.com/v1.0/me",
        headers={'Authorization': 'Bearer ' + access_token},
    ).json()
    print("Graph API call result: %s" % json.dumps(graph_data, indent=2))
else:
    print(result.get("error"))
    print(result.get("error_description"))
    print(result.get("correlation_id"))


In [None]:
import base64
from json import loads

messages={"messages":[
    {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
    {"role": "user", "content": "Can you tell me the time, please?"}
]}

response = requests.post(url, headers = {'api-key':apim_subscription_key, 'Authorization': 'Bearer ' + access_token}, json = messages)
print("status code: ", response.status_code)
if (response.status_code == 200):
    data = json.loads(response.text)
    print("response: ", data.get("choices")[0].get("message").get("content"))
    print("🎉🎉🎉 Successfully authenticated with JWT using Entra ID! 🎉🎉🎉")
else:
    print(response.text)
    print("❌❌❌ Failed to authenticate with JWT using Entra ID! ❌❌❌")

<a id='sdk'></a>
### 🧪 Analyzing Semantic Caching

This section consists of 4 sets of questions to demonstrate how semantic caching works within APIM.

This Jupyter Notebook demonstrates how to integrate Azure Redis Cache with Azure API Management to cache responses from an embeddings model deployment. The steps are based on the official Microsoft documentation for caching external responses in Azure API Management.

#### Instructions
1. Connect an Azure Redis Cache instance to your API Management instance:
    - Navigate to the Azure Portal.
    - Go to Deployment & Infrastructure > External Cache.
    - Follow the prompts to connect your Redis Cache instance.

2. Create a backend for your embeddings model deployment:
    - This involves setting up the backend service that will handle requests for your embeddings model.

3. Add the following policy snippet to your inbound policy in Azure API Management:
    ```xml
    <azure-openai-semantic-cache-lookup score-threshold="0.8" embeddings-backend-id="<embeddings-backend-name>" embeddings-backend-auth="system-assigned" />
    ```
    - `score-threshold`: A value between 0.0 and 1.0 that determines the similarity tolerance for cache matches. A higher score requires higher similarity.
    - `embeddings-backend-id`: The identifier for your embeddings backend service.
    - `embeddings-backend-auth`: Authentication method for the backend service, typically "system-assigned".

4. Run through each test provided in the notebook and analyze the results to ensure the caching mechanism works as expected.

For more detailed information, refer to the official documentation: [Azure API Management - How to Cache External Responses](https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-cache-external)
requiring higher similarity).


#### Initialize Helper Methods

In [None]:
from openai import AzureOpenAI

import random
import time
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
# Configure logging


"""
Executes a series of queries to an Azure OpenAI service and measures the response times.

Parameters:
questions (list): A list of questions to randomly select from for each query.
runs (int): The number of times to run the queries. Default is 5.

Returns:
list: A list of response times for each query run.

The function performs the following steps:
1. Initializes an AzureOpenAI client with the provided endpoint, API key, and API version.
2. For each run, selects a random question from the provided list.
3. Constructs a message with a predefined system role and the selected question as the user content.
4. Measures the time taken to get a response from the Azure OpenAI service.
5. Handles exceptions, specifically checking for content filtering errors.
6. Prints the response time, the question, and the response content.
7. Appends the response time to a list and returns this list after all runs are completed.
"""
def run_queries(questions, runs=5, randomize=True):

    api_runs = []
    client = AzureOpenAI(
        azure_endpoint=apim_resource_gateway_url, 
        api_key=apim_subscription_key, 
        api_version=openai_api_version
    )

    for i in range(runs):
        if randomize:
            random_question = random.choice(questions)
        else:
            random_question = questions[i % len(questions)]
        random_question = random.choice(questions)
        messages = [
            {"role": "system", "content": "You are a sarcastic unhelpful assistant."},
            {"role": "user", "content": random_question}
        ]
        region = ""
        start_time = time.time()
        try:
            raw_response = client.chat.completions.with_raw_response.create(
                model=openai_model_name, 
                messages=messages, 
                extra_headers={"Authorization": "Bearer " + access_token}
            )
            response = raw_response.parse()
            headers = raw_response.headers
            response_time = time.time() - start_time
            
            remaining_tokens = headers.get('x-ratelimit-remaining-tokens')
            remaining_requests = headers.get('x-ratelimit-remaining-requests')
            region = headers.get('x-ms-region')
            
            logging.debug(f"Response: {response}")
            logging.debug(f"\n\tRemaining Tokens: {remaining_tokens} \n\tRemaining Requests: {remaining_requests}\n\tRegion: {region}")
            logging.debug(f"\tResponse Time: {response_time:.2f} seconds")
            

        except Exception as e:
            if 'Azure AI Content Safety service' in str(e):
                print(f"\n❌ Content Filtered: {e}\n\tInitial Question: {random_question}\n")
            else:
                print(f"Error during API call: {e}")
            continue

        response_time = time.time() - start_time
        print("\n\n▶️ Run: ", i+1, f"duration: {response_time:.2f} seconds")
        if remaining_tokens:
            print("\tx-ratelimit-remaining-tokens:", '\x1b[1;31m'+remaining_tokens+'\x1b[0m')
        if remaining_requests:
            print("\tx-ratelimit-remaining-requests:", '\x1b[1;31m'+remaining_requests+'\x1b[0m')
        if region:
            print("\tx-ms-region:", '\x1b[1;31m'+region+'\x1b[0m') # this header is useful to determine the region of the backend that served the request
        if  headers.get('remaining-tokens'):
            print("\tremaining-tokens:", '\x1b[1;31m'+headers.get('remaining-tokens')+'\x1b[0m')
        if headers.get('consumed-tokens'):
            print("\tconsumed-tokens:", '\x1b[1;31m'+headers.get('consumed-tokens')+'\x1b[0m')
        print("💬 ", random_question, response.choices[0].message.content)

        api_runs.append((response_time, region))

    return api_runs


"""
# Reusable question function for testing
Plots the response times of API runs using a bar chart.

api_runs (list): A list of response times for each query run.

1. Sets the figure size for the plot.
2. Converts the list of response times into a pandas DataFrame.
3. Adds a 'Run' column to the DataFrame to represent each run number.
4. Plots a bar chart with 'Run' on the x-axis and 'Response Time' on the y-axis.
5. Sets the title, x-axis label, and y-axis label for the plot.
6. Sets the x-axis ticks to be the run numbers.
Plots the response times of API runs using a bar chart.

api_runs (list): A list of response times for each query run.

1. Sets the figure size for the plot.
2. Converts the list of response times into a pandas DataFrame.
3. Adds a 'Run' column to the DataFrame to represent
"""
def plot_results(api_runs, title='Semantic Caching Performance'):
    mpl.rcParams['figure.figsize'] = [15, 5]
    df = pd.DataFrame(api_runs, columns=['Response Time'])
    df['Run'] = range(1, len(df) + 1)
    df.plot(kind='bar', x='Run', y='Response Time', legend=False)
    plt.title(title)
    plt.xlabel('Runs')
    plt.ylabel('Response Time (s)')
    plt.xticks(df['Run'], rotation=0)  # Set x-axis ticks to be the run numbers

    average = df['Response Time'].mean()
    plt.axhline(y=average, color='r', linestyle='--', label=f'Average: {average:.2f}')
    plt.legend()

    plt.show()


#### 🧪 Semantic Caching: Set 1 - Exact Matches (Control Group)

This subsection of semantic caching is responsible for testing the first set of questions (Set 1: Exact Matches).

The purpose of this test is to evaluate the performance and accuracy of the caching mechanism when dealing with exact match queries. This set serves as the control group, providing a baseline for comparison with other sets of questions that may involve more complex query patterns.

__Expected Outcome__:
- Latency should be extremely low for repeated queries.

The results from this test will help in understanding the efficiency of the semantic caching system in handling straightforward, exact match queries and will serve as a reference point for further optimizations.


In [None]:
runs = 1
api_runs = []  # Response Times for each run

# Set 1: Exact Matches (Control Group)
    # Expected Outcome:
	# •	Latency should be extremely low for repeated queries.
questions = [
    "What is climate change?",
    "What is climate change?",
    "What is climate change?",
    "What is climate change?",
    "What is climate change?"
]

api_runs = run_queries(questions)
response_times = [run[0] for run in api_runs]
plot_results(response_times, 'Semantic Caching Performance (Set 1: Exact Matches)')


#### 🧪 Semantic Caching: Set 2: Paraphrased/Slightly Related Matches

This subsection of semantic caching is responsible for testing the first set of questions (Set 2: Paraphrased Matches)

In this test, we evaluate the performance and accuracy of the caching mechanism when dealing with paraphrased queries. This set helps in understanding how well the caching system handles variations in query phrasing while still retrieving relevant cached responses.

__Expected Outcome__:
- Similar questions (1, 2, 3) should hit the cache.
- Slightly reworded but related questions (4) may still hit the cache if the similarity threshold allows.
- Questions with overlapping meanings (5 and 6) might hit the cache if configured for broad similarity.
- Question 7 might generate a new response depending on threshold settings.

The results from this test will provide insights into the robustness of the semantic caching system in handling paraphrased queries and its ability to maintain low latency and high accuracy.

In [None]:
api_runs = []  # Response Times for each run
runs = 6
questions = [
    "What is climate change?",
    "Explain climate change.",
    "Describe the concept of climate change.",
    "Tell me about global warming and its impact on the environment.",
    "What causes climate change?",
    "How does carbon dioxide contribute to global warming?",
    "What are the effects of climate change on sea levels?"
]

api_runs = run_queries(questions, runs, randomize=False) 
response_times = [run[0] for run in api_runs]
plot_results(response_times, 'Semantic Caching Performance (Set 2: Paraphrased Matches)')

### 🧪 Load Balancing Test

In this section, we test the load balancing capabilities of the Azure OpenAI service. The test involves sending multiple requests to the service and monitoring the distribution of these requests across different backend servers.



In [None]:
def plot_load_balancing_results(api_runs, color_map, title='Load Balancing results'):
    if color_map is None:
        color_map = {
            'East US': 'blue',
            'North Central US': 'green',
            'West US': 'red',
            # Add more regions and their corresponding colors as needed
        }
    mpl.rcParams['figure.figsize'] = [15, 7]
    df = pd.DataFrame(api_runs, columns=['Response Time', 'Region'])
    df['Run'] = range(1, len(df) + 1)

    # Plot the dataframe with colored bars
    ax = df.plot(kind='bar', x='Run', y='Response Time', color=[color_map.get(region, 'gray') for region in df['Region']], legend=False)

    # Add legend
    legend_labels = [plt.Rectangle((0, 0), 1, 1, color=color_map.get(region, 'gray')) for region in df['Region'].unique()]
    ax.legend(legend_labels, df['Region'].unique())

    plt.title(title)
    plt.xlabel('Runs')
    plt.ylabel('Response Time')
    plt.xticks(df['Run'], rotation=0)

    average = df['Response Time'].mean()
    plt.axhline(y=average, color='r', linestyle='--', label=f'Average: {average:.2f}')

    plt.show()



### 🧪 Load Balancing Test: Weights and Priorities

In this section, we test the load balancing capabilities of the Azure OpenAI service with a focus on weights and priorities. The test involves sending multiple requests to the service and monitoring the distribution of these requests across different backend servers based on their assigned weights and priorities.

#### Expected Outcome:
- Requests should be distributed according to the weights and priorities assigned to each backend server.
- Higher priority servers should handle more requests.
- Response times should be consistent, indicating effective load balancing.

#### Steps:
1. **Send Requests**: Send multiple requests to the Azure OpenAI API and record the backend server handling each request.
2. **Monitor Distribution**: Analyze the distribution of requests across the backend servers to ensure they align with the assigned weights and priorities.
3. **Plot Results**: Visualize the response times and distribution of requests across different backend servers.

#### Monitoring Load Balancing:
- **Server Distribution**: The `plot_load_balancing_results` function will show the number of requests handled by each backend server.
- **Response Times**: Consistent response times across requests indicate effective load balancing.
- **Logs and Metrics**: Use Azure Monitor and Application Insights to track logs and metrics related to request distribution and performance.

By analyzing the server distribution and response times, you can ensure that the load balancing mechanism is working effectively and that the requests are distributed according to the assigned weights and priorities.

In [None]:
api_runs = []  
runs = 20
questions = [
    "What is climate change?",
    "Explain climate change.",
    "Describe the concept of climate change.",
    "Tell me about global warming and its impact on the environment.",
    "What causes climate change?",
    "How does carbon dioxide contribute to global warming?",
    "What are the effects of climate change on sea levels?"
]

api_runs = run_queries(questions, runs, randomize=True) 
# Define the color map
color_map = {
    'East US': 'blue',
    'North Central US': 'green',
    'West US': 'red'
    # Add more regions and their corresponding colors as needed
}

plot_load_balancing_results(api_runs, color_map, 'Load Balancing results')

#### Summary Analysis:

In [None]:
import numpy as np



# Extract response times and regions from api_runs
response_times = [run[0] for run in api_runs]
regions = [run[1] for run in api_runs]

plot_results(response_times, 'Semantic Caching Performance (Set 2: Paraphrased Matches)')
# Calculate % distribution of load to the backend regions
region_counts = pd.Series(regions).value_counts(normalize=True) * 100
print("\n📊 Percentage distribution of load to backend regions:")
print(region_counts)

# Calculate average response times per region
region_avg_response_times = pd.DataFrame(api_runs, columns=['Response Time', 'Region']).groupby('Region').mean()
print("\n⏱️ Average response times per region:")
print(region_avg_response_times)

# Identify potential causes for long response times
threshold = np.percentile(response_times, 90)  # Define a threshold for long response times (90th percentile)
long_response_times = [(time, region) for time, region in api_runs if time > threshold]

print("\n🚨 Potential causes for long response times:")
for time, region in long_response_times:
    print(f"⏰ Response Time: {time:.2f} seconds, 🌍 Region: {region}")
    if time > 10:  # Arbitrary threshold for rate limiting/throttling
        print("  - ⚠️ Potential cause: Rate limiting/throttling or policy misconfiguration")

        # Analyze time gaps between region calls to identify potential circuit breaker events
        time_gaps = [j - i for i, j in zip(response_times[:-1], response_times[1:])]
        circuit_breaker_events = [(gap, regions[idx + 1]) for idx, gap in enumerate(time_gaps) if gap > 5]
        if circuit_breaker_events:
            print("\n🔍 Circuit Breaker Analysis based on time gaps:")
            for gap, region in circuit_breaker_events:
                print(f"⏰ Time Gap: {gap:.2f} seconds, 🌍 Region: {region}")
                print("  - ⚠️ Potential cause: Circuit breaker triggered due to high response time gap")
      
