# **Apigee GenAI Workshop**

<!-- <table align="left">
    <td style="text-align: center">
      <a href="https://colab.research.google.com/github/GoogleCloudPlatform/apigee-samples/blob/genai-workshop/genai-workshop.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo\"><br> Open in Colab
      </a>
    </td>
    <td style="text-align: center">
      <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapigee-samples%2Fgenai-workshop%2Fgenai-workshop.ipynb">
        <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
      </a>
    </td>    
    <td style="text-align: center">
      <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/apigee-samples/genai-workshop/genai-workshop.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
      </a>
    </td>
    <td style="text-align: center">
      <a href="https://github.com/GoogleCloudPlatform/apigee-samples/blob/genai-workshop/genai-workshop.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
      </a>
    </td>
</table> -->

## Introduction

Welcome to Google's Apigee GenAI Workshop! 

This hands-on workshop will equip you with the knowledge and skills to leverage the power of Generative AI within your API ecosystem. Through practical exercises and real-world examples, you'll learn how to seamlessly integrate Large Language Models (LLMs) with Apigee, Google's leading API management platform. Get ready to unlock new possibilities and explore the exciting world of GenAI and APIs!

You should already have a lab instance up and running with all the necessary artifacts (Apigee, Vertex AI, Vertex DB, etc) provisioned for you to use. 

First, lets install the necessary dependencies to run the labs

## Install dependencies

This may take a few minutes to complete as it will first initialize the runtime and then install all the dependencies.

In [None]:
!pip install langchain
!pip install langchain-community
!pip install langchain_google_vertexai
!pip install langchain-openai
!pip install google-cloud-aiplatform
!pip install google-cloud-tasks
!pip install openai

## Initialize notebook variables

You can fetch all the variables from your lab instance

* **PROJECT_ID**: The default GCP project provisioned
* **LOCATION**: The default GCP Region where the project is provisioned.
* **APIGEE_HOSTNAME**:  The hostname of the Apigee instance provisioned

In [11]:
# Define project information
PROJECT_ID = ""  # @param {type:"string"}
LOCATION = ""  # @param {type:"string"}
APIGEE_HOSTNAME = "" # @param {type:"string"}

---

## Lab: LLM-Logging with Apigee

Logging both prompts and candidate responses from large language models (LLMs) allows for detailed analysis and improvement of the model's performance over time. By examining past interactions, AI practitioners can identify patterns leading to refinements in the training data or model architecture. Furthermore, by examining the prompts security teams can detect malicious intent, such as attempts to extract sensitive information or generate harmful content.

Additionally, logging the generated candidates provides insights into the LLM's behavior and helps identify any biases or vulnerabilities in the model itself. This information can then be used to improve security measures, fine-tune the model, and mitigate potential risks associated with LLM usage.

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-logging/images/llm-logging.png?raw=1)


### Benefits of Logging with Apigee and Google Cloud Logging

* **Seamless logging**: Effortlessly capture prompts, candidate responses, and metadata without complex coding.
* **Scalable and secure**: Leverage Google Cloud's infrastructure for reliable and secure log management.

### How does it work?

1. Prompt request is receved by an Apigee Proxy.
2. Apigee extracts prompt and candidate responses.
3. Apigee logs prompt and candidate responses to Cloud Logging.

### Test Sample

Apigee allows you to seamlessly send logs to Cloud Logging using native integration with the [Message Logging](https://cloud.google.com/apigee/docs/api-platform/reference/policies/message-logging-policy#cloudloggingelement) policy. This sample also includes a message chunking solution that allows logging very long messages (ex. 1M tokens supported by Gemini) and connecting them together using a unique message identifier.

With the following cell you'll be able to invoke an LLM and both prompt and candidate resposne will be logged in Cloud Logging.

In [None]:
from langchain_google_vertexai import VertexAI

if not PROJECT_ID or PROJECT_ID == "":
    raise ValueError("Please set your PROJECT_ID")
if not LOCATION or LOCATION == "":
    raise ValueError("Please set your LOCATION")
if not APIGEE_HOSTNAME or APIGEE_HOSTNAME == "":
    raise ValueError("Please set your APIGEE_HOSTNAME")

API_ENDPOINT = "https://"+APIGEE_HOSTNAME+"/v1/samples/llm-logging"

# Initialize Langchain
model = VertexAI(
      project=PROJECT_ID,
      location=LOCATION,
      api_endpoint=API_ENDPOINT,
      api_transport="rest",
      streaming=True,
      model_name="gemini-1.5-pro")

prompts = ["Provide an explanation about the benefits of using sunscreen. Make sure to make it as long as a novel."]

for prompt in prompts:
  print(model.invoke(prompt))

### Explore and Analyze Logs with Cloud Logging

1. Navigate to the Cloud Logging Explorer by [right clicking here and opening in a new tab](https://console.cloud.google.com/logs/query?_ga=2.194228271.307340908.1727018794-898542846.1726863667).

2. Set the query filter. Make sure to replace the `PROJECT_ID` with the Apigee project ID:

  ```
  logName="projects/PROJECT_ID/logs/apigee"
  ```
3. Run the query and explore the logs. See example below:

  ![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-logging/images/logs-explorer.png?raw=1)


Congratulations! You've successfully deployed the llm-logging proxy and tested the ability to log the request and responses from subsequent LLM's.

---

## Lab: LLM-Routing with Apigee

- This is a sample Apigee proxy to demonstrate the routing capabilities of Apigee across different LLM providers. In this sample we will use Google VertexAI, Mistral and HuggingFace as the LLM providers
- The framework will easily help onboarding other providers using configurations

![architecture](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-routing/images/arch.jpg?raw=1)

### Pre-requisites

This lab requires you to have a HuggingFace Access Token and Mistral AI API Key. The provisioned lab instance **does not** include both.


To create a HuggingFace Access Token:
- You wll need to create an account in [HuggingFace](https://huggingface.co)
- Go to settings, click `Access Tokens` and then click the "Create new token" button
- Choose `Read` for token type, provide a name and then hit "Create token"
- Copy the token below

In [None]:
HUGGINGFACE_TOKEN = ""  # @param {type:"string"}

To create a Mistral AI API Key:
- You wll need to create an account in [Mistral AI](https://mistral.ai )
- Click `API Keys` and then click the "Create new key" button
- Provide a name and then hit "Create key"
- Copy the key below

In [None]:
MISTRALAI_KEY="" # @param {type:"string"}

### Benefits of Routing with Apigee:

* **Configuration Driven Routing**: All the routing logic are driven through configuration which makes onboarding very easy
* **Security**: Irrespective of the model and providers, Apigee will secure the endpoints
* **Consistency**: Apigee can offer that layer of consistency to work with any LLM SDKs that are being used

### Update HuggingFace and Mistral AI credentials in Apigee KVM Store

We need to update the HuggingFace and Mistral AI credentials in the Apigee KVM store

In [None]:
from google.auth import default
from google.auth.transport.requests import Request
import json
import requests

if not HUGGINGFACE_TOKEN or HUGGINGFACE_TOKEN == "":
    raise ValueError("Please set your HUGGINGFACE_TOKEN")
if not MISTRALAI_KEY or MISTRALAI_KEY == "":
    raise ValueError("Please set your MISTRALAI_KEY")

SCOPES = ['https://www.googleapis.com/auth/cloud-platform']

credentials, project_id = default(scopes=SCOPES, quota_project_id=PROJECT_ID)
credentials.refresh(Request())
access_token = credentials.token

url = 'https://apigee.googleapis.com/v1/organizations/'+PROJECT_ID+'/environments/eval/keyvaluemaps/llm-routing-v1-modelprovider-config/entries'
headers = {'Authorization': 'Bearer '+access_token, 'Content-type': 'application/json'}

entry = 'huggingface__token'
resp = requests.put(url+"/"+entry, headers = headers, data=json.dumps({"name": entry,"value": HUGGINGFACE_TOKEN}))
if resp.status_code == 200:
    print("HuggingFace Access Token updated successfully")
else:
    print (resp.text)

entry = 'mistral__token'
resp = requests.put(url+"/"+entry, headers = headers, data=json.dumps({"name": entry,"value": MISTRALAI_KEY}))
if resp.status_code == 200:
    print("Mistral AI API Key updated successfully")
else:
    print (resp.text)

### Test Sample

We will need an API Key that is already provisioned in your lab instance. Go to Apigee in your GCP console. In the left hand menu, select `Apps` to see the list of Apps. Click the `llm-routing-app` app and copy the `Key` from the `Credentials` section

In [None]:
ROUTING_SAMPLE_APIKEY="" # @param {type:"string"}

In [None]:
if not PROJECT_ID or PROJECT_ID == "":
    raise ValueError("Please set your PROJECT_ID")
if not LOCATION or LOCATION == "":
    raise ValueError("Please set your LOCATION")
if not APIGEE_HOSTNAME or APIGEE_HOSTNAME == "":
    raise ValueError("Please set your APIGEE_HOSTNAME")
if not ROUTING_SAMPLE_APIKEY or ROUTING_SAMPLE_APIKEY == "":
    raise ValueError("Please set your ROUTING_SAMPLE_APIKEY")

API_ENDPOINT = "https://"+APIGEE_HOSTNAME+"/v1/samples/llm-routing/"

PROMPT="Suggest name for a flower shop"

#### Select an LLM Provider

Select a provider from the dropdown. This will automatically set the model name used by the SDKs

Try picking different providers from the dropdown above. You will see that the same SDK is able to call the Apigee endpoint serving responses from different providers

In [None]:
import sys
from google.colab import auth

llm_provider = "select" # @param ["select","google", "huggingface", "mistral"]

if llm_provider == "google":
    model = "google/gemini-1.5-flash"
elif llm_provider == "mistral":
    model = "open-mistral-nemo"
elif llm_provider == "huggingface":
    model = "meta-llama/Llama-3.2-11B-Vision-Instruct"
else:
    raise ValueError("Invalid LLM provider")

#### Using OpenAI SDK

In [None]:
import openai

openai.api_key = ROUTING_SAMPLE_APIKEY
openai.base_url = API_ENDPOINT
openai.default_headers = {"x-apikey": ROUTING_SAMPLE_APIKEY, "x-llm-provider": llm_provider}

completion = openai.chat.completions.create(
    model=model,
    messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": PROMPT
        }
      ]
    }
  ]
)
print(f"Using the OpenAI SDK, fetching the response from \"{model}\" provided by \"{llm_provider}\"")
print("\n")
print(completion.choices[0].message.content)

#### Using Langchain SDK

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model=model,
    api_key=ROUTING_SAMPLE_APIKEY,
    base_url=API_ENDPOINT,
    default_headers = {"x-apikey": ROUTING_SAMPLE_APIKEY, "x-llm-provider": llm_provider}
)
messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": PROMPT
      }
    ]
  }
]
print(f"Using the Langchain SDK, fetching the response from \"{model}\" provided by \"{llm_provider}\"")
print("\n")
print(llm.invoke(messages).content)

Congratulations! You've successfully deployed the routing proxy and tested the ability to route calls to different LLM providers.

---

## Lab: LLM-Semantic-Caching with Apigee

This sample performs a cache lookup of responses on Apigee's Cache layer and Vector Search as an embeddings database. It operates by comparing the vector proximity of the prompt to prior requests and using a configurable similarity score threshold.

![architecture](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-semantic-cache/images/arch-1.png?raw=1)

### Benefits of a Semantic Cache Layer with Apigee:

* **Reduced Response Times**: The cache layer significantly reduces response times for repeated queries, as Apigee efficiently stores and retrieves frequently accessed data.
* **Improved Efficiency**: By leveraging the caching capabilities of Apigee, unnecessary calls to the underlying model will be minimized, leading to optimized LLM costs.
* **Scalability**: The Apigee Cache Layer is managed and distributed, enhancing platform scalability without operational overhead.


### About Vertex AI Vector Search API and Embeddings API

[**Vertex AI Vector Search**](https://cloud.google.com/vertex-ai/docs/vector-search/overview) enables real-time, fast retrieval of embeddings, which powers a wide range of next-gen user experiences. It provices state-of-the-art embeddigs similarity search ([**ScaNN**](https://research.google/blog/announcing-scann-efficient-vector-similarity-search/)) that is foundational to Google services like Search, Play, and Youtube. It is a key enabler for Search applications and RAG. Vertex AI Vector Search offers speed, scale, quality and cost advantages over alternatives. It also has differentiated value added capabilities including incremental indexing, numerical and tag-based filtering, ensuring diversity of results, and auto-scaling.

[**Vertex AI Embeddings API**](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) provides a powerful way to represent text, images, and videos as numerical vectors. It allows for finding similar content based on meaning, Suggesting relevant items based on user preferences and past interactions, classifying, clustering, and detecting outliers based on semantic relationships. It is a key component of RAG architecture allowing more natural and engaging interactions with chatbots. It supports combining information from different data types for richer insights.

### How does it work?

1. Prompt request is receved by an Apigee Proxy.
2. Apigee extracts prompt contents and generates a numerical representation using the Vertex AI Embeddings API
3. Apigee performs a semantic similarity search using Vertex AI Vector Search
4. If there's a datapoint with a good similarity score, then perform a cache lookup using [**Apigee's Cache**](https://cloud.google.com/apigee/docs/api-platform/cache/persistence-tools#caching).
5. If there's a cached datapoint, then return the cached LLM response, otherwise populate the Apigee Cache with the LLM respose.

### Test Sample

This script measures and visualizes the performance of a semantic cache layer implemented using Apigee.

####  Initialize the variables

In [None]:
from langchain_google_vertexai import VertexAI

if not PROJECT_ID or PROJECT_ID == "":
    raise ValueError("Please set your PROJECT_ID")
if not LOCATION or LOCATION == "":
    raise ValueError("Please set your LOCATION")
if not APIGEE_HOSTNAME or APIGEE_HOSTNAME == "":
    raise ValueError("Please set your APIGEE_HOSTNAME")

# Define project information
API_ENDPOINT = "https://"+APIGEE_HOSTNAME+"/v1/samples/llm-semantic-cache"
# Initialize Langchain
model = VertexAI(
      project=PROJECT_ID,
      location=LOCATION,
      api_endpoint=API_ENDPOINT,
      api_transport="rest",
      streaming=True,
      model_name="gemini-1.5-pro")

#### Test and analyze semantic cache performance

This script measures and visualizes the performance of a semantic cache layer implemented using Apigee.

It executes a set of prompts multiple times and records the response times for each execution.
The script then plots the response times over the executions, highlighting the average response time.

In [None]:
import time
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl


exec = 2
execs = []
prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?"]
for i in range(exec):
  for prompt in prompts:
    start_time = time.time()
    model.invoke(prompt)
    response_time = time.time() - start_time
    execs.append(response_time)

mpl.rcParams['figure.figsize'] = [15, 5]
df = pd.DataFrame(execs, columns=['Response time'])
df['Exec'] = range(1, len(df) + 1)
df.plot(kind='line', x='Exec', y='Response time', legend=False)
plt.title('Semantic Cache Performance')
plt.xlabel('Executions')
plt.ylabel('Response Time')
plt.xticks(df['Exec'], rotation=0)

average = df['Response time'].mean()
plt.axhline(y=average, color='r', linestyle='--', label=f'Average: {average:.2f}')
plt.legend()

plt.show()

Congratulations! You've successfully deployed the semantic cache proxy and tested the ability to cache prompts effectively.

---

## Lab: LLM-Token-Limits with Apigee

Every interaction with an LLM consumes tokens, therefore, LLM token management plays a crutial role in maintaining platform-level control and visility over the consumption of tokens across LLM providers and consumers.

Apigee's API Products, when applied to token consumption, allows you to effectively manage token usage by setting limits on the number of tokens consumed per LLM consumer. This policy leverages the token usage metrics provided by an LLM, enabling real-time monitoring and enforcement of limits.

![architecture](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/images/ai-product.png?raw=1)


### Benefits Token Limits with AI Products

Creating Product tiers within Apigee allows for differentiated token quotas at each consumer tier. This enables you to:

* **Control resource allocation**: Prioritize resources for high-priority consumers by allocating higher token quotas to their tiers. This will also help to manage platform-wide token budgets across multiple LLM providers.
* **Tiered AI products**: By utilizing product tiers with granular token quotas, Apigee effectively manages LLM and empowers AI platform teams to manage costs and provide a multi-tenant platform experience.

### How does it work?

1. Prompt request is receved by an Apigee Proxy.
2. Apigee identifies the consumer Application and verifies that the AI Product token quota has not been exceeded.
3. Apigee extracts token counts and adds them to quota counter.
4. Apigee captures token counts as metrics for Analytics.

### Test Sample

We will need the API Keys that are already provisioned in your lab instance. Go to Apigee in your GCP console. In the left hand menu, select Apps to see the list of Apps. Click the `ai-consumer-app` app and copy the Keys from the Credentials section

In [None]:
BRONZE_APIKEY="" # @param {type:"string"}
SILVER_APIKEY="" # @param {type:"string"}

#### Test tiered AI products

Apigee allows you to create a tiered product strategy with different API access levels (e.g., Bronze, Silver, Gold) to cater to diverse user needs and limits.

##### Bronze AI Product

This product enforces a 2000 token limit every 5 minutes. Initializing using the Bronze API Key

In [None]:
from langchain_google_vertexai import VertexAI

if not PROJECT_ID or PROJECT_ID == "":
    raise ValueError("Please set your PROJECT_ID")
if not LOCATION or LOCATION == "":
    raise ValueError("Please set your LOCATION")
if not APIGEE_HOSTNAME or APIGEE_HOSTNAME == "":
    raise ValueError("Please set your APIGEE_HOSTNAME")
if not BRONZE_APIKEY or BRONZE_APIKEY == "":
    raise ValueError("Please set your BRONZE_APIKEY")

# Define project information
API_ENDPOINT = "https://"+APIGEE_HOSTNAME+"/v1/samples/llm-token-limits"
# Initialize Langchain
model = VertexAI(
      project=PROJECT_ID,
      location=LOCATION,
      api_endpoint=API_ENDPOINT,
      api_transport="rest",
      streaming=True,
      model_name="gemini-1.5-pro",
      additional_headers={"x-apikey": BRONZE_APIKEY})

To test this limit, follow the steps below:

1. Start a debug session in the Apigee console on the **llm-token-limits-v1** proxy that was deployed
2. Run the 2000 tokens every 5 minutes test. This scenario demonstrates a basic interaction with a language model. The code repeatedly asks a language model the same question, "Why is the sky blue?" but phrased in different ways. It's a simple example of how to interact with a language model.

In [None]:
prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?"]

def invoke_model(prompt, model_to_invoke=model):
  model_to_invoke.invoke(prompt)

for prompt in prompts:
  print(model.invoke(prompt))

3. After running the scenario, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Bronze AI Product tokens limit of 2000 tokens every 5 minutes.
4. Now lets run the 5000 tokens every 5 minutes test with the same key. In this scenario we ask the model the same question, "Why is the sky blue?" but phrased in different ways to make sure the candidate responses are very **extensive (high token count)**

In [None]:
prompts = ["Why is the sky blue? Provide a very long and detailed explanation.",
           "Furnish and exhaustive and long explanation (as long as a scence magazine article) for the phenomenon of the blue sky.",
           "Can you give me a really in-depth and as long as a book chapter of why the sky is blue?",
           "Give me a super detailed and very extensive explanation (as long as the yellow pages) of why the sky is blue.",
           "Can you tell me all about why the sky is blue, and make sure it's longer than a novel?"]

def invoke_model(prompt, model_to_invoke=model):
  model_to_invoke.invoke(prompt)

for prompt in prompts:
  print(model.invoke(prompt))

5. After running the scenario, the final token count (sum of tokens from prompts and response candidates) **should exceed** the Bronze AI Product tokens limit of 2000 tokens every 5 minutes. Should expect `HTTP 429` error messages in the notebook and also visible on Apigee's debug session.

##### Silver AI Product

This product enforces a 5000 token limit every 5 minutes. Initializing using the Silver API Key

In [None]:
from langchain_google_vertexai import VertexAI

if not PROJECT_ID or PROJECT_ID == "":
    raise ValueError("Please set your PROJECT_ID")
if not LOCATION or LOCATION == "":
    raise ValueError("Please set your LOCATION")
if not APIGEE_HOSTNAME or APIGEE_HOSTNAME == "":
    raise ValueError("Please set your APIGEE_HOSTNAME")
if not SILVER_APIKEY or SILVER_APIKEY == "":
    raise ValueError("Please set your SILVER_APIKEY")

# Define project information
API_ENDPOINT = "https://"+APIGEE_HOSTNAME+"/v1/samples/llm-token-limits"
# Initialize Langchain
model = VertexAI(
      project=PROJECT_ID,
      location=LOCATION,
      api_endpoint=API_ENDPOINT,
      api_transport="rest",
      streaming=True,
      model_name="gemini-1.5-pro",
      additional_headers={"x-apikey": SILVER_APIKEY})

To test this limit, follow the steps below:

1. Start a debug session in the Apigee console on the **llm-token-limits-v1** proxy that was deployed
2. Run the 2000 tokens every 5 minutes test. This scenario demonstrates a basic interaction with a language model. The code repeatedly asks a language model the same question, "Why is the sky blue?" but phrased in different ways. It's a simple example of how to interact with a language model.

In [None]:
prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?"]

def invoke_model(prompt, model_to_invoke=model):
  model_to_invoke.invoke(prompt)

for prompt in prompts:
  print(model.invoke(prompt))

3. After running the scenario, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Silver AI Product tokens limit of 5000 tokens every 5 minutes.
4. Now lets run the 5000 tokens every 5 minutes test with the same key. In this scenario we ask the model the same question, "Why is the sky blue?" but phrased in different ways to make sure the candidate responses are very **extensive (high token count)**

In [None]:
prompts = ["Why is the sky blue? Provide a very long and detailed explanation.",
           "Furnish and exhaustive and long explanation (as long as a scence magazine article) for the phenomenon of the blue sky.",
           "Can you give me a really in-depth and as long as a book chapter of why the sky is blue?",
           "Give me a super detailed and very extensive explanation (as long as the yellow pages) of why the sky is blue.",
           "Can you tell me all about why the sky is blue, and make sure it's longer than a novel?"]

def invoke_model(prompt, model_to_invoke=model):
  model_to_invoke.invoke(prompt)

for prompt in prompts:
  print(model.invoke(prompt))

5. After running the scenario, the final token count (sum of tokens from prompts and response candidates) **should not exceed** the Silver AI Product tokens limit of 5000 tokens every 5 minutes.

### Tokens Consumption Analytics

This sample also creates a Tokens Consumption analytics dashboard that allows you to:

* Understand usage patterns: See how often tokens are being used and by Developer App.
* Optimize token management Make informed decisions about token usage and ajust your tiered limits.
* Plan for scalability: Forecast future demand and ensure resource availability.

To use this dashboard, from the Apigee console navigate to `Custom Reports` > `Tokens Consumption Report`. You'll be able to drill down into token metrics that represent consumption by Developer Apps and Products. See sample below:

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/images/token-counts.png?raw=1)

Congratulations! You've successfully deployed the llm-token-limits proxy and tested the ability to control the access to the LLM workloads using tokens

---

## Lab: LLM-Circuit-Breaking with Apigee

Circuit breaking with Apigee offers significant benefits for serving Large Language Models (LLMs) in Retrieval Augmented Generation (RAG) applications, particularly in preventing the dreaded `429` HTTP errors that arise from exceeding LLM endpoint quotas. By placing Apigee between the RAG application and LLM endpoints, users gain a robust mechanism for managing traffic distribution and graceful failure handling.

Imagine a scenario where multiple tenants, each with their own LLM endpoints and associated capacity limits, are accessed by a single RAG application. Without circuit breaking, a surge in traffic to a particular tenant's LLM endpoint could trigger a `429` error, disrupting the entire RAG application's functionality. Apigee acts as a traffic cop, monitoring the health of each tenant's endpoint and implementing a circuit-breaking strategy to prevent cascading failures.

To further enhance resilience, users can create priority pools, grouping together LLM endpoints with similar capabilities and quota limitations. This allows Apigee to distribute traffic evenly within a pool, effectively aggregating the individual endpoint quotas and ensuring that the combined capacity can handle the load.

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/images/llm-circuit-breaking.png?raw=1)


### Circuit Breaking Benefits

1. **Improved fault tolerance**: The multi-pool architecture, coupled with circuit breaking, provides inherent fault tolerance, ensuring that the RAG application remains operational even if one or more LLM endpoints fail or experience outages.
2. **Data-driven capacity planning**: Circuit breaking provides valuable insights into endpoint performance, allowing you to monitor and adjust capacity allocations based on actual traffic patterns and usage. This enables informed capacity planning and avoids unnecessary overprovisioning.
3. **Multitenancy**: Apigee provides a unified platform for managing and routing traffic to different LLM tenants, simplifying integration and reducing development effort.
4. **Centralized monitoring and analytics**: Apigee offers comprehensive monitoring and analytics capabilities, allowing for real-time insights into LLM endpoint performance, quota usage, and failover events. This enables proactive identification and resolution of issues, enhancing operational efficiency.


### How does it work?

1. Apigee recieves a request and verifies the primary pool status. If it's open, then route the traffic to the primary pool. It it's closed, then route the traffic to the secondary pool.
2. If the request to the primary pool fails (`429` or error greater than `399`) then failover to the seconday pool and increase the error count in the circuit breaker.
3. Once an max of 2 errors has been detected, then the primary pool is taken out of rotation and all traffic will be sent to the secondary pool.
4. The primary pool will be returned back into rotation after a cooldown period of 2 minutes.

### Test Sample

#### Initialize the variables

In [None]:
from google.auth import default
from google.auth.transport.requests import Request
# Define sample information

API_ENDPOINT = "https://"+APIGEE_HOSTNAME+"/v1/samples/llm-circuit-breaking"
TASK_QUEUE = "ai-queue"

SCOPES = ['https://www.googleapis.com/auth/cloud-platform']
GEMINI_SUFFIX = "/v1/projects/{project}/locations/{location}/publishers/google/models/gemini-1.5-pro:streamGenerateContent".format(project=PROJECT_ID, location=LOCATION)
LLM_REQUEST_URL=API_ENDPOINT + GEMINI_SUFFIX

credentials, project_id = default(scopes=SCOPES, quota_project_id=PROJECT_ID)
credentials.refresh(Request())
access_token = credentials.token

### Test Circuit Breaking

The following cell executes a test scenario to exceed the total Gemini quota [gemini-1.5-pro model limits](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#quotas_by_region_and_model) for a **primary** GCP project. As soon as the project quota is reached, a secondary target will serve traffic without returing `429` errors to the consumer.

In [None]:
from google.cloud import tasks_v2
from google.protobuf import duration_pb2
from typing import Dict, Optional
import json

prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?",
           "Why is the sky blue?"]

def create_http_task(
    project: str,
    location: str,
    queue: str,
    url: str,
    json_payload: Dict,
    scheduled_seconds_from_now: Optional[int] = None,
    task_id: Optional[str] = None,
    deadline_in_seconds: Optional[int] = None,
) -> tasks_v2.Task:
    client = tasks_v2.CloudTasksClient()
    task = tasks_v2.Task(
        http_request=tasks_v2.HttpRequest(
            http_method=tasks_v2.HttpMethod.POST,
            url=url,
            headers={"Content-type": "application/json",
                     "Authorization": f"Bearer {access_token}"},
            body=json.dumps(json_payload).encode(),
        ),
        name=(
            client.task_path(project, location, queue, task_id)
            if task_id is not None
            else None
        ),
    )
    duration = duration_pb2.Duration()
    duration.FromSeconds(120)
    task.dispatch_deadline = duration
    return client.create_task(
        tasks_v2.CreateTaskRequest(
            parent=client.queue_path(project, location, queue),
            task=task,
        )
    )

def invoke_model(prompt):
  request = {"contents":[{"role":"user","parts":[{"text":prompt}]}],"generationConfig":{}}
  create_http_task(PROJECT_ID, LOCATION, TASK_QUEUE, LLM_REQUEST_URL, request)

x = range(15)
for n in x:
  for prompt in prompts:
    invoke_model(prompt)

### Analyze target pool Gemini quotas

This sample also creates am LLM Target analytics report that allows you to:

* Understand usage patterns: See how often the Gemini quota is being reached.
* Optimize token management Make informed decisions about quota usage and ajust pte-allocated quota.
* Plan for scalability: Forecast future demand and ensure resource availability.

To use this dashboard, from the Apigee console navigate to `Custom Reports` > `LLM Target Report`. You'll be able to drill down into token metrics that represent LLM traffic. 

**NOTE**: It might take a few mins for the report to show some data

See sample below:

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/images/circuit-breaking-report.png?raw=1)


Congratulations! You've successfully deployed the circuit-breaking proxy and tested the ability to switch to a secondary target as a circuit breaker.

---

## Congratulations!

Congratulations on finishing the workshop! You've now gained valuable skills and hands-on experience in integrating Large Language Models with Apigee. You're well-equipped to build robust, efficient, and innovative AI-powered APIs that can transform your applications and services.

We encourage you to continue exploring the exciting world of GenAI and Apigee. Dive deeper into the documentation, experiment with new ideas, and leverage these powerful tools to create cutting-edge solutions.

We're excited to see what you build going forward!