# **LLM Serving with Apigee**

<table align="left">
    <td style="text-align: center">
        <a href="https://colab.research.google.com/github/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/llm_token_limits.ipynb">
          <img src="https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/images/icon32.png?raw=true" alt="Google Colaboratory logo\"><br> Open in Colab
        </a>
      </td>
      <td style="text-align: center">
        <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapigee-samples%2Fmain%2Fllm-token-limits%2Fllm_token_limits.ipynb">
          <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
        </a>
      </td>    
      <td style="text-align: center">
        <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/apigee-samples/main/llm-token-limits/llm_token_limits.ipynb">
          <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
        </a>
      </td>
      <td style="text-align: center">
        <a href="https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits/llm_token_limits.ipynb">
          <img src="https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/images/github-mark.png?raw=true" width="30" alt="GitHub logo"><br> View on GitHub
        </a>
      </td>
</table>
<br />
<br />
<br />

# Token Limits per User Sample 

Every interaction with an LLM consumes tokens, therefore, LLM token management plays a crutial role in maintaining platform-level control and visility over the consumption of tokens across LLM providers and consumers.

Apigee's API Products, when applied to token consumption, allows you to effectively manage token usage by setting limits on the number of tokens consumed per LLM consumer. This policy leverages the token usage metrics provided by an LLM, enabling real-time monitoring and enforcement of limits.

One can measure the token consumption per client app (let's say, an AI Agent). Another way though, would be limiting and counting the token usage according to the end user that uses the client app. A single AI Agent, for instance, can have multiple end users (for example, human users interacting with an AI chatbot application) we might want to also track and control this usage.

This requires not only a client_id (that would identify the client app) but some form of user ID (maybe coming from an ID Token) and this information needs to reach Apigee for such control. 

![architecture](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits-per-user/images/ai-product.png?raw=1)


# Benefits Token Limits with AI Products

Creating Product tiers within Apigee allows for differentiated token quotas at each consumer tier. This enables you to:

* **Control resource allocation**: Prioritize resources for high-priority consumers by allocating higher token quotas to their tiers. This will also help to manage platform-wide token budgets across multiple LLM providers.
* **Tiered AI products**: By utilizing product tiers with granular token quotas, Apigee effectively manages LLM and empowers AI platform teams to manage costs and provide a multi-tenant platform experience. With Apigee's flexibility, we can have the rates to be "counted" per user or per app. This sample explores the per user.

# How does it work?

1. Prompt request is receved by an Apigee Proxy.
2. Apigee identifies the consumer Application and verifies and validates it, and gets the associated AI Product.
3. Apigee extracts the ID that identifies the end user from the header x-userid using this application 
3. Apigee extracts token counts and adds them to quota counter, per user.
4. Apigee captures token counts as metrics for Analytics.

# Setup

Use the following GCP CloudShell tutorial. Follow the instructions to deploy the sample.

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/open?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/apigee-samples&cloudshell_git_branch=main&cloudshell_workspace=.&cloudshell_tutorial=llm-token-limits-per-user/docs/cloudshell-tutorial.md)

# Test Sample

## Install dependencies


In [1]:
!pip install -Uq google-genai

## Authenticate your notebook environment (Colab only)
If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using Vertex AI Workbench or Colab Enterprise.

In [1]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

## Initialize notebook variables

* **PROJECT_ID**: The default GCP project to use when making Vertex API calls.
* **REGION**: The default location to use when making API calls.
* **API_ENDPOINT**:  Desired API endpoint, e.g. https://apigee.iloveapimanagement.com/generate
* **API_KEY**: After deploying the sample you'll get 2 API keys: **Bronze Key** and **Silver Key**. First, set the value of your **Bronze Key**.
* **USER_ID**: A unique user ID (string) that identifies the end user using the AI Application. For this sample, you can mock a string such as "userA"

In [None]:
from google import genai
from google.genai import types
# Define project information
PROJECT_ID = ""  # @param {type:"string"}
LOCATION = ""  # @param {type:"string"}
API_ENDPOINT = ""  # @param {type:"string"}
API_KEY = ""  # @param {type:"string"}
MODEL = "gemini-2.5-flash"
USER_ID = "" # @param {type:"string"}

client = genai.Client(
    vertexai=True,
    project=PROJECT_ID,
    location=LOCATION,
    http_options=types.HttpOptions(api_version='v1', base_url=API_ENDPOINT, headers = {"x-apikey": API_KEY, "x-userid": USER_ID})
)

## Test tiered AI products

Apigee allows you to create a tiered product strategy with different API access levels (e.g., Bronze, Silver, Gold) to cater to diverse user needs and limits. During the [Setup](#setup) stage you deployed 2 AI Product tiers for testing purposes. 

* **Bronze AI Product with first user**

This product enforces a 2000 token limit every 5 minutes per user. To test this limit, follow the steps below.

  1. Set the `API_KEY` value using your **Bronze Key** in the previous [step](#initialize-notebook-variables).
  2. Set the `USER_ID` value to the unique value that identifies the user ID. You can mock to something like "userA" or any other valid string.
  3. Start a debug session on the **llm-token-limits-per-user-v1** proxy that was deployed during the [Setup](#setup) stage.
  4. Run the 2000 tokens every 5 minutes [scenario](#2000-tokens-every-5-minutes).
  5. Observe `HTTP 200` success codes on debug session and explore `Q-TokenQuota` policy flow variables `allowed.count`, `used.count` and `available.count`.
  6. Run the same [scenario](#2000-tokens-every-5-minutes) again until you observe `HTTP 429` error codes on debug session and explore `Q-TokenQuota` policy flow variables `allowed.count`, `used.count`, `available.count` and `exceed.count`.

* **Bronze AI Product with second user**

Immediately after getting 429s, run the steps again but now using a different user ID.

  1. Set the `API_KEY` value using your **Bronze Key** in the previous [step](#initialize-notebook-variables).
  2. Set the `USER_ID` value to the unique value that identifies a second, different user ID. You can mock to something like "userB" or any other valid string, as long as it is different from the previous run.
  3. Start a debug session on the **llm-token-limits-per-user-v1** proxy that was deployed during the [Setup](#setup) stage.
  4. Run the 2000 tokens every 5 minutes [scenario](#2000-tokens-every-5-minutes).
  5. Observe `HTTP 200` success codes on debug session and explore `Q-TokenQuota` policy flow variables `allowed.count`, `used.count` and `available.count`. This will succeed, showing a success since it is a different user.


Feel free to change the `API_KEY` value using your **Silver Key** and re-run the scenarios again to see a larger win

## Tokens Consumption Analytics

This sample also creates a Tokens Consumption analytics dashboard that allows you to:

* Understand usage patterns: See how often tokens are being used and by Developer App and per User ID.
* Optimize token management Make informed decisions about token usage and ajust your tiered limits.
* Plan for scalability: Forecast future demand and ensure resource availability.

To use this dashboard, from the Apigee console navigate to `Custom Reports` > `Tokens Consumption Report`. You'll be able to drill down into token metrics that represent consumption by Developer Apps and Products. See sample below:

![image](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-token-limits-per-user/images/token-counts.png?raw=1)

# 2000 tokens every 5 minutes

This scenario demonstrates a basic interaction with a language model. The code repeatedly asks a language model the same question, "Why is the sky blue?" but phrased in different ways. It's a simple example of how to interact with a language model. After running the scenario **only once** expect the following behavior:

* If using the **Bronze Key**, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Bronze AI Product tokens limit of 2000 tokens every 5 minutes. This will be counted per user ID.
* If using the **Silver Key**, the final token count (sum of tokens from prompts and response candidates) shouldn't exceed the Silver AI Product tokens limit of 5000 tokens every 5 minutes. This will be counted per user ID.


In [19]:
prompts = ["Why is the sky blue?",
           "What makes the sky blue?",
           "Why does the sky is blue colored?",
           "Can you explain why the sky is blue?",
           "The sky is blue, why is that?"]

for prompt in prompts:
  response = client.models.generate_content(model=MODEL, contents=prompt)
  print(response.text)

The sky is blue primarily due to a phenomenon called **Rayleigh scattering**. Here's a breakdown of how it works:

1.  **Sunlight is White Light:** Sunlight, which appears white to us, is actually made up of all the colors of the rainbow (red, orange, yellow, green, blue, indigo, violet). Each of these colors has a different **wavelength** – red light has the longest wavelength, and violet/blue light has the shortest.

2.  **Earth's Atmosphere:** Earth's atmosphere is composed mainly of tiny gas molecules, predominantly **nitrogen (N2) and oxygen (O2)**. These molecules are much smaller than the wavelengths of visible light.

3.  **Rayleigh Scattering:** When sunlight enters the atmosphere, these tiny molecules act like miniature obstacles. They **scatter** the light in all directions. However, they don't scatter all colors equally:
    *   **Shorter wavelengths** (like blue and violet light) are scattered *much more efficiently* than longer wavelengths (like red and yellow light). Thi

ClientError: 429 None. {'fault': {'faultstring': 'Rate limit quota violation. Quota limit  exceeded. Identifier : userB', 'detail': {'errorcode': 'policies.ratelimit.QuotaViolation'}}}