<a href="https://colab.research.google.com/github/xqr-g/vertex-ai-samples/blob/main/notebooks/community/generative_ai/backoff_and_retry_for_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Backoff and retry for LLM

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/generative_ai/backoff_and_retry_for_LLMs.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fgenerative-ai%2Fbackoff_and_retry_for_LLMs.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/generative_ai/backoff_and_retry_for_LLMs.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/generative_ai/backoff_and_retry_for_LLMs.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

NOTE: This notebook has been tested in the following environment:

Python version = 3.10

## Overview

This notebook demonstrates how sending large amounts of traffic to Gemini-1.5-Pro can cause "429 Quota Exceeded Errors" and how implementing a backoff-and-retry strategy can help complete jobs without interrupting operations.

This notebook provides examples for the blog post: Don't let 429 errors leave your users hanging: A guide to handling resource exhaustion

This tutorial uses the following Google Cloud ML service:

- Vertex LLM SDK

The steps performed include:

- Installation and imports
- Asynchronously calling the Gemini model
- Using the Tenacity retry decorator to implement backoff and retry

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

**This notebook sends large amount of tokens to Gemini for inference, reduce the number of attempts or use smaller video to reduce costs.**

## Get started

## Install Vertex AI SDK for Python and other required packages

Install the following packages required to execute this notebook.

**Remember to restart the runtime after installation.**

In [16]:
!pip install --upgrade --quiet google-cloud-aiplatform tenacity google-cloud-storage

### Restart runtime (Colab only)
To use the newly installed packages, you must restart the runtime on Google Colab.

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step.  ⚠️</b>
</div>


### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# import sys

# if "google.colab" in sys.modules:

#     from google.colab import auth

#     auth.authenticate_user()

### Import libraries

In [1]:
import asyncio
import time

import nest_asyncio
import vertexai

nest_asyncio.apply()
from google.cloud import storage
from tenacity import retry, wait_random_exponential
from vertexai.generative_models import GenerationConfig, GenerativeModel, Part

### Set Google Cloud project information and initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
DEFAUL_MODEL_NAME = "gemini-1.5-pro-001"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}


# Initiate Vertex AI
vertexai.init(project=PROJECT_ID, location=REGION)
config = GenerationConfig(temperature=0.5, max_output_tokens=512)

Updated property [core/project].


### Helper functions

In [9]:
def get_images_uri_from_bucket(bucket_name, prefix, delimiter=None):
    """Lists all the images with extension '.jpg', 'jpeg' or 'png' in the bucket that begin with the prefix (folder)."""
    storage_client = storage.Client()
    blobs = storage_client.list_blobs(bucket_name, prefix=prefix, delimiter=delimiter)
    images = [
        f"gs://{bucket_name}/{blob.name}"
        for blob in blobs
        if blob.name.endswith(tuple([".jpg", "jpeg", "png"]))
    ]
    return images


async def async_ask_gemini(contents, model_name=DEFAUL_MODEL_NAME):
    # This basic function calls Gemini asynchronously without a retry logic
    multimodal_model = GenerativeModel(model_name)
    response = await multimodal_model.generate_content_async(
        contents=contents, generation_config=config
    )
    return response.text


@retry(wait=wait_random_exponential(multiplier=1, max=60))
async def retry_async_ask_gemini(contents, model_name=DEFAUL_MODEL_NAME):
    """This is the same code as the async_ask_gemini function but implements a retry logic using tenacity decorator.
    wait_random_exponential(multiplier=1, max=60) means that it will
    Retry “Randomly wait up to 2^x * 1 seconds between each retry until the range reaches 60 seconds, then randomly up to 60 seconds afterwards.
    """

    multimodal_model = GenerativeModel(model_name)
    response = await multimodal_model.generate_content_async(
        contents=contents, generation_config=config
    )
    return response.text


async def load_test_gemini(function, model_name, attempts=5):
    failed_attempts = 0
    print(f"Testing with model: {model_name} and function: {function.__name__}")
    for i in range(attempts):
        try:
            time_start = time.time()
            get_gemini_responses = [
                function(
                    [
                        prompt,
                        video_part,
                        Part.from_uri(image_uri, mime_type="image/jpeg"),
                    ],
                    model_name=MODEL_NAME,
                )
                for image_uri in images_list
            ]
            async_poems = await asyncio.gather(*get_gemini_responses)
            time_taken = time.time() - time_start
            print(f"{len(async_poems)} Poems written in {time_taken:.0f} seconds")
        except Exception as error:
            failed_attempts += 1
            print("An error occurred:", error)

    print(
        f"{failed_attempts} out of {attempts} failed"
    ) if failed_attempts > 0 else print(f"All {attempts} attempts succeded")

### Getting images and videos used for testing

In [4]:
# The images and video used for this test are stored in a public GCS bucket: "cloud-samples-data"
bucket_name = "cloud-samples-data"
image_prefix = "generative-ai/image/"
images_list = get_images_uri_from_bucket(bucket_name, image_prefix, delimiter="/")

prompt = "Get the elements from the image, get all the animals from the video, print all the animals and elements found on a numbered list, and then write a poem about them\n"
small_video_uri = "gs://cloud-samples-data/generative-ai/video/animals.mp4"
large_video_uri = (
    "gs://cloud-samples-data/generative-ai/video/behind_the_scenes_pixel.mp4"
)

## Load testing Gemini 

### Test without retry and default quota for Gemini-1.5-pro-001 of 60 QPM

4 out of 5 tests fail due to 429 Quota exceeded

In [8]:
video_part = Part.from_uri(small_video_uri, mime_type="video/mp4")
MODEL_NAME = "gemini-1.5-pro-001"
# Uncomment line below to re-run the test. Beware of costs since it will make multiple calls to Gemini
# await (load_test_gemini(async_ask_gemini, MODEL_NAME, attempts=5))

Testing with model: gemini-1.5-pro-001 and function: async_ask_gemini
72 Poems written in 23 seconds
An error occurred: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-pro. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.
An error occurred: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-pro. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.
An error occurred: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-pro. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.
An error occurred: 429 Quota exceeded for aiplatform.googleapis.com/generate_con

### Re-testing with backoff and retry mechanism enabled 

All tests finallize correctly

In [6]:
MODEL_NAME = "gemini-1.5-pro-001"
# Uncomment line below to re-run the test. Beware of costs since it will make multiple calls to Gemini
# await (load_test_gemini(retry_async_ask_gemini, MODEL_NAME, attempts=5))

Testing with model: gemini-1.5-pro-001 and function: retry_async_ask_gemini
72 Poems written in 21 seconds
72 Poems written in 167 seconds
72 Poems written in 18 seconds
72 Poems written in 149 seconds
72 Poems written in 22 seconds
All 5 attempts succeded


### Testing without retry but with Dynamic Shared Quota using Gemini-1.5-pro-002 

All 5 attempts succeded with a small video as input

In [5]:
video_part = Part.from_uri(small_video_uri, mime_type="video/mp4")
MODEL_NAME = "gemini-1.5-pro-002"
# Uncomment line below to re-run the test. Beware of costs since it will make multiple calls to Gemini
# await (load_test_gemini(async_ask_gemini, MODEL_NAME, attempts=5))

Testing with model: gemini-1.5-pro-002 and function: async_ask_gemini
72 Poems written in 23 seconds
72 Poems written in 21 seconds
72 Poems written in 19 seconds
72 Poems written in 17 seconds
72 Poems written in 22 seconds
All 5 attempts succeded


### Re-testing Dynamic Shared quota with larger video

Without backoff and retry, testing Gemini-1.5-pro-002 with larger context window caused all tests to fail with 429 reason code.

In [7]:
# Larger video  used to increase token input size
video_part = Part.from_uri(large_video_uri, mime_type="video/mp4")
MODEL_NAME = "gemini-1.5-pro-002"
# Uncomment line below to re-run the test. Beware of costs since it will make multiple calls to Gemini
# await (load_test_gemini(async_ask_gemini, MODEL_NAME, attempts=5))

Testing with model: gemini-1.5-pro-002 and function: async_ask_gemini
An error occurred: 429 Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#error-code-429 for more details.
An error occurred: 429 Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#error-code-429 for more details.
An error occurred: 429 Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#error-code-429 for more details.
An error occurred: 429 Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#error-code-429 for more details.
An error occurred: 429 Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#error-code-429 for more details.
5 out of 5 failed


### Adding Backoff and Retry to Dynamic Shared Quota Testing

Adding backoff and retry mechanisms significantly increased inference time, but all tests completed successfully even with much larger context window.

Provisioned Throughput should be used to guarantee the capacity and therefore reduce latency.


In [13]:
video_part = Part.from_uri(large_video_uri, mime_type="video/mp4")
MODEL_NAME = "gemini-1.5-pro-002"
# Uncomment line below to re-run the test. Beware of costs since it will make multiple calls to Gemini
# await (load_test_gemini(retry_async_ask_gemini, MODEL_NAME, attempts=3))

Testing with model: gemini-1.5-pro-002 and function: retry_async_ask_gemini
72 Poems written in 188 seconds
72 Poems written in 205 seconds
72 Poems written in 216 seconds
All 3 attempts succeded


## Summary

These basic tests demonstrate how Dynamic Shared Quota reduces the frequency of "429 Resource Exhausted" errors.  The results highlight the importance of always using backoff and retry mechanisms when calling LLMs, regardless of the model version.  Combining this with Provisioned Throughput further enhances reliability by guaranteeing capacity.