<a href="https://colab.research.google.com/github/Saif-Shines/pk-cookbook/blob/next/ai-gateway/resilient_loadbalancing_with_failure_mitigating_fallbacks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up resilient Load balancers with failure-mitigating Fallbacks

Companies often face challenges of scaling their services efficiently as the traffic to their applications grow - when you’re consuming APIs, the first point of failure is that if you hit the API too much, you can get rate limited. Loadbalancing is a proven way to scale usage horizontally without overburdening any one provider and thus staying within rate limits.

For your AI app, rate limits are even more stringent, and if you start hitting the providers’ rate limits, there’s nothing you can do except wait to cool down and try again. With Portkey, we help you solve this very easily.

This cookbook will teach you how to utilize Portkey to distribute traffic across multiple LLMs, ensuring that your loadbalancer is robust by setting up backups for requests. Additionally, you will learn how to load balance across OpenAI and Anthropic, leveraging the powerful Claude-3 models recently developed by Anthropic, with Azure serving as the fallback layer.

<span style="text-decoration:underline;">Prerequisites:</span>

You should have the [Portkey API Key](https://portkey.ai/docs/api-reference/authentication#obtaining-your-api-key). Please sign up to obtain it. Additionally, you should have stored the OpenAI, Azure OpenAI, and Anthropic details in the [Portkey vault](https://portkey.ai/docs/product/ai-gateway-streamline-llm-integrations/virtual-keys).

## 1. Import the SDK and authenticate Portkey

Start by installing the `portkey-ai` to your NodeJS project.

In [1]:
!pip install portkey-ai

Installing collected packages: pathspec, mypy-extensions, h11, mypy, httpcore, black, httpx, openai, portkey-ai
Successfully installed black-23.7.0 h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 mypy-1.9.0 mypy-extensions-1.0.0 openai-1.14.3 pathspec-0.12.1 portkey-ai-1.2.2


Once installed, you can import it and instantiate it with the API key to your Portkey account.

In [2]:
from portkey_ai import Portkey
from google.colab import userdata

PORTKEYAI_API_KEY=userdata.get('PORTKEY_API_KEY')

portkey = Portkey(
    api_key=PORTKEYAI_API_KEY
)

## 2. Create Configs: Loadbalance with Nested Fallbacks

Portkey acts as AI gateway to all of your requests to LLMs. It follows the OpenAI SDK signature in all of it’s methods and interfaces making it easy to use and switch. Here is an example of an chat completions requests through Portkey.

In [None]:
messages = [
    {
        "role": "user",
        "content": "What are the 7 wonders of the world?"
    }
]

response = portkey.chat.completions.create(
    messages = messages,
    model = 'gpt-4'
)

print(response.choices[0].message.content)

The Portkey AI gateway can apply our desired behaviour to the requests to various LLMs. In a nutshell, our desired behaviour is the following:

![](https://raw.githubusercontent.com/Saif-Shines/pk-cookbook/next/ai-gateway/images/resilient-loadbalancing-with-failure-mitigating-fallbacks/1-loadbalancing-with-fallbacks.png)

Lucky for us, all of this can implemented by passing a configs allowing us to express what behavior to apply to every request through the Portkey AI gateway.

In [None]:
import json

config_data ={
  "strategy":{
    "mode":"loadbalance"
  },
  "targets":[
    {
      "virtual_key":"ANTHROPIC_VIRTUAL_KEY",
      "weight":0.5,
      "override_params":{
        "max_tokens":200,
        "model":"claude-3-opus-20240229"
      }
    },
    {
      "strategy":{
        "mode":"fallback"
      },
      "targets":[
        {
          "virtual_key":"OPENAI_VIRTUAL_KEY"
        },
        {
          "virtual_key":"AZURE_OPENAI_VIRTUAL_KEY"
        }
      ],
      "weight":0.5
    }
  ]
}

portkey = Portkey(
    api_key=PORTKEYAI_API_KEY,
    config=json.dumps(config_data)
)

We apply the `loadbalance` strategy across _Anthropic and OpenAI._ `weight` describes the traffic should be split into 50/50 among both the LLM providers while `override_params` will help us override the defaults.

Let’s take this a step further to apply a fallback mechanism for the requests from* OpenAI* to fallback to _Azure OpenAI_. This nested mechanism among the `targets` will ensure our app is reliable in the production in great confidence.

See the documentation for Portkey [Fallbacks](https://portkey.ai/docs/product/ai-gateway-streamline-llm-integrations/fallbacks) and [Loadbalancing](https://portkey.ai/docs/product/ai-gateway-streamline-llm-integrations/load-balancing).

## 3. Make a Request

Now that the `config` ‘s are concrete and are passed as arguments when instantiating the Portkey client instance, all subsequent will acquire desired behavior auto-magically — No additional changes to the codebase.

In [None]:
messages = [
    {
    "role": 'system',
    "content": 'You are a very helpful assistant.'
    },
    {
        "role": "user",
        "content": "What are the 7 wonders of the world?"
    }
]

response = portkey.chat.completions.create(
    messages = messages,
    model = 'gpt-4'
)

print(response.choices[0].message.content) # The Seven Wonders of the Ancient World are:

Next, we will examine how to identify load-balanced requests or those that have been executed as fallbacks.

## 4. Trace the request from the logs

It can be challenging to identify particular requests from the thousands that are received every day, similar to trying to find a needle in a haystack. However, Portkey offers a solution by enabling us to attach a desired trace ID. Here `request-loadbalance-fallback`.

In [None]:
response = portkey.with_options(trace_id="request-loadbalance-fallback").chat.completions.create(
    messages = messages,
    model = 'gpt-4'
)

print(response.choices[0].message.content)


This trace ID can be used to filter requests from the Portkey Dashboard (>Logs) easily.

![](https://raw.githubusercontent.com/Saif-Shines/pk-cookbook/next/ai-gateway/images/resilient-loadbalancing-with-failure-mitigating-fallbacks/2-logs-request-loadbalance-fallback.png)

In addition to activating Loadbalance (icon), the logs provide essential observability information, including tokens, cost, and model.

Are the configs growing and becoming harder to manage in the code? [Try creating them from Portkey UI](https://portkey.ai/docs/product/ai-gateway-streamline-llm-integrations/configs#creating-configs) and reference the configs ID in your code. It will make it significantly easier to maintain.

## 5. Advanced: Canary Testing

Given there are new models coming every day and your app is in production — What is the best way to try the quality of those models? Canary Testing allows you to gradually roll out a change to a small subset of users before making it available to everyone.

Consider this scenario: You have been using OpenAI as your LLM provider for a while now, but are considering trying an open-source Llama model for your app through Anyscale.

In [None]:
import json

config_data ={
  "strategy":{
    "mode":"loadbalance"
  },
  "targets":[
    {
      "virtual_key":"ANTHROPIC_VIRTUAL_KEY",
      "weight":0.5,
      "override_params":{
        "max_tokens":200,
        "model":"claude-3-opus-20240229"
      }
    },
    {
      "strategy":{
        "mode":"loadbalance"
      },
      "targets":[
        {
          "virtual_key":"OPENAI_VIRTUAL_KEY"
        },
        {
          "virtual_key":"AZURE_OPENAI_VIRTUAL_KEY"
        }
      ],
      "weight":0.5
    }
  ]
}

portkey = Portkey(
    api_key=PORTKEYAI_API_KEY,
    config=json.dumps(config_data)
)

messages = [
    {
    "role": 'system',
    "content": 'You are a very helpful assistant.'
    },
    {
        "role": "user",
        "content": "What are the 7 wonders of the world?"
    }
]

response = portkey.chat.completions.create(
    messages = messages,
    model = 'gpt-4'
)

print(response.choices[0].message.content)


The `weight` , indication of traffic is split to have 10% of your user-base are served from Anyscale’s Llama models. Now, you are all set up to get feedback and observe the performance of your app and release increasingly to larger userbase.

## Considerations

You can implement production-grade Loadbalancing and nested fallback mechanisms with just a few lines of code. While you are equipped with all the tools for your next GenAI app, here are a few considerations:

- Every request has to adhere to the LLM provider’s requirements for it to be successful. For instance, `max_tokens` is required for Anthropic and not for OpenAI.
- While loadbalance helps reduce the load on one LLM - it is recommended to pair it with a Fallback strategy to ensure that your app stays reliable
- On Portkey, you can also pass the loadbalance weight as 0 - this will essentially stop routing requests to that target and you can amp it up when required
- Loadbalance has no target limits as such, so you can potentially add multiple account details from one provider and effectively multiply your available rate limits
- Loadbalance does not alter the outputs or the latency of the requests in any way

Happy Coding!