In this notebook, I will be exploring the API rate limit based on the mathematical approach mentioned in the `README.md`

### Function to calculate API gateway maximum rate limit based on the max input token size

In [12]:
def max_rate_limit_per_minute(
    total_llm_tokens_per_minute: float,
    max_input_tokens_per_request: float,
    max_output_tokens_per_request: float = 4096
) -> float:
    return total_llm_tokens_per_minute / (max_input_tokens_per_request + max_output_tokens_per_request)

### Function to calculate maximum input token limit based on the required rate limit

In [13]:
def max_input_token_limit_per_request(
    total_llm_tokens_per_minute: float,
    max_rate_limit_per_minute: float,
    max_output_tokens_per_request: float = 4096
) -> float:
    return (total_llm_tokens_per_minute / max_rate_limit_per_minute) - max_output_tokens_per_request

## Function to convert number of tokens to approx. number of words

This conversion is derived based on the [documentation](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them) here.

In [21]:
def token_to_word(tokens: float):
    return 0.75 * tokens

## Helper function

In [28]:
import pandas as pd

def generate_api_constraints_dataframe(**model_specs):
    """
    Calculate the maximum rate limit per minute based on token constraints.
    
    Args:
    total_llm_tokens_per_minute (float): Total tokens available per minute for the model.
    max_rate_limit_per_minute: (float): Maximum rate limit of the model provide API.
    max_input_tokens_per_request (float): Maximum input tokens that can be sent per request.
    max_output_tokens_per_request (float): Maximum output tokens that the model can generate per request. Default is 4096.
    
    Returns:
    float: Maximum number of requests per minute within the token constraints.
    """
    input_tokens_list = list(range(
        model_specs["max_output_tokens_per_request"],
        model_specs["max_input_tokens_per_request"] + 1,
        model_specs["max_output_tokens_per_request"]
    ))
    
    data = []
    for input_tokens in input_tokens_list:
        rate_limit = max_rate_limit_per_minute(
            model_specs["total_llm_tokens_per_minute"],
            input_tokens,
            model_specs["max_output_tokens_per_request"]
        )
        words = token_to_word(input_tokens)
        reserved_percentage = (rate_limit * model_specs["max_output_tokens_per_request"])
        reserved_percentage /= model_specs["total_llm_tokens_per_minute"]
        reserved_percentage *= 100
        data.append((input_tokens, f"{rate_limit:0.3f}", words, f"{reserved_percentage:0.3f}%"))
    
    return pd.DataFrame(
        data, 
        columns=['Input Tokens', 'Max Rate Limit Per Minute', 'Approx. Words', 'Reserved Percentage for Output']
    )

# Model Exploration

Now in this section, I will pick up a model from a provider and then tabulate the different API constraints based on the above formule

## AWS Bedrock - Claude 3 Sonnet

It has the total token limit of 1,000,000 tokens per minute as per the [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html), the maximum Bedrock API rate limit of 500 requests per minute and the maximum input token window of 180,000 (it can actually take 200,000 but I am leaving a buffer) token per request and maximum output token limit of 4096

In [29]:
import pandas as pd

# AWS Bedrock - Claude 3 Sonnet constraints
bedrock_claude = dict(
    total_llm_tokens_per_minute = 1_000_000,
    max_rate_limit_per_minute = 500,
    max_output_tokens_per_request = 4096,
    max_input_tokens_per_request = 180_000,
)

df = generate_api_constraints_dataframe(**bedrock_claude)
df

Unnamed: 0,Input Tokens,Max Rate Limit Per Minute,Approx. Words,Reserved Percentage for Output
0,4096,122.07,3072.0,50.000%
1,8192,81.38,6144.0,33.333%
2,12288,61.035,9216.0,25.000%
3,16384,48.828,12288.0,20.000%
4,20480,40.69,15360.0,16.667%
5,24576,34.877,18432.0,14.286%
6,28672,30.518,21504.0,12.500%
7,32768,27.127,24576.0,11.111%
8,36864,24.414,27648.0,10.000%
9,40960,22.195,30720.0,9.091%


## Azure - OpenAI ChatGPT 4 Turbo (Australia East)

It has the total token limit of 240,000 tokens per minute as per the [documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits#regional-quota-limits), the maximum Azure OpenAI API rate limit of 500 requests per minute (I could not find the documentation so assumimg) and the maximum input token window of 120,000 (it can actually take 128,000 but I am leaving a buffer) token per request and maximum output token limit of 4096 as per [documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)

In [30]:
import pandas as pd


azure_openai_gpt4_turbo = dict(
    total_llm_tokens_per_minute = 240_000,
    max_rate_limit_per_minute = 500,
    max_output_tokens_per_request = 4096,
    max_input_tokens_per_request = 120_000,
)

df = generate_api_constraints_dataframe(**azure_openai_gpt4_turbo)
df

Unnamed: 0,Input Tokens,Max Rate Limit Per Minute,Approx. Words,Reserved Percentage for Output
0,4096,29.297,3072.0,50.000%
1,8192,19.531,6144.0,33.333%
2,12288,14.648,9216.0,25.000%
3,16384,11.719,12288.0,20.000%
4,20480,9.766,15360.0,16.667%
5,24576,8.371,18432.0,14.286%
6,28672,7.324,21504.0,12.500%
7,32768,6.51,24576.0,11.111%
8,36864,5.859,27648.0,10.000%
9,40960,5.327,30720.0,9.091%
