# Bedrock Optimized Latency Test (by Markus Bestehorn)
This notebook contains code to test the optimized performance inference feature of Bedrock that has been released for public preview at re:invent 2024: https://aws.amazon.com/about-aws/whats-new/2024/12/latency-optimized-inference-foundation-models-amazon-bedrock/

The code allows comparing the inference result times of the standard inference with the inference times for the latency optimized inference. **Running this test will incur costs on the AWS account running it**. For further information on the cost - particularly on optimized inference - refer to the official pricing documentation page of Amazon Bedrock: https://aws.amazon.com/bedrock/pricing/  

**Disclaimer**: The code in this notebook has been written for the sole purpose of testing the aforementioned feature of Amazon Bedrock. This code is not production ready or usable for other purposes.

**Prerequisites**: 
1. It is assumed here, that the this notebook runs inside a security context that has adequate priviledges to use the converse API of Amazon Bedrock.
2. It is assumed that model access has been adequately configured in Amazon Bedrock through the "Model Access" page for the region that will be used ("us-east-2") as well as all foundational models that will be used.

As a first step, we need to make sure that the most recent version of the boto3 library is installed where this notebook runs.

In [None]:
import sys
import subprocess

def update_boto3():
    # Implement pip upgrade using subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--upgrade', 'boto3'])
    
    # Verify the new version
    import boto3
    print(f"Current boto3 version: {boto3.__version__}")

update_boto3()

## Configuration
The test uses the Bedrock converse API to respond to the same prompt multiple times and then computes aggregates over the times required to complete these calls. The main parameters for this test are as follows:
 -  `prompt`: This is the prompt that is being sent to the Bedrock converse API. This can be modified freely and depending on the complexity of the prompt, the absolute values of the test may vary.
 -  `DEFAULT_MODEL_ID`: This is the model ID or inference profile that is used for the test. As optimized inference is currently only supported by a limited number of foundational models in Bedrock, it is not recommended to change this value.
 -  `DEFAULT_REGION`: This is the AWS region where the inference will be running. Do not change this, as optimized performance is currently not supported in other AWS regions.

In [89]:
# Some predefined prompts that result in a different number of output tokens and therefore different processing times in Bedrock
simple_prompt = """Explain in a few sentences why objects cannot travel faster than the speed of light."""
complex_prompt = """Write a book chapter on why objects cannot travel faster than the speed of light."""

# The variable that actually defined which prompt is being used
prompt = simple_prompt
DEFAULT_NO_ITERATIONS = 5
DEFAULT_REGION = "us-east-2" # do not change this value => optimized performance is currently only available in us-east-2

## Foundational Model
Optimized performance is currently only supported for a limited number of foundational models as documented here: https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html

The cell below configures the model ID that will be used in a variable called `DEFAULT_MODEL_ID`. If you want to change the model that is used here, use one of the other provided Model IDs and copy them accordingly.

In [32]:
MODEL_ID_ANTHROPIC_CLAUDE3_5_HAIKU_CRIS = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
MODEL_ID_META_LLAMA_3_1_70B_INSTRUCT = "us.meta.llama3-1-70b-instruct-v1:0"
MODEL_ID_META_LLAMA_3_1_405B_INSTRUCT = "us.meta.llama3-1-405b-instruct-v1:0"

DEFAULT_MODEL_ID = MODEL_ID_ANTHROPIC_CLAUDE3_5_HAIKU_CRIS

## Code
The following cell contains all code that is used for running the test. Note that executing the cell does not actually execute the code, but merely loads it so that it can be executed when needed in the next cell.

In [88]:
import boto3
import json
import time
from statistics import mean
import os
from botocore.config import Config
from datetime import datetime, timedelta
from botocore.exceptions import ClientError
import statistics

import logging
logging.basicConfig()
logging.root.setLevel(logging.INFO)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

BEDROCK_THROTTLING_PAUSE = 30


def json_to_str(input_json: dict) -> str:
    return json.dumps(input_json, indent=3)


def send_prompt_to_bedrock(prompt: str,
                          additional_files: list[dict] = [],
                          model_id: str = DEFAULT_MODEL_ID, 
                          max_tokens: int = 4096,
                          temperature: float = 0.7,
                          top_p: float = 0.9,
                          api_key: str = None, 
                          region_name: str = DEFAULT_REGION,
                          latency_optimized=False) -> (dict, float):
    config = Config(
        read_timeout=240,  # Timeout in seconds
        connect_timeout=60,  # Connection timeout
        retries={'max_attempts': 3}  # Optional: Configure retry behavior
    )

    # Create a Bedrock client
    bedrock = boto3.client(service_name='bedrock-runtime', region_name=region_name, config=config)

    if prompt is None or len(prompt) == 0:
        logger.error("Prompt cannot be empty.")
        return None

    content_blocks = [
        {"text": prompt}
    ]

    for current_doc_element in additional_files:
        content_blocks.append({"document": current_doc_element})

    # setup the performance config
    performanceConfig = {}
    if latency_optimized:
        performanceConfig["latency"] = "optimized"
    else:
        performanceConfig["latency"] = "standard"
    
    # we loop here to retry in case of throttling
    done = False
    tries = 0
    last_toggle_was_region = False
    while not done:
        try:
            # Invoke the model with the request.
            tries += 1
            logger.debug(f"Sending prompt with the following performance config to Bedrock:\n{json_to_str(performanceConfig)}")
            start_time = datetime.now()
            response = bedrock.converse(
                modelId=model_id,
                messages=[
                    {
                        "role": "user",
                        "content": content_blocks
                    }
                ],
                inferenceConfig={
                    "temperature": temperature,
                    "topP": top_p,
                    "maxTokens": max_tokens
                },
                performanceConfig=performanceConfig
            )
            end_time = datetime.now()
            elapsed_time = end_time - start_time
            logger.debug(f"Measured time: {elapsed_time} (Type: {elapsed_time.__class__})")
            total_tokens = int(response["usage"]["totalTokens"])
            done = True
        except ClientError as e:
            if e.response['Error']['Code'] == "ThrottlingException":
                pause = BEDROCK_THROTTLING_PAUSE * tries
                logger.warning(f"Call to Bedrock was throttled. Waiting for {pause} seconds before retrying:\n{str(e)}")
                time.sleep(pause)
                continue
            elif e.response['Error']['Code'] == "ServiceUnavailableException":
                pause = BEDROCK_THROTTLING_PAUSE * tries
                logger.warning(f"Bedrock services is unavailable at the moment. Waiting for {pause} seconds before retrying:\n{str(e)}")
                time.sleep(pause)
                continue
            else:
                logger.error(f"Can't invoke {model_id} due to Client Error. Reason: {str(e)}\nType: {str(e.__class__)}")
                continue
        except ParamValidationError as pe:
            logger.error(f"Cannot invoke model with ID {model_id} as the format of the parameters in the input is illegal:\n{str(pe)}\nError Type: {e.__class__}\nAborting call to bedrock.")
            break
        except TypeError as te:
            logger.error(f"Cannot invoke model with ID {model_id} as the call to the converse API is malformatted/illegal:\n{str(te)}\nError Type: {te.__class__}\nAborting call to bedrock.")
            break
        except Exception as e:
            logger.error(f"Call to Bedrock caused an Exception: {str(e)} \nType: {str(e.__class__)}")
            break

    logger.debug(f"Received the following response from Bedrock:\n{json_to_str(response)}\nReponse Type: {response.__class__}")

    return response, elapsed_time, total_tokens


def average_timedelta(timedeltas: list[timedelta]):
    return sum(timedeltas, timedelta()) / len(timedeltas)


def measure_inference_time(prompt: str, num_iterations: int = DEFAULT_NO_ITERATIONS, model_id: str = DEFAULT_MODEL_ID, region_name: str = DEFAULT_REGION, latency_optimized: bool = False):
    inference_times = []
    tokens = []
    for i in range(num_iterations):
        response, duration, total_tokens = send_prompt_to_bedrock(prompt=prompt, model_id=model_id, region_name=region_name, latency_optimized=latency_optimized)
        if duration:
            inference_times.append(duration)
            tokens.append(total_tokens)
            logger.debug(f"Iteration {i+1}: {duration} seconds processing {total_tokens} tokens.")
            # Optional: Print the model's response
            if response and 'output' in response:
                logger.debug(f"Model response: {response['output']['message']['content'][0]['text']}")
    return inference_times, tokens


def print_results(inference_times: list[timedelta], model_id: str, latency_optimized: bool, tokens_per_second: float):
    if latency_optimized:
        header = f"Results with Latency Optimized Performance using {model_id}:"
    else:
        header = f"Results with Standard Performance using {model_id}:"

    print(f"""
        {header}
            Average Inference Time: {average_timedelta(timedeltas=inference_times)} seconds
            Median Inference Time: {statistics.median(inference_times)} seconds
            Minimum Inference Time: {min(inference_times)} seconds
            Maximum Inference Time: {max(inference_times)} seconds
            Performance: {tokens_per_second} tokens / sec
        """)


def calculate_tokens_per_sec(timedeltas: list[timedelta], tokens: list[int]) -> float:
    seconds = 0
    token_count = 0
    for delta_obj in timedeltas:
        seconds += int(delta_obj.total_seconds())
    for t in tokens:
        token_count += t

    return token_count / seconds


def run_comparison(prompt: str, num_iterations: int = DEFAULT_NO_ITERATIONS, model_id: str = DEFAULT_MODEL_ID, region_name: str = DEFAULT_REGION):
    latency_optimized = False
    without_optimized_inference, tokens_without_optimized_inference = measure_inference_time(
        prompt=prompt,
        num_iterations=num_iterations,
        model_id=model_id,
        region_name=region_name,
        latency_optimized=latency_optimized
    )
    # Calculate and display results
    if without_optimized_inference:
        print_results(inference_times=without_optimized_inference, model_id=model_id, latency_optimized=latency_optimized, tokens_per_second=calculate_tokens_per_sec(timedeltas=without_optimized_inference, tokens=tokens_without_optimized_inference))

    latency_optimized = True
    with_optimized_inference, tokens_with_optimized_inference = measure_inference_time(
        prompt=prompt,
        num_iterations=num_iterations,
        model_id=model_id,
        region_name=region_name,
        latency_optimized=latency_optimized
    )
    # Calculate and display results
    if with_optimized_inference:
        print_results(inference_times=with_optimized_inference, model_id=model_id, latency_optimized=latency_optimized, tokens_per_second=calculate_tokens_per_sec(timedeltas=with_optimized_inference, tokens=tokens_with_optimized_inference))

    return with_optimized_inference, with_optimized_inference

# Test Execution
The cell below executes the test. Make sure that all of the cells above have been executed before executing this cell. Depending on the prompt that is used to do this evaluation, the complete execution of this cell can take a few minutes. 
**Important**: Depending on the quota of the AWS account for which Bedrock is accessed, running this cell may cause throttling of Bedrock. The code above handles these exceptions by waiting, but in such cases, the execution of the cell may take longer. 


Finally, a result like the following will appear:
~~~
Results with Standard Performance using us.meta.llama3-1-405b-instruct-v1:0:
    Average Inference Time: 0:00:54.931572 seconds
    Median Inference Time: 0:00:55.680593 seconds
    Minimum Inference Time: 0:00:49.521402 seconds
    Maximum Inference Time: 0:01:01.527845 seconds
    Performance: 16.63235294117647 tokens / sec
Results with Latency Optimized Performance using us.meta.llama3-1-405b-instruct-v1:0:
    Average Inference Time: 0:00:14.460839 seconds
    Median Inference Time: 0:00:14.165186 seconds
    Minimum Inference Time: 0:00:12.774470 seconds
    Maximum Inference Time: 0:00:17.149608 seconds
    Performance: 64.4 tokens / sec
~~~

The two blocks of information contain all the required information including the model that was used (in the example above that is `us.meta.llama3-1-405b-instruct-v1:0`). The values below this header line provide the average, median, minimum and maximum latency over the complete execution of the prompt in Amazon Bedrock as well as the average number of tokens that have been generated per second. The latency in this context is the time it takes from sending the prompt to receiving the response from the Bedrock service. For instance, in the example above, the average time it took Bedrock to generate a response *without* optimized inference was almost 55 seconds, while the optimized inferences had about 14.5 seconds for the same task. Similarly, the performance is the sum of the total number of generated tokens divided by the total number of seconds of latency. In the example above, optimized inference generated more than 64 tokens per second, while the non-optimized version only generated less than 17 tokens, i.e., the optimized inference was better by a factor of 3.7x compared to the non-optimized inference.

Findings so far show that the more output a prompt creates and the large the used LLM is, the higher is also the impact of the optimized performance. For instance, the simpler prompt with Claude Haiku 3.5 only requires 2.5 seconds on average even without optimized inference and just 1.4 seconds with optimized inference. Hence, the factor is only 2x.

In [90]:
with_optimized_inference, with_optimized_inference = run_comparison(
    prompt=prompt,
    num_iterations=DEFAULT_NO_ITERATIONS,
    model_id=MODEL_ID_ANTHROPIC_CLAUDE3_5_HAIKU_CRIS
)


        Results with Standard Performance using us.anthropic.claude-3-5-haiku-20241022-v1:0:
            Average Inference Time: 0:00:02.526479 seconds
            Median Inference Time: 0:00:02.397623 seconds
            Minimum Inference Time: 0:00:02.194138 seconds
            Maximum Inference Time: 0:00:02.833778 seconds
            Performance: 56.0 tokens / sec
        


An error occurred (ThrottlingException) when calling the Converse operation (reached max retries: 3): Too many requests, please wait before trying again.
An error occurred (ThrottlingException) when calling the Converse operation (reached max retries: 3): Too many requests, please wait before trying again.
An error occurred (ThrottlingException) when calling the Converse operation (reached max retries: 3): Too many requests, please wait before trying again.



        Results with Latency Optimized Performance using us.anthropic.claude-3-5-haiku-20241022-v1:0:
            Average Inference Time: 0:00:01.417591 seconds
            Median Inference Time: 0:00:01.405501 seconds
            Minimum Inference Time: 0:00:01.198176 seconds
            Maximum Inference Time: 0:00:01.617868 seconds
            Performance: 114.8 tokens / sec
        
