# Bedrock Optimized Latency Test (by Markus Bestehorn)
This notebook contains code to test the optimized performance inference feature of Bedrock that has been released for public preview at re:invent 2024: https://aws.amazon.com/about-aws/whats-new/2024/12/latency-optimized-inference-foundation-models-amazon-bedrock/

The code allows comparing the inference result times of the standard inference with the inference times for the latency optimized inference. **Running this test will incur costs on the AWS account running it**. For further information on the cost - particularly on optimized inference - refer to the official pricing documentation page of Amazon Bedrock: https://aws.amazon.com/bedrock/pricing/  

**Disclaimer**: The code in this notebook has been written for the sole purpose of testing the aforementioned feature of Amazon Bedrock. This code is not production ready or usable for other purposes.

**Prerequisites**: 
1. It is assumed here, that the this notebook runs inside a security context that has adequate priviledges to use the converse API of Amazon Bedrock.
2. It is assumed that model access has been adequately configured in Amazon Bedrock through the "Model Access" page for the region that will be used ("us-east-2") as well as all foundational models that will be used.

As a first step, we need to make sure that the most recent version of the boto3 library is installed where this notebook runs.

In [2]:
import sys
import subprocess


def update_boto3():
    # Implement pip upgrade using subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "-q", "boto3"])

    # Verify the new version
    import boto3
    print(f"Current boto3 version: {boto3.__version__}")


update_boto3()

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.2 requires nvidia-ml-py3==7.352.0, which is not installed.
aiobotocore 2.19.0 requires botocore<1.36.4,>=1.36.0, but you have botocore 1.37.10 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.3 which is incompatible.
autogluon-multimodal 1.2 requires jsonschema<4.22,>=4.18, but you have jsonschema 4.23.0 which is incompatible.
autogluon-multimodal 1.2 requires nltk<3.9,>=3.4.5, but you have nltk 3.9.1 which is incompatible.
autogluon-multimodal 1.2 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.[0m[31m
[0m

Current boto3 version: 1.37.10


## Configuration
The test uses the Bedrock converse API to respond to the same prompt multiple times and then computes aggregates over the times required to complete these calls. The main parameters for this test are as follows:
 -  `prompt`: This is the prompt that is being sent to the Bedrock converse API. This can be modified freely and depending on the complexity of the prompt, the absolute values of the test may vary.
 -  `DEFAULT_MODEL_ID`: This is the model ID or inference profile that is used for the test. As optimized inference is currently only supported by a limited number of foundational models in Bedrock, it is not recommended to change this value.
 -  `DEFAULT_REGION`: This is the AWS region where the inference will be running. Do not change this, as optimized performance is currently not supported in other AWS regions.

In [6]:
# Some predefined prompts that result in a different number of output tokens and therefore different processing times in Bedrock
simple_prompt = """Explain in a few sentences why objects cannot travel faster than the speed of light."""
middle_prompt = """Write 2 paragraphs on why objects cannot travel faster than the speed of light."""
complex_prompt = """Write a book chapter on why objects cannot travel faster than the speed of light."""

text_for_input_prompt_lengthening = """
The Ultimate Barrier: Why Objects Cannot Travel Faster Than Light

The universe has many wonders, but perhaps none is more profound than the existence of an ultimate speed limit. At approximately 299,792,458 meters per second, the speed of light in vacuum represents not merely a practical barrier but a fundamental limit woven into the very fabric of spacetime. This chapter explores why material objects cannot breach this cosmic speed limit, delving into both the theoretical framework established by Einstein and the experimental evidence that supports it.

The Emergence of a Universal Speed Limit

Before Einstein's revolutionary work in 1905, most physicists believed that speeds could add without restriction. If you run forward at 5 km/h on a train moving at 50 km/h, your total speed relative to the ground would be 55 km/h—a straightforward addition. This Galilean view of relativity suggested no inherent limit to how fast an object might travel.

Einstein's special relativity, however, revealed that this simple addition of velocities fails when speeds approach that of light. His equations showed that velocities combine according to the formula:

$v_{total} = \frac{v_1 + v_2}{1 + \frac{v_1 v_2}{c^2}}$

This elegant formula ensures that regardless of how many velocity boosts are applied, the resulting speed never exceeds c, the speed of light in vacuum.

The Mass-Energy Relationship

Perhaps the most famous equation in physics, $E = mc^2$, demonstrates the profound relationship between mass and energy. Less discussed, but equally important, is how this relationship affects objects approaching the speed of light.

As an object with rest mass accelerates, its relativistic mass increases according to:

$m = \frac{m_0}{\sqrt{1-\frac{v^2}{c^2}}}$

This equation reveals something remarkable: as velocity approaches the speed of light, the denominator approaches zero, causing the relativistic mass to approach infinity. Consequently, the energy required to accelerate the object further also approaches infinity.

The Energy Problem

Consider a spacecraft with a rest mass of 1,000 kg. Accelerating this craft to 50% the speed of light would require tremendous energy, but remains theoretically possible. At 90% light speed, the energy requirements grow dramatically. At 99%, they become staggering. But to reach precisely the speed of light? The mathematics is unequivocal: it would require infinite energy—an insurmountable barrier.

This isn't merely an engineering challenge to overcome with better technology; it represents a fundamental limit imposed by the structure of reality itself.

Spacetime and Causality

Beyond the energy considerations lies something even more profound: the nature of spacetime itself. Special relativity reveals that space and time are not separate entities but aspects of a unified spacetime. As objects approach light speed, they experience time dilation and length contraction. At the speed of light itself, time would stop entirely from the perspective of the moving object, and its length in the direction of motion would contract to zero—physically impossible conditions for any object with mass.

Furthermore, faster-than-light travel would violate causality—the principle that causes precede effects. An object moving faster than light could, in some reference frames, appear to arrive before it departed, creating paradoxes that unravel the coherent fabric of physical law.

What About Quantum "Spookiness"?

Some might point to quantum entanglement, where information seems to travel instantaneously between entangled particles, as a counterexample. However, careful analysis shows that no usable information can be transmitted faster than light through entanglement. The "spooky action at a distance" that troubled Einstein does not violate the cosmic speed limit.

Tachyons: Theoretical Faster-Than-Light Particles

Theoretical physics has explored the mathematical possibility of tachyons—hypothetical particles that always travel faster than light. Interestingly, these would need to have imaginary rest mass and could never slow down to light speed or below. Despite decades of searching, no experimental evidence supports their existence, and most physicists consider them mathematical curiosities rather than physical realities.

Experimental Confirmation

The universal speed limit has been tested repeatedly in particle accelerators. Protons in the Large Hadron Collider are accelerated to 99.9999991% the speed of light, requiring enormous energy input. Despite the tremendous energies involved, these particles never reach or exceed light speed, precisely as Einstein's equations predict.

Loopholes and Workarounds?

Science fiction often depicts "warp drives," "hyperspace," or "wormholes" as methods for effective faster-than-light travel. These concepts don't actually violate special relativity because they involve manipulating spacetime itself rather than exceeding light speed locally.

For instance, the Alcubierre warp drive, a theoretical solution to Einstein's field equations, would contract spacetime in front of a vessel and expand it behind, potentially allowing effective faster-than-light travel without the vessel itself ever exceeding light speed in its local region of spacetime. However, such mechanisms require exotic matter with negative energy density—something not known to exist in sufficient quantities, if at all.

Conclusion: The Profound Implications

The universal speed limit is not merely a curious fact but a profound feature of our universe with far-reaching implications. It establishes ultimate horizons for our potential exploration of the cosmos. It ensures the preservation of causality. And perhaps most importantly, it reminds us that the universe operates according to deep principles that cannot be circumvented by mere technological advancement.

The speed of light stands as a humbling reminder that despite our remarkable scientific progress, we remain subject to the fundamental laws that govern reality. Like the laws of thermodynamics or the uncertainty principle, the cosmic speed limit represents not a challenge to overcome, but a fundamental characteristic of the universe we inhabit—a universe more strange, more beautiful, and more constrained than our intuitions might suggest.
"""

# The variable that actually defined which prompt is being used
prompt = complex_prompt
DEFAULT_NO_ITERATIONS = 25
DEFAULT_REGION = "us-east-2" # do not change this value => optimized performance is currently only available in us-east-2

## Foundational Model
Optimized performance is currently only supported for a limited number of foundational models as documented here: https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html

The cell below configures the model ID that will be used in a variable called `DEFAULT_MODEL_ID`. If you want to change the model that is used here, use one of the other provided Model IDs and copy them accordingly.

In [8]:
MODEL_ID_ANTHROPIC_CLAUDE3_5_HAIKU_CRIS = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
MODEL_ID_META_LLAMA_3_1_70B_INSTRUCT = "us.meta.llama3-1-70b-instruct-v1:0"
MODEL_ID_META_LLAMA_3_1_405B_INSTRUCT = "us.meta.llama3-1-405b-instruct-v1:0"
MODEL_ID_AMAZON_NOVA_PRO_CRIS = "us.amazon.nova-pro-v1:0"

DEFAULT_MODEL_ID = MODEL_ID_AMAZON_NOVA_PRO_CRIS

## Code
The following cell contains all code that is used for running the test. Note that executing the cell does not actually execute the code, but merely loads it so that it can be executed when needed in the next cell.

In [4]:
import boto3
import json
import time
from botocore.config import Config
from datetime import datetime, timedelta
from botocore.exceptions import ClientError, ParamValidationError
import statistics

import logging
logging.basicConfig()
logging.root.setLevel(logging.INFO)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

BEDROCK_THROTTLING_PAUSE = 30


def json_to_str(input_json: dict) -> str:
    return json.dumps(input_json, indent=3)


def send_prompt_to_bedrock(prompt: str,
                          additional_files: list[dict] = [],
                          model_id: str = DEFAULT_MODEL_ID, 
                          max_tokens: int = 4096,
                          temperature: float = 0.7,
                          top_p: float = 0.9,
                          api_key: str = None, 
                          region_name: str = DEFAULT_REGION,
                          latency_optimized=False) -> (dict, float):
    config = Config(
        read_timeout=240,  # Timeout in seconds
        connect_timeout=60,  # Connection timeout
        retries={'max_attempts': 3}  # Optional: Configure retry behavior
    )

    # Create a Bedrock client
    bedrock = boto3.client(service_name='bedrock-runtime', region_name=region_name, config=config)

    if prompt is None or len(prompt) == 0:
        logger.error("Prompt cannot be empty.")
        return None

    content_blocks = [
        {"text": prompt}
    ]

    for current_doc_element in additional_files:
        content_blocks.append({"document": current_doc_element})

    # setup the performance config
    performanceConfig = {}
    if latency_optimized:
        performanceConfig["latency"] = "optimized"
    else:
        performanceConfig["latency"] = "standard"
    
    # we loop here to retry in case of throttling
    done = False
    tries = 0
    last_toggle_was_region = False
    while not done:
        try:
            # Invoke the model with the request.
            tries += 1
            logger.debug(f"Sending prompt with the following performance config to Bedrock:\n{json_to_str(performanceConfig)}")
            start_time = datetime.now()
            response = bedrock.converse(
                modelId=model_id,
                messages=[
                    {
                        "role": "user",
                        "content": content_blocks
                    }
                ],
                inferenceConfig={
                    "temperature": temperature,
                    "topP": top_p,
                    "maxTokens": max_tokens
                },
                performanceConfig=performanceConfig
            )
            end_time = datetime.now()
            elapsed_time = end_time - start_time
            logger.debug(f"Measured time: {elapsed_time} (Type: {elapsed_time.__class__})")
            total_tokens = int(response["usage"]["totalTokens"])
            done = True
        except ClientError as e:
            if e.response['Error']['Code'] == "ThrottlingException":
                pause = BEDROCK_THROTTLING_PAUSE * tries
                logger.debug(f"Call to Bedrock was throttled. Waiting for {pause} seconds before retrying:\n{str(e)}")
                time.sleep(pause)
                continue
            elif e.response['Error']['Code'] == "ServiceUnavailableException":
                pause = BEDROCK_THROTTLING_PAUSE * tries
                logger.debug(f"Bedrock services is unavailable at the moment. Waiting for {pause} seconds before retrying:\n{str(e)}")
                time.sleep(pause)
                continue
            else:
                logger.error(f"Can't invoke {model_id} in {region_name} due to Client Error. Reason: {str(e)}\nType: {str(e.__class__)}")
                break
        except ParamValidationError as pe:
            logger.error(f"Cannot invoke model with ID {model_id} in {region_name} as the format of the parameters in the input is illegal:\n{str(pe)}\nError Type: {e.__class__}\nAborting call to bedrock.")
            break
        except TypeError as te:
            logger.error(f"Cannot invoke model with ID {model_id} in {region_name} as the call to the converse API is malformatted/illegal:\n{str(te)}\nError Type: {te.__class__}\nAborting call to bedrock.")
            break
        except Exception as e:
            logger.error(f"Call to Bedrock caused an Exception: {str(e)} \nType: {str(e.__class__)}")
            break

    logger.debug(f"Received the following response from Bedrock:\n{json_to_str(response)}\nReponse Type: {response.__class__}")

    return response, elapsed_time, total_tokens


def average_timedelta(timedeltas: list[timedelta]):
    return sum(timedeltas, timedelta()) / len(timedeltas)


def measure_inference_time(prompt: str, num_iterations: int = DEFAULT_NO_ITERATIONS, model_id: str = DEFAULT_MODEL_ID, region_name: str = DEFAULT_REGION, latency_optimized: bool = False, print_output: bool = False):
    inference_times = []
    tokens = []
    for i in range(num_iterations):
        response, duration, total_tokens = send_prompt_to_bedrock(prompt=prompt, model_id=model_id, region_name=region_name, latency_optimized=latency_optimized)
        if duration:
            inference_times.append(duration)
            tokens.append(total_tokens)
            logger.debug(f"Iteration {i+1}: {duration} seconds processing {total_tokens} tokens.")
            # Optional: Print the model's response
            if response and 'output' in response:
                logger.debug(f"Model response: {response['output']['message']['content'][0]['text']}")
            if print_output:
                print(f"Performance in run {i+1}: {total_tokens} tokens in {duration.total_seconds()} seconds equals {total_tokens/duration.total_seconds()} tokens/sec")
    return inference_times, tokens


def print_results(inference_times: list[timedelta], model_id: str, latency_optimized: bool, region_name: str, num_iterations: int, tokens: list[int], prompt: str):
    if latency_optimized:
        header = f"Results over {num_iterations} iterations with **Latency Optimized Performance** using {model_id} in {region_name}:"
    else:
        header = f"Results over {num_iterations} iterations with **Standard Performance** using {model_id} in {region_name}:"

    sum_tokens = sum(tokens)
    avg_tokens = sum_tokens / num_iterations
    min_tokens = min(tokens)
    max_tokens = max(tokens)
    tokens_per_second = calculate_tokens_per_sec(timedeltas=inference_times, tokens=tokens)

    print(f"""
    {header}
        Prompt: {prompt}
        Avg./Min./Max. Tokens per response: {avg_tokens} / {min_tokens} / {max_tokens} tokens
        Total Tokens generated: {sum_tokens}
        Average Inference Time: {average_timedelta(timedeltas=inference_times)} seconds
        Median Inference Time: {statistics.median(inference_times)} seconds
        Minimum Inference Time: {min(inference_times)} seconds
        Maximum Inference Time: {max(inference_times)} seconds
        Performance: {tokens_per_second} tokens / sec
        """)


def calculate_tokens_per_sec(timedeltas: list[timedelta], tokens: list[int]) -> float:
    seconds = 0
    token_count = 0
    for delta_obj in timedeltas:
        seconds += int(delta_obj.total_seconds())
    for t in tokens:
        token_count += t

    return token_count / seconds


def run_comparison(prompt: str, num_iterations: int = DEFAULT_NO_ITERATIONS, model_id: str = DEFAULT_MODEL_ID, region_name: str = DEFAULT_REGION, input_prompt_prefix: str = "", print_output: bool = False):
    latency_optimized = False
    prompt = (input_prompt_prefix.strip() + " " + prompt).strip()
    without_optimized_inference, tokens_without_optimized_inference = measure_inference_time(
        prompt=prompt,
        num_iterations=num_iterations,
        model_id=model_id,
        region_name=region_name,
        latency_optimized=latency_optimized,
        print_output=print_output
    )
    # Calculate and display results
    if without_optimized_inference:
        print_results(
            inference_times=without_optimized_inference,
            model_id=model_id,
            latency_optimized=latency_optimized,
            region_name=region_name,
            num_iterations=num_iterations,
            tokens=tokens_without_optimized_inference,
            prompt=prompt
        )

    latency_optimized = True
    with_optimized_inference, tokens_with_optimized_inference = measure_inference_time(
        prompt=prompt,
        num_iterations=num_iterations,
        model_id=model_id,
        region_name=region_name,
        latency_optimized=latency_optimized,
        print_output=print_output
    )
    # Calculate and display results
    if with_optimized_inference:
        print_results(
            inference_times=with_optimized_inference,
            model_id=model_id,
            latency_optimized=latency_optimized,
            region_name=region_name,
            num_iterations=num_iterations,
            tokens=tokens_with_optimized_inference,
            prompt=prompt
        )

    return with_optimized_inference, without_optimized_inference

# Test Execution
The cell below executes the test. Make sure that all of the cells above have been executed before executing this cell. Depending on the prompt that is used to do this evaluation, the complete execution of this cell can take a few minutes. 
**Important**: Depending on the quota of the AWS account for which Bedrock is accessed, running this cell may cause throttling of Bedrock. The code above handles these exceptions by waiting, but in such cases, the execution of the cell may take longer. Note that these waiting times are not included in the performance measurement.


Finally, a result like the following will appear:
~~~
    Results over 25 iterations with **Standard Performance** using us.meta.llama3-1-405b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 883.88 / 736 / 1029 tokens
        Total Tokens generated: 22097
        Average Inference Time: 0:00:53.573597 seconds
        Median Inference Time: 0:00:53.170353 seconds
        Minimum Inference Time: 0:00:44.332831 seconds
        Maximum Inference Time: 0:01:02.987788 seconds
        Performance: 16.639307228915662 tokens / sec
        

    Results over 25 iterations with **Latency Optimized Performance** using us.meta.llama3-1-405b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 897.84 / 782 / 1046 tokens
        Total Tokens generated: 22446
        Average Inference Time: 0:00:13.992810 seconds
        Median Inference Time: 0:00:13.551137 seconds
        Minimum Inference Time: 0:00:11.460454 seconds
        Maximum Inference Time: 0:00:19.525074 seconds
        Performance: 66.6053412462908 tokens / sec
~~~

The two blocks of information contain all the required information including the model that was used (in the example above that is `us.meta.llama3-1-405b-instruct-v1:0`). The values below this header line provide the average, median, minimum and maximum latency over the complete execution of the prompt in Amazon Bedrock as well as the average number of tokens that have been generated per second. The latency in this context is the time it takes from sending the prompt to receiving the response from the Bedrock service. For instance, in the example above, the average time it took Bedrock to generate a response *without* optimized inference was almost 54 seconds, while the optimized inferences had about 14 seconds for the same task. Similarly, the performance is the sum of the total number of generated tokens divided by the total number of seconds of latency. In the example above, optimized inference generated more than 66 tokens per second, while the non-optimized version only generated less than 17 tokens, i.e., the optimized inference was better by a factor of approximately 4x compared to the non-optimized inference.

Findings so far show that the more output a prompt creates and the large the used LLM is, the higher is also the impact of the optimized performance. For instance, the simpler prompt with Claude Haiku 3.5 only requires 2.5 seconds on average even without optimized inference and just 1.4 seconds with optimized inference. Hence, the factor is less than 2x.

In [33]:
optimized_inference_result, non_optimized_inference_result = run_comparison(
    prompt=prompt,
    num_iterations=DEFAULT_NO_ITERATIONS,
    model_id=MODEL_ID_META_LLAMA_3_1_405B_INSTRUCT,
    region_name=DEFAULT_REGION
)


    Results over 25 iterations with **Standard Performance** using us.meta.llama3-1-405b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 883.88 / 736 / 1029 tokens
        Total Tokens generated: 22097
        Average Inference Time: 0:00:53.573597 seconds
        Median Inference Time: 0:00:53.170353 seconds
        Minimum Inference Time: 0:00:44.332831 seconds
        Maximum Inference Time: 0:01:02.987788 seconds
        Performance: 16.639307228915662 tokens / sec
        

    Results over 25 iterations with **Latency Optimized Performance** using us.meta.llama3-1-405b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 897.84 / 782 / 1046 tokens
        Total Tokens generated: 22446
        Average Inference Time: 0:00:13.992810 seconds
     

# Anthropic Claude Haiku 3.5
This section provides code to produce performance values Anthropic's Haiku 3.5 LLM.

In [34]:
optimized_inference_result, non_optimized_inference_result = run_comparison(
    prompt=prompt,
    num_iterations=DEFAULT_NO_ITERATIONS,
    model_id=MODEL_ID_ANTHROPIC_CLAUDE3_5_HAIKU_CRIS,
    region_name="us-east-2",
)


    Results over 25 iterations with **Standard Performance** using us.anthropic.claude-3-5-haiku-20241022-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 776.32 / 694 / 865 tokens
        Total Tokens generated: 19408
        Average Inference Time: 0:00:24.696465 seconds
        Median Inference Time: 0:00:24.425361 seconds
        Minimum Inference Time: 0:00:19.212240 seconds
        Maximum Inference Time: 0:00:30.940138 seconds
        Performance: 32.079338842975204 tokens / sec
        

    Results over 25 iterations with **Latency Optimized Performance** using us.anthropic.claude-3-5-haiku-20241022-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 759.64 / 561 / 811 tokens
        Total Tokens generated: 18991
        Average Inference Time: 0:00:11.651419

# Meta Llama 3.1 70B
As witht he above section, this section uses the smaller version of the Llama 3.1 LLM with 70 billion parameters for the inference. It is particularly interesting to compare these values with the efficiency gains obtained for the 405B version of this model below.

In [35]:
optimized_inference_result, non_optimized_inference_result = run_comparison(
    prompt=prompt,
    num_iterations=DEFAULT_NO_ITERATIONS,
    model_id=MODEL_ID_META_LLAMA_3_1_70B_INSTRUCT,
    region_name="us-east-2",
)


    Results over 25 iterations with **Standard Performance** using us.meta.llama3-1-70b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 937.24 / 842 / 1079 tokens
        Total Tokens generated: 23431
        Average Inference Time: 0:00:29.065641 seconds
        Median Inference Time: 0:00:28.842485 seconds
        Minimum Inference Time: 0:00:26.658636 seconds
        Maximum Inference Time: 0:00:33.336058 seconds
        Performance: 32.862552594670404 tokens / sec
        

    Results over 25 iterations with **Latency Optimized Performance** using us.meta.llama3-1-70b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 1035.76 / 880 / 1213 tokens
        Total Tokens generated: 25894
        Average Inference Time: 0:00:08.386833 seconds
      

# Meta Llama 3.1 405B
This section provides performance metrics for Llama 405B, i.e., the largest version (405 billion parameters) of the Llama 3.1 LLMs from Meta.

In [36]:
optimized_inference_result, non_optimized_inference_result = run_comparison(
    prompt=prompt,
    num_iterations=DEFAULT_NO_ITERATIONS,
    model_id=MODEL_ID_META_LLAMA_3_1_405B_INSTRUCT,
    region_name="us-east-2",
)


    Results over 25 iterations with **Standard Performance** using us.meta.llama3-1-405b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 872.28 / 756 / 1034 tokens
        Total Tokens generated: 21807
        Average Inference Time: 0:00:52.804474 seconds
        Median Inference Time: 0:00:50.241769 seconds
        Minimum Inference Time: 0:00:45.679469 seconds
        Maximum Inference Time: 0:01:02.850421 seconds
        Performance: 16.65928189457601 tokens / sec
        

    Results over 25 iterations with **Latency Optimized Performance** using us.meta.llama3-1-405b-instruct-v1:0 in us-east-2:
        Prompt: Write a book chapter on why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 901.0 / 775 / 1040 tokens
        Total Tokens generated: 22525
        Average Inference Time: 0:00:16.397900 seconds
       

# Amazon Nova Pro
As with the previous sections, this section provides performance comparison numbers for Amazon Nova Pro. Note that at the time of writing this notebook, this feature along with Nova Pro was in preview and only available in us-east-1 through cross-region inference (CRIS).

In [4]:
optimized_inference_result, non_optimized_inference_result = run_comparison(
    prompt=simple_prompt,
    num_iterations=DEFAULT_NO_ITERATIONS,
    model_id=MODEL_ID_AMAZON_NOVA_PRO_CRIS,
    region_name="us-east-1",
    #input_prompt_prefix=text_for_input_prompt_lengthening
    print_output=False
)


    Results over 25 iterations with **Standard Performance** using us.amazon.nova-pro-v1:0 in us-east-1:
        Prompt: Explain in a few sentences why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 92.8 / 87 / 97 tokens
        Total Tokens generated: 2320
        Average Inference Time: 0:00:01.562418 seconds
        Median Inference Time: 0:00:01.105986 seconds
        Minimum Inference Time: 0:00:00.978808 seconds
        Maximum Inference Time: 0:00:06.003716 seconds
        Performance: 72.5 tokens / sec
        

    Results over 25 iterations with **Latency Optimized Performance** using us.amazon.nova-pro-v1:0 in us-east-1:
        Prompt: Explain in a few sentences why objects cannot travel faster than the speed of light.
        Avg./Min./Max. Tokens per response: 87.88 / 87 / 88 tokens
        Total Tokens generated: 2197
        Average Inference Time: 0:00:01.232627 seconds
        Median Inference Time: 0:00:01.068637 se