# Amazon Bedrock - Latency Benchmark Tool
This notebook contains a set of tools to benchmark inference latency for Foundation Models available in Amazon Bedrock. 

You can evaluate latency for different scenarios such as comparison between models, regions, use cases...

To run this notebook you will need to have the appropiate access to Amazon Bedrock, and previously enabled the models from the Amazon Bedrock Console. 

## Scenarios in this notebook

#### 1. Model Comparison 
#### 2. Region Comparison
#### 3. Use Case Comparison

## Install needed dependencies
This notebook requires a Python 3 environment

In [None]:
!pip install --quiet --upgrade pip
!pip install --quiet --upgrade boto3 awscli matplotlib numpy pandas anthropic

In [None]:
from utils.utils import benchmark, create_prompt, execute_benchmark, get_cached_client, post_iteration
import matplotlib.pyplot as plt

## Scenario keys and configurations

Each scenario is a dictionary with latency relevant keys:

| Key | Definition |
|-|-|
| `model_id` | The model to test, smaller models are likely slower. Currently only Anthropic models are supported. |
| `in_tokens` | The number of tokens to feed to the model. aka: input context length. Range: 40 - 100K.|
| `out_tokens` | The number of tokens for the model to generate. Range: 1 - 8191. |
| `region` | The AWS region to invoke Bedrock in. This can affect network latency depending on client location. |
| `stream` | True&#124;False - A streaming response starts returning tokens to the client as they are generated, instead of waiting before returning the complete resopnses. This should be True for interactive use cases.|
| `name` | A human readable name for the scenario (will appear in reports and graphs). |

Each scenario also has a benchmark configuration you can modify:

| Key | Definition |
|-|-|
| `invocations_per_scenario` | The number of times to benchmark each scenario. This is important in measuring variance and average response time across a long duration. |
| `sleep_between_invocations` | Seconds to sleep between each invocation. (0 is no sleep). Sleeping between invocation can help you measure across longer periods of time, and/or avoid throttling.|

## Scenario 1. Model Comparison

In [None]:
model_compare_scenarios = [
            {
                'model_id'    : 'anthropic.claude-v2',
                'in_tokens'  : 200,
                'out_tokens' : 50,
                'region'     : 'us-east-1',
                'stream' : True,
                'name' : f'claude-v2. in=200, out=50',
            },
            {
                'model_id'    : 'anthropic.claude-instant-v1',
                'in_tokens' : 200,
                'out_tokens' : 50,
                'region'     : 'us-east-1',
                'stream' : True,
                'name' : f'claude-instant-v1. in=200, out=50',
            },
]

scenario_config = {
    "invocations_per_scenario" : 2,
    "sleep_between_invocations": 5
}

In [None]:
scenarios = execute_benchmark(model_compare_scenarios,scenario_config)
# Early breaking will break after a single scenario, useful for debugging.
#execute_benchmark(model_compare_scenarios,scenario_config, early_break = True) 

### Results
Adapt the show_results function to the type of data representation you wish to use. 

In [None]:
def show_results(scenarios):

    fig, ax = plt.subplots()

    metric = 'time-to-first-token'
    #metric = 'time-to-last-token'

    for scenario in scenarios:
      durations = [d[metric] for d in scenario['durations']]

      ax.boxplot(durations, positions=[scenarios.index(scenario)])

      ax.set_xticks(range(len(scenarios)))
      ax.set_xticklabels([s['name'] for s in scenarios])

      ax.set_ylabel(f'{metric} (sec)')

    fig.tight_layout()
    plt.show()

In [None]:
show_results(scenarios)

## Scenario 2. Region Comparison

> **🚨 ALERT 🚨** Remember to enable the models in **all regions** you wish to test. 
You can learn how to manage model access in the following [page](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html#manage-model-access).


In [None]:
region_compare_scenarios = [
            {
                'model_id'    : 'anthropic.claude-instant-v1',
                'in_tokens'  : 200,
                'out_tokens' : 50,
                'region'     : 'us-east-1',
                'stream' : True,
                'name' : f'us-east-1',
            },
            {
                'model_id'    : 'anthropic.claude-instant-v1',
                'in_tokens' : 200,
                'out_tokens' : 50,
                'region'     : 'us-west-2',
                'stream' : True,
                'name' : f'us-west-2',
            },
            {
                'model_id'    : 'anthropic.claude-instant-v1',
                'in_tokens' : 200,
                'out_tokens' : 50,
                'region'     : 'eu-central-1',
                'stream' : True,
                'name' : f'eu-central-1',
            },
]

scenario_config = {
    "invocations_per_scenario" : 2,
    "sleep_between_invocations": 5
}

In [None]:
scenarios = execute_benchmark(region_compare_scenarios,scenario_config)
# Early breaking will break after a single scenario, useful for debugging.
#execute_benchmark(region_compare_scenarios,scenario_config, early_break = True)

### Results
Adapt the show_results function to the type of data representation you wish to use. 

In [None]:
def show_results(scenarios):

    fig, ax = plt.subplots()

    metric = 'time-to-first-token'
    #metric = 'time-to-last-token'

    for scenario in scenarios:
      durations = [d[metric] for d in scenario['durations']]

      ax.boxplot(durations, positions=[scenarios.index(scenario)])

      ax.set_xticks(range(len(scenarios)))
      ax.set_xticklabels([s['name'] for s in scenarios])

      ax.set_ylabel(f'{metric} (sec)')

    fig.tight_layout()
    plt.show()

In [None]:
show_results(scenarios)

## Scenario 3. Use Case Comparison

In [None]:
use_cases_scenarios = [
            {
                'model_id'    : 'anthropic.claude-v2',
                'in_tokens'  : 1000,
                'out_tokens' : 200,
                'region'     : 'us-east-1',
                'stream' : True,
                'name' : f'Summarization. in=1000, out=200',
            },
            {
                'model_id'    : 'anthropic.claude-v2',
                'in_tokens' : 200,
                'out_tokens' : 50,
                'region'     : 'us-east-1',
                'stream' : True,
                'name' : f'Classification. in=200, out=50',
            },
]

scenario_config = {
    "invocations_per_scenario" : 2,
    "sleep_between_invocations": 5
}

In [None]:
execute_benchmark(use_cases_scenarios,scenario_config, early_break = True)
# Early breaking will break after a single scenario, useful for debugging.
#execute_benchmark(use_cases_scenarios,scenario_config, early_break = True)

### Results
Adapt the show_results function to the type of data representation you wish to use. 

In [None]:
def show_results(scenarios):

    fig, ax = plt.subplots()

    metric = 'time-to-first-token'
    #metric = 'time-to-last-token'

    for scenario in scenarios:
      durations = [d[metric] for d in scenario['durations']]

      ax.boxplot(durations, positions=[scenarios.index(scenario)])

      ax.set_xticks(range(len(scenarios)))
      ax.set_xticklabels([s['name'] for s in scenarios])

      ax.set_ylabel(f'{metric} (sec)')

    fig.tight_layout()
    plt.show()

In [None]:
show_results(scenarios)