# Load Testing Bedrock Claude Sonnet With LLMPerf

In this notebook we'll take the [LLMPerf](https://github.com/ray-project/llmperf) repo and showcase how you can use it to load test Bedrock Claude 3 Sonnet. LLM Load Testing is a little different then our traditional load testing we did for ML models, we'll specifically look at metrics such as:

- Time to First Token
- Token Throughput (Tokens per Second)

Along with our usual requests per minutes (RPM), but these more granular metrics give a more accurate picture in terms of our LLM performance especially when using an API that bills based off of the token input the model is processing.

## Additional Resources/Credits
- [Bedrock Starter Guide](https://www.youtube.com/watch?v=8aMJUV0qhow&t=3s)
- [Load Testing Custom Models Bedrock](https://aws.amazon.com/blogs/machine-learning/benchmarking-customized-models-on-amazon-bedrock-using-llmperf-and-litellm/)

## Setup
Our notebook setting is a conda_python3 kernel and ml.g5.12xlarge SageMaker Notebook Instance, note dependencies versions might change depending on the environment you're in, the installations are configured for this specific environment.

In [None]:
!pip install setuptools==65.5.1 --quiet
!git clone https://github.com/ray-project/llmperf.git
!cd llmperf; pip install -e . --quiet; cd ..
!pip install pydantic -U --quiet

## LiteLLM
[LiteLLM](https://github.com/BerriAI/litellm) helps you invoke different Model Providers in a singular unified format making it simple to test across different LLM Providers/Models. In this case we can test to see how it works with Bedrock, this is also natively integrated with LLMPerf and simplifies our load testing process.

In [None]:
import os
from litellm import completion

os.environ["AWS_ACCESS_KEY_ID"] = "Enter your access key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret access key"
os.environ["AWS_REGION_NAME"] = "us-east-1"

response = completion(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.choices[0].message.content
print(output)

## LLMPerf Benchmark

Here we utilize the following LLMPerf script: https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py to configure our load test. You can adjust the input and output token sizes depending on your use-cases. Other parameters you can toggle include the number of concurrent requests and test duration. We've configured the test to run for 5 minutes and after conclusion you should see the results displayed and a directory called <b>bedrock-outputs</b> with the resulting files.

In [None]:
%%sh
python llmperf/token_benchmark_ray.py \
    --model bedrock/anthropic.claude-3-sonnet-20240229-v1:0 \
    --mean-input-tokens 1024 \
    --stddev-input-tokens 200 \
    --mean-output-tokens 1024 \
    --stddev-output-tokens 200 \
    --max-num-completed-requests 30 \
    --num-concurrent-requests 1 \
    --timeout 300 \
    --llm-api litellm \
    --results-dir bedrock-outputs

### Display Summary Results

In [None]:
import json
from pathlib import Path
import pandas as pd

# Load JSON files
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")

with open(individual_path, "r") as f:
    individual_data = json.load(f)

with open(summary_path, "r") as f:
    summary_data = json.load(f)

# Print summary metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
    "Model": summary_data.get("model"),
    "Mean Input Tokens": summary_data.get("mean_input_tokens"),
    "Stddev Input Tokens": summary_data.get("stddev_input_tokens"),
    "Mean Output Tokens": summary_data.get("mean_output_tokens"),
    "Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
    "Mean TTFT (s)": summary_data.get("results_ttft_s_mean"),
    "Mean Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
    "Mean Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
    "Completed Requests": summary_data.get("results_num_completed_requests"),
    "Error Rate": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Performance Summary:\n")
for k, v in summary_metrics.items():
    print(f"{k}: {v}")