# Load Testing Llama3 on SageMaker RT Inference With LLMPerf

In this notebook we'll take the [LLMPerf](https://github.com/ray-project/llmperf) repo and showcase how you can use it to load test Llama3 deployed on SageMaker JumpStart RT-Inference.. LLM Load Testing is a little different then our traditional load testing we did for ML models, we'll specifically look at metrics such as:

- Time to First Token
- Token Throughput (Tokens per Second)

Along with our usual requests per minutes (RPM), but these more granular metrics give a more accurate picture in terms of our LLM performance especially when using an API that bills based off of the token input the model is processing.

## Additional Resources/Credits
- [JumpStart Starter Guide](https://www.youtube.com/watch?v=c0ASHUm3BwA&t=636s)

## Llama Deployment via JumpStart
You can also optionally skip this and just bring your own endpoint, if you would like to specify hardware ensure to specify an instance type in the deployment params here as well.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgeneration-llama-3-8b")
predictor = model.deploy(accept_eula=True)

In [None]:
import boto3
import json

endpoint_name = predictor.endpoint_name
sm_client = boto3.client("sagemaker-runtime", region_name="us-east-1")
payload = {
    "inputs": "Who is Roger Federer?",
    "parameters": {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}

response = sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json"
)
#print(response)
print(response["Body"].read())

## LLMPerf Setup
Our notebook setting is a conda_python3 kernel and ml.g5.12xlarge SageMaker Notebook Instance, note dependencies versions might change depending on the environment you're in, the installations are configured for this specific environment.

In [None]:
!pip install setuptools==65.5.1 --quiet
!git clone https://github.com/ray-project/llmperf.git
!cd llmperf; pip install -e . --quiet; cd ..
!pip install pydantic -U --quiet

In [None]:
import os 
from litellm import completion
import litellm
#litellm._turn_on_debug()

os.environ["AWS_ACCESS_KEY_ID"] = "Enter Access Key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter Secret Access Key"
os.environ["AWS_REGION_NAME"] = "us-east-1" #update region if needed

response = completion(
            model=f"sagemaker/{endpoint_name}", 
            messages=[{ "content": "Who is Roger Federer?","role": "user"}],
        )
output = response.choices[0].message.content
print(output)

## LLMPerf Benchmark

Here we utilize the following LLMPerf script: https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py to configure our load test. You can adjust the input and output token sizes depending on your use-cases. Other parameters you can toggle include the number of concurrent requests and test duration. We've configured the test to run for 5 minutes and after conclusion you should see the results displayed and a directory called <b>sagemaker-outputs</b> with the resulting files.

In [None]:
%%sh
python llmperf/token_benchmark_ray.py \
    --model sagemaker/<enter ep name here> \
    --mean-input-tokens 1024 \
    --stddev-input-tokens 200 \
    --mean-output-tokens 1024 \
    --stddev-output-tokens 200 \
    --max-num-completed-requests 20 \
    --num-concurrent-requests 1 \
    --timeout 300 \
    --llm-api litellm \
    --results-dir sagemaker-outputs