https://docs.databricks.com/aws/en/notebooks/source/machine-learning/large-language-models/llm-benchmarking.html

# Large language model endpoints benchmarking script

To use this notebook, update the Databricks serving `endpoint_name` and number of `input_tokens` and `output tokens` in the next cell. At the end of the notebook a latency versus throughput graph is calculated and the benchmark is printed.


In [0]:
# Update this with the name of the endpoint to benchmark
endpoint_name = '<YOUR-ENDPOINT-NAME>'
# Number of input and outut tokens to benchmark
input_tokens = 2048
output_tokens = 256
# Number of queries per thread, higher gives more accurate results
num_queries_per_thread = 20

## Initial setup

In [0]:
import asyncio
import time
import aiohttp
import requests
import json
import statistics
import matplotlib
import math

# Set up the endpoint UTL and headers so that you can query the server.
API_ROOT = "<YOUR-WORKSPACE-URL>"
API_TOKEN = "<YOUR-API-TOKEN>"


headers = {'Authorization': f'Bearer {API_TOKEN}', 'Content-Type': 'application/json'}
endpoint_url = f'{API_ROOT}/serving-endpoints/{endpoint_name}/invocations'

The following `get_request` function sets the request for each query. The number of tokens in the prompt must match the number of tokens the model sees. The prompt also must contain a single token from the tokenizer corresponding to the model being benchmarked. The example in this notebook works for Llama models.

In [0]:
def get_request(in_tokens, out_tokens):
  # Edit the code so that the input of input/output tokens is as expected. This might depend on the tokenizer the model is using.
  return {'prompt': '<|begin_of_text|>'*(in_tokens-1) , 'temperature': 0.0, 'max_tokens': out_tokens, 'ignore_eos': True}


Next, you can validate the number of input tokens. However, you might need to manually edit this as it depends on the tokenizer used by the model. The following example: 

- Runs 10 queries.
- Validates the number of input tokens matches the number of tokens the model can see.
- Warms up the model.

In [0]:
# Sends an inital set of warm up requests and validates that you are sending the correct number of input tokens.
def warm_up_and_validate(in_tokens=2048, out_tokens=256, warm_up_requests=10):
  input_data = get_request(in_tokens, out_tokens)
  input_json = json.dumps(input_data)
  req = requests.Request('POST', endpoint_url, headers=headers, data=input_json)
  prepped = req.prepare()
  session = requests.Session()
  for _ in range(warm_up_requests):
    resp = session.send(prepped)
    result = json.loads(resp.text)
    assert(result['usage']['completion_tokens'] == out_tokens)
    assert(result['usage']['prompt_tokens'] == in_tokens), f"Model received {result['usage']['prompt_tokens']} input tokens, expected {in_tokens}. Please adjust the input prompt in cell 4."

warm_up_and_validate(input_tokens, output_tokens)

## Benchmarking library

In [0]:
latencies = []

# This is a single worker, which processes the given number of requests, one after the other.
async def worker(index, num_requests, in_tokens=2048, out_tokens=256):
  input_data = get_request(in_tokens, out_tokens)
  # Sleep some time to offset the the threads.
  await asyncio.sleep(0.1*index)
  
  for i in range(num_requests):
    request_start_time = time.time()
    
    success = False 
    while not success:
      timeout = aiohttp.ClientTimeout(total=3 * 3600)
      async with aiohttp.ClientSession(timeout=timeout) as session:
        async with session.post(endpoint_url, headers=headers, json=input_data) as response:
          success = response.ok
          chunks = []
          async for chunk, _ in response.content.iter_chunks():
            chunks.append(chunk)
    latency = time.time() - request_start_time
    result = json.loads(b''.join(chunks))
    latencies.append((result['usage']['prompt_tokens'], 
                      result['usage']['completion_tokens'], latency))


# This code runs parallel_requests' parallel sets of queries with num_requests_per_worker queries per worker.
async def single_benchmark(num_requests_per_worker, num_workers, in_tokens=2048, out_tokens=256):
  tasks = []
  for i in range(num_workers):
    task = asyncio.create_task(worker(i, num_requests_per_worker, in_tokens, out_tokens))
    tasks.append(task)
  await asyncio.gather(*tasks)

# This runs the benchmark with 1, n//2 and n output tokens to enable deriving time to first token (from 1 output token)
# and the time per token by looking at the difference in latency between 64 and 128 output tokens.
async def benchmark(parallel_queries=1, in_tokens=2048, out_tokens=256, num_tries=5):
  # store statistics about the number of input/outpu and the latency for each setup.
  avg_num_input_tokens = [0, 0, 0]
  avg_num_output_tokens = [0, 0, 0]
  median_latency = [0, 0, 0]
  print(f"Parallel queries {parallel_queries}")
  for i, out_tokens in enumerate([1, out_tokens//2, out_tokens]):
    # Clear the latencies array so that you get fresh statistics.
    latencies.clear()
    await single_benchmark(num_tries, parallel_queries, in_tokens, out_tokens)
    # Compute the median latency and the mean number of tokens.
    avg_num_input_tokens[i] = statistics.mean([inp for inp, _, _ in latencies])
    avg_num_output_tokens[i] = statistics.mean([outp for _, outp, _ in latencies])
    median_latency[i] = statistics.median([latency for _, _, latency in latencies])
    tokens_per_sec = (avg_num_input_tokens[i]+avg_num_output_tokens[i])*parallel_queries/median_latency[i]
    print(f'Output tokens {avg_num_output_tokens[i]}, median latency (s): {round(median_latency[i], 2)}, tokens per second {round(tokens_per_sec, 1)}')
  
  # Use the difference in the time between out_tokens//2 and out_tokens to find the time per output token
  # these are stored in median_latency[1] and median_latency[2] respectively
  # The time to generate just 1 token to get the time to first token is stored in median_latency[0]
  output_token_time = (median_latency[2] - median_latency[1])*1000/(avg_num_output_tokens[2]-avg_num_output_tokens[1])
  print(f'Time to first token (s): {round(median_latency[0],2)}, Time per output token (ms) {round(output_token_time,2)}')
  data.append([median_latency[2],
               (avg_num_input_tokens[2]+avg_num_output_tokens[2])*parallel_queries/median_latency[2]])


## Run the benchmark with differing parallel queries

In [0]:
# This runs until the throughput of the model is no longer increasing by 10%.
data = []
for parallel_queries in [1, 2, 4, 8]:
  print(f"Input tokens {input_tokens}")
  await benchmark(parallel_queries, input_tokens, output_tokens, num_queries_per_thread)
  # Break if the throughput doesn't increase by more than 10%
  if len(data) > 1 and (data[-1][1] - data[-2][1])/data[-2][1] < 0.1:
    break

# Plot the latency vs throughput curve
matplotlib.pyplot.xlabel("Latency (s)")
matplotlib.pyplot.ylabel("Throughput (tok/s)")
line = matplotlib.pyplot.plot([x[0] for x in data], [x[1] for x in data], marker='o')
matplotlib.pyplot.show()