# Validating dstack's Llama 3.1 405B Testing
Recently, also on basically an identical Hot Aisle machine, [dstack](https://dstack.ai/) got some surprising results with vLLM vs TGI performance where TGI basically performed 2X faster. I did [a quick check](https://www.reddit.com/r/AMD_MI300/comments/1fzzdga/comment/lr5h9km/) and didn't seem to have that big of a disparity.

Now that we have a better testing setup, let's try to see if we can replicate at least the first part of their results w/ the exact same settings (I will be using CK FA and hipBLASlt):
https://dstack.ai/blog/amd-mi300x-inference-benchmark/

In [1]:
# Increase File handles
!ulimit -n 131072

In [6]:
# Run the server in the background...
# %%bash --bg
# VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-405B-Instruct  --tensor-parallel-size=8 --disable-log-requests

import subprocess

cmd = """VLLM_USE_TRITON_FLASH_ATTN=0 \
         vllm serve NousResearch/Hermes-3-Llama-3.1-405B \
         -tp=8 \ 
         --disable-log-requests \
         --num-scheduler-steps 15 \
         --max-num-seqs 2048 \
         --enable-chunked-prefill false"""

log_file = "vllm_serve.2048.vllm-rocm.tune.log"
with open(log_file, 'w') as f:
    process = subprocess.Popen(cmd, shell=True, stdout=f, stderr=subprocess.STDOUT, text=True)

print(f"VLLM serve process started. Output is being logged to {log_file}")
print(f"Process ID: {process.pid}")

# We can terminate the process later...
# process.terminate()

VLLM serve process started. Output is being logged to vllm_serve.2048.log
Process ID: 536976


In [12]:
%cd ~/dstat.benchmarks/amd/inference/scripts

/home/hotaisle/dstat.benchmarks/amd/inference/scripts


In [13]:
# 80 tokens
import subprocess

num_prompts = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]

base_command = """python benchmark_serving.py \
    --backend vllm \
    --model NousResearch/Hermes-3-Llama-3.1-405B \
    --dataset-name sonnet \
    --dataset-path="sonnet.txt" \
    --sonnet-input-len 80 \
    --sonnet-prefix-len 50 \
    --num-prompt {num_prompt}"""

for num_prompt in num_prompts:
    command = base_command.format(num_prompt=num_prompt)
    print(f"Running benchmark with num_prompt={num_prompt}")
    
    # Run the command and capture output
    result = subprocess.run(command, shell=True, text=True, capture_output=True)
    
    # Print stdout and stderr
    print("STDOUT:")
    print(result.stdout)
    # print("STDERR:")
    # print(result.stderr)
    
    print(f"Finished benchmark with num_prompt={num_prompt}\n")

print("All benchmarks completed.")

Running benchmark with num_prompt=1
STDOUT:
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sonnet', dataset_path='sonnet.txt', model='NousResearch/Hermes-3-Llama-3.1-405B', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', sonnet_input_len=80, sonnet_output_len=150, sonnet_prefix_len=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Successful requests:                     1         
Benchmark durati

## Parse results
Now that we've gotten our results, here's a whole mess of code to both get that into a data frame, and also to pull the raw results from dstack's testing into data frames as well...

In [14]:
# Benchmark parsing code

from IPython.display import display
import nbformat
import ipynbname
import pandas as pd

# Enhanced parsing functions
def parse_benchmark_results(text):
    """
    Parse the benchmark results from text into a structured DataFrame.

    Parameters:
    - text (str): The text containing benchmark results.

    Returns:
    - pd.DataFrame: DataFrame containing the parsed benchmark results.
    """
    # Define all the metrics to extract with their regex patterns
    metrics_patterns = {
        'num_prompt': r"Running benchmark with num_prompt=(\d+)|Successful requests:\s+(\d+)",
        'successful_requests': r"Successful requests:\s+(\d+)",
        'benchmark_duration_s': r"Benchmark duration \(s\):\s+([\d.]+)",
        'total_input_tokens': r"Total input tokens:\s+([\d.]+)",
        'total_generated_tokens': r"Total generated tokens:\s+([\d.]+)",
        'request_throughput_req_s': r"Request throughput \(req/s\):\s+([\d.]+)",
        'output_token_throughput_tok_s': r"Output token throughput \(tok/s\):\s+([\d.]+)",
        'total_token_throughput_tok_s': r"Total Token throughput \(tok/s\):\s+([\d.]+)",
        'mean_ttft_ms': r"Mean TTFT \(ms\):\s+([\d.]+)",
        'median_ttft_ms': r"Median TTFT \(ms\):\s+([\d.]+)",
        'p99_ttft_ms': r"P99 TTFT \(ms\):\s+([\d.]+)",
        'mean_tpot_ms': r"Mean TPOT \(ms\):\s+([\d.]+)",
        'median_tpot_ms': r"Median TPOT \(ms\):\s+([\d.]+)",
        'p99_tpot_ms': r"P99 TPOT \(ms\):\s+([\d.]+)",
        'mean_itl_ms': r"Mean ITL \(ms\):\s+([\d.]+)",
        'median_itl_ms': r"Median ITL \(ms\):\s+([\d.]+)",
        'p99_itl_ms': r"P99 ITL \(ms\):\s+([\d.]+)"
    }

    results = []
    current_result = {}
    current_prompt = None

    for line in text.split('\n'):
        # Check for num_prompt
        num_prompt_match = re.search(r"Running benchmark with num_prompt=(\d+)", line)
        if num_prompt_match:
            current_prompt = int(num_prompt_match.group(1))
            current_result = {'num_prompt': current_prompt}
            continue  # Move to the next line

        # Extract metrics
        for metric, pattern in metrics_patterns.items():
            match = re.search(pattern, line, re.IGNORECASE)
            if match:
                # Handle multiple groups (e.g., num_prompt and successful_requests)
                value_str = match.group(1) if match.group(1) else match.group(2)
                if value_str:
                    # Convert to float or int
                    if '.' in value_str:
                        value = float(value_str)
                    else:
                        value = int(value_str)
                    current_result[metric] = value

        # Check if we have reached the end of a benchmark block
        if "==================================================" in line:
            if current_prompt is not None:
                results.append(current_result)
                current_prompt = None
                current_result = {}

    # Convert the list of results to a DataFrame
    df = pd.DataFrame(results)
    return df

def parse_cell_output(cell_output):
    """
    Parse the output of a code cell containing benchmark results.

    Parameters:
    - cell_output (str): The output text from the code cell.

    Returns:
    - pd.DataFrame: DataFrame containing the parsed benchmark results.
    """
    return parse_benchmark_results(cell_output)

def parse_markdown_cell(cell_content):
    """
    Parse the content of a cell (e.g., Markdown) containing benchmark results.

    Parameters:
    - cell_content (str): The content text from the cell.

    Returns:
    - pd.DataFrame: DataFrame containing the parsed benchmark results.
    """
    return parse_cell_content(cell_content)

def parse_cell_content(cell_content):
    """
    Parse the cell content to extract benchmark results into a DataFrame.

    Parameters:
    - cell_content (str): The content of the cell to parse.

    Returns:
    - pd.DataFrame: DataFrame containing the parsed benchmark results.
    """
    # Step 1: Extract num_prompts
    num_prompts_pattern = r"num_prompts:\s*([\d,\s]+)"
    num_prompts_match = re.search(num_prompts_pattern, cell_content, re.IGNORECASE)
    
    if not num_prompts_match:
        print("No 'num_prompts' found in cell content.")
        return None
    
    num_prompts_str = num_prompts_match.group(1)
    num_prompts = [int(x.strip()) for x in num_prompts_str.split(",")]
    
    # Step 2: Split the content into benchmark result blocks
    # Adjusted the regex to handle variations in the separator lines
    benchmark_block_pattern = r"=========== Serving Benchmark Result ============\n(.*?)=================================================="
    blocks = re.findall(benchmark_block_pattern, cell_content, re.DOTALL)
    
    if not blocks:
        print("No benchmark result blocks found in cell content.")
        return None
    
    # Check if the number of blocks matches the number of num_prompts
    if len(blocks) != len(num_prompts):
        print("Warning: Number of benchmark result blocks does not match number of num_prompts.")
        print(f"Number of blocks: {len(blocks)}, Number of num_prompts: {len(num_prompts)}")
        # Proceeding, but mapping may be incorrect
    
    # Define all the metrics to extract
    metrics_patterns = {
        'successful_requests': r"Successful requests:\s+(\d+)",
        'benchmark_duration_s': r"Benchmark duration \(s\):\s+([\d.]+)",
        'total_input_tokens': r"Total input tokens:\s+([\d.]+)",
        'total_generated_tokens': r"Total generated tokens:\s+([\d.]+)",
        'request_throughput_req_s': r"Request throughput \(req/s\):\s+([\d.]+)",
        'output_token_throughput_tok_s': r"Output token throughput \(tok/s\):\s+([\d.]+)",
        'total_token_throughput_tok_s': r"Total Token throughput \(tok/s\):\s+([\d.]+)",
        'mean_ttft_ms': r"Mean TTFT \(ms\):\s+([\d.]+)",
        'median_ttft_ms': r"Median TTFT \(ms\):\s+([\d.]+)",
        'p99_ttft_ms': r"P99 TTFT \(ms\):\s+([\d.]+)",
        'mean_tpot_ms': r"Mean TPOT \(ms\):\s+([\d.]+)",
        'median_tpot_ms': r"Median TPOT \(ms\):\s+([\d.]+)",
        'p99_tpot_ms': r"P99 TPOT \(ms\):\s+([\d.]+)",
        'mean_itl_ms': r"Mean ITL \(ms\):\s+([\d.]+)",
        'median_itl_ms': r"Median ITL \(ms\):\s+([\d.]+)",
        'p99_itl_ms': r"P99 ITL \(ms\):\s+([\d.]+)"
    }
    
    results = []
    
    for idx, block in enumerate(blocks):
        # Initialize a dictionary for each result
        result = {}
        result['num_prompt'] = num_prompts[idx] if idx < len(num_prompts) else None
        
        # Extract each metric using its regex pattern
        for metric, pattern in metrics_patterns.items():
            match = re.search(pattern, block, re.IGNORECASE)
            if match:
                # Convert numerical values to float or int as appropriate
                value_str = match.group(1)
                if '.' in value_str:
                    value = float(value_str)
                else:
                    value = int(value_str)
                result[metric] = value
            else:
                result[metric] = None  # Assign None if the metric is not found
        
        # Append the result to the list
        results.append(result)
    
    # Create a DataFrame from the results
    df = pd.DataFrame(results)
    
    return df

In [15]:
# Pull data from cells...

import nbformat
import ipynbname
import pandas as pd
import re

def get_notebook_cells():
    try:
        notebook_path = ipynbname.path()
    except FileNotFoundError:
        print("Could not determine the notebook's path.")
        return []
    
    with open(notebook_path, 'r', encoding='utf-8') as f:
        notebook = nbformat.read(f, as_version=4)
    
    return notebook.cells

def get_cell_content_by_position(position):
    """Get the content of a cell by its position in the notebook (0-based index)."""
    cells = get_notebook_cells()
    if position < len(cells):
        cell = cells[position]
        return cell.source
    else:
        print(f"No cell found at position {position}.")
        return None

def get_cell_outputs_by_position(position):
    """Get the outputs of a code cell by its position in the notebook (0-based index)."""
    cells = get_notebook_cells()
    if position < len(cells):
        cell = cells[position]
        if cell.cell_type == 'code':
            outputs = cell.get('outputs', [])
            # Concatenate all output texts
            output_texts = []
            for output in outputs:
                if 'text' in output:
                    output_texts.append(''.join(output['text']))
                elif 'data' in output and 'text/plain' in output['data']:
                    output_texts.append(''.join(output['data']['text/plain']))
                elif 'ename' in output and 'evalue' in output:
                    # Capture error messages
                    output_texts.append(f"{output['ename']}: {output['evalue']}")
            return '\n'.join(output_texts)
        else:
            print(f"Cell at position {position} is not a code cell.")
            return None
    else:
        print(f"No cell found at position {position}.")
        return None

In [16]:
### My results from the output

# Get output
cell_position = 4  # 0-based index (e.g., 4 corresponds to the 5th cell)

# Retrieve cell content
cell_content = get_cell_content_by_position(cell_position)

# Retrieve cell output (only for code cells)
cell_output = get_cell_outputs_by_position(cell_position)

# Parsing the output if available
if cell_output:
    df_myvllm = parse_cell_output(cell_output)
    display(df_myvllm)

Unnamed: 0,num_prompt,successful_requests,benchmark_duration_s,total_input_tokens,total_generated_tokens,request_throughput_req_s,output_token_throughput_tok_s,total_token_throughput_tok_s,mean_ttft_ms,median_ttft_ms,p99_ttft_ms,mean_tpot_ms,median_tpot_ms,p99_tpot_ms,mean_itl_ms,median_itl_ms,p99_itl_ms
0,1,1,2.76,73,40,0.36,14.47,40.87,421.32,421.32,421.32,60.06,60.06,60.06,58.57,57.58,67.93
1,2,2,2.81,146,80,0.71,28.45,80.36,424.96,424.96,425.23,61.17,61.17,61.18,59.66,58.16,73.13
2,4,4,3.13,294,161,1.28,51.43,145.35,480.22,480.21,480.52,61.83,61.84,62.02,60.31,59.56,68.33
3,8,8,3.67,592,343,2.18,93.43,254.69,582.46,582.52,583.07,66.04,66.08,66.39,64.49,63.55,91.84
4,16,16,4.06,1179,689,3.94,169.71,460.1,723.83,723.76,724.9,68.48,68.36,69.71,66.92,64.1,127.28
5,32,32,4.66,2333,1350,6.86,289.47,789.71,1165.58,1165.16,1172.62,70.89,70.76,71.9,69.23,67.55,98.17
6,64,64,5.76,4669,2693,11.11,467.69,1278.56,1776.87,1777.84,1787.03,76.49,76.45,77.78,74.65,73.77,106.13
7,128,128,8.16,9329,5325,15.68,652.37,1795.28,3282.96,3282.87,3288.8,96.84,96.97,99.45,94.51,92.05,136.01
8,256,256,13.6,18685,10520,18.82,773.53,2147.43,5163.32,6551.26,6742.07,179.16,146.06,267.64,174.56,138.58,3292.77
9,512,512,22.72,37425,21039,22.53,925.91,2572.95,11623.41,11597.04,11774.09,229.82,231.65,241.39,224.27,224.75,377.34


In [17]:
# Get dstack vllm
cell_position = 14
# Retrieve cell content
cell_content = get_cell_content_by_position(cell_position)

df_dsvllm = parse_markdown_cell(cell_content)
print("dstack vllm results")
display(df_dsvllm)

# Get dstack tgi
cell_position = 15
# Retrieve cell content
cell_content = get_cell_content_by_position(cell_position)

df_dstgi = parse_markdown_cell(cell_content)
print("dstack tgi results")
display(df_dstgi)

dstack vllm results


Unnamed: 0,num_prompt,successful_requests,benchmark_duration_s,total_input_tokens,total_generated_tokens,request_throughput_req_s,output_token_throughput_tok_s,total_token_throughput_tok_s,mean_ttft_ms,median_ttft_ms,p99_ttft_ms,mean_tpot_ms,median_tpot_ms,p99_tpot_ms,mean_itl_ms,median_itl_ms,p99_itl_ms
0,1,1,4.53,80,85,0.22,18.75,36.4,314.58,314.58,314.58,50.19,50.19,50.19,49.13,49.66,94.22
1,2,2,4.66,160,127,0.43,27.28,61.65,372.8,372.8,373.27,51.59,51.59,52.21,50.38,49.87,95.48
2,4,4,5.42,320,288,0.74,53.18,112.27,397.54,397.41,398.43,54.18,54.07,54.64,53.3,54.54,106.68
3,8,8,5.65,640,484,1.42,85.64,198.89,646.2,596.84,796.26,55.48,54.98,60.39,54.28,52.68,201.8
4,16,16,6.57,1279,936,2.44,142.52,337.26,1034.69,1207.28,1413.84,63.38,60.83,79.37,61.59,58.6,203.24
5,32,32,8.49,2594,1857,3.77,218.64,524.06,1546.0,1500.61,2506.87,87.1,83.15,118.37,83.45,82.86,408.83
6,64,64,19.61,5189,3842,3.26,195.93,460.56,2698.98,2811.65,4723.29,230.16,237.99,293.12,218.27,202.67,411.57
7,128,128,27.7,10306,7588,4.62,273.89,645.88,5161.26,4910.31,10056.4,298.29,299.27,380.18,286.41,303.66,505.04
8,256,256,44.46,20829,15376,5.76,345.83,814.31,11787.27,10570.1,25591.5,389.49,402.13,499.46,373.87,403.86,808.29
9,512,512,75.58,41838,30400,6.77,402.21,955.75,27060.36,25908.71,57181.18,419.23,452.64,474.21,406.6,406.3,731.78


dstack tgi results


Unnamed: 0,num_prompt,successful_requests,benchmark_duration_s,total_input_tokens,total_generated_tokens,request_throughput_req_s,output_token_throughput_tok_s,total_token_throughput_tok_s,mean_ttft_ms,median_ttft_ms,p99_ttft_ms,mean_tpot_ms,median_tpot_ms,p99_tpot_ms,mean_itl_ms,median_itl_ms,p99_itl_ms
0,1,1,2.56,80,57,0.39,22.27,53.52,123.48,123.48,123.48,43.48,43.48,43.48,42.71,42.66,43.68
1,2,2,4.67,159,125,0.43,26.76,60.8,864.65,864.65,1589.18,59.22,59.22,72.42,57.08,45.22,51.38
2,4,4,3.7,319,210,1.08,56.82,143.13,244.9,285.64,285.7,53.01,52.97,55.16,51.78,51.06,67.05
3,8,8,3.91,639,404,2.05,103.41,266.97,395.44,432.47,433.75,54.17,53.71,58.47,53.03,52.51,54.65
4,16,16,4.97,1276,823,3.22,165.52,422.14,581.37,610.26,614.75,60.53,60.85,64.99,59.06,59.35,67.93
5,32,32,8.09,2590,1685,3.95,208.23,528.31,2325.73,2396.74,2398.41,76.4,75.16,137.68,72.63,73.57,83.79
6,64,64,10.03,5255,3615,6.38,360.44,884.41,2150.47,2181.28,2190.51,101.4,102.73,120.5,97.98,99.58,125.9
7,128,128,14.45,10480,7236,8.86,500.92,1226.4,3753.25,3781.18,3799.51,133.02,134.98,141.72,128.25,132.54,159.03
8,256,256,22.07,20857,14739,11.6,667.82,1612.85,6981.56,7005.91,7041.34,194.79,197.87,210.77,187.1,199.32,216.95
9,512,512,38.6,41795,29639,13.27,767.92,1850.8,14086.51,14109.33,14161.82,322.78,330.42,355.85,308.13,336.09,369.31


## Comparison
OK, that wasn't so hard was it? *sob* Actually, 3.5 Sonnet and o1-mini did a good job, this could have been much more painful, so now let's run our comparison...

In [18]:
# Function to create comparison DataFrame
def create_throughput_comparison(df1, df2, label1='Model1', label2='Model2'):
    """
    Compare total_token_throughput_tok_s between two DataFrames and calculate percentage difference.

    Parameters:
    - df1 (pd.DataFrame): First DataFrame containing 'num_prompt' and 'total_token_throughput_tok_s'.
    - df2 (pd.DataFrame): Second DataFrame containing 'num_prompt' and 'total_token_throughput_tok_s'.
    - label1 (str): Label for the first model.
    - label2 (str): Label for the second model.

    Returns:
    - pd.DataFrame: Comparison DataFrame with percentage differences.
    """
    # Ensure both DataFrames have 'num_prompt' and 'total_token_throughput_tok_s'
    if 'num_prompt' not in df1.columns or 'total_token_throughput_tok_s' not in df1.columns:
        raise ValueError(f"df1 must contain 'num_prompt' and 'total_token_throughput_tok_s' columns.")
    if 'num_prompt' not in df2.columns or 'total_token_throughput_tok_s' not in df2.columns:
        raise ValueError(f"df2 must contain 'num_prompt' and 'total_token_throughput_tok_s' columns.")

    # Merge the DataFrames on 'num_prompt'
    merged_df = pd.merge(
        df1[['num_prompt', 'total_token_throughput_tok_s']],
        df2[['num_prompt', 'total_token_throughput_tok_s']],
        on='num_prompt',
        how='inner',
        suffixes=(f'_{label1}', f'_{label2}')
    )

    # Calculate percentage difference
    epsilon = 1e-10  # To prevent division by zero
    merged_df['percentage_diff'] = ((merged_df[f'total_token_throughput_tok_s_{label2}'] - merged_df[f'total_token_throughput_tok_s_{label1}']) / (merged_df[f'total_token_throughput_tok_s_{label1}'] + epsilon)) * 100

    # Round percentage_diff to two decimal places
    merged_df['percentage_diff'] = merged_df['percentage_diff'].round(2)

    # Rename columns for clarity
    final_df = merged_df.rename(columns={
        f'total_token_throughput_tok_s_{label1}': f'{label1}_total_throughput',
        f'total_token_throughput_tok_s_{label2}': f'{label2}_total_throughput'
    })[['num_prompt', f'{label1}_total_throughput', f'{label2}_total_throughput', 'percentage_diff']]

    return final_df


comparison_df_myvllm_dsvllm = create_throughput_comparison(
    df_dsvllm,
    df_myvllm,
    label1='dsvllm',
    label2='myvllm'
)

comparison_df_myvllm_dstgi = create_throughput_comparison(
    df_myvllm,
    df_dstgi,
    label1='myvllm',
    label2='dstgi'
)

comparison_df_dsvllm_dstgi = create_throughput_comparison(
    df_dsvllm,
    df_dstgi,
    label1='dsvllm',
    label2='dstgi'
)

# Display the comparison DataFrames
print("Comparison between my vLLM and dstack vLLM")
display(comparison_df_myvllm_dsvllm)

print("Comparison between my vLLM and dstack TGI")
display(comparison_df_myvllm_dstgi)

print("Original dstack vLLM vs TGI (TGI avg 90% faster):")
display(comparison_df_dsvllm_dstgi)


Comparison between my vLLM and dstack vLLM


Unnamed: 0,num_prompt,dsvllm_total_throughput,myvllm_total_throughput,percentage_diff
0,1,36.4,40.87,12.28
1,2,61.65,80.36,30.35
2,4,112.27,145.35,29.46
3,8,198.89,254.69,28.06
4,16,337.26,460.1,36.42
5,32,524.06,789.71,50.69
6,64,460.56,1278.56,177.61
7,128,645.88,1795.28,177.96
8,256,814.31,2147.43,163.71
9,512,955.75,2572.95,169.21


Comparison between my vLLM and dstack TGI


Unnamed: 0,num_prompt,myvllm_total_throughput,dstgi_total_throughput,percentage_diff
0,1,40.87,53.52,30.95
1,2,80.36,60.8,-24.34
2,4,145.35,143.13,-1.53
3,8,254.69,266.97,4.82
4,16,460.1,422.14,-8.25
5,32,789.71,528.31,-33.1
6,64,1278.56,884.41,-30.83
7,128,1795.28,1226.4,-31.69
8,256,2147.43,1612.85,-24.89
9,512,2572.95,1850.8,-28.07


Original dstack vLLM vs TGI (TGI avg 90% faster):


Unnamed: 0,num_prompt,dsvllm_total_throughput,dstgi_total_throughput,percentage_diff
0,1,36.4,53.52,47.03
1,2,61.65,60.8,-1.38
2,4,112.27,143.13,27.49
3,8,198.89,266.97,34.23
4,16,337.26,422.14,25.17
5,32,524.06,528.31,0.81
6,64,460.56,884.41,92.03
7,128,645.88,1226.4,89.88
8,256,814.31,1612.85,98.06
9,512,955.75,1850.8,93.65


## Throughput

The tuned vLLM now does about +170% on better on throughput at bs=64 and above.

When compared to TGI, TGI is now about 30% slower for bs=32 and above.

## TTFT
This seems to be mirrored with TTFT, where there's actually a lot more noise, but again, about a +30% for TGI...

In [19]:
# Function to create TTFT comparison DataFrame
def create_ttft_comparison(df1, df2, label1='Model1', label2='Model2'):
    """
    Compare mean_ttft_ms between two DataFrames and calculate percentage difference.

    Parameters:
    - df1 (pd.DataFrame): First DataFrame containing 'num_prompt' and 'mean_ttft_ms'.
    - df2 (pd.DataFrame): Second DataFrame containing 'num_prompt' and 'mean_ttft_ms'.
    - label1 (str): Label for the first model.
    - label2 (str): Label for the second model.

    Returns:
    - pd.DataFrame: Comparison DataFrame with percentage differences.
    """
    # Ensure both DataFrames have 'num_prompt' and 'mean_ttft_ms'
    if 'num_prompt' not in df1.columns or 'mean_ttft_ms' not in df1.columns:
        raise ValueError(f"df1 must contain 'num_prompt' and 'mean_ttft_ms' columns.")
    if 'num_prompt' not in df2.columns or 'mean_ttft_ms' not in df2.columns:
        raise ValueError(f"df2 must contain 'num_prompt' and 'mean_ttft_ms' columns.")

    # Merge the DataFrames on 'num_prompt'
    merged_df = pd.merge(
        df1[['num_prompt', 'mean_ttft_ms']],
        df2[['num_prompt', 'mean_ttft_ms']],
        on='num_prompt',
        how='inner',
        suffixes=(f'_{label1}', f'_{label2}')
    )

    # Calculate percentage difference
    epsilon = 1e-10  # To prevent division by zero
    merged_df['percentage_diff'] = (
        (merged_df[f'mean_ttft_ms_{label2}'] - merged_df[f'mean_ttft_ms_{label1}']) 
        / (merged_df[f'mean_ttft_ms_{label1}'] + epsilon)
    ) * 100

    # Round percentage_diff to two decimal places
    merged_df['percentage_diff'] = merged_df['percentage_diff'].round(2)

    # Rename columns for clarity
    final_df = merged_df.rename(columns={
        f'mean_ttft_ms_{label1}': f'{label1}_mean_ttft_ms',
        f'mean_ttft_ms_{label2}': f'{label2}_mean_ttft_ms'
    })[['num_prompt', f'{label1}_mean_ttft_ms', f'{label2}_mean_ttft_ms', 'percentage_diff']]

    return final_df


comparison_df_myvllm_dsvllm = create_ttft_comparison(
    df_dsvllm,
    df_myvllm,
    label1='dsvllm',
    label2='myvllm'
)

comparison_df_myvllm_dstgi = create_ttft_comparison(
    df_myvllm,
    df_dstgi,
    label1='myvllm',
    label2='dstgi'
)

comparison_df_dsvllm_dstgi = create_ttft_comparison(
    df_dsvllm,
    df_dstgi,
    label1='dsvllm',
    label2='dstgi'
)

# Display the comparison DataFrames
print("Comparison between my vLLM  and dstack vLLM (mine avg 50% faster)")
display(comparison_df_myvllm_dsvllm)

print("Comparison between my vLLM and dstack TGI (TGI avg 30% faster)")
display(comparison_df_myvllm_dstgi)

print("Original dstack vLLM vs TGI (TGI avg 90% faster):")
display(comparison_df_dsvllm_dstgi)


Comparison between my vLLM  and dstack vLLM (mine avg 50% faster)


Unnamed: 0,num_prompt,dsvllm_mean_ttft_ms,myvllm_mean_ttft_ms,percentage_diff
0,1,314.58,421.32,33.93
1,2,372.8,424.96,13.99
2,4,397.54,480.22,20.8
3,8,646.2,582.46,-9.86
4,16,1034.69,723.83,-30.04
5,32,1546.0,1165.58,-24.61
6,64,2698.98,1776.87,-34.17
7,128,5161.26,3282.96,-36.39
8,256,11787.27,5163.32,-56.2
9,512,27060.36,11623.41,-57.05


Comparison between my vLLM and dstack TGI (TGI avg 30% faster)


Unnamed: 0,num_prompt,myvllm_mean_ttft_ms,dstgi_mean_ttft_ms,percentage_diff
0,1,421.32,123.48,-70.69
1,2,424.96,864.65,103.47
2,4,480.22,244.9,-49.0
3,8,582.46,395.44,-32.11
4,16,723.83,581.37,-19.68
5,32,1165.58,2325.73,99.53
6,64,1776.87,2150.47,21.03
7,128,3282.96,3753.25,14.33
8,256,5163.32,6981.56,35.21
9,512,11623.41,14086.51,21.19


Original dstack vLLM vs TGI (TGI avg 90% faster):


Unnamed: 0,num_prompt,dsvllm_mean_ttft_ms,dstgi_mean_ttft_ms,percentage_diff
0,1,314.58,123.48,-60.75
1,2,372.8,864.65,131.93
2,4,397.54,244.9,-38.4
3,8,646.2,395.44,-38.81
4,16,1034.69,581.37,-43.81
5,32,1546.0,2325.73,50.44
6,64,2698.98,2150.47,-20.32
7,128,5161.26,3753.25,-27.28
8,256,11787.27,6981.56,-40.77
9,512,27060.36,14086.51,-47.94


Compare to [dstack's results](https://github.com/dstackai/benchmarks/blob/main/amd/inference/raw_data/vllm_raw.txt):

```
vllm
num_prompts: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048
seq_length: 80
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  4.53      
Total input tokens:                      80        
Total generated tokens:                  85        
Request throughput (req/s):              0.22      
Output token throughput (tok/s):         18.75     
Total Token throughput (tok/s):          36.40     
---------------Time to First Token----------------
Mean TTFT (ms):                          314.58    
Median TTFT (ms):                        314.58    
P99 TTFT (ms):                           314.58    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          50.19     
Median TPOT (ms):                        50.19     
P99 TPOT (ms):                           50.19     
---------------Inter-token Latency----------------
Mean ITL (ms):                           49.13     
Median ITL (ms):                         49.66     
P99 ITL (ms):                            94.22     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     2         
Benchmark duration (s):                  4.66      
Total input tokens:                      160       
Total generated tokens:                  127       
Request throughput (req/s):              0.43      
Output token throughput (tok/s):         27.28     
Total Token throughput (tok/s):          61.65     
---------------Time to First Token----------------
Mean TTFT (ms):                          372.80    
Median TTFT (ms):                        372.80    
P99 TTFT (ms):                           373.27    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.59     
Median TPOT (ms):                        51.59     
P99 TPOT (ms):                           52.21     
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.38     
Median ITL (ms):                         49.87     
P99 ITL (ms):                            95.48     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     4         
Benchmark duration (s):                  5.42      
Total input tokens:                      320       
Total generated tokens:                  288       
Request throughput (req/s):              0.74      
Output token throughput (tok/s):         53.18     
Total Token throughput (tok/s):          112.27    
---------------Time to First Token----------------
Mean TTFT (ms):                          397.54    
Median TTFT (ms):                        397.41    
P99 TTFT (ms):                           398.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.18     
Median TPOT (ms):                        54.07     
P99 TPOT (ms):                           54.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.30     
Median ITL (ms):                         54.54     
P99 ITL (ms):                            106.68    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  5.65      
Total input tokens:                      640       
Total generated tokens:                  484       
Request throughput (req/s):              1.42      
Output token throughput (tok/s):         85.64     
Total Token throughput (tok/s):          198.89    
---------------Time to First Token----------------
Mean TTFT (ms):                          646.20    
Median TTFT (ms):                        596.84    
P99 TTFT (ms):                           796.26    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.48     
Median TPOT (ms):                        54.98     
P99 TPOT (ms):                           60.39     
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.28     
Median ITL (ms):                         52.68     
P99 ITL (ms):                            201.80    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     16        
Benchmark duration (s):                  6.57      
Total input tokens:                      1279      
Total generated tokens:                  936       
Request throughput (req/s):              2.44      
Output token throughput (tok/s):         142.52    
Total Token throughput (tok/s):          337.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          1034.69   
Median TTFT (ms):                        1207.28   
P99 TTFT (ms):                           1413.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.38     
Median TPOT (ms):                        60.83     
P99 TPOT (ms):                           79.37     
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.59     
Median ITL (ms):                         58.60     
P99 ITL (ms):                            203.24    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  8.49      
Total input tokens:                      2594      
Total generated tokens:                  1857      
Request throughput (req/s):              3.77      
Output token throughput (tok/s):         218.64    
Total Token throughput (tok/s):          524.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          1546.00   
Median TTFT (ms):                        1500.61   
P99 TTFT (ms):                           2506.87   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          87.10     
Median TPOT (ms):                        83.15     
P99 TPOT (ms):                           118.37    
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.45     
Median ITL (ms):                         82.86     
P99 ITL (ms):                            408.83    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  19.61     
Total input tokens:                      5189      
Total generated tokens:                  3842      
Request throughput (req/s):              3.26      
Output token throughput (tok/s):         195.93    
Total Token throughput (tok/s):          460.56    
---------------Time to First Token----------------
Mean TTFT (ms):                          2698.98   
Median TTFT (ms):                        2811.65   
P99 TTFT (ms):                           4723.29   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          230.16    
Median TPOT (ms):                        237.99    
P99 TPOT (ms):                           293.12    
---------------Inter-token Latency----------------
Mean ITL (ms):                           218.27    
Median ITL (ms):                         202.67    
P99 ITL (ms):                            411.57    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     128       
Benchmark duration (s):                  27.70     
Total input tokens:                      10306     
Total generated tokens:                  7588      
Request throughput (req/s):              4.62      
Output token throughput (tok/s):         273.89    
Total Token throughput (tok/s):          645.88    
---------------Time to First Token----------------
Mean TTFT (ms):                          5161.26   
Median TTFT (ms):                        4910.31   
P99 TTFT (ms):                           10056.40  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          298.29    
Median TPOT (ms):                        299.27    
P99 TPOT (ms):                           380.18    
---------------Inter-token Latency----------------
Mean ITL (ms):                           286.41    
Median ITL (ms):                         303.66    
P99 ITL (ms):                            505.04    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  44.46     
Total input tokens:                      20829     
Total generated tokens:                  15376     
Request throughput (req/s):              5.76      
Output token throughput (tok/s):         345.83    
Total Token throughput (tok/s):          814.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          11787.27  
Median TTFT (ms):                        10570.10  
P99 TTFT (ms):                           25591.50  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          389.49    
Median TPOT (ms):                        402.13    
P99 TPOT (ms):                           499.46    
---------------Inter-token Latency----------------
Mean ITL (ms):                           373.87    
Median ITL (ms):                         403.86    
P99 ITL (ms):                            808.29    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  75.58     
Total input tokens:                      41838     
Total generated tokens:                  30400     
Request throughput (req/s):              6.77      
Output token throughput (tok/s):         402.21    
Total Token throughput (tok/s):          955.75    
---------------Time to First Token----------------
Mean TTFT (ms):                          27060.36  
Median TTFT (ms):                        25908.71  
P99 TTFT (ms):                           57181.18  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          419.23    
Median TPOT (ms):                        452.64    
P99 TPOT (ms):                           474.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           406.60    
Median ITL (ms):                         406.30    
P99 ITL (ms):                            731.78    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  138.24    
Total input tokens:                      83423     
Total generated tokens:                  60121     
Request throughput (req/s):              7.41      
Output token throughput (tok/s):         434.89    
Total Token throughput (tok/s):          1038.34   
---------------Time to First Token----------------
Mean TTFT (ms):                          58452.87  
Median TTFT (ms):                        58111.01  
P99 TTFT (ms):                           119186.77 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          435.60    
Median TPOT (ms):                        450.38    
P99 TPOT (ms):                           475.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           425.01    
Median ITL (ms):                         406.39    
P99 ITL (ms):                            688.60    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  272.95    
Total input tokens:                      166647    
Total generated tokens:                  119667    
Request throughput (req/s):              7.50      
Output token throughput (tok/s):         438.42    
Total Token throughput (tok/s):          1048.96   
---------------Time to First Token----------------
Mean TTFT (ms):                          128150.28 
Median TTFT (ms):                        129063.54 
P99 TTFT (ms):                           253434.71 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          461.09    
Median TPOT (ms):                        470.08    
P99 TPOT (ms):                           497.82    
---------------Inter-token Latency----------------
Mean ITL (ms):                           451.17    
Median ITL (ms):                         430.56    
P99 ITL (ms):                            794.46    
==================================================
```

and their TGI results:
```
tgi 

num_prompts: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048
seq_length: 80
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  2.56      
Total input tokens:                      80        
Total generated tokens:                  57        
Request throughput (req/s):              0.39      
Output token throughput (tok/s):         22.27     
Total Token throughput (tok/s):          53.52     
---------------Time to First Token----------------
Mean TTFT (ms):                          123.48    
Median TTFT (ms):                        123.48    
P99 TTFT (ms):                           123.48    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.48     
Median TPOT (ms):                        43.48     
P99 TPOT (ms):                           43.48     
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.71     
Median ITL (ms):                         42.66     
P99 ITL (ms):                            43.68     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     2         
Benchmark duration (s):                  4.67      
Total input tokens:                      159       
Total generated tokens:                  125       
Request throughput (req/s):              0.43      
Output token throughput (tok/s):         26.76     
Total Token throughput (tok/s):          60.80     
---------------Time to First Token----------------
Mean TTFT (ms):                          864.65    
Median TTFT (ms):                        864.65    
P99 TTFT (ms):                           1589.18   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.22     
Median TPOT (ms):                        59.22     
P99 TPOT (ms):                           72.42     
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.08     
Median ITL (ms):                         45.22     
P99 ITL (ms):                            51.38     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     4         
Benchmark duration (s):                  3.70      
Total input tokens:                      319       
Total generated tokens:                  210       
Request throughput (req/s):              1.08      
Output token throughput (tok/s):         56.82     
Total Token throughput (tok/s):          143.13    
---------------Time to First Token----------------
Mean TTFT (ms):                          244.90    
Median TTFT (ms):                        285.64    
P99 TTFT (ms):                           285.70    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.01     
Median TPOT (ms):                        52.97     
P99 TPOT (ms):                           55.16     
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.78     
Median ITL (ms):                         51.06     
P99 ITL (ms):                            67.05     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  3.91      
Total input tokens:                      639       
Total generated tokens:                  404       
Request throughput (req/s):              2.05      
Output token throughput (tok/s):         103.41    
Total Token throughput (tok/s):          266.97    
---------------Time to First Token----------------
Mean TTFT (ms):                          395.44    
Median TTFT (ms):                        432.47    
P99 TTFT (ms):                           433.75    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.17     
Median TPOT (ms):                        53.71     
P99 TPOT (ms):                           58.47     
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.03     
Median ITL (ms):                         52.51     
P99 ITL (ms):                            54.65     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     16        
Benchmark duration (s):                  4.97      
Total input tokens:                      1276      
Total generated tokens:                  823       
Request throughput (req/s):              3.22      
Output token throughput (tok/s):         165.52    
Total Token throughput (tok/s):          422.14    
---------------Time to First Token----------------
Mean TTFT (ms):                          581.37    
Median TTFT (ms):                        610.26    
P99 TTFT (ms):                           614.75    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          60.53     
Median TPOT (ms):                        60.85     
P99 TPOT (ms):                           64.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.06     
Median ITL (ms):                         59.35     
P99 ITL (ms):                            67.93     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  8.09      
Total input tokens:                      2590      
Total generated tokens:                  1685      
Request throughput (req/s):              3.95      
Output token throughput (tok/s):         208.23    
Total Token throughput (tok/s):          528.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          2325.73   
Median TTFT (ms):                        2396.74   
P99 TTFT (ms):                           2398.41   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.40     
Median TPOT (ms):                        75.16     
P99 TPOT (ms):                           137.68    
---------------Inter-token Latency----------------
Mean ITL (ms):                           72.63     
Median ITL (ms):                         73.57     
P99 ITL (ms):                            83.79     
==================================================

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  10.03     
Total input tokens:                      5255      
Total generated tokens:                  3615      
Request throughput (req/s):              6.38      
Output token throughput (tok/s):         360.44    
Total Token throughput (tok/s):          884.41    
---------------Time to First Token----------------
Mean TTFT (ms):                          2150.47   
Median TTFT (ms):                        2181.28   
P99 TTFT (ms):                           2190.51   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          101.40    
Median TPOT (ms):                        102.73    
P99 TPOT (ms):                           120.50    
---------------Inter-token Latency----------------
Mean ITL (ms):                           97.98     
Median ITL (ms):                         99.58     
P99 ITL (ms):                            125.90    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     128       
Benchmark duration (s):                  14.45     
Total input tokens:                      10480     
Total generated tokens:                  7236      
Request throughput (req/s):              8.86      
Output token throughput (tok/s):         500.92    
Total Token throughput (tok/s):          1226.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          3753.25   
Median TTFT (ms):                        3781.18   
P99 TTFT (ms):                           3799.51   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          133.02    
Median TPOT (ms):                        134.98    
P99 TPOT (ms):                           141.72    
---------------Inter-token Latency----------------
Mean ITL (ms):                           128.25    
Median ITL (ms):                         132.54    
P99 ITL (ms):                            159.03    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  22.07     
Total input tokens:                      20857     
Total generated tokens:                  14739     
Request throughput (req/s):              11.60     
Output token throughput (tok/s):         667.82    
Total Token throughput (tok/s):          1612.85   
---------------Time to First Token----------------
Mean TTFT (ms):                          6981.56   
Median TTFT (ms):                        7005.91   
P99 TTFT (ms):                           7041.34   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          194.79    
Median TPOT (ms):                        197.87    
P99 TPOT (ms):                           210.77    
---------------Inter-token Latency----------------
Mean ITL (ms):                           187.10    
Median ITL (ms):                         199.32    
P99 ITL (ms):                            216.95    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  38.60     
Total input tokens:                      41795     
Total generated tokens:                  29639     
Request throughput (req/s):              13.27     
Output token throughput (tok/s):         767.92    
Total Token throughput (tok/s):          1850.80   
---------------Time to First Token----------------
Mean TTFT (ms):                          14086.51  
Median TTFT (ms):                        14109.33  
P99 TTFT (ms):                           14161.82  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          322.78    
Median TPOT (ms):                        330.42    
P99 TPOT (ms):                           355.85    
---------------Inter-token Latency----------------
Mean ITL (ms):                           308.13    
Median ITL (ms):                         336.09    
P99 ITL (ms):                            369.31    
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1024      
Benchmark duration (s):                  71.75     
Total input tokens:                      83707     
Total generated tokens:                  59207     
Request throughput (req/s):              14.27     
Output token throughput (tok/s):         825.20    
Total Token throughput (tok/s):          1991.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          25545.29  
Median TTFT (ms):                        22280.29  
P99 TTFT (ms):                           49871.32  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          625.33    
Median TPOT (ms):                        643.94    
P99 TPOT (ms):                           796.86    
---------------Inter-token Latency----------------
Mean ITL (ms):                           590.60    
Median ITL (ms):                         521.93    
P99 ITL (ms):                            5750.15   
==================================================

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  138.63    
Total input tokens:                      166906    
Total generated tokens:                  118221    
Request throughput (req/s):              14.77     
Output token throughput (tok/s):         852.79    
Total Token throughput (tok/s):          2056.77   
---------------Time to First Token----------------
Mean TTFT (ms):                          55778.92  
Median TTFT (ms):                        52655.73  
P99 TTFT (ms):                           110957.99 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          767.29    
Median TPOT (ms):                        803.02    
P99 TPOT (ms):                           1042.70   
---------------Inter-token Latency----------------
Mean ITL (ms):                           748.77    
Median ITL (ms):                         552.11    
P99 ITL (ms):                            7360.92   
==================================================
```