# vLLM: GEMM TUning

## File Handles and -TP8
When using hipBLASlt (which is the default for ROCm with PyTorch 2.4+), it will have problems loading above `-tp4` due to exhausted file handles. You can read more about it here: https://github.com/pytorch/pytorch/issues/137695

It can be solved by increasing the file handles:

In [1]:
# Increase File handles
!ulimit -n 131072

## Environment
For replicability, here are the versions used and some of the more relevant system information using the `vllm/collect_env.py` tool.

In [2]:
!python vllm/collect_env.py

Collecting environment information...
PyTorch version: 2.6.0.dev20241015+rocm6.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.2.41133-dd7f95766

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35

Python version: 3.11.10 | packaged by conda-forge | (main, Sep 30 2024, 18:08:57) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.2.41133
MIOpen runtime version: 3.2.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                  

In [3]:
!pip install nbformat

[0m[33mDEPRECATION: Loading egg at /mnt/nvme1n1p1/miniforge3/envs/vllm/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /mnt/nvme1n1p1/miniforge3/envs/vllm/lib/python3.11/site-packages/vllm-0.6.4.dev9+g5d264f4a.rocm624-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m

In [10]:
import pandas as pd
import re
import nbformat

# PyTorch GEMM Tuning
PyTorch has new feature called [TunableOp](https://pytorch.org/docs/main/cuda.tunable.html) which allows GEMM tuning.
I'm using this ROCm blog [guide to run the GEMM Tuning](https://rocm.blogs.amd.com/artificial-intelligence/vllm-optimize/README.html#gemm-tuning). Here is a replication of their example:

In [5]:
# With PyTorch GEMM Tuning
!VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 time python3 vllm/benchmarks/benchmark_latency.py --input-len 512 --output-len 512 --num-iters 10 --model meta-llama/Meta-Llama-3-8B-Instruct

Namespace(model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=512, batch_size=8, n=1, use_beam_search=False, num_iters_warmup=10, num_iters=10, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='auto', block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=True, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, load_format='auto', distributed_executor_backend=None, otlp_traces_endpoint=None)
INFO 10-28 17:18:37 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-28 17:18:37 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev9+g5d264f4a) with confi

In [7]:
# Run again to see if we skip tuning time
!VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 time python3 vllm/benchmarks/benchmark_latency.py --input-len 512 --output-len 512 --num-iters 10 --model meta-llama/Meta-Llama-3-8B-Instruct

Namespace(model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=512, batch_size=8, n=1, use_beam_search=False, num_iters_warmup=10, num_iters=10, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='auto', block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=True, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, load_format='auto', distributed_executor_backend=None, otlp_traces_endpoint=None)
INFO 10-28 17:38:53 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-28 17:38:53 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev9+g5d264f4a) with confi

In [6]:
# Without PyTorch GEMM Tuning - ROCR_VISIBLE_DEVICES=7 if you can run at same time
!VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=0 time python3 vllm/benchmarks/benchmark_latency.py --input-len 512 --output-len 512 --num-iters 10 --model meta-llama/Meta-Llama-3-8B-Instruct

Namespace(model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=512, batch_size=8, n=1, use_beam_search=False, num_iters_warmup=10, num_iters=10, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='auto', block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=True, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, load_format='auto', distributed_executor_backend=None, otlp_traces_endpoint=None)
INFO 10-28 17:35:56 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-28 17:35:56 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev9+g5d264f4a) with confi

# GEMM Tuning Replication Results
While our raw numbers are a bit higher (avg 4.669s, 4.935s - 5.4% lower latency) vs their results (avg 4.30s, 4.60s - 6.5% lower latency) the latency decrease with GEMM tuning was directionally similar and within the same ballpark.

Replication is good!

Now lets see how this affects bechmark throughput...

In [2]:
!VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 time python vllm/benchmarks/benchmark_throughput.py --backend vllm --input-len 0 --output-len 128 --model meta-llama/Llama-3.1-8B-Instruct -tp 8

Namespace(backend='vllm', dataset=None, input_len=0, output_len=128, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=1, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-29 03:04:56 config.py:887] Defaulting to use mp for distributed inference
INFO 10-29 03:04:56 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-29 03:04:56 llm_engine.py:237] Initializin

In [5]:
# Run again to see what it's like after compile...
!VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 time python vllm/benchmarks/benchmark_throughput.py --backend vllm --input-len 0 --output-len 128 --model meta-llama/Llama-3.1-8B-Instruct -tp 8

Namespace(backend='vllm', dataset=None, input_len=0, output_len=128, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=1, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-29 03:18:18 config.py:887] Defaulting to use mp for distributed inference
INFO 10-29 03:18:18 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-29 03:18:18 llm_engine.py:237] Initializin

In [3]:
!VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=0 time python vllm/benchmarks/benchmark_throughput.py --backend vllm --input-len 0 --output-len 128 --model meta-llama/Llama-3.1-8B-Instruct -tp 8

Namespace(backend='vllm', dataset=None, input_len=0, output_len=128, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=1, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-29 03:09:06 config.py:887] Defaulting to use mp for distributed inference
INFO 10-29 03:09:06 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-29 03:09:06 llm_engine.py:237] Initializin

In [17]:
import time
def benchmark_model(model, input_len, output_len, tp):
    import pandas as pd
    import re
    
    # Initialize the DataFrame
    df = pd.DataFrame(columns=['Tuning', 'Requests per Second', 'Tokens per Second'])
    
    # Function to run the benchmark command and capture output
    def run_benchmark(tuning):
        # Set the environment variable
        if tuning == 'pytorch':
            command = f"VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=1 time python vllm/benchmarks/benchmark_throughput.py --backend vllm --input-len {input_len} --output-len {output_len} --model {model} -tp {tp}"
        else:
            command = f"VLLM_USE_TRITON_FLASH_ATTN=0 PYTORCH_TUNABLEOP_ENABLED=0 time python vllm/benchmarks/benchmark_throughput.py --backend vllm --input-len {input_len} --output-len {output_len} --model {model} -tp {tp}"
        # Run the command and capture the output
        start = time.time()
        output = get_ipython().getoutput(command)
        end = time.time()
        output_str = ' '.join(output)
        print(f"  {tuning} Run time: {end-start:.2f} seconds")
        # Use regular expressions to extract the throughput values
        matches = re.findall(r"Throughput:\s*([\d.]+)\s*requests/s,\s*([\d.]+)\s*tokens/s", output_str)
        if matches:
            requests_per_sec, tokens_per_sec = map(float, matches[0])
            return requests_per_sec, tokens_per_sec
        else:
            print(f"No throughput data found for {tuning} tuning.")
            return None, None
        

    # Run benchmarks for no GEMM Tuning
    none_rps, none_tps = run_benchmark(tuning='none')
    if none_rps is None or none_tps is None:
        print("Benchmark failed for no GEMM tuning.")
        return None

    # Append No GEMM Tuning results to the DataFrame
    df.loc[len(df)] = {'Tuning': 'none', 'Requests per Second': none_rps, 'Tokens per Second': none_tps}

    # Run benchmarks for Pytorch GEMM Tuning (tunable ops)
    pytorch_rps, pytorch_tps = run_benchmark(tuning='pytorch')
    if pytorch_rps is None or pytorch_tps is None:
        print("Benchmark failed for Pytorch TunableOp GEMM tuning.")
        return None

    # Append Triton FA results to the DataFrame
    df.loc[len(df)] = {'Tuning': 'pytorch', 'Requests per Second': pytorch_rps, 'Tokens per Second': pytorch_tps}

    # Calculate percentage differences (None is baseline)
    percent_diff_rps = ((pytorch_rps - none_rps) / none_rps) * 100
    percent_diff_tps = ((pytorch_tps - none_tps) / none_tps) * 100
    avg_percent_diff = (percent_diff_rps + percent_diff_tps) / 2

    # Add percentage differences to the DataFrame
    df['% Difference RPS'] = [0, percent_diff_rps]
    df['% Difference TPS'] = [0, percent_diff_tps]
    df['% Difference Avg'] = [0, avg_percent_diff]

    # Display the DataFrame
    print(df)
    return df

In [18]:
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 256, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 512, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 1024, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 2048, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 4096, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 128, 'tp': 8}
  none Run time: 75.02 seconds
  pytorch Run time: 213.28 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                92.71           11866.28          0.000000   
1  pytorch                55.85            7148.72        -39.758386   

   % Difference TPS  % Difference Avg  
0          0.000000            0.0000  
1        -39.756015          -39.7572  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 256, 'tp': 8}
  none Run time: 85.67 seconds
  pytorch Run time: 225.29 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                44.84           11480.29          0.000000   
1  pytorch                34.30            8781.02        -23.505798   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1        -23.512211        -23.509005  
{'model': 'met

Unnamed: 0,Tuning,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,none,92.71,11866.28,0.0,0.0,0.0,"input_len=0, output_len=128, tp=8"
1,pytorch,55.85,7148.72,-39.758386,-39.756015,-39.7572,"input_len=0, output_len=128, tp=8"
2,none,44.84,11480.29,0.0,0.0,0.0,"input_len=0, output_len=256, tp=8"
3,pytorch,34.3,8781.02,-23.505798,-23.512211,-23.509005,"input_len=0, output_len=256, tp=8"
4,none,22.6,11570.4,0.0,0.0,0.0,"input_len=0, output_len=512, tp=8"
5,pytorch,19.74,10106.81,-12.654867,-12.649433,-12.65215,"input_len=0, output_len=512, tp=8"
6,none,11.03,11299.32,0.0,0.0,0.0,"input_len=0, output_len=1024, tp=8"
7,pytorch,10.3,10548.75,-6.618314,-6.642612,-6.630463,"input_len=0, output_len=1024, tp=8"
8,none,5.26,10779.3,0.0,0.0,0.0,"input_len=0, output_len=2048, tp=8"
9,pytorch,5.17,10588.03,-1.711027,-1.774419,-1.742723,"input_len=0, output_len=2048, tp=8"


In [19]:
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 128, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 256, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 512, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 1024, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2048, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 4096, 'output_len': 128, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 128, 'output_len': 128, 'tp': 8}
  none Run time: 75.49 seconds
  pytorch Run time: 409.22 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                83.43           21357.74          0.000000   
1  pytorch                 4.64            1187.13        -94.438451   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1        -94.441687        -94.440069  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 256, 'output_len': 128, 'tp': 8}
  none Run time: 77.80 seconds
  pytorch Run time: 454.51 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                74.77           28710.44          0.000000   
1  pytorch                 3.82            1465.84        -94.890999   

   % Difference TPS  % Difference Avg  
0          0.000000            0.0000  
1        -94.894401          -94.8927  
{'model': 

Unnamed: 0,Tuning,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,none,83.43,21357.74,0.0,0.0,0.0,"input_len=128, output_len=128, tp=8"
1,pytorch,4.64,1187.13,-94.438451,-94.441687,-94.440069,"input_len=128, output_len=128, tp=8"
2,none,74.77,28710.44,0.0,0.0,0.0,"input_len=256, output_len=128, tp=8"
3,pytorch,3.82,1465.84,-94.890999,-94.894401,-94.8927,"input_len=256, output_len=128, tp=8"
4,none,60.11,38472.05,0.0,0.0,0.0,"input_len=512, output_len=128, tp=8"
5,pytorch,5.48,3505.06,-90.88338,-90.889334,-90.886357,"input_len=512, output_len=128, tp=8"
6,none,47.3,54486.56,0.0,0.0,0.0,"input_len=1024, output_len=128, tp=8"
7,pytorch,5.27,6074.7,-88.858351,-88.851012,-88.854682,"input_len=1024, output_len=128, tp=8"
8,none,31.34,68196.24,0.0,0.0,0.0,"input_len=2048, output_len=128, tp=8"
9,pytorch,5.43,11815.98,-82.673899,-82.673561,-82.67373,"input_len=2048, output_len=128, tp=8"


In [20]:
# Second time...
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 128, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 256, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 512, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 1024, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2048, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 4096, 'output_len': 128, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 128, 'output_len': 128, 'tp': 8}
  none Run time: 76.46 seconds
  pytorch Run time: 406.58 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                82.44           21103.54          0.000000   
1  pytorch                 4.64            1187.83        -94.371664   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1        -94.371418        -94.371541  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 256, 'output_len': 128, 'tp': 8}
  none Run time: 82.36 seconds
  pytorch Run time: 456.11 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                74.92           28769.30          0.000000   
1  pytorch                 3.80            1457.97        -94.927923   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1        -94.932202        -94.930063  
{'model': 

Unnamed: 0,Tuning,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,none,82.44,21103.54,0.0,0.0,0.0,"input_len=128, output_len=128, tp=8"
1,pytorch,4.64,1187.83,-94.371664,-94.371418,-94.371541,"input_len=128, output_len=128, tp=8"
2,none,74.92,28769.3,0.0,0.0,0.0,"input_len=256, output_len=128, tp=8"
3,pytorch,3.8,1457.97,-94.927923,-94.932202,-94.930063,"input_len=256, output_len=128, tp=8"
4,none,61.66,39460.56,0.0,0.0,0.0,"input_len=512, output_len=128, tp=8"
5,pytorch,5.39,3449.6,-91.258514,-91.258107,-91.258311,"input_len=512, output_len=128, tp=8"
6,none,47.14,54304.69,0.0,0.0,0.0,"input_len=1024, output_len=128, tp=8"
7,pytorch,5.27,6070.88,-88.820535,-88.820708,-88.820621,"input_len=1024, output_len=128, tp=8"
8,none,31.38,68279.47,0.0,0.0,0.0,"input_len=2048, output_len=128, tp=8"
9,pytorch,5.52,12021.28,-82.409178,-82.394005,-82.401591,"input_len=2048, output_len=128, tp=8"


Random outputs...

In [21]:
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 131, 'output_len': 131, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2000, 'output_len': 2000, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2048, 'output_len': 2048, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 131, 'output_len': 131, 'tp': 8}
  none Run time: 78.27 seconds
  pytorch Run time: 411.14 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                80.99           21218.48          0.000000   
1  pytorch                 4.57            1196.45        -94.357328   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1        -94.361283        -94.359306  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2000, 'output_len': 2000, 'tp': 8}
  none Run time: 298.03 seconds
  pytorch Run time: 893.04 seconds
    Tuning  Requests per Second  Tokens per Second  % Difference RPS  \
0     none                 4.28           17104.94          0.000000   
1  pytorch                 1.43            5726.42        -66.588785   

   % Difference TPS  % Difference Avg  
0          0.000000           0.00000  
1        -66.521835         -66.55531  
{'model

Unnamed: 0,Tuning,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,none,80.99,21218.48,0.0,0.0,0.0,"input_len=131, output_len=131, tp=8"
1,pytorch,4.57,1196.45,-94.357328,-94.361283,-94.359306,"input_len=131, output_len=131, tp=8"
2,none,4.28,17104.94,0.0,0.0,0.0,"input_len=2000, output_len=2000, tp=8"
3,pytorch,1.43,5726.42,-66.588785,-66.521835,-66.55531,"input_len=2000, output_len=2000, tp=8"
4,none,4.19,17153.8,0.0,0.0,0.0,"input_len=2048, output_len=2048, tp=8"
5,pytorch,2.62,10723.66,-37.470167,-37.485222,-37.477694,"input_len=2048, output_len=2048, tp=8"


# Conclusion
Based on these results we are seeing huge overhead, no improvements at all when turning on TunableOps.

# gradlib GEMM Tuning
- https://rocm.blogs.amd.com/artificial-intelligence/vllm-optimize/README.html#gemm-tuning
- https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop/README.html
- https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpus-with-gemm-tuning-improves-throughput-and-latency-by-up-to-7-2x

There's surprisingly little documentation on using gradlib (part of ROCm/vllm)
- https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#gemm-tuning-steps

In [1]:
#VLLM_UNTUNE_FILE="untuned-in128-out128.csv" VLLM_TUNE_GEMM=1 VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct -tp 8 --input-len 128 --output-len 128
#python ~/vllm-rocm/gradlib/gradlib/gemm_tuner.py --input untuned-in128-out128.csv --tuned_file tuned-in128-out128.csv --tp 8  --nsets 1 

# && for tp in 1 1 2 2 4 4 8 8 128 128 256 256; do VLLM_TUNE_FILE="tuned_llama3_8B_B${tp}.csv" VLLM_TUNE_GEMM=0 HIP_VISIBLE_DEVICES=2,3,4,5,6,7 VLLM_USE_TRITON_FLASH_ATTN=0 python benchmark_throughput_prompt.py --model /models/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/ -tp 1 --num-prompts $tp --input-len 1024 --output-len 128 --prompt "Write a poem about a black cat" > ${tp}PromptsL3_8BGEMM; done 2>&1 &

In [3]:
# VLLM variables
# export VLLM_UNTUNE_FILE="/tmp/vllm_untuned.csv"
# export VLLM_TUNE_FILE="$(pwd)/tuned.csv"

In [6]:
# Generate vLLM untuned files
!VLLM_UNTUNE_FILE="/tmp/vllm_untuned.csv" VLLM_TUNE_GEMM=1 VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --num-scheduler-steps 15 --model meta-llama/Llama-3.1-8B-Instruct --input-len 128 --output-len 128 -tp 8

Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=15, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-30 19:01:14 config.py:887] Defaulting to use mp for distributed inference
INFO 10-30 19:01:14 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-30 19:01:14 llm_engine.py:237] Initiali

In [7]:
# GEMM Tuning
!time python ~/vllm-rocm/gradlib/gradlib/gemm_tuner.py --input /tmp/vllm_untuned.csv --tuned_file tuned.csv

>>> Loading /tmp/vllm_untuned.csv
M N K dtype 768 131072 4096 torch.bfloat16 >>> Total rocb solutions 388
M N K bias dtype 768 131072 4096 False torch.bfloat16 >>> Total hipb solutions 445
>>> Rocblas top solutions, Fast Mode 1
            gtimems
621283192  1.529309
621283627  1.540934
621283196  1.543761
621286477  1.544263
621283639  1.548452
621283198  1.554225
621283177  1.568658
621283552  1.570802
621286458  1.571945
621283554  1.572868
621283625  1.577277
621283180  1.577778
621283200  1.582008
621283637  1.588503
621283634  1.590768
621283553  1.593213
621283442  1.596040
621283628  1.596080
621283444  1.597884
621286461  1.599066
>>> HipBlasLt top solutions, Fast Mode 1
        gtimems
13908  1.478093
67575  1.561321
13828  1.568498
13737  1.577077
13829  1.578520
68900  1.586218
67582  1.600169
13887  1.605301
67589  1.614321
67224  1.631921
13911  1.634808
13792  1.650984
13820  1.654533
13862  1.655875
67621  1.660607
13819  1.661909
13824  1.677765
13825  1.679569
67543  

In [8]:
# Test w/ tuning
!VLLM_TUNE_FILE="$(pwd)/tuned.csv" VLLM_TUNE_GEMM=0 VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --num-scheduler-steps 15 --model meta-llama/Llama-3.1-8B-Instruct --input-len 128 --output-len 128 -tp 8

Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=15, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-30 19:09:11 config.py:887] Defaulting to use mp for distributed inference
INFO 10-30 19:09:11 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-30 19:09:11 llm_engine.py:237] Initiali

In [9]:
# Test w/ tuning - does this work w/ TP1?
!VLLM_TUNE_FILE="$(pwd)/tuned.csv" VLLM_TUNE_GEMM=0 VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --num-scheduler-steps 15 --model meta-llama/Llama-3.1-8B-Instruct --input-len 128 --output-len 128

Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=1, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=15, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-30 19:10:24 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-30 19:10:24 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev9+g5d264f4a) with config: model='meta-llama/Llama-3.

In [10]:
# Does this work with different in/ou?
!VLLM_TUNE_FILE="$(pwd)/tuned.csv" VLLM_TUNE_GEMM=0 VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --num-scheduler-steps 15 --model meta-llama/Llama-3.1-8B-Instruct --input-len 512 --output-len 512 -tp 8

Namespace(backend='vllm', dataset=None, input_len=512, output_len=512, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=15, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-30 19:34:28 config.py:887] Defaulting to use mp for distributed inference
INFO 10-30 19:34:28 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-30 19:34:28 llm_engine.py:237] Initiali

In [11]:
# Don't use tuned GEMM
!VLLM_UNTUNE_FILE="/tmp/vllm_untuned-512-512.csv" VLLM_TUNE_GEMM=1 VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --num-scheduler-steps 15 --model meta-llama/Llama-3.1-8B-Instruct --input-len 512 --output-len 512 -tp 8

Namespace(backend='vllm', dataset=None, input_len=512, output_len=512, model='meta-llama/Llama-3.1-8B-Instruct', tokenizer='meta-llama/Llama-3.1-8B-Instruct', quantization=None, tensor_parallel_size=8, n=1, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, device='auto', num_scheduler_steps=15, use_v2_block_manager=True, enable_prefix_caching=False, enable_chunked_prefill=False, max_num_batched_tokens=None, download_dir=None, output_json=None, distributed_executor_backend=None, load_format='auto', disable_async_output_proc=False, async_engine=False, disable_frontend_multiprocessing=False)
INFO 10-30 19:37:19 config.py:887] Defaulting to use mp for distributed inference
INFO 10-30 19:37:19 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 10-30 19:37:19 llm_engine.py:237] Initiali