# vLLM: hipBLAS vs hipBLASLt
In newer ROCm PyTorch versions (2.4+), it defaults to using [hipBLASLt](https://github.com/ROCm/hipBLASLt) vs regular [hipBLAS](https://github.com/ROCm/hipBLAS). To use hipBLAS, you need to set `TORCH_BLAS_PREFER_HIPBLASLT=0`.
- https://github.com/pytorch/pytorch/issues/119081

There are two main reasons to use this setting. 1) the hipBLASLt included with PyTorch does not have support for all hipBLASLt (not to mention hipBLAS) supported platforms like RDNA3 gfx1100 and 2) if you don't increase your file handle limit, it will fail on `-tp 8`.

Anyway, let's run some tests and see if it makes any performance difference.

## File Handles and -TP8
When using hipBLASlt (which is the default for ROCm with PyTorch 2.4+), it will have problems loading above `-tp4` due to exhausted file handles. You can read more about it here: https://github.com/pytorch/pytorch/issues/137695

It can be solved by increasing the file handles:

In [1]:
# Increase File handles
!ulimit -n 131072

## Environment
For replicability, here are the versions used and some of the more relevant system information using the `vllm/collect_env.py` tool.

In [2]:
!python vllm/collect_env.py

Collecting environment information...
PyTorch version: 2.6.0.dev20241015+rocm6.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.2.41133-dd7f95766

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35

Python version: 3.11.10 | packaged by conda-forge | (main, Sep 30 2024, 18:08:57) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.2.41133
MIOpen runtime version: 3.2.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                  

In [3]:
!pip install nbformat

[0m[33mDEPRECATION: Loading egg at /mnt/nvme1n1p1/miniforge3/envs/vllm/lib/python3.11/site-packages/vllm-0.6.4.dev9+g5d264f4a.rocm624-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /mnt/nvme1n1p1/miniforge3/envs/vllm/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m

In [4]:
import pandas as pd
import re
import nbformat

In [5]:
def benchmark_model(model, input_len, output_len, tp):
    import pandas as pd
    import re
    import time
    
    # Initialize the DataFrame
    df = pd.DataFrame(columns=['BLAS', 'Requests per Second', 'Tokens per Second'])
    
    # Function to run the benchmark command and capture output
    def run_benchmark(use_hipblaslt):
        start = time.time()
        # Set the environment variable
        TORCH_BLAS_PREFER_HIPBLASLT = '1' if use_hipblaslt else '0'
        # Construct the command
        command = f"TORCH_BLAS_PREFER_HIPBLASLT={TORCH_BLAS_PREFER_HIPBLASLT} VLLM_USE_TRITON_FLASH_ATTN=0 python vllm/benchmarks/benchmark_throughput.py --backend vllm --input-len {input_len} --output-len {output_len} --model {model} -tp {tp}"
        # Run the command and capture the output
        output = get_ipython().getoutput(command)
        output_str = ' '.join(output)
        # Use regular expressions to extract the throughput values
        matches = re.findall(r"Throughput:\s*([\d.]+)\s*requests/s,\s*([\d.]+)\s*tokens/s", output_str)
        if matches:
            requests_per_sec, tokens_per_sec = map(float, matches[0])
            return requests_per_sec, tokens_per_sec
        else:
            print(f"No throughput data found for {'hipBLASLt' if use_hipblaslt else 'hipBLAS'}.")
            return None, None
        duration = time.time() - start
        print(f"Took {duration:.3f} seconds")

    # Run benchmarks for hipBLAS (use_hipblaslt=False)
    hb_rps, hb_tps = run_benchmark(use_hipblaslt=False)
    if hb_rps is None or hb_tps is None:
        print("Benchmark failed for hipBLAS.")
        return None

    # Append hipBLAS results to the DataFrame
    df.loc[len(df)] = {'BLAS': 'hipBLAS', 'Requests per Second': hb_rps, 'Tokens per Second': hb_tps}

    # Run benchmarks for Triton FA (use_triton=True)
    hblt_rps, hblt_tps = run_benchmark(use_hipblaslt=True)
    if hblt_rps is None or hblt_tps is None:
        print("Benchmark failed for hipBLASLt.")
        return None

    # Append Triton FA results to the DataFrame
    df.loc[len(df)] = {'BLAS': 'hipBLASLt', 'Requests per Second': hblt_rps, 'Tokens per Second': hblt_tps}

    # Calculate percentage differences (hipBLAS is baseline)
    percent_diff_rps = ((hblt_rps - hb_rps) / hb_rps) * 100
    percent_diff_tps = ((hblt_tps - hb_tps) / hb_tps) * 100
    avg_percent_diff = (percent_diff_rps + percent_diff_tps) / 2

    # Add percentage differences to the DataFrame
    df['% Difference RPS'] = [0, percent_diff_rps]
    df['% Difference TPS'] = [0, percent_diff_tps]
    df['% Difference Avg'] = [0, avg_percent_diff]

    # Display the DataFrame
    print(df)
    return df

In [6]:
# Call the function with your parameters
df_results = benchmark_model(
    model='meta-llama/Llama-2-7b-chat-hf',
    input_len=128,
    output_len=128,
    tp=8
)
display(df_results)

        BLAS  Requests per Second  Tokens per Second  % Difference RPS  \
0    hipBLAS                73.02           18693.78            0.0000   
1  hipBLASLt                81.61           20893.19           11.7639   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1         11.765464         11.764682  


Unnamed: 0,BLAS,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg
0,hipBLAS,73.02,18693.78,0.0,0.0,0.0
1,hipBLASLt,81.61,20893.19,11.7639,11.765464,11.764682


## Llama2-7B in:128 out:128
Just an initial test to make sure everything is working hunky dory. There's a surprising amount of variance... hipBLASLt has been as little as 2% faster, 5% faster, and in this run, almost 12% faster on each run. Any real testing may require 5-10 runs (drop high/low, and mean) or something to reduce std deviation/get better numbers.

OK, lets run our Llama3-8B sweeps:

In [7]:
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 256, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 512, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 1024, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 2048, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 4096, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 128, 'tp': 8}
        BLAS  Requests per Second  Tokens per Second  % Difference RPS  \
0    hipBLAS                82.74           10591.08          0.000000   
1  hipBLASLt                92.54           11845.65         11.844332   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1         11.845534         11.844933  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 256, 'tp': 8}
        BLAS  Requests per Second  Tokens per Second  % Difference RPS  \
0    hipBLAS                43.16           11048.96          0.000000   
1  hipBLASLt                43.32           11088.96          0.370714   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1          0.362025          0.366369  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 0, 'output_len': 512, 'tp': 8}
        BLAS  Requests per Second  Tokens pe

Unnamed: 0,BLAS,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,hipBLAS,82.74,10591.08,0.0,0.0,0.0,"input_len=0, output_len=128, tp=8"
1,hipBLASLt,92.54,11845.65,11.844332,11.845534,11.844933,"input_len=0, output_len=128, tp=8"
2,hipBLAS,43.16,11048.96,0.0,0.0,0.0,"input_len=0, output_len=256, tp=8"
3,hipBLASLt,43.32,11088.96,0.370714,0.362025,0.366369,"input_len=0, output_len=256, tp=8"
4,hipBLAS,21.73,11125.29,0.0,0.0,0.0,"input_len=0, output_len=512, tp=8"
5,hipBLASLt,22.12,11325.74,1.794754,1.801751,1.798252,"input_len=0, output_len=512, tp=8"
6,hipBLAS,10.65,10903.1,0.0,0.0,0.0,"input_len=0, output_len=1024, tp=8"
7,hipBLASLt,10.92,11180.66,2.535211,2.545698,2.540455,"input_len=0, output_len=1024, tp=8"
8,hipBLAS,5.1,10441.79,0.0,0.0,0.0,"input_len=0, output_len=2048, tp=8"
9,hipBLASLt,5.24,10733.39,2.745098,2.792625,2.768861,"input_len=0, output_len=2048, tp=8"


## Llama3-8B in:0 out:128-4096
We see one close to 12% improvement, but largely a 1-3% gain for hipBLASLt vs hipBLAS.

Next we test longer inputs:

In [10]:
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 128, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 256, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 512, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 1024, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2048, 'output_len': 128, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 4096, 'output_len': 128, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 128, 'output_len': 128, 'tp': 8}
        BLAS  Requests per Second  Tokens per Second  % Difference RPS  \
0    hipBLAS                80.50           20608.97           0.00000   
1  hipBLASLt                83.38           21345.87           3.57764   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1          3.575628          3.576634  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 256, 'output_len': 128, 'tp': 8}
        BLAS  Requests per Second  Tokens per Second  % Difference RPS  \
0    hipBLAS                72.28           27755.17          0.000000   
1  hipBLASLt                73.62           28269.26          1.853901   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1          1.852231          1.853066  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 512, 'output_len': 128, 'tp': 8}
        BLAS  Requests per Second  Tok

Unnamed: 0,BLAS,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,hipBLAS,80.5,20608.97,0.0,0.0,0.0,"input_len=128, output_len=128, tp=8"
1,hipBLASLt,83.38,21345.87,3.57764,3.575628,3.576634,"input_len=128, output_len=128, tp=8"
2,hipBLAS,72.28,27755.17,0.0,0.0,0.0,"input_len=256, output_len=128, tp=8"
3,hipBLASLt,73.62,28269.26,1.853901,1.852231,1.853066,"input_len=256, output_len=128, tp=8"
4,hipBLAS,58.03,37137.53,0.0,0.0,0.0,"input_len=512, output_len=128, tp=8"
5,hipBLASLt,61.46,39335.04,5.910736,5.917222,5.913979,"input_len=512, output_len=128, tp=8"
6,hipBLAS,42.58,49055.79,0.0,0.0,0.0,"input_len=1024, output_len=128, tp=8"
7,hipBLASLt,46.82,53931.77,9.957727,9.939663,9.948695,"input_len=1024, output_len=128, tp=8"
8,hipBLAS,28.46,61932.65,0.0,0.0,0.0,"input_len=2048, output_len=128, tp=8"
9,hipBLASLt,31.28,68066.89,9.908644,9.904695,9.906669,"input_len=2048, output_len=128, tp=8"


## Llama3-8B in:128-4096 out:128
With longer context, we see on average a bigger perf boost, with 5-10% possible. It looks like on longer context, hipBLASlt has better performance.

We'll do just a couple more tests now:
- in:131 out:131 - let's see how a non power of 2 (prime no less) number works
- in:4000 out:4000 - and what a medium context looks like.

In [11]:
# List of configurations to test
configs = [
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 131, 'output_len': 131, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2000, 'output_len': 2000, 'tp': 8},
    {'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2048, 'output_len': 2048, 'tp': 8},
]

# Initialize an empty DataFrame to store all results
all_results = pd.DataFrame()

# Run benchmarks for each configuration
for config in configs:
    print(config)
    df_result = benchmark_model(**config)
    if df_result is not None:
        # Add a column for the configuration
        df_result['Config'] = f"input_len={config['input_len']}, output_len={config['output_len']}, tp={config['tp']}"
        all_results = pd.concat([all_results, df_result], ignore_index=True)

# Display all results
display(all_results)

{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 131, 'output_len': 131, 'tp': 8}
        BLAS  Requests per Second  Tokens per Second  % Difference RPS  \
0    hipBLAS                76.94           20159.24          0.000000   
1  hipBLASLt                77.93           20417.96          1.286717   

   % Difference TPS  % Difference Avg  
0          0.000000          0.000000  
1          1.283382          1.285049  
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2000, 'output_len': 2000, 'tp': 8}
No throughput data found for hipBLAS.
Benchmark failed for hipBLAS.
{'model': 'meta-llama/Llama-3.1-8B-Instruct', 'input_len': 2048, 'output_len': 2048, 'tp': 8}
No throughput data found for hipBLAS.
Benchmark failed for hipBLAS.


Unnamed: 0,BLAS,Requests per Second,Tokens per Second,% Difference RPS,% Difference TPS,% Difference Avg,Config
0,hipBLAS,76.94,20159.24,0.0,0.0,0.0,"input_len=131, output_len=131, tp=8"
1,hipBLASLt,77.93,20417.96,1.286717,1.283382,1.285049,"input_len=131, output_len=131, tp=8"


Hmm, it looks like there werer some failures. Oh well.