Skip to content

IOG-Foundation/llm-inference-benchmarks

Repository files navigation

LLM Inference Benchmarks

LLM Inference Benchmarks on Consumer- and Enterprise-Grade Accelerators

The scripts/benchmarks folder is from vllm's benchmark implementation vllm 0.8.5 release. Added __init__.py for convenience.

How to Reproduce

Setup Environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If using docker, then get vllm/vllm-openai:v0.8.5

Prepare Dataset

For dataset random, nothing to be done. For sharegpt dataset, download:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Serve the Model

Example with vllm backend limited to 1 gpu:

CUDA_VISIBLE_DEVICES=0 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B  \
  --disable-log-requests \
  --dtype float16 \
  --tensor-parallel-size 1

Optional: set --max-model-len and --gpu-memory-utilization as needed based on the model requirements and hardware capacity. See Engine Arguments doc.

Run Benchmark

Run run_benchmark.sh with the desired parameters. Example for 200 prompts from sharegpt dataset:

./run_benchmark.sh --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
                   --backend vllm \
                   --dataset sharegpt \
                   --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
                   --result-dir ./results \
                   --num-prompts 200 \
                   --gpu-name NVIDIA-H100-PCIe \
                   --gpu-count 1
  • it runs the complete benchmark for different queries-per-second (QPS) values: 32, 16, 8, 4, 1
  • with burstiness 1.0 (Poisson) and max concurrency set as None (requests follow RPS without limitation)
  • collects metric percentiles (50, 90, 95, 99, 100) for: time-to-first-token (TTFT), time-per-output-token (TPOT), inter-token-latency (ITL)
  • for goodput metric evaluation, it sets 2000ms threshold for TTFT
  • it saves results to file result-dir/{BACKEND}_{NUM_PROMPTS}prompts{QPS}qps_{MODEL}_{DATASET}_{GPU_COUNT}x{GPU_NAME}.json
  • throughput definition: requests completed / duration in seconds
  • goodput definition: requests completed under the goodput metric / duration in seconds
  • parameters gpu-name and gpu-count here serve only as metadata for the saved benchmark results and do not influence the benchmark execution, as they are actually defined on the vllm serve.

The script run_benchmark.sh calls benchmark_serving.py under the hood. For more information on the underlying parameters, check vllm's benchmark README.md.

Results

Complete raw results are on results. Benchmarks executed on io.net's cloud.

Pricing Context

As of May 2025, GPU rental pricing shows marked differences between hyperscale cloud providers (AWS, GCP, Azure) and emerging cloud computing platforms including GPU as a Service (GPUaaS). New entrants offer enterprise GPUs at approximately 20% of hyperscaler rates, while consumer GPUs remain exclusive to these platforms. NVIDIA H100 pricing differentials approach 5×, significantly impacting AI workload economics.

Table GPU Rental Pricing Comparison Across Cloud Providers

GPU Model Alternative Providers (USD/hour) Hyperscalers (USD/hour) Notes
NVIDIA H100 SXM 2.49 12.29ᵃ ᵃ8-GPU instance normalized per unit
NVIDIA H100 PCIe 1.99 8.00–10.00ᵇ ᵇEstimated range based on typical 65–80% pricing ratio relative to SXM variant
NVIDIA RTX 4090 0.25 N/A Alternative providers only

Note: On-demand hourly rates as of May 2025. Base compute costs only; excludes storage, networking, and data transfer charges, which are typically lower or bundled on alternative platforms.

Subsequent cost-performance analyses utilize alternative provider pricing, as these platforms uniquely offer both consumer and enterprise GPUs, enabling direct comparison between GPU categories within a consistent pricing framework. Readers using hyperscaler services should multiply enterprise GPU costs by approximately 5× to reflect their pricing structure.

Online Serving

The benchmark was executed with 200 prompts using dataset sharegpt. The tensor parallelism was set to the maximum available GPUs on the hardware configuration being evaluated, to collect metrics on the overall behaviour.

It's worth noting that when serving models via vLLM, unnecessarily increasing the tensor parallelism to utilize all the available GPUs can, in fact, negatively impact performance due to communication overhead. Thus, one should find the adequate tensor parallelism that handles the model and context size, and identify the optimal QPS (queries-per-second) for the given model/hardware configuration. Then, deliver higher QPSs by load balancing requests across multiple vLLM servers. The orchestration of such a scenario can be achieved in various ways and will not be detailed in this work. It can be handled using Ray/Kubernetes, or even a load balancer that routes traffic to geo-distributed nodes based on location, capacity, Service Level Objective (SLO), and other requirements.

Meta-Llama-3-8B-Instruct

QPS Hardware Duration (s) Hardware pricing / h Cost / benchmark duration Throughput (tokens/s) Throughput (req/s) TTFT mean (ms) TTFT median (ms) TTFT p90 (ms) TTFT p99 (ms)
1 1 × H100-PCIe 190.21 $2.00 $0.106 428.54 1.05 20.60 20.36 25.50 27.43
1 1 × RTX-4090 192.07 $0.25 $0.013 424.43 1.04 49.16 43.50 77.43 110.69
4 1 × H100-PCIe 52.19 $2.00 $0.029 1562.05 3.83 21.65 21.65 26.04 27.93
4 1 × RTX-4090 58.91 $0.25 $0.004 1384.26 3.39 56.86 50.02 90.25 132.07
8 1 × H100-PCIe 30.36 $2.00 $0.017 2685.11 6.59 23.12 22.73 27.79 30.16
8 1 × RTX-4090 37.95 $0.25 $0.003 2147.82 5.27 69.22 60.99 111.09 180.34
16 1 × H100-PCIe 19.66 $2.00 $0.011 4145.92 10.17 24.75 24.95 29.80 31.06
16 1 × RTX-4090 28.68 $0.25 $0.002 2843.04 6.97 85.85 74.37 143.88 193.13
32 1 × H100-PCIe 14.99 $2.00 $0.008 5438.38 13.34 40.44 37.74 60.43 77.54
32 1 × RTX-4090 24.52 $0.25 $0.002 3325.26 8.16 161.46 162.92 260.01 330.49

DeepSeek-R1-Distill-Llama-8B

QPS Hardware Duration (s) Hardware pricing / h Cost / benchmark duration Throughput (tokens/s) Throughput (req/s) TTFT mean (ms) TTFT median (ms) TTFT p90 (ms) TTFT p99 (ms)
1 1 × H100-PCIe 190.17 $2.00 $0.106 439.03 1.05 20.49 20.59 25.52 26.74
1 1 × RTX-4090 191.99 $0.25 $0.013 437.06 1.04 47.65 39.85 76.04 110.90
1 2 × RTX-4090 190.96 $0.50 $0.027 437.22 1.05 1187.81 1173.37 2206.91 2767.89
4 1 × H100-PCIe 52.30 $2.00 $0.029 1596.42 3.82 22.17 22.48 26.44 27.84
4 1 × RTX-4090 58.81 $0.25 $0.004 1426.92 3.40 58.03 49.67 92.83 144.55
4 2 × RTX-4090 56.01 $0.50 $0.008 1490.56 3.57 363.24 223.10 814.06 1003.69
8 1 × H100-PCIe 30.43 $2.00 $0.017 2744.12 6.57 22.50 22.35 27.72 29.29
8 1 × RTX-4090 38.06 $0.25 $0.003 2193.92 5.26 67.82 60.23 108.35 148.29
8 2 × RTX-4090 35.60 $0.50 $0.005 2345.28 5.62 114.28 92.65 207.90 365.97
16 1 × H100-PCIe 19.76 $2.00 $0.011 4225.61 10.12 24.15 24.67 29.21 32.35
16 1 × RTX-4090 31.59 $0.25 $0.002 2655.99 6.33 86.87 73.60 147.44 195.46
16 2 × RTX-4090 29.53 $0.50 $0.004 2827.67 6.77 251.60 198.02 429.94 1022.01
32 1 × H100-PCIe 16.58 $2.00 $0.009 5034.81 12.06 42.10 40.26 62.41 82.89
32 1 × RTX-4090 29.27 $0.25 $0.002 2866.80 6.83 157.86 151.39 253.06 335.53
32 2 × RTX-4090 32.74 $0.50 $0.005 2550.39 6.11 1627.97 1603.34 2459.90 2685.79

DeepSeek-R1-Distill-Qwen-1.5B

QPS Hardware Duration (s) Hardware pricing / h Cost / benchmark duration Throughput (tokens/s) Throughput (req/s) TTFT mean (ms) TTFT median (ms) TTFT p90 (ms) TTFT p99 (ms)
1 1 × H100-PCIe 189.31 $2.00 $0.105 447.53 1.06 10.93 10.69 12.80 14.17
1 1 × RTX-4090 189.99 $0.25 $0.013 445.90 1.05 19.15 18.70 24.20 27.65
4 1 × H100-PCIe 47.97 $2.00 $0.027 1766.12 4.17 10.94 10.78 13.04 14.30
4 1 × RTX-4090 52.71 $0.25 $0.004 1607.46 3.79 20.84 21.08 25.37 28.57
8 1 × H100-PCIe 25.21 $2.00 $0.014 3360.08 7.93 11.13 10.86 13.40 14.74
8 1 × RTX-4090 30.85 $0.25 $0.002 2746.06 6.48 22.87 22.74 28.62 30.21
16 1 × H100-PCIe 14.30 $2.00 $0.008 5924.98 13.99 11.85 11.85 14.43 15.79
16 1 × RTX-4090 20.17 $0.25 $0.001 4200.67 9.92 26.38 26.38 33.92 40.39
32 1 × H100-PCIe 8.88 $2.00 $0.005 9536.68 22.51 21.03 16.21 22.55 145.23
32 1 × RTX-4090 16.61 $0.25 $0.001 5100.88 12.04 39.78 37.14 54.74 105.53

DeepSeek-R1-Distill-Qwen-7B

QPS Hardware Duration (s) Hardware pricing / h Cost / benchmark duration Throughput (tokens/s) Throughput (req/s) TTFT mean (ms) TTFT median (ms) TTFT p90 (ms) TTFT p99 (ms)
1 1 × H100-PCIe 190.07 $2.00 $0.106 448.67 1.05 20.22 20.17 24.70 26.24
1 1 × RTX-4090 191.71 $0.25 $0.013 443.91 1.04 33.50 33.86 40.29 43.85
1 2 × RTX-4090 190.40 $0.50 $0.026 447.53 1.05 4253.82 2724.50 9964.76 16206.69
4 1 × H100-PCIe 51.77 $2.00 $0.029 1644.85 3.86 21.49 21.08 25.56 27.81
4 1 × RTX-4090 57.74 $0.25 $0.004 1475.33 3.46 38.56 39.28 46.56 49.75
4 2 × RTX-4090 53.99 $0.50 $0.007 1577.77 3.70 494.29 322.24 1203.65 1763.86
8 1 × H100-PCIe 29.89 $2.00 $0.017 2843.24 6.69 22.53 22.36 26.68 28.54
8 1 × RTX-4090 36.61 $0.25 $0.003 2327.15 5.46 44.46 44.79 54.03 58.15
8 2 × RTX-4090 33.40 $0.50 $0.005 2547.06 5.99 102.30 86.10 174.14 290.63
8 4 × RTX-4090 31.68 $1.00 $0.009 2687.71 6.31 124.65 94.99 234.67 353.82
16 1 × H100-PCIe 19.12 $2.00 $0.011 4461.08 10.46 24.04 24.04 28.84 31.23
16 1 × RTX-4090 26.97 $0.25 $0.002 3157.55 7.41 49.96 49.75 63.14 74.86
16 2 × RTX-4090 25.63 $0.50 $0.004 3319.17 7.80 371.93 242.15 745.85 937.87
16 4 × RTX-4090 24.92 $1.00 $0.007 3419.58 8.02 287.77 244.50 550.65 730.39
32 1 × H100-PCIe 15.74 $2.00 $0.009 5413.64 12.71 38.41 36.75 56.29 69.63
32 1 × RTX-4090 27.72 $0.25 $0.002 3074.63 7.21 145.71 150.03 227.90 307.21
32 2 × RTX-4090 29.07 $0.50 $0.004 2931.55 6.88 1021.83 949.28 1534.68 1695.71
32 4 × RTX-4090 26.53 $1.00 $0.007 3208.07 7.54 1032.83 1027.77 1338.91 1491.44

DeepSeek-R1-Distill-Qwen-14B

QPS Hardware Duration (s) Hardware pricing / h Cost / benchmark duration Throughput (tokens/s) Throughput (req/s) TTFT mean (ms) TTFT median (ms) TTFT p90 (ms) TTFT p99 (ms)
1 1 × H100-PCIe 192.17 $2.00 $0.107 433.76 1.04 35.10 35.05 43.12 45.55
1 2 × RTX-4090 193.62 $0.50 $0.027 430.51 1.03 355.29 300.75 649.22 1031.91
1 4 × RTX-4090 192.77 $1.00 $0.054 431.19 1.04 389.44 304.36 846.98 1199.15
1 8 × RTX-4090 191.75 $2.00 $0.107 434.92 1.04 6457.09 7269.67 12589.86 14453.35
4 1 × H100-PCIe 59.04 $2.00 $0.033 1411.84 3.39 36.53 36.26 44.95 46.79
4 2 × RTX-4090 66.47 $0.50 $0.009 1250.59 3.01 240.39 207.68 463.89 638.69
4 4 × RTX-4090 61.82 $1.00 $0.017 1344.58 3.24 156.97 131.61 289.73 482.13
4 8 × RTX-4090 59.68 $2.00 $0.033 1393.08 3.35 132.29 92.70 215.01 752.63
8 1 × H100-PCIe 37.20 $2.00 $0.021 2240.95 5.38 38.29 38.21 46.65 49.09
8 2 × RTX-4090 55.56 $0.50 $0.008 1500.34 3.60 335.95 311.47 571.83 802.63
8 4 × RTX-4090 44.68 $1.00 $0.012 1858.57 4.48 154.99 133.90 265.25 448.04
8 8 × RTX-4090 44.84 $2.00 $0.025 1849.74 4.46 132.48 122.05 210.40 268.73
16 1 × H100-PCIe 27.68 $2.00 $0.015 3011.13 7.22 40.89 41.88 50.38 54.41
16 2 × RTX-4090 51.47 $0.50 $0.007 1619.37 3.89 843.12 727.63 1571.76 1947.71
16 4 × RTX-4090 41.67 $1.00 $0.012 1994.82 4.80 163.87 159.71 250.68 337.07
16 8 × RTX-4090 41.16 $2.00 $0.023 2026.68 4.86 148.55 144.75 221.16 266.40

Note: GPU hourly price is assumed $2.00 for each H100 and $0.25 for each RTX-4090

How to Evaluate in Practice

While the benchmarks utilize all available GPUs for the hardware configuration to collect performance data, production deployments should carefully optimize tensor parallelism sizing. Excessive parallelism can degrade performance due to inter-GPU communication overhead. In practice, operators should:

  1. Determine the minimum tensor parallelism required for model and context size
  2. Optionally tune tensor parallelism up aiming specific latency/throughput SLOs
  3. Find the optimal QPS for the combination
  4. Then, for higher QPS, scale horizontally by load-balancing across multiple vLLM instances.

For instance, for the given model DeepSeek-R1-Distill-Qwen-14B and request pattern, from the results table with the performance and cost metrics for the evaluation, the following configurations demonstrate to be in the sweet spot:

  • #6: QPS=4, 2 × RTX-4090, TPS=1250.59, TTFT p90=463.89 ms, cost=$0.111/ 1M token
  • #11: QPS=8, 4 × RTX-4090, TPS=1858.57, TTFT p90=265.25 ms, cost=$0.149/ 1M token
  • #13: QPS=16, 1 x H100, TPS=3011.13, TTFT p90=50.38 ms, cost=$0.185/ 1M token

So let's set runs #6 and #13 as the chosen configurations for the RTX 4090 and H100, respectively, for further in-depth analysis.

Figure below shows the scaling pattern that helps to identify the ideal QPS for the configuration.

aibc2025 Scaling Pattern Target QPS and Achieved Requests per Second.

We compare the target QPS (the one triggered by the benchmark) and the achieved throughput (requests/s): H100 closely follows the ideal 100% scaling line up to ≈8 req/s, then begins to plateau around 7.8 req/s at 16–32 QPS, reflecting the card’s device-level throughput limit; 2× RTX 4090 scales nearly linearly up to 4 QPS (≈3.2 req/s), but beyond this point yields diminishing returns, plateauing at ≈3.9 req/s at 16–32 QPS. Increasing to 4 local cards shows almost no additional throughput, indicating that on-node memory and inter-GPU communication become the bottlenecks; 4 nodes × (2 × RTX 4090) overcomes these local limits: at 16 QPS, it achieves ≈12 req/s (≈1.5× the H100). Horizontal sharding across multiple inference servers is preferable to pushing a single replica past this knee. For RTX 4090 configuration, with QPS > 8, to avoid queue congestion and increased tail latency, one would load balance, for example, among: 4 vLLM server instances of 2 x RTX 4090 or 2 instances of 4 x RTX 4090, depending on the SLO requirements and expected steady stream duration, ideally dynamically routing based on instances real time capacity. For H100 configuration, with QPS > 16, to keep latency < 100ms, one would load balance to other instances (for example 2 vLLM instances of H100); or keep 1 instance of H100 with QPS ≤ 32 if some jitter is allowed. Such orchestration across multiple vLLM servers is outside the scope of the collected measured benchmark and is typically implemented using specific frameworks (e.g., Ray, Kubernetes) and custom load balancers. H100 maintained sub-55 ms P99 time-to-first-token. Ideal for real-time applications and tight SLOs. The 4090 clusters deliver markedly better unit economics. Suitable for research, dev/test environments, and any traffic that can tolerate 200–500 ms tail latencies, including stream chat with latency allowed, batch summarization, embedding, and eval sweeps. They could default to small 4096 pools and use H100 when latency or head-of-line jitter dominates the user experience. Distributed 4-node RTX 4090 outperforms a single H100 in raw requests throughput (≈12 req/s at 16 QPS) while remaining cost-competitive ($0.111 / 1M token). Cost-performance sweet spot: 2× RTX 4090 is the most efficient demonstrating 4–7× more throughput per dollar than H100 across typical QPS. H100 shines when latency SLOs or peak throughput demands exceed what small consumer clusters can deliver.

Citation

If you use this work in your research, please cite¹:

@inproceedings{almeida2025gpus,
  author    = {Aline Ribeiro de Almeida},
  title     = {Idle Consumer GPUs as a Complement to Enterprise Hardware for LLM Inference: Performance, Cost and Carbon Analysis},
  booktitle = {Proceedings of the 2025 6th International Artificial Intelligence and Blockchain Conference (AIBC 2025)},
  year      = {2025},
  doi       = {10.1145/3775043.3775047},
  isbn      = {979-8-4007-1967-7/2025/09},
  publisher = {Association for Computing Machinery},
  location  = {Tokyo, Japan},
  month     = sep
}

¹ to be updated after final publication. Preprint: aibc2025

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.

About

LLM Inference Benchmarks on Consumer- and Enterprise-Grade Accelerators

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published