# Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study

Jianwei Zhu jianweiz@phala.network

Peng Deng pdeng21@m.fudan.edu.cn Hang Yin hangyin@phala.network

Shunfan Zhou shelvenzhou@phala.network

September 16, 2024

#### Abstract

This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on nVIDIA H100 GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various LLMs and token lengths, with a particular focus on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results indicate that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer. For the majority of typical LLM queries, the overhead remains below 5%, with larger models and longer sequences experiencing nearly zero overhead.

### 1 Introduction

Trusted Execution Environments (TEEs) are increasingly important in machine learning and AI due to growing security requirements in both enterprise and decentralized applications [SAB15, MSM<sup>+</sup>18, AKKH18]. The introduction of TEE-enabled GPUs, such as the nVIDIA H100, adds an extra layer of protection for sensitive data but may impact performance. Understanding these trade-offs, particularly for large-scale machine learning tasks, is crucial for adopting TEE in high-performance AI applications [YMY<sup>+</sup>22, WO24].

This report quantifies the performance overhead of enabling TEE on the nVIDIA H100 GPU during LLM inference tasks, identifying where the overhead arises and under what conditions it can be minimized.

## 2 Background

#### 2.1 Trusted Execution Environment

A TEE is a hardware-based security feature that isolates computations, preventing unauthorized access and tampering, even from the operating system or the physical hardware owner. As the core technology enabling Confidential Computing, TEEs create secure enclaves where sensitive data and code are processed with encryption, ensuring confidentiality and integrity even if the broader system is compromised [SAB15]. Traditionally implemented in CPUs, TEE technology was extended to GPUs by nVIDIA in 2023, enabling tamper-proof and confidentiality-preserving computation inside the GPU with minimal performance penalty [DGK<sup>+</sup>23].

#### 2.2 nVIDIA H100 GPU

The nVIDIA H100 GPU marks a significant milestone as the first GPU to support TEE. In TEE mode, the H100 operates in an isolated and secure environment where data transfers between the CPU and GPU are encrypted. This is achieved through "bounce buffers", which protect all inputs and outputs during transit between the CPU's encrypted memory and the GPU's internal memory [DGK<sup>+</sup>23].

To maintain end-to-end security, the H100 works in conjunction with CPU TEEs, such as Intel's TDX [Int] or AMD's SEV-SNP [AMD, SS20], securing communication channels between the GPU driver and interacting software. This setup prevents unauthorized access and ensures data integrity throughout the process.

The H100 also implements remote attestation to verify the GPU's identity and the authenticity of its firmware. Additionally, Secure Boot ensures that only authenticated firmware is executed during the GPU's boot process, further strengthening security.

#### 2.3 Performance Impact

Enabling TEE on the nVIDIA H100 GPU introduces performance overheads primarily due to additional encryption and decryption during secure data transfer [MYF<sup>+</sup>24]. While the GPU's internal computation remains unaffected, the main bottleneck lies in the CPU-GPU I/O, particularly when data is exchanged via PCIe. This impact varies with the size of the data transfer. The following sections present experimental results quantifying these effects across various use cases.

With the TEE-enabled nVIDIA H100 GPU, it becomes crucial to quantify performance tradeoffs during practical use cases. In the next section, we outline the methodology used to assess the performance impact of TEE during LLM inference tasks.

## 3 Methodology

To evaluate the performance overhead, we conducted experiments comparing inference throughput and latency with TEE mode enabled and disabled, under different models, input and output lengths, and batch size setups. Our primary focus was to reveal the performance penalty in real-world large language model (LLM) inference tasks.

#### 3.1 Metrics

The primary metrics were evaluated following typical evaluation frameworks [AAK<sup>+</sup>24]:

- TTFT (Time To First Token): The time from request arrival to the generation of the first output token. It includes scheduling delay and prompt processing. Lower TTFT is essential for real-time applications, while higher TTFT is tolerable in batch processing.
- ITL (Inter-Token Latency): The time between generating each token during decoding. This directly affects the perceived model speed. A rate of around 6 tokens per second is necessary for a smooth user experience, assuming an average reading speed.
- TPS (Tokens per Second): The average rate of token generation during decoding. It is calculated as the number of tokens generated divided by the total decoding time.
- Latency: The total execution time per request, including scheduling, prompt processing, and token generation. Lower normalized latency improves system throughput, especially under high query loads.
- QPS (Queries per Second): The maximum load a system can handle while meeting latency targets. Higher QPS reduces serving costs and is a key measure of system capacity.

#### 3.2 Test Scenarios

Experiments were structured to explore the impact of TEE mode under diverse conditions:

- TEE mode ON vs. TEE mode OFF: Tests were performed with TEE mode alternately enabled and disabled on the H100 GPU, allowing for a direct comparison of performance.
- Sequence Lengths: Various token lengths were tested by sampling the ShareGPT Dataset [ano] to simulate different LLM inference tasks.
- Batch Size: Both fixed batch sizes (1, 4, and 16) and dynamic batch sizes determined by vLLM [KLZ<sup>+</sup>23] were tested to simulate the performance for serving real-time requests and batch requests.

#### 3.3 Experimental Setup

#### 3.3.1 Infrastructure

The experiments were set up with the following hardware:

• GPU: nVIDIA H100 NVL (94GB, 3.9TB/s bandwidth)

• CPU: AMD EPYC 9V84 96-Core Processor with SEV-SNP

• **Memory**: 314 GB

• Driver versions:

- CUDA 12.5 (driver version 555.42.02)

- Kernel driver version 550.90.07

#### 3.3.2 Application

The experiments utilized the benchmark suite of vLLM v0.5.4 (rev: 4db5176) [KLZ+23].

#### **3.3.3** Models

Three LLMs were used for inference:

- Meta-Llama-3.1-8B-Instruct
- Phi-3-14B-128k-Instruct
- Meta-Llama-3.1-70B-Instruct with 4-bit bits and bytes quantization to fit into a single H100 GPU

#### 4 Results

Conclusion 1: The average overhead is less than 7%. We quantified the overhead by measuring the throughput with TEE mode enabled versus disabled, across varying input sizes and model configurations, as shown in Table 1.

| $\mathbf{Model}$ | TPS (tokens/s) |          |               | ${ m QPS} \; ({ m req/s})$ |         |               |  |
|------------------|----------------|----------|---------------|----------------------------|---------|---------------|--|
|                  | TEE-on         | TEE-off  | Overhead      | TEE-on                     | TEE-off | Overhead      |  |
| LLama-3.1-8B     | 123.2985       | 132.3618 | 6.85%         | 18.2141                    | 18.8208 | 3.22%         |  |
| Phi3-14B-128k    | 66.5845        | 69.7787  | 4.58%         | 7.1760                     | 7.3456  | 2.31%         |  |
| Llama-3.1-70B    | 2.4822         | 2.4789   | $-0.13\%^{1}$ | 0.8325                     | 0.8295  | $-0.36\%^{2}$ |  |

Table 1: Performance comparison of TEE-on and TEE-off modes for various models in terms of TPS (tokens per second) and QPS (queries per second).

The throughput is measured in two ways: the average throughput of the outputted tokens per second (TPS), and that of the parallel requests the hardware can handle (QPS). TPS is measured by running the model with a batch size of 1. It shows the pure latency overhead introduced by the TEE mode and reflects the performance of real-time requests. QPS is measured by maximizing the query throughput with a dynamically optimized batch size. It reflects the minimal average overhead the TEE mode brings.

Conclusion 2: The overhead reduces toward zero as the model size grows. As shown in Table 1, the smallest model (Llama-3.1-8B) has the highest overhead. The medium-sized model (Phi-3-14B-128k) has roughly two-thirds of the overhead compared to the smaller one. The largest model (Llama-3.1-70B) has a trivial overhead close to zero.

<sup>&</sup>lt;sup>1</sup>The overhead is negative due to the precision loss.

<sup>&</sup>lt;sup>2</sup>The overhead is negative due to the precision loss.



Figure 1: Throughput overhead across different token sizes (length of the input and output sequence). Short sequences are no longer than 100 tokens. Medium sequences are no longer than 500 tokens. Long sequences are between 501 and 1500 tokens.

Conclusion 3: The latency is the main factor contributing to the overhead of the TEE mode. Table 2 shows the overhead introduced to the latency measured by TTFT and ITL. TTFT has a higher overhead compared with ITL, indicating the bottleneck is likely introduced by the I/O instead of the computation happening inside the TEE. Nevertheless, the overhead becomes trivial when hosting heavy computation models like Llama-3.1-70B.

| Model         | TTFT (s) |         |               | ITL (s) |         |               |  |
|---------------|----------|---------|---------------|---------|---------|---------------|--|
|               | TEE-on   | TEE-off | Overhead      | TEE-on  | TEE-off | Overhead      |  |
| LLama-3.1-8B  | 0.0288   | 0.0242  | 19.03%        | 1.6743  | 1.5549  | 7.67%         |  |
| Phi3-14B-128k | 0.0546   | 0.0463  | 18.02%        | 3.7676  | 3.5784  | 5.29%         |  |
| Llama-3.1-70B | 0.5108   | 0.5129  | $-0.41\%^{3}$ | 94.8714 | 95.2395 | $-0.39\%^{4}$ |  |

Table 2: Comparison of TTFT (Time to First Token) and ITL (Inter Output Token Latency) for TEE-on and TEE-off modes across models.

Conclusion 4: The overhead reduces as the token size grows. As shown in Figure 1, the throughput overhead reduces when the sequence length grows, measured by the total input and output token count. The detailed throughput metrics across various sequence lengths can be found in Table 3.

| Model         | TPS - short (tokens/s) |          |          | TPS - medium (tokens/s) |          |               | TPS - long (tokens/s) |          |                     |
|---------------|------------------------|----------|----------|-------------------------|----------|---------------|-----------------------|----------|---------------------|
|               | TEE-on                 | TEE-off  | Overhead | TEE-on                  | TEE-off  | Overhead      | TEE-on                | TEE-off  | Overhead            |
| LLama-3.1-8B  | 127.0310               | 136.8282 | 7.16%    | 122.9356                | 132.0464 | 6.90%         | 122.9705              | 131.7333 | 6.65%               |
| Phi3-14B-128k | 70.9799                | 74.7556  | 5.05%    | 66.1690                 | 69.3104  | 4.53%         | 66.2987               | 69.4176  | 4.49%               |
| Llama-3.1-70B | 2.5983                 | 2.6073   | 0.34%    | 2.4413                  | 2.4374   | $-0.16\%^{5}$ | 2.5245                | 2.5168   | -0.30% <sup>6</sup> |

Table 3: Performance comparison of TEE-on and TEE-off modes across different sequence lengths in terms of TPS (tokens per second). Short sequences are no longer than 100 tokens. Medium sequences are no longer than 500 tokens. Long sequences are between 501 and 1500 tokens.

#### Conclusion 5: TEE can reach typical throughput

<sup>&</sup>lt;sup>3</sup>The overhead is negative due to the precision loss.

<sup>&</sup>lt;sup>4</sup>The overhead is negative due to the precision loss.

<sup>&</sup>lt;sup>5</sup>The overhead is negative due to the precision loss.

<sup>&</sup>lt;sup>6</sup>The overhead is negative due to the precision loss.

Our experiments revealed that, with medium-sized inputs, the nVIDIA H100 GPU achieves 130 TPS for Llama-3.1-8B, while the larger Phi-3-14B model reaches approximately 6 TPS. These results demonstrate the robust performance of the H100 GPU across models of varying complexity.

More detailed experimental data is shown in Figure 2, 3, and 4.



Figure 2: Throughput vs output token size for LLama-3.1-8B



Figure 3: Throughput vs output token size for Phi3-14B-128k



Figure 4: Throughput vs output token size for Llama-3.1-70B

#### 5 Conclusion

Our results show that as input size grows, the efficiency of TEE mode increases significantly. When computation time within the GPU dominates overall processing time, the I/O overhead introduced by TEE mode diminishes, allowing efficiency to approach nearly 99%.

Efficiency growth is more pronounced in larger models, such as **Phi3-14B-128k** and **Llama-3.1-70B**, due to their greater computational demands, which result in longer GPU processing times. Consequently, the I/O overhead becomes increasingly trivial as model size increases.

The total token size (sum of input and output token size) significantly influences the throughput overhead. Larger total token counts lead to higher efficiencies, as they enhance the ratio of computation time to I/O time.

These findings underscore the scalability of TEE mode in handling large-scale LLM inference tasks, particularly as input sizes and model complexities grow. The minimal overhead in high-computation scenarios validates its applicability in secure, high-performance AI workloads.

#### References

[AAK<sup>+</sup>24] Amey Agrawal, Anmol Agarwal, Nitin Kedia, Jayashree Mohan, Souvik Kundu, Nipun Kwatra, Ramachandran Ramjee, and Alexey Tumanov. Metron: Holistic performance evaluation framework for llm inference systems. arXiv preprint arXiv:2407.07000, 2024.

[AKKH18] Gbadebo Ayoade, Vishal Karande, Latifur Khan, and Kevin Hamlen. Decentralized iot data management using blockchain and trusted execution environment. In 2018 IEEE

- international conference on information reuse and integration (IRI), pages 15–22. IEEE, 2018.
- [AMD] AMD. Amd secure encrypted virtualization-secure nested paging. https://www.amd.com/en/developer/sev.html. Accessed: 2024-09-12.
- [ano] anon8231489123. Sharegpt vicuna unfiltered. https://huggingface.co/datasets/anon8231489123/ShareGPT\_Vicuna\_unfiltered. Accessed: 2024-09-04.
- [DGK<sup>+</sup>23] Gobikrishna Dhanuskodi, Sudeshna Guha, Vidhya Krishnan, Aruna Manjunatha, Michael O'Connor, Rob Nertney, and Phil Rogers. Creating the first confidential gpus: The team at nvidia brings confidentiality and integrity to user code and data for accelerated computing. Queue, 21(4):68–93, 2023.
- [Int] Intel. Intel trust domain extensions. https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/overview.html. Accessed: 2024-09-12.
- [KLZ<sup>+</sup>23] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In *Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles*, 2023.
- [MSM+18] Sinisa Matetic, Moritz Schneider, Andrew Miller, Ari Juels, and Srdjan Capkun. {DelegaTEE}: Brokered delegation using trusted execution environments. In 27th USENIX Security Symposium (USENIX Security 18), pages 1387–1403, 2018.
- [MYF<sup>+</sup>24] Apoorve Mohan, Mengmei Ye, Hubertus Franke, Mudhakar Srivatsa, Zhuoran Liu, and Nelson Mimura Gonzalez. Securing ai inference in the cloud: Is cpu-gpu confidential computing ready? In 2024 IEEE 17th International Conference on Cloud Computing (CLOUD), pages 164–175. IEEE, 2024.
- [SAB15] Mohamed Sabt, Mohammed Achemlal, and Abdelmadjid Bouabdallah. Trusted execution environment: What it is, and what it is not. In 2015 IEEE Trustcom/BigDataSE/Ispa, volume 1, pages 57–64. IEEE, 2015.
- [SS20] AMD Sev-Snp. Strengthening vm isolation with integrity protection and more. White Paper, January, 53:1450–1465, 2020.
- [WO24] Qifan Wang and David Oswald. Confidential computing on heterogeneous systems: Survey and implications. arXiv preprint arXiv:2408.11601, 2024.
- [YMY<sup>+</sup>22] Ardhi Wiratama Baskara Yudha, Jake Meyer, Shougang Yuan, Huiyang Zhou, and Yan Solihin. Lite: a low-cost practical inter-operable gpu tee. In *Proceedings of the 36th ACM International Conference on Supercomputing*, pages 1–13, 2022.