# ASCEND-CC: Confidential Computing on Heterogeneous NPU for Emerging Generative AI Workloads

Aritra Dhar<sup>★†</sup> Clément Thorens<sup>‡†\*</sup> Lara Magdalena Lazier<sup>★</sup> Lukas

Lukas Cavigelli★

† Joint first authors

★Huawei Zurich Research Center

‡ETH Zurich

### **Abstract**

Cloud workloads have dominated generative AI based on large language models (LLM). Specialized hardware accelerators, such as GPUs, NPUs, and TPUs, play a key role in AI adoption due to their superior performance over general-purpose CPUs. The AI models and the data are often highly sensitive and come from mutually distrusting parties. Existing CPU-based TEEs such as Intel SGX or AMD SEV do not provide sufficient protection. Device-centric TEEs like Nvidia-CC only address tightly coupled CPU-GPU systems with a proprietary solution requiring TEE on the host CPU side. On the other hand, existing academic proposals are tailored toward specific CPU-TEE platforms.

To address this gap, we propose ASCEND-CC, a confidential computing architecture based on discrete NPU devices that requires no trust in the host system. ASCEND-CC provides strong security by ensuring data and model encryption that protects not only the data but also the model parameters and operator binaries. ASCEND-CC uses delegation-based memory semantics to ensure isolation from the host software stack, and task attestation provides strong model integrity guarantees. Our ASCEND-CC implementation and evaluation with state-of-the-art LLMs such as Llama2 and Llama3 shows that ASCEND-CC introduces minimal overhead with no changes in the AI software stack.

### 1 Introduction

Recently, Generative AI (GenAI) has gained momentum, with large language models (LLMs) being used in applications, such as chat bot [1], image and video generation [2, 3], code completion [4]. The GenAI is considered a build block [5] for future artificial general intelligence (AGI). Major cloud providers offer AI-centric services [6–9], that typically utilize specialized accelerators such as GPUs, NPUs, and TPUs.

Security concerns. These large GenAI workloads bring numerous security challenges in data center environments.

The massive data and computation resources required to train LLMs make them exceedingly expensive [10] and prime intellectual properties for the model providers. Second, the users' queries to the LLMs often contain sensitive information such as health data, personal information, or even business secrets [11]. In the data center deployments, there are three mutually distrusting parties: the model provider, the data provider, and the cloud provider. The *model provider* develops and owns the AI model. The *data provider* gives the data to the AI model for processing. Finally, the *cloud provider* owns the computing infrastructure where the AI models are trained or run inference. The data provider interacts with the computing infrastructure to deliver their data to the AI model. Therefore, the AI model and data require protection from the cloud provider and the software stack.

Gap in prior works. Existing CPU-based trusted execution environments (TEE) [12–15] enables secure applications or enclaves isolated from a malicious software stack such as OS-/hypervisor. They can often withstand untrusted DRAM and bus by employing memory encryption. Outside CPUs, devices such as GPU [16–18], IPU [19], FPGA [20–22], provides TEEs. Several existing works [23–27] also extend CPU TEE's security primitives to connected devices. However, except for very few proposals [19], most existing proposals require a CPU TEE, which increases the Trusted Computing Base (TCB). Recent side channel attacks [28–32] demonstrate that attackers can undermine the security guarantees of CPU TEEs. Additionally, proposals requiring a confidential VM or C-VM (such as CCA and SEV) further increase the TCB by trusting the C-VM OS. Several proposals only consider integrated GPU [24, 26, 27] or NPU [25] that simplify the problem by not considering PCIe communication and separate memory spaces from CPU. Placing the part of the driver inside a high-privilege trusted security monitor [25, 26, 33] offloads critical security decisions such as memory allocation for tasks and binaries, memory sharing and access control to a trusted entity. Such designs expand the TCB and do not fit well in a scenario where the host is fully malicious. While a proposal on GraphCore IPU [19] can withstand a TEE-less host, it does not support

<sup>\*</sup>Work done while the author was in Huawei Zurich Research Center

a modern AI software stack with interactive sessions with the accelerator and requires extensive hardware modification. Finally, almost all the existing TEE proposals consider older CNN-based AI models or evaluate using operators such as matrix multiplication or SVD, which has a small memory footprint. Therefore, these solutions do not scale to LLMs with large memory footprints and require low latency response.

Our contribution. We design ASCEND-CC, a confidential computing solution for discrete NPUs without relying on a CPU TEE. The TCB of ASCEND-CC is only the NPU itself, and the entire host is untrusted. The hardware root of trust (HW-RoT) in the NPU facilitates key derivation to establish a secure channel between the model and the data provider and enable attestation. Measured boot ensures the NPU is booted with the correct firmware signed by the hardware manufacturer. ASCEND-CC accepts fully encrypted data and models from the data and model provider. During inference, the NPU removes all the DMA mapping from the host's virtual address space (by removing SMMU entries) to prevent malicious DMA operations from entering its model, data, and workspace (operator execution space). Only after the memory unmapping does the data and model decryption start. The results are encrypted before the corresponding memory is DMA mapped to the host. The NPU runtime creates tasks from the AI models that dictate the order of operator execution (e.g., matrix multiplication or ReLU) and memory operation (such as DMA copy from host to device). The malicious host can inject tasks (e.g., performing a DMA copy) into the model to compromise the confidentiality of the data and model. ASCEND-CC performs the task and binary attestation of the model before the model starts executing to ensure that the integrity of the task and model is preserved. The isolation and end-to-end encryption are provided without introducing changes to AI software stack such as PyTorch. Therefore, ASCEND-CC does not burden the AI programmer to introduce confidential computing to the existing codebase.

We demonstrate ASCEND-CC on a Huawei Ascend 910A, a state-of-the-art NPU, by modifying its firmware. We evaluate ASCEND-CC with state-of-the-art transformer-based LLMs such as GPT-neo-125M, Llama-2-7B-Base, Llama-2-7B-chat, Llama-2-13B-instruct, Llama-3-8B-Base, Llama-3-ChatQA-1.5-8B, Llama-3-8B-Instruct, CodeLlama-7B-Instruct, for tasks like chat, sentence completion, and code completion. Our evaluation shows that ASCEND-CC introduces minimal performance loss during the inference pass(0.91% in GPT2-neo-125M and 0.028% in Llama3-Chat-QA-1.5-8B model, both with 2K input sequence size) and one-time set-up. Even though our proposal is focused on specific NPU implementation, our design philosophy extends to other AI accelerators, such as GPUs and TPUs that exhibit task-based execution models.

In summary, this paper makes the following contributions:

(1) Identify fundamental building blocks for NPU-based confidential computing. We identify security properties for protecting models and data from untrusted hosts and cloud providers. Specifically, how to thwart the privileged host



Figure 1: The figure shows a high-level architecture of Ascend 910A SoC along with the shared virtual memory with a 64-bit host CPU.

from accessing data and models on the NPU and change their memory mapping. We list these observations and requirements in Sec. 3.

- (2) **System design and analysis.** We design ASCEND-CC that enables confidential computing in NPU.
- (3) **End-to-end evaluation.** We implement and evaluate ASCEND-CC with state-of-the-art LLMs and show that ASCEND-CC introduced minimal overhead with no changes to the AI software stack.

## 2 NPU Background

**NPU hardware architecture.** In this paper, we focus on the Huawei Atlas 300T card [34], which is based on Ascend 910A SoC. The Ascend 910A is a state-of-the-art AI SoC designed primarily for large data centers and clouds for training and inference accelerators. Fig. 1 shows a high-level view of Ascend 910 NPU SoC architecture. All the components of the SoC are connected via an internal bus. The NPU SoC has two types of computation cores to execute AI tasks: AI CPU and AI Core. Four AI CPU cores are general-purpose Huawei Taishan (ARM A73 profile) with hardware cryptographic extensions. 32 AI Cores are based on the Huawei DaVinci [35] architecture optimized for executing neural network operations. The control CPU is a Taishan core (similar to the AI CPU cores) that runs the NPU firmware and manages the PCIe interfaces. After the NPU powers on, the control CPU boots (measured boot) a minimal Linux kernel and initializes the hardware components. The task scheduler combines a dedicated hardware component with firmware running on a Taishan core. It distributes AI tasks to the NPU's computing resources, AI CPUs, and AI Cores. NPU driver, runtime, and AI stack. The NPU runtime stack converts instructions from the higher-level AI software stack (PyTorch/TensorFlow) and communicates with the NPU driver. The NPU driver is a set of Linux kernel modules that manages the communications over DMA, issues MMIO commands to send instructions, and monitors NPU health. Ascend PyTorch adapter (torch\_npu [36]) provides the necessary interfaces to bridge high-level AI-specific APIs to lower-level NPU driver

calls. PyTorch provides two types of model executions: eager

and dynamo. Eager is an interactive execution in which the tasks are executed as soon as they arrive at the NPU. This is done with compileAndExecute instruction that compiles a single operation to a graph and executes it on the NPU. Dynamo, or graph mode, waits for the entire graph to be compiled and deployed to the NPU before executing the tasks. The NPU runtime copies the data and model to the NPU HBM. The model includes the parameters associated with each AI model layer and corresponding operator binaries (such as matrix multiplication, ReLU, etc.). Next, the NPU runtime creates a set of tasks, such as matrix multiplication, to be executed on the AI CPUs or AI Cores. A task contains operator metadata such as the location of the operator binary on the NPU memory (PC\_START), location of the data arguments, and workspace to store intermediate variables. The runtime sends the tasks to the task buffer, a reserved NPU memory location. After sending all tasks, the NPU sends the executeModel task, or, compileAndExecute for Eager mode. This triggers the task scheduler to read the task buffer in order, select the first task, and submit it to the corresponding AI CPU or AI Core. After executing the task, the scheduler moves the task entry to the completion queue (CQ) and continues till the task buffer is empty. The NPU runtime can read the CQ to know the current progress of the execution. The above description is for a single PCI stream context, and multiple such contexts may exist concurrently. These streams are processed by the task scheduler in parallel.

### 3 Motivation and Attacker Model

Motivation: Gap in the Prior Art. Large ML/AI workloads involving sensitive and proprietary data require securing data and computation in the cloud [37,40]. Notably, the rise of LLMs has necessitated confidential computing settings with three parties: the data provider, the cloud provider, and the model provider. Neither the model nor the data can be leaked to other parties. We call this setting *multi-residence TEE* as the TEE needs to access code and data from different mutually distrusting parties. This is a clear deviation from the traditional cloud-TEE setting involving two parties: enclave user and untrusted cloud.

Prior works port the confidential computing paradigm to ML-specific accelerators (e.g., NPU [25], GPU [16–18]). As we show in Table 1, most existing proposals rely on CPU-based TEE solutions, such as Intel-SGX [17, 18], AMD-SEV [38], TrustZone [26, 27] or ARM CCA [23, 24] on the host, building on it to extend security guarantees across devices. This has the advantage of already having a trusted component, i.e., an enclave on the host, to ensure secure communication with the (trusted) device. This enables sending authenticated commands using authenticated encryption, signatures, or memory isolation for integrated devices. CPU-based TEE approaches significantly enlarge the TCB, requiring trust not just in the accelerator device but also in the CPU and various respective monitors and drivers. Moreover, it complicates compatibility as the solution relies on the specific CPU on the



Figure 2: Memory footprint of LLama-3-8B and Llama-2-13B in Ascend 910A NPU with 32GB HBM.

host, and existing attacks [28–30] may even undermine the CPU TEEs' security guarantees.

To the best of our knowledge, only two previous works remove trust from the host completely and solely use the device as the hardware root of trust (HRoT): SheF [21] and Graphcore IPU [19]. SheF uses an FPGA as the trusted device, ensuring integrity and confidentiality through authenticated and encrypted bitstreams with device isolation. Graphcore relies on a specialized compiler to convert the model into an encrypted and authenticated binary in a clean room environment, which can be sent directly to the device without host intervention. It requires significant hardware changes and eliminates interactive AI software stacks. Therefore, these systems are impractical for real-world, large-scale models.

Our proposal *solely* relies on the NPU as the root of trust, does not require a CPU TEE, works with current AI frameworks, requires no hardware changes, extends to other task-based AI accelerators such as TPUs, and is optimized to run modern real-world LLMs.

A case against spacial sharing. The increasing scale of LLMs influences resource-sharing and utilization strategies. Earlier ML-specific TEEs focused on spatial sharing to boost device utilization. They rely on complex techniques (multiple page tables, monitors, dedicated hardware support) to facilitate resource and performance isolation. We observe this trend of spatial sharing (also called multi-tenancy) in commercial confidential computing solutions such as Nvidia-CC on H100 and B100 (using MIG [41]), as well as several academic proposals [17, 23-26]. Multi-tenancy is often a choice for workloads with low memory utilization, as seen in existing academic proposals such as older CNNs (ResNet, VGG, AlexNet, and MobileNet) or isolated operations such as matrix multiplication or SVD. However, such workloads do not represent modern AI workloads like generative AI. This is evident in Fig. 2, where we evaluate the memory utilization of state-ofthe-art transformers-based LLMs such as Llama2 and Llama3, a Huawei Ascend 910A NPU with 32GB HBM, and we over 90% HBM utilization. Moreover, a single NPU has insufficient memory to load a 70B parameter model or execute a 13B parameter model over a 2K input sequence length. Internal data structures, such as the KV cache, grow quadratic in relation to the input sequence length. Therefore, increasing sequence

Table 1: Comparison with existing confidential computing mechanisms on specialized accelerator device and their security capabilities.

| -                                |                                            |                              | a contract of the contract of |                        |                       |
|----------------------------------|--------------------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------|
| <ul><li>trusted driver</li></ul> | <ul><li>partially trusted driver</li></ul> | ○: untrusted driver          | ✓: supported                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ✗: not supported       | ?: unknown            |
| C-VM: confidential VM            | SM: security monitor                       | HRoT: hardware root-of-trust | SB: Secure boot                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | RA: remote attestation | LA: Local attestation |
| STA: Single task attestation     | TA: task attestation                       | SC: Security controller      | PT: PyTorch                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | TF: TensorFlow         |                       |

|                      | CC capability and trust assumption |                          |                 |                        |                                 | Device       |           | AI/ML programming capability |                        | Required changes                      |                                                   |
|----------------------|------------------------------------|--------------------------|-----------------|------------------------|---------------------------------|--------------|-----------|------------------------------|------------------------|---------------------------------------|---------------------------------------------------|
| Existing systems     | Host TCB                           | Isolation<br>granularity | Spatial sharing | Multi-residence<br>TEE | Attestation<br>(CPU/host excl.) | Туре         | Interface | Native programming interface | AI stack               | HW                                    | SW                                                |
| Graviton [17]        | Intel SGX + 🔾                      | GPU<br>contexts          | 1               | х                      | HRoT,RA,STA                     | GPU          | PCIe      | CUDA                         | ?                      | SC                                    | Runtime, drivers, CUDA                            |
| HIX [18]             | Intel SGX $+ \bullet$              | Enclaves                 | 1               | ×                      | LA                              | GPU          | PCIe      | CUDA                         | ?                      | SGX instruction,<br>MMU, PCIe         | GPU enclave, inter-enclave<br>communication, CUDA |
| GraphcoreIPU [19]    | 0                                  | Device                   | X               | ✓                      | HRoT,SB,RA                      | IPU          | PCIe      | Proprietary                  | TF                     | CCU                                   | XLA, poplar compiler, runtime                     |
| NvidiaCC (H100) [16] | C-VM+●                             | VM                       | 1               | ×                      | HRoT,SB,RA                      | GPU          | PCIe      | CUDA                         | TF/PT                  | Security<br>processor                 | CUDA, C-VM                                        |
| Apple PCC [37]       | Enclave CPU (?)                    | Node                     | ?               | ?                      | HRoT,SB,RA                      | NPU          | Internal  | Swift                        | Proprietary<br>support | Custom Apple<br>Silicon               | SepOS, SW stack                                   |
| sNPU [25]            | Penglai Enclave<br>+ → + SM        | Worlds                   | 1               | ×                      | HRoT,SB                         | NPU          | Internal  | Proprietary                  | ?                      | SoC-NoC, SC                           | SMMU, SM                                          |
| StrongBox [26]       | TrustZone<br>+ → + SM              | Worlds                   | ×               | ×                      | SB                              | GPU          | Internal  | OpenCL                       | ?                      | -                                     | Runtime, driver,<br>MMIO, TASK protector          |
| Honeycomb [38]       | AMD-SEV + ○<br>+ Validator + SM    | SVSM<br>+ SM             | 1               | ×                      | SB,RA(of SM)                    | GPU          | PCIe      | HIP                          | ?                      | -                                     | Validator, SVSM,<br>SM, runtime                   |
| CAGE [24]            | ARM CCA<br>+ → + SM                | Realm                    | 1               | ×                      | RA                              | GPU          | PCIe      | OpenCL                       | ?                      | -                                     | API, monitors, ShadowTas                          |
| ACAI [23]            | ARM CCA<br>+ ● + PCIe port         | Realm                    | X               | ×                      | SB,RA                           | GPU+<br>FPGA | PCIe      | CUDA                         | ?                      | -                                     | TF-A, SMMU, RMM                                   |
| GR-T [27]            | Trustzone<br>+ ● + cloudVM         | VM<br>+ Worlds           | 1               | ×                      | SB,RA                           | GPU          | Internal  | GlobalPlatform API           | ?                      | -                                     | Driver Shim, GPU_shim                             |
| HETEE [39]           | ● + SM                             | Node                     | X               | ×                      | SB,RA                           | GPU          | PCIe      | Proprietary                  | ?                      | HETEE box, PCIe<br>interconnect, (SC) | SC, API                                           |
| ShEF [21]            | 0                                  | Device                   | Х               | ×                      | HRoT,SB,RA                      | FPGA         | PCIe      | х                            | ×                      | -                                     | ShEF runtime,<br>Shield                           |
| ASCEND-CC            | 0                                  | Device                   | ×               | 1                      | HRoT,SB,RA,TA                   | NPU          | PCIe      | CANN                         | PT/TF                  | -                                     | Driver, runtime,<br>kernels (operators)           |

length poses a memory capacity challenge. For example, NPU runs out of HMB in Llama-2-13B with an input sequence length of more than 2K. Secondly, the generative AI applications such as chat-bot [1], code generation [4], search [42, 43] are latency-sensitive. Lower compute resources due to spatial sharing between multiple tenants result in a higher latency response. Given such a large memory footprint of LLMs and latency *sensitivity*, we conclude that multi-tenancy is *irrelevant*. Therefore, we explicitly aim for a single-tenant solution.

**Settings.** The AI workloads involve three parties. The model provider brings his IP model, while the data provider uses her secret data to run inference workloads. The model provider can also use the data to train or fine-tune the model. The hardware, i.e., host system, NPUs, network infrastructure, OS, hypervisor, and AI software stack, are deployed by the cloud provider, where all the computations occur. The cloud provider can also offer the models for the data provider to run an inference service.

**Trust assumption and attacker model.** In a typical trusted execution scenario, the software stack and the cloud provider are untrusted. In addition, our setting involves two types of TEE users: the model and the data provider. The model and data provider are mutually distrusting, and neither trusts the cloud provider:

(1) Cloud/infrastructure provider: The cloud service provider (CSP) is responsible for provisioning and maintaining all hardware and software resources for operation. The CSP controls all the nodes (CPU, NPUs), manages all the infrastructure such as network interfaces, switches, etc.,

and maintains all the software such as OS, hypervisor, device drivers, firmware, AI/ML software stack such as PyTorch, or TensorFlow. We assume all the hardware and software the cloud provider provides are *untrusted* except the specific NPUs where the AI model execution occurs.

- (2) Model provider: The model provider develops and trains the model and keeps the model's composition and parameter secret. The model provider may train an open-source model with a proprietary data set. In that case, the parameters of the model are secret. The model provider could also be multiple parties: foundation model owner and fine-tuner. The fine-tuner trains the foundation model with a specific data set to suit application scenarios like video generation or chat. The CSP could also be the model provider in the machine learning as a service (MLaaS) scenario.
- (3) **Data owner:** Typically, the data owner is the client who wants to use a specific model and cloud infrastructure (CPU, memory, NPU) for training or inference. This is a scenario where the data owner can bring her model. Typically, we assume that the data and the model provider are two separate, mutually distrusting parties.

We only assume the specific NPUs where the AI/ML workloads are deployed are trusted. The NPUs have an on-chip hardware security module (HSM) that acts as the hardware root of trust. Lastly, we assume that denial of service (DoS) and side-channel attacks are outside the scope of this paper.

```
1  def mm_npu_kernel(m, n):
2     M1 = torch.rand(m,n).npu() #copy M1 to npu
3     M2 = torch.rand(m,n).npu() #copy M2 to npu
4     M3 = torch.mm(M1,M2).cpu() #copy result M3 to cpu
```

Figure 3: An example code of matrix multiplication on Ascend NPU.



Figure 4: An example matrix multiplication task and memory layout on NPU, corresponding to the code snippet in Fig. 3.

## 4 Security Challenges and Requirements

We use matrix multiplication as a running example, as depicted in Fig. 3. The NPU runtime copies the NPU-optimized binary for matrix multiplication kernel (torch.mm) and M1 and M2 onto the NPU's memory. After the kernel executes, the NPU copies the M3 from the NPU memory to the CPU's main memory. The three tasks corresponding to this example are depicted in Fig. 4, namely, a memory copy of the tensors (M1, M2) from the host to the NPU, the matrix multiplication and memory copy of the result tensor (M3) from the NPU to the host memory. The host-side NPU runtime reserves memory spaces on the NPU HBM for the binary of the matrix multiplication operator, tensors: M1, M2, M3, and the workspace (operator's working space, e.g., heap and stack). Based on this execution model, we observe several security challenges assuming an attacker-controlled host and cloud provider. Based on these observations, we develop a set of security requirements that the NPU must provide to ensure the security of the model, data, and execution.

**Security Challenge 1:** The untrusted host runs the privileged software, such as the OS, hypervisor, and device driver, along with the AI software stack, such as PyTorch, that handles the data and model. Therefore, at any point in a traditional AI/ML scenario, the untrusted host has full access to the data and model.

→ Requirement 1: End-to-end authenticated encryption (such as AES-GCM) is necessary for all input and output tensors, model parameters, and model binaries to ensure the attacker-controlled host cannot observe or manipulate data and models. The host only handles authenticated encrypted (AES-GCM) data, models, and results at any point.

**Security Challenge 2:** The NPU requires the model and data to be decrypted before execution. The attacker-controlled host can extract the data and model when the NPU decrypts the data before execution. Therefore, it is critical to ensure that once the data and model are decrypted inside the NPU, the attacker cannot extract or manipulate the model and data on the NPU. Similarly, end-to-end authenticated encryption of the execution result is not enough, as the host can access the plain text results from the NPU memory before the encryption of the results takes place.

→ Requirement 2: Atomic execution invariant: Before the data and model are decrypted for the model to execute, the host must lose access to the data and model, as the data and model need to be decrypted before execution. This can be achieved by removing the DMA mapping of the NPU memory region where the data and model reside from the NPU side.

Security Challenge 3: The host defines the memory mapping of the model's inputs and outputs. This includes the pointers to the input data and the model and the output result from the model. A malicious host can declare the output pointer to be the same as the input data or the model, compromising the model's confidentiality.

→ Requirement 3: Memory invariant: For every memory copy from NPU to host, we must ensure that the NPU rejects any memory copy where the plain text input data and model, intermediate results, or output are located.

**Security Challenge 4:** Even without direct access to the data or the model, the untrusted host can send malicious commands to the NPU, such as copying part of the model parameters to the results sent to the data provider, compromising the model's confidentiality.

 $\rightarrow$  Requirement 4: The lack of integrity of the model execution compromises the model and data security.

Security Challenge 5: The security primitives for confidential computing are only valid and secure if there is a mechanism to attest the NPU. Without a systematic way to check the integrity of the NPU firmware (which includes the binaries for the control CPU, task scheduler, NPU memory manager, etc.), we cannot assert the trustworthiness of the NPU's confidential computing capability.

→ Requirement 5: We need a measured boot-equivalent primitive for the NPU, where the NPU only accepts manufacturer-certified firmware and does not allow the attacker-controlled host to flush its firmware or change the runtime configuration.

**Security Challenge 6:** The NPU and the corresponding software stack provide several debugging methods for correctness and performance, such as inspecting the NPU memory, acquiring a snapshot of a memory region, watching execution time, etc. Such mechanisms allow the attacker to extract models and data.

 $\rightarrow$  Requirement 6: To ensure data and model confidentiality, all debugging-related operations must be restricted.

# 5 Basic Building Blocks for Confidential Computing on the Ascend NPU

Here, we provide the basic design building blocks to enable confidential computing on Ascend NPU that are associated to the requirements described in Sec. 4.

# 5.1 Model and Data Encryption

In Requirement 1 (Sec. 3), we describe that it is essential that the AI model and the corresponding data are encrypted by the shared keys between the NPU and model and data providers, respectively. As we assume that the model and data providers are mutually distrusting, they cannot access each other's shared key.

**Set up.** The model and data providers initiate an authenticated key exchange (DH) with the NPU to start a session. The NPU derives ephemeral keys from its root key, along with the firmware measurement (derived in the measured boot), with the root key stored in the hardware RoT (refer to Sec. 5.4). This ensures that the model and data provider interact with a legitimate NPU device with the correct firmware image signed by the hardware vendor.

## 5.1.1 AI CPU-based custom operator

We use NPU AI-CPU cores' ARM AES intrinsic and SIMD (NEON) instructions to accelerate AES-GCM without requiring additional hardware. AES-GCM is implemented as an AI CPU operator and can be triggered before (for data and model decryption) and after (for result encryption) the model execution. The operators (AI CPU or AI Core) are typically part of the model layer operator; therefore, this eliminates any need for specialized hardware for encryption. This operator is part of the NPU firmware and is verified by the measured boot during NPU initialization. All the cryptographic operations are *in-place* and do not need additional memory.

The parallelized AES-GCM operator decrypts multiple batches of input data on multiple AI CPU cores to hide the decryption and verification latency. Fig. 5 shows parallel AES-GCM operation on the model, data, and results running on the AI CPUs, while the AI Cores execute the actual AI operations related to the model layers during an inference pass. Typically, for an LLM, the computation is bound by



Figure 5: Parallel cryptographic operation on model and data to hide the latency introduced by the AES-GCM operator running on the AI-CPU cores. The AI core executes the AI-related operations, such as the layer computation during an inference pass.

the inference latency (few milliseconds) compared to the AES-GCM operations for the data (typically in the order of  $100 \,\mu$  seconds). Therefore, the latency is only visible for the first inference, as the model and the first data batch must be decrypted. The decrypted model will already be in the NPU memory for subsequent inference passes.

#### 5.1.2 Executing AI CPU operator with model

**Preparing the model.** It starts with the model provider preparing it in a trusted clean room environment. Typically, the compiled model file (also known as the frozen model for the inference) contains the layer information, parameters, and operator binaries. The model provider encrypts the weights (a list of Tensor) and the individual operator binaries with the secret key shared between the model provider and NPU beforehand. Each binary contains a header, symbol table, and compiled instructions for the NPU AI core. The model also contains a list of binary sizes for the NPU runtime to use when parsing the model file. We modify the sizes to account for the encrypted binary size (added padding) and the 16 bytes added as the message authentication code (MAC).

The model file contains the layer information corresponding to the tasks linked to operator binaries, such as a layer named te\_relu\_1\_1, which denotes the first ReLU activation function. Typically, a task name corresponds to an actual operation. However, this is not necessary for the model to function correctly. Therefore, the model provider also replaces the task names with randomized strings to prevent the underlying operator from being exposed.

The encrypted model file is then transferred to the untrusted host, where the AI software stack (e.g., PyTorch) is executed. The data provider provides the encrypted data to the untrusted host, encrypted with the secret key shared between the data provider and NPU. The host copies the encrypted model and encrypted data to the NPU. If the model and data provider are the same party, the model and data are encrypted with the same key. After this, the host calls the execute () API to start the inference passes.

Model execution on the NPU. The executeModel API call from the NPU runtime indicates the NPU task scheduler to start executing the AI model tasks. The NPU task scheduler

ensures that the model and data are decrypted right after the executeModel API is called. After one inference pass i.e., a forward pass throughout all the layers in the model, the model outputs a vector known as the logits. A normalization operation (such as softmax) on the logits produces the probability values of inference classes. The AES-GCM operator encrypts the logits before they are copied back to the host.

# 5.2 Enforcing Memory Lock Invariants

Running the data and model decryption (refer to Sec. 5.1) right after the executeModel API call from the host is insecure. It will give the host full access to the decrypted model and data. Therefore, the host's access must be revoked from these NPU memory regions. This brings us to Requirement 2 and Requirement 3, which are related to critical memory invariants to ensure that the data, model, and execution are isolated from that attacker-controlled host. We design a memory access control primitive leveraging the SMMU (similar to ARM's SMMU [44]) on the NPU. All the PCIe transfers to and from the NPU memory go through the SMMU on the NPU. Therefore, we enforce access control on the SMMU by reprogramming the NPU control CPU with exclusive access to the NPU-SMMU. However, there are two challenges that we need to solve to ensure secure access control.

- (1) We must ensure that the sequence of events: loading the model to NPU memory  $\rightarrow$  locking the NPU memory where the model is loaded  $\rightarrow$  decrypting the model, to be atomic, i.e., the host software stack cannot interrupt the NPU.
- (2) Once the model and data are decrypted, corresponding virtual memory spaces cannot be unlocked without either re-encryption or resetting the memory content.

To solve the first challenge, we ensure that the AES-GCM AI CPU operator for decrypting the model and data can only be scheduled after the NPU memory manager successfully unmaps the model, data, and workspace memory. Unmapping the memory is achieved using the dma\_unmap\_pages API call, removing all the memory mapping from the SMMU. By doing so, the NPU memory manager ensures that any memory access from the host is blocked. Any interrupt from the host at this point will prevent the NPU memory manager from sending an acknowledgment signal to the NPU task manager. Under normal circumstances, after receiving the acknowledgment signal, the NPU task manager schedules the AES-GCM operator to decrypt and verify the model weights, binaries, and data.

We address the second challenge by introducing a memory exclusivity invariant in the NPU memory manager. Inside the NPU memory manager, a data structure tracks all the allocated memory locations based on their DMA direction (DMA\_BIDIRECTIONAL, DMA\_TO\_DEVICE, or DMA\_FROM\_DEVICE). This way, we ensure that the host cannot copy over any data transferred from host to device. The only memory content that the host is allowed to copy out is the output from the model. After the model's execution, the NPU

task scheduler schedules an AES-GCM encryption operator on the outgoing memory location. Upon completing the AES-GCM operation, the task scheduler signals the NPU memory manager to remap the memory using dma\_remap(location, size). The remap API adds an entry to the SMMU so the host can access and copy the encrypted result.

#### 5.3 Model and Task Attestation

Requirement 4 implies that the integrity of the model execution is critical to protect the integrity and confidentiality of both model and data. The model and task attestation mechanism ensures its correct execution. The untrusted NPU driver sends the model file (containing layer information, model parameters, and operator binaries) over DMA, and writes the task to the task queue. The task contains a pointer to the location of the binary (PC START attribute) of the specific operator (e.g., matrix multiplication) and the relevant data so that the NPU can execute the code with either the AI CPU or the AI Core. As the NPU driver runs on the attacker-controlled host, the host can always push arbitrary tasks and remove or reorder the tasks to compromise the integrity of the AI model. The attacker can also modify tasks (PC\_START) to point to a different binary and thus execute a different operator than intended, compromising the integrity of the model execution. To prevent such an attack, we design a model verification method shown in Fig. 6. We assume two keys  $\mathcal{K}_M$ , and  $\mathcal{K}_D$  shared respectively between model provider & NPU, and data provider & NPU. We continue with the matrix multiplication example depicted in Fig. 3, and Fig. 4 to describe our mechanism. Here, the model consists of three layers that execute three operators (memory copy from host to device, matrix multiplication, and memory copy from device to host). The sequence of the corresponding operators' binaries is  $B_1 \rightarrow B_2 \rightarrow B_3$ . Fig. 6 depicts the flow of our model and task attestation mechanism. The steps are the following: ① An AI model consists of layer information (L), model parameter (W), and operator binaries (B =  $B_1, B_2, B_3$ ) to execute a matrix multiplication of two tensors of shape  $\{2,2\}$ . The model provider encrypts and creates message authentication codes (MAC) of W and B. Alongside this, the model provider also generates the MAC of the sequence of binaries corresponding to the layers, such as  $\mathcal{M} \leftarrow Enc_{\mathcal{K}_{M}}(Model)$ and  $\mathcal{B} \leftarrow MAC_{\mathcal{K}_M}(B_1, B_2, B_3)$ . Typically, the layer information contains the name of the layer (that the model provider can masquerade) and the location of the encrypted binary relative to a fixed starting point in the model file so that the NPU runtime can create corresponding layer tasks later. The model provider sends  $\mathcal{M}$  and  $\mathcal{B}$  to the untrusted host (on the CSP) and  $\mathcal{B}$  to the data provider.

② The host deploys the model to the NPU. The runtime determines the starting addresses of binaries  $(B_1, B_2, B_3)$  known as the PC to map the binaries on the memory based on the available space on the HBM. In our example, PC= $\{0x10, 0x20, 0x30\}$ .



Figure 6: Protocol to ensure integrity and confidentiality of the AI model and integrity of model execution.

- ③ The driver copies the model file over to the NPU memory over DMA. As Fig. 6 shows, after the memory copy, the HBM contains the layer information, the model parameters, and the binaries. At this point, W and  $\{B_1, B_2, B_3\}$  are encrypted with  $\mathcal{K}_M$ . At the same time, the NPU runtime creates a sequence of tasks:  $T_1, T_2, T_3$  from the layer information and uses PC to populate the PC\_START attribute.
- The untrusted host sends the PC to the data provider.
- ⑤ In response, the data provider generates  $P_1 \leftarrow MAC_{\mathcal{K}_D}(\mathbb{PC}), P_2 \leftarrow MAC_{\mathcal{K}_D}(\mathcal{B}),$  where  $\mathbb{PC} \leftarrow \{0 \times 10, 0 \times 20, 0 \times 30\}, \mathcal{B} \leftarrow MAC_{\mathcal{K}_M}(B_1, B_2, B_3),$  and sends them to the host.
- **⑤** The driver sends the executeModel instruction to the task scheduler by writing a specific execution task (E in Fig. 6) on the task queue.  $P_1$  is sent together with executeModel This triggers the task scheduler to remove all the memory mapping from the NPU's SMMU. Therefore, the host can no longer read or write from and to either the HBM or the task queue.
- $\widehat{\mathbb{C}}$  The task scheduler collects all the PC\_START attributes from the task queue. The task scheduler calculates  $P_1' \leftarrow MAC_{\mathcal{K}_D}(\text{PC\_START})$  and checks if  $P_1 = P_1'$ . Otherwise, the task scheduler aborts.
- If the verification in Step T is successful, the task scheduler invokes an AI CPU operator to inspect the integrity and sequence of binaries.
- 9 The host send  $P_1, P_2$ , and PC to the AI CPU operator.
- **⑩** After receiving  $P_1$ ,  $P_2$ , and PC, the AI CPU operator recomputes  $P_1$  (with  $\mathcal{K}_D$ ) from PC and checks if it matches the  $P_1$  received from the untrusted host. Using PC, the AI CPU decrypts the binaries one by one, each time checking that the binaries match their associated MAC. Once the binaries are all decrypted, the AI CPU computes  $\mathcal{B}' \leftarrow MAC_{\mathcal{K}_M}(B_1,B_2,B_3)$  and  $P_2' = MAC_{\mathcal{K}_D}(\mathcal{B}')$ . It checks that  $P_2' = P_2$ . If all of these steps are successful, the AI CPU operator then decrypts W in place on the HBM. Note that removing all memory mappings in the NPU SMMU in step **⑥** prohibits the untrusted host from reading the decrypted W and B on the HBM or writing new content on the HBM.

## 5.4 Firmware and Runtime Integrity

The previously stated mechanisms are implemented in the NPU firmware, which is part of ASCEND-CC software TCB. Therefore, the trustworthiness of ASCEND-CC depends on the integrity and authenticity of the firmware. The NPU has a hardware root of trust, where the NPU vendor embeds a cryptographic key during the manufacturing process (e-fuse keys). The root key cannot be extracted from the NPU. It serves as the NPU's unforgeable identity, preventing the attacker from impersonating and emulating a legitimate NPU. Subsequent keys for key exchanges are derived from this root key and signed with the NPU vendor's key. The control CPU boots and sets up the NPU, including the PCI drivers, task scheduler, NPU memory manager, AI CPU, and AI Cores. During the boot, the control CPU verifies whether the firmware image was signed with the manufacturer's root key. This prevents the attacker from flashing an unsigned firmware image to the NPU. We assume the cloud service provider has a public key infrastructure to ensure that the model and data provider can execute authenticated Diffie-Hellman key exchange with the NPU to derive shared secrets, encrypt and authenticate the AI model and data, and verify the legitimacy of the NPU.

The NPU control CPU intercepts all the command messages from the host-side runtime. ASCEND-CC blocks all debugging commands (e.g., memory inspection, profiling operators) from the control CPU to ensure the attacker has no additional communication channel with the NPU.

#### 6 ASCEND-CC

Based on the fundamental design building blocks discussed in sec. 5, we briefly describe the ASCEND-CC end-to-end system. ASCEND-CC works seamlessly with torch\_npu [36] which is a NPU-specific adapter for PyTorch.

**Initial setup.** The model and data providers execute an authenticated Diffie-Hellman key exchange (DHKE) with the NPU over the untrusted host and obtain  $\mathcal{K}_M$  and  $\mathcal{K}_D$ , respectively. The NPU stores these two keys on the HSM on-chip. The data provider prepares encrypted data with the



Figure 7: ASCEND-CC end-to-end system with internal confidential computing mechanisms and the corresponding PyTorch interfaces.

shared secret  $\mathcal{K}_D$ . Assume the data provider interacts with the remote LLM application through a browser. A browser extension loaded with  $\mathcal{K}_D$  can tokenize the data provider's input (D), encrypt it  $(\mathbb{D} \leftarrow Enc_{\mathcal{K}_D}(D))$ , and send the encrypted tokens to the cloud provider. Similarly, the model provider encrypts the model binaries and parameters and generates the encrypted model  $\mathbb{M} \leftarrow Enc_{\mathcal{K}_M}(M)$ . The model provider also generates the signed sequence of model binaries for model attestation as described in Sec. 5.3.

ASCEND-CC: End-to-end system. Fig. 7 shows end-to-end ASCEND-CC systems along with the sequence of operations (step 1 to step 10). The figure also shows where the individual steps take place. The highlighted steps are security-sensitive steps that ASCEND-CC adds to the NPU firmware to enable confidential computing. A brief summary of the steps follows: 1 the host calls loadModel to send the encrypted model M to the NPU. ② Then memcpy transfers the encrypted data  $\mathbb{D}$ . At this point, the host calls executeModel API that instructs the NPU to start executing the AI model with the data. In ASCEND-CC, executeModel triggers additional steps to ensure that the model and data are inaccessible from the attacker-controlled host after the decryption and that the model tasks are not manipulated. 3  $lock(M, \mathbb{D})$ : The NPU task scheduler intercepts the executeModel from the host and instructs the NPU memory manager to remove the model, workspace, and data region, located on the NPU HBM, from the SMMU mapping. Therefore, the model, data, and workspace on the NPU are no longer accessible from the host. This mechanism is described in Sec. 5.2. 4 attestModel (M): The NPU task scheduler verifies the tasks and model binaries. This is a multi-step protocol that involves a dedicated AI CPU operator. The model and task attestation mechanism is described in Sec. 5.3. 5 Once the memory regions are locked, the task scheduler invokes an AI CPU operator to decrypt the model and data (refer to Sec. 5.1) in place. 6 The AI model is executed on the AI cores and AI CPUs based on the operators in the model layers. The output of the model is R. The memory region(s) where R resides is not DMA-mapped to the host. **7** The NPU task scheduler invokes the AI CPU crypto operator (Sec. 5.1) to encrypt the model output with the data provider's key:  $\mathbb{R} \leftarrow Enc_{\mathcal{H}_D}(R)$ ) **(8)** The memcpy API from



Figure 8: ASCEND-CC memory and execution lifecycle.

the host copies encrypted output  $\mathbb{R}$  from the NPU memory to the host memory. 9 If the host runs another inference pass, the task scheduler invokes a specialized AI CPU operator to zero out the previous input data. If the host triggers an end of the session, the task scheduler cleans up both the model and data. 10 The task scheduler invokes the NPU memory manager to remap the memory (either D or both D,M) before the start of the next inference pass or the end of the session.

Programming interface. ASCEND-CC modifications in the NPU firmware and driver such as the model and task attestations, cryptographic operators, and memory invariant enforcement are completely transparent to higher-level programmings, such as the AI software stack (PyTorch). Therefore, from the developer's perspective, the existing inference (and even training) workflow remains unchanged. ASCEND-CC simply acts as a drop-in-replacement for the AI developers. There are minimal changes within the PyTorch adapter for Ascend NPU (i.e., torch npu) to support the collection of model attestations. Fig. 7 shows the matrix multiplication example (from Fig. 3) with all the ASCEND-CC hardening. **ASCEND-CC lifecycle.** We summarize the memory lifecycle of ASCEND-CCin Fig. 8 by providing the memory states over subsequent rounds of inference passes. Typically, the model is loaded once, followed by multiple inference passes (such as a conversation with an LLM chatbot). Therefore, the memory region associated with the model (parameters, binary, and model operator workspace) must be locked and decrypted once. Then, it remains locked for the duration of

the runtime until the model is unloaded or an interrupt occurs. The data, on the other hand, arrives in a streaming fashion. The data region is locked, decrypted, and fed to the model execution. Once the output is generated, it is encrypted with the data owner's key, and the corresponding memory region is unlocked for device-to-host DMA transfers. At the same time, the input regions are reset (by overwriting them to zero using a dedicated AI CPU operator), and the corresponding memory regions are unlocked. Consequently, the host can transfer the next batch of encrypted input data to the NPU.

# 7 Security Analysis

In this section, we provide an informal security analysis of ASCEND-CC and show how it ensures the security of AI models and data from the untrusted host and cloud provider. Attacks from a malicious host and cloud provider. The host runs the operating system/hypervisor, the NPU driver, and the AI software stack and has full access to the NPU. The host needs to have initial access to allocate and copy over memory for the data, model, and operator binaries. Once the host calls the execute () API, the NPU task scheduler unmaps the NPU memory. Therefore, the host cannot access the encrypted model and data anymore. The host can interrupt the NPU between the memory locking and the model and data decryption to disrupt the DMA region locking. However, before the NPU task scheduler schedules the AI CPU operator to decrypt the model and data, a successful confirmation from the NPU's memory manager is required to ensure that the memory is not accessible from the host. Similarly, the host can interrupt the NPU between the result encryption and memory unlocking to prevent the result encryption from occurring. However, the NPU task scheduler only triggers the NPU memory manager to unlock the DMA memory after successfully executing the AI CPU operation. The task scheduler obtains this confirmation from its completion queue (CQ), indicating if an operator has completed successfully or failed. This atomic property prevents the host from accessing the plaintext model and data on the NPU memory. The host cannot issue a malicious DMA to modify the model and data as they contain the MAC from the model and data provider. The host cannot manipulate the tasks after the executeModel is called as the memory is locked. The NPU control CPU has disabled all the debugging and performance monitoring interfaces to prevent the host from having any additional communication channel with the NPU. As all data entering and leaving the NPU is encrypted and authenticated, a malicious cloud provider with physical access cannot compromise the device's security. Note that denial of service (DoS) is always possible and out-of-scope of ASCEND-CC. The malicious cloud provider can flash an NPU card with an older or compromised firmware. However, as described in Sec. 5.4, the NPU collects the firmware measurement during the measured boot. This signed information will be passed on to the model and data provider during the key establishment. The model and

data provider can detect if the NPU runs an older or compromised firmware version. If the attacker tries to boot an emulated NPU, it will fail, as it does not have the private NPU key.

Attacks from a malicious model provider. The motive of a malicious model provider is to steal data from the data provider. A malicious model provider can manipulate the AI operator code to execute unintended operations. For example, it can copy part of the data to the model output and retrieve it later. We ensure all output data is encrypted with the data provider's key. Therefore, only the data provider can decrypt the output from the model.

Attacks from a malicious data provider. A malicious data provider tries to steal model parameters or infer operator code, known as a model stealing (MS) attack. The data provider also tries to infer the training data by sending many queries to the model, known as a membership inference attack (MIA). Typically, in any confidential computing solution, MS and MIA are orthogonal problems, so they are considered out-of-scope. The model provider can deploy additional measures in the operators to add noise to the input or reject inference after a given number of queries. The model provider ensures that these security mechanisms are in place via the model and task attestation (refer to Sec. 5.3) This makes both MIA and MS prohibitively expensive for the data provider.

# 8 Implementation and Evaluation

## 8.1 ASCEND-CC Implementation

We implement all our confidential computing primitives into Ascend 910A NPU's software stack, which involves the driver, firmware, runtime, and Ascend PyTorch adapter written in C++. All encryption and decryption are executed with AES-GCM-128 based on the AArch64cryptolib [45] library that uses ARM's hardware cryptographic intrinsic for fast performance. For LLMs, we use both the Ascend native execution (acl runtime) and the PyTorch adapter to execute the models. For GPT-Neo, we use the ImageNet dataset, and for transformers, we use the squad\_v2 dataset [46].

**Model provider.** We emulate the model provider by implementing it as a TCP server that serves authenticated-encrypted compiled models with a shared secret negotiated previously with the NPU. The MAC of the parameters and operator instructions are appended after each encrypted blob, increasing the size of each binary by 16 bytes.

Host runtime. We modify the NPU runtime (CANN) [47] to implement the model and task attestation described in Sec. 5.3. During the loadModel API call, we extract the tasks and associated PC\_START attribute pointing to the operator binary's memory location. The modified runtime expects an additional signed binary sequence from the model file, establishing the model's ground truth and task attributes. The modified runtime communicates with the data provider over the TCP socket to retrieve the signed PC\_START sequence.

Task scheduler. The modified task scheduler (TS) firmware enforces the memory invariant and attestation. The TS firmware records all the tasks submitted since the last execution completion. Upon reception of the <code>executeModel</code> command, the TS firmware computes the signature of the <code>PC\_START</code> sequence using  $\mathcal{K}_D$ , as described in Sec. 8.1. It then verifies that the signature matches the one generated by the data provider and sent with the <code>executeModel</code> command. If all the steps above are successful, the TS firmware relays the <code>executeModel</code> command to the TS hardware.

AI CPU operators. We use Huawei Ascend NPU AI-CPU operator development environment [48] that is part of the Ascend AI development software stack known as CANN [47] to implement the custom AI CPU operator. The custom operators are responsible for AES-GCM decryption and encryption (refer to Sec. 5.1) for the model, data, and results and carrying out model and task attestation(refer to Sec. 5.3). The AI CPU operators are executed whenever they are called from a compiled model graph. They could also be called native C++ kernels or be from PyTorch. All the AI CPU operators are implemented as internal operators, i.e., the compiled binary of the operator remains inside the trusted NPU firmware. Therefore, a user cannot modify the operator code.

We use the dma\_map\_page Memory lock. dma unmap page functions to isolate the device from the host before starting the execution (Sec. 5.2). The host only accesses mapped HBM regions on the device via DMA using shared virtual memory (SVM). Removing the mapping of the corresponding DMA addresses prevents unauthorized host access. The NPU task scheduler sends a synchronized message (implemented over shared memory, non-accessible to the host) to the memory manager driver running on the NPU control CPU, which then uses dma\_unmap\_page to unmap the entire mapped region. Only after the unmapping does the task scheduler start with the model and task attestation (Sec. 5.3). In case the unmapping fails for any reason, the device aborts the execution. The task scheduler stores the input address and length to clear and remap the memory region for the next inference. During the remapping, the encryption operator informs the task scheduler when it has encrypted the output, informing the SVM driver that respective regions can be remapped so that the output can be read (and the next input can be loaded).

#### **8.2** LLM Evaluations

We chose two different types of LLM workloads based on their parameter size: GPT-Neo, a small model with 125 million parameters, and large models, Llama2 (7 and 13 billion) and Llama3 (8 billion). Typically, in large models, the inference time is higher (seconds). Therefore, the relative overhead of ASCEND-CC is small. However, the inference time is within a second in smaller models like GPT-Neo. Such a fast inference speed magnifies the overhead the different ASCEND-CC's modules introduce. First, we start with microbenchmarks,



Figure 9: AES-GCM-128 operator latency on the Ascend 910A AI-CPU cores with auto-tiling for concurrency.



Figure 10: Time (overhead) to map and unmap the Ascend 910A VAs from host VAs.

where we evaluate different parts of ASCEND-CC, and after that, we present end-to-end LLM overheads.

#### 8.2.1 Micro benchmarks

The AES-GCM AI-CPU operator has an effect on both the setup and inference time. Fig. 9 shows the AES-GCM-128 performance on all four AI-CPU cores on Ascend 910A NPU. We use the NPU's automatic scheduling strategy to maximize parallelization. The four AI-CPU cores have a shared 1KB L2 cache. Therefore, we achieve a maximum throughput of 6.1 GB/s when the data size is exactly 1KB and is spread evenly across all 4 cores (i.e., 256B for each core). For data sizes larger than 1K, the cores experience cache contention and converge to single-core performance, around 1.6 GB/s. Therefore, in our AI-CPU operator, we always execute on 1KB of maximum chunk size (with interleaved DMA transfer) to ensure maximum throughput rate.

Fig. 10 shows the latency to map (dma\_map\_pages) or unmap (dma\_unmap\_pages) call from the NPU memory manager to either remap or unmap DMA memory region for the host. The map and unmap operations are at the granularity of pages of size 4K, which takes 2.47  $\mu$ s. However, most unmap operations are done during the model setup to unmap the model and the model workspace from the host. In the subsequent inference passes, ASCEND-CC only needs to unmap the input data before decryption, map the encrypted result, and remap the input region after resetting the previous input.

### 8.2.2 LLM Inference and Model Setup Evaluation

We evaluate the ASCEND-CC setup time and runtime overhead. The setup time is a one-time cost that occurs when the model is loaded on the NPU and prepared for inference passes. Runtime overhead denotes the added latency for token generation



Figure 11: GPT-Neo-125 Load time overhead. The solid color box indicates the time to execute loadModel, i.e., the DMA transfer time of the model file. The hatched boxes show additional setup time for ASCEND-CC.



Figure 12: Llama models load time overheads (%) in ASCEND-CC.

during inference time. Typically, the user only notices the runtime overhead while interacting with the LLM application, such as the chatbot or the code completion assistant.

**Setup overhead.** We provide set-up costs for two scenarios: a small model such as GPT-Neo with 125 million parameters and large models such as Llama-2 and Llama 3 with 7, 8, and 13 billion parameters. As expected, the smaller model has significantly lower loading times (e.g., 0.25 seconds in GPT-Neo vs 26 seconds in Llama-3 8B) than the larger models. However, due to the shorter loading time, the pay-per-inference has a noticeable effect on the small model. At the same time, we did not observe any measurable difference in the Llama-2 and Llama-3 variants. In GPT-Neo, the overhead of setup time is between 50 to 71%, as seen in Fig. 11. A similar trend can also be seen in the Llama2 and Llama3 evaluations depicted in Fig. 12, where we evaluate the setup time of 7 LLMs. Typically, the setup time reflects the size of the AI model, which is proportional to the model parameter count. We also observe that Llama-3-8B-Base has the highest load time in vanilla as the model is encoded with bfloat16 data type. Ascend 910A NPU does not have native support bfloat16. Hence, it is required to convert the datatype from bfloat16 to standard float16. This datatype conversion introduces additional latency in loading. The rest of the models are encoded in native float16 and do not require additional



Figure 13: Inference time overhead (% on the bars) of GPT-Neo-125M with different input sequence sizes.

typecasting before loading the model. Therefore, the loading times are proportional to the model's parameter count.

Inference overhead. Fig. 13, and Fig. 14 show the runtime overhead of ASCEND-CC with GPT-Neo 125M. In the case of GPT-Neo, the overhead is significantly higher compared to Llama models, as the model is much smaller (125M vs. 7/8/13 B). Therefore, the inference latency is very small. In very input lengths such as 50 and 100 tokens, we experience 16.03% and 13.74% overhead, respectively. However, a small input length (typically less than 512) is impractical in real-world LLM applications. Larger sequence lengths result in higher inference latency. For a 2K context size, ASCEND-CC only introduces 0.91% overhead.

ASCEND-CC's overhead reduction in LLMs with a higher parameter count is apparent in models such as Llama-2, Llama-3, and CodeLlama, along with their variants (chat, instruct, and Q&A). These results are depicted in Fig. 14 for different context sizes. Note that we only show up to a 2K context size for the 13B variant of Llama-2 as larger input sequence sizes in these two models caused out-of-memory errors. We observed a reduction of ASCEND-CC overhead with larger input sequence sizes. This is expected as the inference latency with increasing sequence sizes is dominated by the model computation rather than the cryptographic operations. We generally observe less than 0.1% overhead in all Llama variants across all input sequence sizes. Note that in all of these experiments, we use a batch size of one. With larger batch sizes, the overhead increases by a small fraction.

### 9 Related Work

Several works aim to secure sensitive data from the model providers by running sensitive computation inside a CPU TEE (e.g., Intel SGX). While some methods run the entire model on the CPU TEE [49], this undermines the advantages of specialized ML task accelerators. Therefore, other approaches try to partially mitigate this shortcoming by only running selected parts of the workload inside the CPU-TEE while utilizing accelerators for more intensive tasks [50–52]. These approaches, though, still have a significant overhead and are vulnerable to privacy-stealing attacks [53].



Figure 14: Llama-2 inference overhead in different input sequence sizes. We show a single inference cost for both vanilla and ASCEND-CC. The number on the bar indicates the overhead of ASCEND-CC in percentage (%).

In parallel (and as detailed in sec. 3), there has been a long line of work aimed at directly extending the CPU-TEE or C-VMs to specific accelerator devices [16, 18, 23, 54, 55].

Closest to this paper, works like Graphcore [19] and SheF [21] move the trust entirely from the host to the device used to run the workload. Similarly, GuardNN [56] removes the trust from the host by redesigning an FPGA as an entirely new secure accelerator for ML tasks. On a larger scale, a few approaches [39,57] try to extend the confidential computing paradigm to data-center architectures, allowing many devices to be split across users. Complementary to the aforementioned techniques, several works purely focus on improving memory isolation through improved capabilities [58,59], or enhancing I/O isolation mechanisms for confidential computing [60], which often directly benefits TEE-devices architectures.

In contrast, some proposals aim to preserve privacy algorithmically either via accelerator-enabled secure multiparty computation (MPC) using secret sharing [61, 62] or homomorphic encryption [63,64]. Despite significant progress in these fields, both solutions still incur significant overhead when applied to many practical, real-world workloads [65], making them impractical for our setting.

While using a TPM for remote attestation has been the most common approach, there have been a few examples for software-based attestation [66, 67], mainly for IoT devices. Upcoming PCIe features enable mechanisms to connect

CPU-TEEs with DSA-TEEs. Specifically, TEE Device Interface Secure Protocol (TDISP) for PCIe-6 enables TEEs on processors to connect to the TEE-enabled PCIe accelerator [68]. Integrity and Data Encryption (IDE) on PCIe-5 encrypts and integrity protects PCIe traffic on processors and devices [69]. The adoption of these PCIe extensions as TDX-Connect for Intel TDX, SEV-TIO for AMD SEV-SNP, Device Attach (DA) for Arm CCA or IOPMP for RISC-V allows these TEE-enabled devices to benefit from a secure direct access to the TEE memory on processors [70–73].

While side channels are out of scope, some works have proposed solutions to secure workloads against such attacks when using a TEE. For instance, Telekine [74] secures workloads for such TEEs as presented in Graviton [17].

#### 10 Conclusion

We present ASCEND-CC, a system to enable confidential computing for large language models that ensures security against a strong attacker model, including the host CPU. ASCEND-CC provides memory invariants and cryptographic operators leveraging the heterogeneous architecture of NPUs. We implement ASCEND-CC on a Huawei Ascend910A NPU and evaluate it on state-of-the-art generative AI workloads. Our evaluation shows that ASCEND-CC is practical and protects the model and data from the untrusted host.

## References

- [1] OpenAI. ChatGPT-OpenAI. [Accessed 12-06-2024].
- [2] OpenAI. DALL-E-2 OpenAI. [Accessed 12-06-2024].
- [3] OpenAI. Sora OpenAI. [Accessed 12-06-2024].
- [4] Microsoft. GitHub Copilot overview code.visualstudio.com. [Accessed 12-06-2024].
- [5] Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, et al. One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era. arXiv preprint arXiv:2304.06488, 2023.
- [6] Google. AI Infrastructure ML and DL Model Training | Google Cloud — cloud.google.com. [Accessed 12-06-2024].
- [7] Microsoft. Azure OpenAI Service Advanced Language Models | Microsoft Azure — azure.microsoft.com. [Accessed 12-06-2024].
- [8] Huawei. Ascend AI Cloud Service | Huawei Cloud huaweicloud.com. [Accessed 12-06-2024].
- [9] Alibaba. Alibaba Cloud AI and Data Intelligence
   Alibaba Cloud alibabacloud.com. [Accessed 12-06-2024].
- [10] Condé Nast. OpenAI's CEO Says the Age of Giant AI Models Is Already Over wired.com. [Accessed 12-06-2024].
- [11] Siladitya Ray. Samsung Bans ChatGPT Among Employees After Sensitive Code Leak forbes.com. [Accessed 12-06-2024].
- [12] Intel. Intel Software Guard Extensions.
- [13] AMD. AMD SEV-SNP.
- [14] ARM. Learn the Architecture: TrustZone for AArch64, 2021.
- [15] ARM. Arm Confidential Compute Architecture (ARM-CCA).
- [16] NVIDIA Hopper Architecture In-Depth, 2022.
- [17] Stavros Volos, Kapil Vaswani, and Rodrigo Bruno. Graviton: Trusted execution environments on gpus. In *USENIX OSDI*, 2018.
- [18] Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethumadhavan, and Jaehyuk Huh. Heterogeneous isolated execution for commodity gpus. In *ASPLOS*, 2019.

- [19] Kapil Vaswani, Stavros Volos, Cédric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, et al. Confidential computing within an {AI} accelerator. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 501–518, 2023.
- [20] Shaza Zeitouni, Jo Vliegen, Tommaso Frassetto, Dirk Koch, Ahmad-Reza Sadeghi, and Nele Mentens. Trusted configuration in cloud fpgas. In *IEEE FCCM*, 2021.
- [21] Mark Zhao, Mingyu Gao, and Christos Kozyrakis. Shef: shielded enclaves for cloud fpgas. In *ACM ASPLOS*, 2022.
- [22] Hyunyoung Oh, Kevin Nam, Seongil Jeon, Yeongpil Cho, and Yunheung Paek. Meetgo: A trusted execution environment for remote applications on fpga. *IEEE Access*, 9:51313–51324, 2021.
- [23] Supraja Sridhara, Andrin Bertschi, Benedict Schlüter, Mark Kuhne, Fabio Aliberti, and Shweta Shinde. Acai: Protecting accelerator execution with arm confidential computing architecture. 2023.
- [24] Chenxu Wang, Fengwei Zhang, Yunjie Deng, Kevin Leach, Jiannong Cao, Zhenyu Ning, Shoumeng Yan, and Zhengyu He. Cage: Complementing arm cca with gpu extensions. ISOC, 2024.
- [25] Erhu Feng and et al. snpu: Trusted execution environments on integrated npus. 2024.
- [26] Yunjie Deng, Chenxu Wang, Shunchang Yu, Shiqing Liu, Zhenyu Ning, Kevin Leach, Jin Li, Shoumeng Yan, Zhengyu He, Jiannong Cao, et al. Strongbox: A gpu tee on arm endpoints. In *Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security*, pages 769–783, 2022.
- [27] Heejin Park and Felix Xiaozhu Lin. Safe and practical gpu computation in trustzone. In *Proceedings of the Eighteenth European Conference on Computer Systems*, pages 505–520, 2023.
- [28] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. Meltdown. arXiv preprint arXiv:1801.01207, 2018.
- [29] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F Wenisch, Yuval Yarom, and Raoul Strackx. Foreshadow: Extracting the keys to the intel {SGX} kingdom with transient out-of-order execution. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 991–1008, 2018.

- [30] Ahmad Moghimi, Gorka Irazoqui, and Thomas Eisenbarth. Cachezoom: How SGX amplifies the power of cache attacks. In *International Conference on Cryptographic Hardware and Embedded Systems*, pages 69–90. Springer, 2017.
- [31] Ferdinand Brasser, Urs Müller, Alexandra Dmitrienko, Kari Kostiainen, Srdjan Capkun, and Ahmad-Reza Sadeghi. Software grand exposure: SGX cache attacks are practical. In USENIX WOOT 17, 2017.
- [32] Riccardo Paccagnella, Licheng Luo, and Christopher W. Fletcher. Lord of the ring(s): Side channel attacks on the CPU on-chip ring interconnect are practical. In *USENIX Security*, 2021.
- [33] Liwei Guo and Felix Xiaozhu Lin. Minimum viable device drivers for arm trustzone. In *Proceedings of the Seventeenth European Conference on Computer Systems*, EuroSys '22, page 300–316, New York, NY, USA, 2022. Association for Computing Machinery.
- [34] Atlas 300T Training Card Huawei Enterprise. [Accessed 17-06-2024].
- [35] Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In *IEEE HPCA*, 2021.
- [36] Huawei. GitHub Ascend/pytorch: Ascend Py-Torch adapter (torch\_npu). github.com. [Accessed 18-06-2024].
- [37] Private Cloud Compute: A new frontier for AI privacy in the cloud, 2024.
- [38] Haohui Mai, Jiacheng Zhao, Hongren Zheng, Yiyang Zhao, Zibin Liu, Mingyu Gao, Cong Wang, Huimin Cui, Xiaobing Feng, and Christos Kozyrakis. Honeycomb: Secure and efficient {GPU} executions via static validation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 155–172, 2023.
- [39] Jianping Zhu, Rui Hou, XiaoFeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, et al. Enabling rack-scale confidential computing using heterogeneous trusted execution environment. In *IEEE S&P*, 2020.
- [40] Microsoft. Azure confidential cloud protect data in use | microsoft azure. https://azure.microsoft.com/en-us/solutions/confidential-compute/.
- [41] Nvidia. NVIDIA Multi-Instance GPU User Guide :: Nvidia Tesla Documentation.

- [42] Microsoft. Microsoft Copilot in Bing bing.com. [Accessed 12-06-2024].
- [43] perplexity. Perplexity. [Accessed 12-06-2024].
- [44] ARM Holdings. ARM system memory management unit architecture specification SMMU architecture version 2.0, 2024.
- [45] ARM. AArch64cryptolib, 2023.
- [46] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don't know: Unanswerable questions for squad. *CoRR*, abs/1806.03822, 2018.
- [47] Huawei. CANN Ascend Community. [Accessed 15-07-2024].
- [48] Huawei. Operator development, TBE&AI CPU Operator Development, API Reference, AI CPU API, Overview. [Accessed 15-07-2024].
- [49] Taegyeong Lee, Zhiqi Lin, Saumay Pushp, Caihua Li, Yunxin Liu, Youngki Lee, Fengyuan Xu, Chenren Xu, Lintao Zhang, and Junehwa Song. Occlumency: Privacy-preserving remote deep-learning inference using sgx. In The 25th Annual International Conference on Mobile Computing and Networking, pages 1–17, 2019.
- [50] Fan Mo, Ali Shahin Shamsabadi, Kleomenis Katevas, Soteris Demetriou, Ilias Leontiadis, Andrea Cavallaro, and Hamed Haddadi. Darknetz: towards model privacy at the edge using trusted execution environments. In *Proceedings of the 18th International Conference on Mobile Systems*, Applications, and Services, pages 161–174, 2020.
- [51] Florian Tramer and Dan Boneh. Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. In *International Conference on Learning Representations*, 2018.
- [52] Hanieh Hashemi, Yongqin Wang, and Murali Annavaram. Darknight: An accelerated framework for privacy and integrity preserving deep learning using trusted hardware. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 212–224, 2021.
- [53] Ziqi Zhang, Chen Gong, Yifeng Cai, Yuanyuan Yuan, Bingyan Liu, Ding Li, Yao Guo, and Xiangqun Chen. No privacy left outside: On the (in-) security of teeshielded dnn partition for on-device ml. *arXiv preprint arXiv:2310.07152*, 2023.
- [54] Jianyu Jiang, Ji Qi, Tianxiang Shen, Xusheng Chen, , Shixiong Zhao, Sen Wang, Li Chen, Nicholas Zhang, Xiapu Luo, and Heming Cui. Cronus: Fault-isolated, secure and high-performance heterogeneous computing for trusted execution environments. In ACM/IEEE Micro, 2022.

- [55] Wei Ren, William Kozlowski, Sandhya Koteshwara, Mengmei Ye, Hubertus Franke, and Deming Chen. Accshield: a new trusted execution environment with machine-learning accelerators. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2023.
- [56] Weizhe Hua, Muhammad Umar, Zhiru Zhang, and G Edward Suh. Guardnn: secure accelerator architecture for privacy-preserving deep learning. In *Proceedings of the 59th ACM/IEEE Design Automation Conference*, pages 349–354, 2022.
- [57] Aritra Dhar, Supraja Sridhara, Shweta Shinde, Srdjan Capkun, and Renzo Andri. Empowering data centers for next generation trusted computing. *arXiv* preprint *arXiv*:2211.00306, 2022.
- [58] Jonathan Woodruff, Robert NM Watson, David Chisnall, Simon W Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G Neumann, Robert Norton, and Michael Roe. The cheri capability model: Revisiting risc in an age of risk. ACM SIGARCH Computer Architecture News, 42(3):457–468, 2014.
- [59] Jason Zhijingcheng Yu, Conrad Watt, Aditya Badole, Trevor E Carlson, and Prateek Saxena. Capstone: a capability-based foundation for trustless secure memory access. In *32nd USENIX Security Symposium (USENIX Security 23)*, pages 787–804, 2023.
- [60] Erhu Feng, Dahu Feng, Dong Du, Yubin Xia, Wenbin Zheng, Siqi Zhao, and Haibo Chen. siopmp: Scalable and efficient i/o protection for tees. In *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, Volume 2, pages 1061–1076, 2024.
- [61] Sijun Tan, Brian Knott, Yuan Tian, and David J Wu. Cryptgpu: Fast privacy-preserving machine learning on the gpu. In 2021 IEEE Symposium on Security and Privacy (SP), pages 1021–1038. IEEE, 2021.
- [62] Nishant Kumar, Mayank Rathee, Nishanth Chandran, Divya Gupta, Aseem Rastogi, and Rahul Sharma. Cryptflow: Secure tensorflow inference. In 2020 IEEE Symposium on Security and Privacy (SP), pages 336–353. IEEE, 2020.
- [63] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. {GAZELLE}: A low latency framework for secure neural network inference. In 27th USENIX security symposium (USENIX security 18), pages 1651–1669, 2018.
- [64] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets:

- Applying neural networks to encrypted data with high throughput and accuracy. In *International conference on machine learning*, pages 201–210. PMLR, 2016.
- [65] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, and Raluca Ada Popa. Delphi: A cryptographic inference system for neural networks. In *Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice*, pages 27–30, 2020.
- [66] Arvind Seshadri, Adrian Perrig, Leendert Van Doorn, and Pradeep Khosla. Swatt: Software-based attestation for embedded devices. In *IEEE S&P*, 2004.
- [67] Ahmad Ibrahim, Ahmad-Reza Sadeghi, Gene Tsudik, and Shaza Zeitouni. Darpa: Device attestation resilient to physical attacks. In Proceedings of the 9th ACM Conference on Security & Privacy in Wireless and Mobile Networks, pages 171–182, 2016.
- [68] PCI-SIG. PCI Express 6.0 Specification.
- [69] PCI-SIG. Integrity and Data Encryption (IDE) ECN Deep Dive, accessed 2023-05-04.
- [70] Intel. Intel TDX Connect Architecture Specification, 2023.
- [71] AMD. AMD SEV-TIO: Trusted I/O for Secure Encrypted Virtualization, March 2023.
- [72] ARM Holdings. Introducing Arm Confidential Compute Architecture guide Version 3.0, 2023.
- [73] sifive. RISC-V Security Architecture Introduction, 2019.
- [74] Tyler Hunt, Zhipeng Jia, Vance Miller, Ariel Szekely, Yige Hu, Christopher J Rossbach, and Emmett Witchel. Telekine: Secure computing with cloud GPUs. In *NSDI*, 2020.