# RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization

An Zou, Student Member, IEEE, Jing Li, Member, IEEE, Christopher D. Gill, Senior Member, IEEE, and Xuan Zhang, Member, IEEE.

Abstract—Many emerging cyber-physical systems, such as autonomous vehicles and robots, rely heavily on artificial intelligence and machine learning algorithms to perform important system operations. Since these highly parallel applications are computationally intensive, they need to be accelerated by graphics processing units (GPUs) to meet stringent timing constraints. However, despite the wide adoption of GPUs, efficiently scheduling multiple GPU applications while providing rigorous real-time guarantees remains a challenge. In this paper, we propose RTGPU, which can schedule the execution of multiple GPU applications in real-time to meet hard deadlines. Each GPU application can have multiple CPU execution and memory copy segments, as well as GPU kernels. We start with a model to explicitly account for the CPU and memory copy segments of these applications. We then consider the GPU architecture in the development of a precise timing model for the GPU kernels and leverage a technique known as persistent threads to implement fine-grained kernel scheduling with improved performance through interleaved execution. Next, we propose a general method for scheduling parallel GPU applications in real time. Finally, to schedule multiple parallel GPU applications, we propose a practical real-time scheduling algorithm based on federated scheduling and grid search (for GPU kernel segments) with uniprocessor fixed priority scheduling (for multiple CPU and memory copy segments). Our approach provides superior schedulability compared with previous work, and gives real-time guarantees to meet hard deadlines for multiple GPU applications according to comprehensive validation and evaluation on a real NVIDIA GTX1080Ti GPU system.

Index Terms—GPGPU, Heterogeneous Computing, Parallel Real-time Scheduling, Persistent Thread, Interleaved Execution, Federated Scheduling, Fixed Priority, Self-suspension Model, Schedulability Analysis.

# 1 Introduction

NOWADAYS, artificial intelligence (AI) and machine learning (ML) applications accelerated by graphics processing units (GPUs) are widely adopted in emerging autonomous systems, such as self-driving vehicles and collaborative robotics [1], [2]. For example, Volvo deployed NVIDIA DRIVE PX 2 technology for semi-autonomous driving in 100 XC90 luxury SUVs [3]. These autonomous systems must simultaneously execute different algorithms in the GPU in order to perform tasks such as object detection, 3D annotation, movement prediction, and route planning [4], [5], and must also process images and signals from various sensors and decide the next action in real time.

It is thus essential to manage concurrent execution in the GPUs diligently with respect to various timing constraints, since they can have direct and critical impacts on the stability and safety of the whole system. For generalpurpose computing in non-real-time systems with GPUs, GPU scheduling has aimed to minimize the makespan of a single application or to maximize the total throughput of the system [6], [7], [8], [9]. However, many of these techniques do not translate well to scheduling GPU applications with

real-time deadlines. Conventional programming interfaces allow scheduling only at the granularity of GPU kernels: by default, the first-launched GPU kernel will occupy all GPU resources until completion, at which time the next scheduled GPU kernel can begin executing<sup>1</sup>, even with Multi-Process Service (MPS) [10]. This kernel-granular scheduling is not sufficient to meet real-time deadlines. For example, consider two real-time tasks run on the same GPU, one of which has a large GPU kernel with a long deadline, while the other has a small GPU kernel with a short deadline. If the large GPU kernel arrives slightly before the small GPU kernel, the large task will take over the entire GPU, leaving the small task stuck waiting and likely missing its deadline. To overcome this deficiency and improve the real-time performance of GPU applications, systems may add some form of preemption via low-level driver support or modify CUDA APIs so that the system's timing behavior is more predictable [11], [12], [13], [14], [15], [16], [17], [18]. However, none of these approaches provides fine-grained real-time GPU scheduling and the corresponding schedulability analysis needed to

1. GPU CUDA activity from independent host processes will normally create independent CUDA contexts. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device. CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when that activity is idle, it will switch to another context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified.

An Zou, Christopher D. Gill, and Xuan Zhang are with Washington University in St. Louis, St. Louis, MO, 63130.

Jing Li is with the Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102.



Figure 1: RTGPU framework

execute multiple real-time tasks in GPUs.

In this paper, we propose RTGPU, a general realtime GPU scheduling framework shown schematically in Fig. 1, which provides GPU partitioning and modeling and a scheduling algorithm and schedulability analysis. First, based on an in-depth understanding of GPU kernel execution and profiling of synthetic workloads, we leverage a technique called persistent threads to support SM-granularity scheduling for concurrent GPU applications [19], [20], [21]. With the persistent threads technique, the interleaved execution can achieve a 10% to 37% improvement in system utilization. We then develop a realtime GPU system model that introduces the concept of virtual streaming multiprocessors (virtual SMs). With this model, we are able to explicitly assign the desired number of virtual SMs to each GPU kernel of each GPU application, allowing finer-grained GPU scheduling without any lowlevel modifications to GPU systems. Compared with previous kernel-granularity scheduling approaches, this model supports more flexible parallel execution in the GPUs.

Each GPU application has multiple CPU execution, memory copy segments, and GPU kernels. For the GPU segments, based on our real-time GPU system model, we extend a parallel real-time scheduling paradigm, federated scheduling [22], to schedule real-time GPU applications with implicit deadlines. The key idea behind federated scheduling is to calculate and statically assign the specific computing resources that each parallel real-time task needs to meet its deadline. Note that preemption between tasks is not needed if the correct number of fixed-granularity computing resources can be accurately derived in analysis and enforced during runtime. For the CPU segments and memory copies between CPU and GPU, a novel uniprocessor fixed priority scheduling method is then proposed based on calculating the response time upper bounds and lower bounds of each segment alternately. This scheduling algorithm is not limited to GPU applications and can be further applied to other applications running on heterogeneous architecture computing systems.

Compared with previous work, combining GPU federated scheduling with CPU and memory copy fixed priority scheduling works well and achieves the best schedulability known to date. To assess the effectiveness of those techniques on real platforms, we evaluate and validate our proposed RTGPU framework on real NVIDIA GPU systems.



Figure 2: Typical GPU task execution pattern.

## 2 BACKGROUND AND RELATED WORK

**2.1 Background on GPU Systems.** GPUs are designed to accelerate compute-intensive workloads with high levels of data parallelism. As shown in Fig. 2., a typical GPU program contains three parts — a code segment that runs on the host CPU (the CPU segment), the host/device memory copy segment, and the device code segment which is also known as the GPU kernel. GPU kernels are single instruction multiple threads (SIMT) programs. The programmer writes code for one thread, many threads are grouped into one thread block, and many thread blocks form a GPU kernel. The threads in one block execute the same instruction on different data simultaneously. A GPU consists of multiple streaming multiprocessors (SMs). The SM is the main computing unit, and each thread block is assigned to an SM to execute. Inside each SM are many smaller execution units that handle the physical execution of the threads in a thread block assigned to the SM, such as CUDA cores for normal arithmetic operations, special function units (SFUs) for transcendental arithmetic operations, and load and store units (LD/ST) for transferring data from/to cache or memory.

When GPU-accelerated tasks are executed concurrently, kernels from different tasks are issued to a GPU simultaneously. Standard CUDA streaming supports multiple kernels concurrently within the same CUDA context. However, it cannot effectively manage concurrent GPU kernels and tasks in an explicit manner. When kernels are launched, the thread blocks are dispatched to all the SMs on a first-come, first-served basis. The first-launched kernel occupies all the GPU resources, and the next kernel begins its execution only when SMs are freed after completion of the first kernel. Therefore, the execution of the concurrent tasks remains sequential despite the CUDA streaming mode.

**2.2 Background on Multi-Segment Self-Suspension.** In the multi-segment self-suspension model, a task  $\tau_i$  has  $m_i$  execution segments and  $m_i-1$  suspension segments between the execution segments. So task  $\tau_i$  with deadline  $D_i$  and period  $T_i$  is expressed as a 3-tuple:

$$\tau_i = \left( (L_i^0, S_i^0, L_i^1, ..., S_i^{m_i - 2}, L_i^{m_i - 1}), D_i, T_i \right)$$

where  $L_i^j$  and  $S_i^j$  are the lengths of the j-th execution and suspension segments, respectively.  $[\widecheck{S}_i^j, \widehat{S}_i^j]$  gives the upper and lower bounds of the suspension length  $S_i^j$ .  $\widehat{L}_i^j$  is the upper bound on the length of the execution segment  $L_i^j$ .

The analysis in [23] bounds the worst-case response time of a task under the multi-segment self-suspension model, which is summarized below and utilized in this work for analyzing the response time of CPU-GPU tasks.

**Lemma 2.1.** The following workload function  $W_i^h(t)$  bounds on the maximum amount of execution that task  $\tau_i$  can perform during an interval with a duration t and a starting segment  $L_i^h$ :

$$W_i^h(t) = \sum_{j=h}^l \widehat{L}_i^{j \mod m_i} + \min\left(\widehat{L}_i^{(l+1) \mod m_i}, t - \sum_{i=h}^l \left(\widehat{L}_i^{j \mod m_i} + S_i(j)\right)\right)$$

where *l* is the maximum integer satisfying the following condition:

$$\sum_{i=h}^{l} \left( \widehat{L}_{i}^{j \bmod m_{i}} + S_{i}(j) \right) \leqslant t$$

and  $S_i(j)$  is the minimum interval-arrival time between execution segments  $L_i^j$  and  $L_i^{j+1}$ , which is defined by:

$$S_{i}(j) = \begin{cases} \check{S}_{i}^{j \mod m_{i}} & \text{if } j \mod m_{i} \neq (m_{i} - 1) \\ T_{i} - D_{i} & \text{else if } j = m_{i} - 1 \\ T_{i} - \sum_{j=0}^{m_{i} - 1} \hat{L}_{i}^{j} - \sum_{j=0}^{m_{i} - 2} \check{S}_{i}^{j} & \text{otherwise} \end{cases}$$

Then the response time of execution segment  $L_i^j$  in task  $\tau_k$  can be bounded by calculating the interference caused by the workload of the set of higher-priority tasks hp(k).

**Lemma 2.2.** The worst-case response time  $R_k^j$  is the smallest value that satisfies the following recurrence:

$$\widehat{R}_k^j = \widehat{L}_k^j + \sum_{\tau_i \in hp(k)} \max_{h \in [0, m_i - 1]} W_i^h(\widehat{R}_k^j)$$

Hence, the response time of task  $\tau_k$  can be bounded by either taking the summation of the response times of every execution segments and the total worst-case suspension time, or calculating the total interference caused by the workload of the set of higher-priority tasks hp(k) plus the total worst-case execution and suspension time.

**Lemma 2.3.** Hence, the worst-case response time  $\hat{R}_k$  of task  $\tau_k$ 

is upper bounded by the minimum of 
$$\widehat{R1}_k$$
 and  $\widehat{R2}_k$ , where: 
$$\widehat{R1}_k = \sum_{j=0}^{m_k-2} \widehat{S}_k^j + \sum_{j=0}^{m_k-1} \widehat{R}_k^j \tag{1}$$

and  $R2_k$  is the smallest value that satisfies the recurrence:

$$\widehat{R2}_{k} = \sum_{j=0}^{m_{k}-2} \widehat{S}_{k}^{j} + \sum_{j=0}^{m_{k}-1} \widehat{L}_{k}^{j} + \sum_{\tau_{i} \in hp(k)} \max_{h \in [0, m_{i}-1]} W_{i}^{h}(\widehat{R2}_{k})$$
(2)

2.3 Related Work. Previous work on GPU resource management at the operating system-level [6], [7], [24] has used persistent threads to implement SM-granularity workload assignment for non-real-time systems [19], [20], [21]. Meanwhile, Lin [2] proposed integrated vectorization and scheduling methods to exploit multiple forms of parallelism for optimizing throughput for synchronous dataflows on memory-constrained CPU-GPU platforms. Wang [25] implemented a user-mode lightweight CPU-GPU resource management framework to optimize the CPU utilization while maintaining good Quality of Service (OoS) of GPUintensive workloads in the cloud, such as cloud games. For a more complex system, Kayiran [8] considered GPU concurrency in a heterogeneous setting. For a large scale server system, Yang [9] studied parallel execution on multicore GPU clusters. Park [15], Basaran [16], Tanasic [17],

and Zhou [18] proposed architecture extensions and Effisha [26] introduced software techniques without any hardware modification to support kernel preemption. Chen [27] extended the original Flink on CPU clusters to GFlink on heterogeneous CPU-GPU clusters for big data applications. Thermal and energy efficient GPU systems were studied in [28], [29]. For real-time systems with GPUs, previous work mainly involves GPU kernel-granularity scheduling. For example, Kato [13] introduced a priority-based scheduler; Elliott proposed shared resources and containers for integrating GPU and CPU scheduling [14] and GPUSync [30] for managing multi-GPU multicore soft real-time systems with flexibility, predictability, and parallelism; Golyanik [31] described a scheduling approach based on time-division multiplexing;  $S^3$ DNN [11] optimized the execution of DNN GPU workloads in a real-time multi-tasking environment through scheduling the GPU kernels. However, these approaches focus on predictable GPU control, and do not allow multiple tasks to use the GPU at the same time. Thus, the GPU may be underutilized and a task may wait a long time to access the GPU. Researchers also have explored other approaches to improve schedulability. Gerum [32] and Berezovskyi [33] targeted accurate timing estimation for GPU workloads. Zhou [18] proposed a technique based on reordering and batching kernels to speed up deep neural networks. Lee [12] studied how to schedule two real-time GPU tasks. Bakhoda [34], Wang [35], Xu [36], and Lee [37] studied GPU scheduling on a GPU simulator. For scheduling theory, the CPU-GPU system can be modeled after the self-suspension framework, but it has CPU, memory copy, and GPU segments leading to more unique and complicated features like the interactions and blockings from nonpreemptive components in the suspension segments. Saha [38] used the persistent threads technique and busy-waiting suspension mode, which underrates the system's performance and causes extra pessimism in the scheduling ability. Sun [39] proposed a formal representation of the scheduling problem upon host-centric acceleration architectures, but it cannot handle classic sporadic/periodic tasks.

# **CPU** AND MEMORY MODEL

3.1 CPU Modelling. As represented in Fig. 2., a typical GPU application has multiple segments of CPU code, memory copies between the CPU and GPU, and GPU code (which are also called GPU kernels). Because a GPU has powerful parallel computational capacity, it is assigned to execute computationally-intensive workloads, such as matrix operations. The CPU executes serial instructions, e.g., for communication with IO devices (sensors and actuators) and launches memory copies and GPU kernels.

When a CPU executes serial instructions, it naturally behaves as a single-threaded application without parallelism. When the CPU code launches memory copies or GPU kernels, these instructions will be added into multiple FIFO buffers called a "CUDA stream". The memory copies and GPU kernels, which are in different CUDA streams, can execute in parallel if there are remaining available resources. The execution order of memory copies and GPU kernels in a single CUDA stream can be controlled by the order in which they are added to it by the CPU code. After the CPU has launched memory copies and GPU kernels into a CUDA



Figure 3: Comparison of three different GPU application scheduling approaches.

stream, it will immediately execute the next instruction, unless extra synchronization is used in the CPU code to wait for the memory copies or GPU kernels to finish. Thus, the CPU segments in GPU applications can be modelled as serial instructions executed by one thread.

**3.2 Memory Modeling.** Memory copying between the CPU and GPU execution units includes two stages. In the first stage, data is copied between the CPU memory and the GPU memory through a single peripheral component interconnect express (PCIe) for a desktop/server GPU, or through a network on chip (NoC) for an embedded GPU. Because of the hardware protocols for PCIe and NoC, only one global memory copy can be performed at a time. Also, the memory copy through PCIe/NoC is non-preemptive once it starts. The memory copy time between CPU memory and GPU memory is a linear function of the copied memory size. The GPU and other accelerators mainly provide two types of memory movement between the CPU and GPU (accelerators) [40], [41]: direct memory copy (also called traditional memory) and unified memory (introduced in CUDA 6.0 and strengthened in CUDA 8.0). Direct memory copy uses traditional memory to store and access memory, where data must be explicitly copied from CPU to GPU portions of DRAM. Unified memory is developed from zerocopy memory where the CPU and the GPU can access the same memory area by using the same memory addresses between the CPU and GPU. In unified memory, the GPU can access any page of the entire system memory and then migrate the data on-demand to its own memory at the granularity of pages. Compared with unified memory, direct memory copy is faster (higher bandwidth) [42] and is a more universal application, not just limited to GPU systems but also widely used in heterogeneous computing systems. In following discussion, we focus mainly on direct memory copy, but our approach can also be directly applied to unified memory by setting explicit copy length to zero.

The second stage is the memory access from the GPU's execution units to the GPU cache or memory. The GPU adopts a hierarchical memory architecture. Each GPU SM has a local L1 cache, and all SMs share a global L2 cache and DRAM banks. Although the current NVIDIA Multi-Process Service (MPS) does not provide any official mechanism for shared memory hierarchy partitioning, computer architecture researchers have proposed software-based generic algorithms [43] for partitioning the publicly unknown architectural details of the GPU L2 cache and DRAM through

reverse engineering. These memory accesses actually happen simultaneously with the kernel's execution. Thus, the second memory operation is modeled as part of the critical-path overhead of the kernel execution model.

## 4 GPU PARALLEL KERNEL EXECUTION MODEL

This section introduces the modeling of GPU kernels, which are the key components in GPU accelerated applications. A hard deadline requires an accurate task execution model, built upon a deep understanding of the GPU architecture and its parallel execution mechanism.

4.1 Kernel-granularity and SM-granularity Scheduling. An off-the-shelf GPU supports only kernel-granularity scheduling, as shown in Fig. 3(a). When kernels are launched in the GPU, each kernel fully occupies all the compute resources (SMs) on the GPU, so even with Multi-Process Service (MPS) by default a GPU is only able to execute one kernel at a time. The execution order of the kernels of the different tasks can be changed in kernel-granularity scheduling, as shown in Fig. 3(b). Ever since the development of the Pascal GP100 architecture, preemption has been supported by swapping the whole kernel context to GPU DRAM. However, preemption is mainly used for long-running or ill-behaved applications. It is not suitable for run-time systems [44], [45], since it introduces intolerable overhead when a whole GPU kernel is swapped in and out.

The persistent threads approach is a new software workload assignment solution proposed to implement finer and more flexible SM-granularity GPU scheduling. The persistent threads technique alters the notion of the lifetime of virtual software threads, bringing them closer to the execution lifetime of the physical hardware thread [20]. Specifically, each persistent threads block links multiple thread blocks of one kernel and is assigned to one SM to execute for the entire hardware execution lifetime of the kernel. For example, in Fig. 3(c), the first thread block in kernel 1 (K1) links the other thread blocks in K1 to form a big linked thread block. When this first thread block is executed by one SM, the other thread blocks in K1, which are linked by the first block, will also be executed in the first SM. Thus, K1 takes one SM to execute. Similarly, in kernel 3 (K3), the first two thread blocks link the other thread blocks and form two big linked thread locks. Thus, the kernel 3 (K3) takes two SMs to execute. The detailed persistent threads technique of linking thread blocks to form linked thread blocks is shown in Algorithm 1.



(a) with increasing numbers of assigned SMs



(b) comprehensive kernel with increasing size

Figure 4: Kernel execution time trends.

When the numbers of linked thread blocks are changed, the resulting number of persistent threads blocks controls how many SMs (i.e., GPU resources) are used by a kernel. In addition, when there are remaining available SMs, CUDA introduces CUDA Streams that support concurrent execution of multiple kernels. By exploiting persistent threads and CUDA Streams, we can explicitly control the number of SMs used by each kernel and execute kernels of different tasks concurrently to achieve SM-granularity scheduling. Persistent threads enabled SM-granularity scheduling fundamentally improves schedulability of parallel GPU applications by exploiting finer-grained parallelism.

**4.2 Kernel Execution Model.** To understand the relationship between the execution time of a kernel and the number of SMs assigned via persistent threads, we conducted the following experiments. We use five synthetic kernel benchmarks that utilize different GPU resources: a computation kernel, consisting mainly of arithmetic operations; a branch kernel containing large number of conditional branch operations; a memory kernel full of memory and register visits; a special-function kernel with special mathematical functions, such as sine and cosine operations; and a comprehensive kernel including all these arithmetic, branch, memory, and special mathematical operations. Each kernel performs 1000 floating-point operations on a 2<sup>15</sup>-long vector.

We first run each kernel separately with a fixed workload for 1000 times and record its corresponding execution time with increasing numbers of assigned SMs, as shown in Fig. 4(a). From the boxplot, we can see that the kernel execution time t follows the classic formula

$$t = \frac{C - L}{m} + L \tag{3}$$

where m is the number of assigned SMs, C is the work of the kernel, and L is the GPU overhead including on-chip memory visit. This makes it clear that GPU kernels are fully parallel workloads, which can utilize all m allocated SMs. The only sequential execution is when the GPU is copying data and launching the kernel. We can also observe that the execution time of a GPU kernel has low variation because it benefits from a single-instruction multiple-threads (SIMT) architecture, in which single-instruction, multiple-



Figure 5: Virtual SM model for interleaved execution data (SIMD) processing is combined with multithreading

for better parallelism.

Next, we examine the kernel execution time with increasing kernel sizes and different numbers of assigned SMs. Fig. 4(b) shows that the sophisticated kernel and the other types of kernels have similar trends. The results are again consistent with Eq. (3). When the size of the kernel is significantly larger than the GPU overhead, the execution time is dominated by the work of the kernel and has a nearly linear speedup. Also, no matter whether the kernel is large or small, and no matter what types of operations

are executed inside the kernel, the variance of the kernel

execution times is consistently small.

**4.3 Interleaved Execution and Virtual SM.** In SM-granularity scheduling with multiple GPU tasks, we can further improve GPU utilization by exploiting interleaved execution of GPU kernels. On a GPU with M SMs, naive SM-granularity scheduling can first concurrently execute the  $K_1$  and  $K_2$  kernels, each with M/2 persistent threads blocks, and then execute the K3 kernel with M persistent threads blocks, as shown in Fig. 5(a). Each block requires one SM to execute one persistent thread at a time.

On the other hand, an SM actually allows the parallel execution of two or more persistent threads blocks to overlap if they use different components of the SM in the same cycle [46]. This interleaved execution is similar to the hyperthreading in conventional multithreaded CPU systems that aims to improve computation performance. For example, in an NVIDIA GTX 1080 TI, one SM can hold 2048 software threads, whereas one thread block can have at most 1024 software threads. Thus, two or more thread blocks can be interleaved and executed on one SM. One important consequence of interleaved execution is that the execution time of a kernel increases. Therefore, to improve GPU utilization and efficiency, we can launch all three kernels, as illustrated in Fig. 5(b), where kernel 1 and kernel 2 will simultaneously execute with kernel 3. The execution latency of each kernel is increased by a factor called the interleaved factor, which ranges from 1.0 to 1.8 in the following experiments.

We propose a virtual SM model to capture this interleaved execution of multiple GPU kernels, as shown in Fig. 5(c). In particular, we double the number of physical SMs to get the number of virtual SMs. Each virtual SM can execute the same type of instruction from one persistent threads block in one virtual cycle. Compared with a physical SM, a virtual SM has a reduced computational ability and hence a prolonged virtual cycle, the length of which is related to the type of instructions in the interleaved kernel. To understand the interleaved ratio between the virtual cycle and the actual cycle, we empirically measured the execution



Figure 6: Characterization of the latency extension ratios of interleaved execution

time of a synthetic benchmark when it was interleaved with another benchmark. Fig. 6 illustrates the minimum, median, and maximum interleaved execution time, colored from light to dark, normalized over the worst-case execution time of the kernel without interleaving, where the left bar is without interleaving and right bar is with interleaving. We can see that the interleaved execution ratio is at most  $1.45\times$ ,  $1.7\times$ ,  $1.7\times$ , and  $1.8\times$  for special, branch, memory and computation kernels, respectively. The proposed virtual SM model improves throughput by  $11\%\sim38\%$  compared to the naive non-interleaved physical SM model.

4.4 Workload Pinning and Self-Interleaving. Using the persistent threads and interleaved execution techniques, multiple tasks can be executed in parallel, and the interleaved execution further improves GPU performance. In real GPU systems, such as NVIDIA GPUs, a hardware scheduler is implemented that allocates the thread blocks to SMs in a greedy-then-oldest manner [34]. Thus, at run time, the thread blocks from a kernel are interleaved and executed with thread blocks from other possible kernels, and the interleaved execution ratio is different when different kernels are interleaved and executed, as shown in Fig. 6. To guarantee a hard deadline, each kernel has to adopt the largest interleaved execution ratio when this kernel is interleaved and executed with other possible kernels. However, using the highest interleaved execution ratio cannot avoid underestimation of the GPU computation ability. Therefore, we introduce workload pinning which pins the persistent threads blocks to specific SMs, and self-interleaving where the kernel interleaves with itself on its pinned SMs.

Workload pinning is implemented by launching 2M persistent threads blocks in each kernel, which is also the number of virtual SMs, so that all virtual SMs will finally have one persistent threads block to execute. If the SM is the targeted pinning SM, the thread block will begin to execute. Persistent threads blocks assigned to undesired SMs (untargeted pinning SMs), will simply return, which takes only about  $10~\mu s$ . When a persistent threads block is assigned to the correct SM, it will not only execute its own

```
Algorithm 1: Pseudo Code of Pinned Self-Interleaving Persistent Thread Pseudo Code
```

```
// Get the ID of current SM with assemble language
static __device__ _inline __ uint32_t __mysmid()
uint32_t smid;
asm volatile ("mov.u32 %0, %%smid;" : "=r"(smid));
return smid; }
// Kernel pinned to desired_SMs with self-interleaved
persistent thread
 _global__ void kernel (int desired_SMs, ...){
int SM_num;
SM_num = __mysmid(); // Get the ID of current SM
//Excute on desired SMs, otherwise return
if(SM_num == desired_SMs) {
  //Get the global thread index: tid
  int tid = threa-
dIdx.x+(SM_num_desired_SM_start)*blockDim.x;
  //off_set links to the next thread block by persistent thread
  int off_set =
blockDim.x*(desired_SM_end-desired_SM_start+1);
//Divide N threads inside a kernel to 2 halves [0 N/2) and
[N/2 \ N). [0 \ N/2) and [N/2 \ N) from same kernel interleaved
execute with each other. From the kernel perspective, the
kernel interleaved execute with itself.
  if(blockIdx.x < virtual_SM/2) {</pre>
    for(int i = tid; i < N/2; i += off_set) {
       Execute on thread i;}}
  else {
    for(int i = tid + N/2; i < N; i += off\_set) {
       Execute on thread i;}}
return; }
// Kernel launch
void main () {
dim3 gridsize (number of virtual SM);
dim3 blocksize (Max number of threads per block);
task1 ≪ gridsize, blocksize, ..., stream ≫ (int
desired_SMs, ...);
kernel(intdesiredSMs, ...);
return; }
```

workload, but will also execute the workloads from blocks assigned to the undesired SMs. Thus, the kernel is actually executed on the desired SMs, and the undesired SMs execute an empty block within a negligible time.

The self-interleaving technique evenly divides the original kernel into two small kernels, which are assigned to the same specific SMs using workload pinning. The two small kernels are then interleaved and executed on the pinned SMs. The original kernel is self-interleaved on the pinned SMs. A persistent threads with pinned self-interleaving design and implementation is described in Algorithm 1.

# 5 PRACTICAL RT-GPU TASK SCHEDULING

In this section, we first introduce the model for real-time GPU tasks, then propose the RT-GPU scheduling algorithm,



Figure 7: GPU tasks real-time scheduling model.

and develop the corresponding response time analysis. RT-GPU uses federated scheduling to execute GPU kernels on virtual SMs and uses fixed-priority scheduling to schedule CPU and memory-copy segments.

One of the key challenges of deriving the end-to-end response times for CPU-GPU tasks is to simultaneously bound the interference on CPU, GPU, and bus without being too pessimistic. Extending federated scheduling allows us to achieve efficient and predictable execution of GPU kernels and to analyze the response times of GPU kernels independently. When analyzing the response times of the CPU segments, we view the CPU segments as execution and the response times of GPU and memory-copy segments as suspension; similarly, when analyzing the response times of the memory-copy segments, we consider the memorycopy segments as execution and the response times of GPU and CPU segments as suspension. We can thus exploit the response time analysis in [23] for multi-segment self-suspension tasks, which allows us to achieve better schedulability for CPU-GPU tasks. Our proposed end-toend response time analysis is not limited to CPU-memory-GPU system. It can also be applied to other heterogeneous systems, like CPU-memory-FPGA and CPU-memory-TUP systems.

5.1 Task Model. Leveraging the platform implementation and the CPU, memory and GPU models discussed in previous sections, the model for the parallel real-time tasks executing on a CPU-GPU platform is shown in Fig. 7. We consider a task set  $\tau$  comprised of n sporadic tasks, where  $\tau = \{\tau_1, \tau_2, \cdots, \tau_n\}$ . Each task  $\tau_i$ , where  $1 \leq i \leq n$ , has a relative deadline  $D_i$  and a period (minimum interarrival time)  $T_i$ . In this work, we restrict our attention to constrained-deadline tasks, where  $D_i \leq T_i$ , and tasks with fixed task-level priorities, where each task is associated with a unique priority. More precisely, when making scheduling decisions on any resource, such as CPU and bus, the system always selects the segment with the highest priority among all available segments for that resource to execute. Of course, a segment of a task only becomes available if all the previous segments of that task have been completed.

On a CPU-GPU platform, task  $\tau_i$  consists of  $m_i$  CPU segments,  $2m_i-2$  memory-copy segments, and  $m_i-1$  GPU segments. As discussed in Section 4.2, a GPU segment  $G_i^j$  models the execution of a GPU kernel on interleaved SMs using total work  $GW_i^j$ , critical-path overhead  $GL_i^j$ , and

interleaved execution ratio  $\alpha_i^j$ , i.e.,  $G_i^j = (GW_i^j, GL_i^j, \alpha_i^j)$ . Thus, task  $\tau_i$  can be characterized by the following tuple:

$$\tau_{i} = \left( \left( CL_{i}^{0}, ML_{i}^{0}, G_{i}^{0}, ML_{i}^{1}, CL_{i}^{1}, ML_{i}^{2}, G_{i}^{1}, ML_{i}^{3}, \cdots, CL_{i}^{j}, ML_{i}^{2j}, G_{i}^{j}, ML_{i}^{2j+1}, \cdots, CL_{i}^{m_{i}-2}, ML_{i}^{2m_{i}-4}, G_{i}^{m_{i}-2}, ML_{i}^{2m_{i}-3}, CL_{i}^{m_{i}-1} \right), D_{i}, T_{i} \right)$$

$$(4)$$

where  $CL_i^j$  and  $ML_i^j$  are the execution times of the (j+1)-th CPU and memory-copy segments, respectively. In addition, we use  $\check{}$  and  $\hat{}$  to denote the lower and upper bound on a random variable. For example,  $\widehat{CL}_i^j$  and  $\widecheck{CL}_i^j$  are the upper and lower bounds on execution times of the (j+1)-th CPU segment of  $\tau_i$ , respectively.

To derive the end-to-end response time  $R_i$  of task  $\tau_i$ , we will analyze the response times  $GR_i^j$ ,  $MR_i^j$ , and  $CR_i^j$  of each individual GPU, memory-copy, and CPU segments, respectively, and calculate their lower and upper bounds in the following subsections.

**5.2 Federated Scheduling for GPU Segments.** For executing the GPU segments of the n tasks on the shared GPU with 2GN virtual SMs (i.e., GN physical SMs), we propose to generalize federated scheduling [22], a scheduling paradigm for parallel real-time tasks on general-purpose multi-core CPUs, to scheduling parallel GPU segments. The key insight of federated scheduling is to calculate and assign the minimum number of dedicated resources needed for each parallel task to meet its deadline.

Specifically, we allocate  $2GN_i$  dedicated virtual SMs to each task  $\tau_i$ , such that its GPU segment  $G_i^j$  can start executing immediately after the completion of the corresponding memory copy  $ML_i^{2j}$ . In this way, the mapping and execution of GPU kernels to SMs are explicitly controlled via the persistent thread and workload pinning interfaces, so the effects caused by the black-box internal scheduler of a GPU are minimized. Additionally, tasks do not need to compete for SMs, so there is no blocking time on the non-preemptive SMs. Furthermore, via the self-interleaving technique, we enforce that GPU kernels do not share any physical SMs. Therefore, the interference between different GPU segments is minimized, and the execution times of GPU segments are more predictable.

In summary, each task  $\tau_i$  is assigned with  $2GN_i$  dedicated virtual SMs where each of its GPU segments self-interleaves and has an interleaved execution ratio  $\alpha_i^j$ . In Section 5.5, we will present the algorithm that determines the SM allocation to tasks. Here, for a given allocation, we can easily extend the formula in Section 4.2 to obtain the following lemma for calculating the response time  $GR_i^j$  of a GPU segment  $G_i^j$ .

**Lemma 5.1.** If the GPU segment  $G_i^j$  has a total work in range  $[\widetilde{GW}_i^j, \widehat{GW}_i^j]$ , a critical-path overhead in range  $[0, \widehat{GL}_i^j]$  and an interleaved execution ratio in range  $[1, \alpha_i^j]$ , then when running on  $2GN_i$  dedicated virtual SMs, its response time is in  $[\widetilde{GR}_i^j, \widehat{GR}_i^j]$  where

$$\widecheck{GR}_{i}^{j} = \widecheck{\frac{GW}_{i}^{j}}_{2GN_{i}}, \text{ and } \widehat{GR}_{i}^{j} = \overbrace{\frac{GW}_{i}^{j}\alpha_{i}^{j} - \widehat{GL}_{i}^{j}}_{2GN_{i}} + \widehat{GL}_{i}^{j}.$$

*Proof.* The lower bound  $\widecheck{GR}_i^j$  is the shortest execution time

of this GPU segment on  $2GN_i$  virtual SMs. In the best case, there is no critical-path overhead and no execution time inflation due to interleaved execution. The minimum total virtual work  $\widetilde{GW}_{i}^{j}$  is executed in full parallelism on  $2GN_i$  virtual SMs, which gives the formula for  $\widetilde{GR}_i^J$ . In the worst case, the maximum total virtual work is  $\widehat{GW}_{i}^{j}\alpha_{i}^{j}$ , and the maximum critical-path overhead  $\widehat{GL}_{i}^{j}$  captures the maximum overhead of launching the kernel. Since  $\widehat{GL}_i^J$  is a constant overhead and is not affected by self-interleaving and multiple virtual SMs, we do not need to apply the interleaved execution ratio  $\alpha_i^j$  to  $\widehat{GL}_i^j$ . After deducting the critical-path overhead, the remaining GPU computation is embarrassingly parallel on  $2GN_i$  virtual SMs, which results the formula of  $\widehat{GR}_{i}^{J}$ .

Note that Lemma 5.1 calculates both the lower and upper bounds on the response time of GPU segment  $G_i^j$ , because both bounds are needed when analyzing the total response time of task  $\tau_i$ . Both the lower and upper bounds can be obtained by profiling the execution time of GPU segments many times.

To ensure that tasks do not share SMs, the total number of virtual SMs assigned to all tasks must be no more than the number of available virtual SMs, i.e.,  $\sum_{i} GN_{i} \leq GN$ ; otherwise, the task set is unschedulable. During runtime execution of schedulable task sets, our platform will generate  $2GN_i$  persistent threads blocks for each GPU segment of task  $\tau_i$  to execute on its assigned  $2GN_i$  virtual SMs.

5.3 Fixed-Priority Scheduling for Memory-Copy Segments with Self-Suspension and Blocking. Our proposed algorithm, which will be explained in detail in Section 5.5, schedules the CPU and memory segments according to fixed-priority scheduling. In this subsection, we will focus on analyzing the fixed-priority scheduling of the memorycopy segments on the bus. From the perspective of executing memory-copies over the bus, memory-copy segments are "execution segments"; the time intervals where task  $\tau_i$ spends on waiting for CPU and GPU to complete the corresponding computation are "suspension segments", since the bus can be used by other tasks during these intervals of  $\tau_i$  even if  $\tau_i$  has higher priority. The analysis uses the lower bounds on the lengths of suspension segments, i.e., the lower bounds on response times of CPU and GPU segments. For a GPU segment, the lower bound  $\widecheck{GR}_i^{\jmath}$  has been obtained in Section 5.2, since our proposed algorithm uses federated scheduling on the GPU. Since the CPU segments are executed on a uniprocessor, the response time of a CPU segment is lower bounded by the minimum execution time of this segment, i.e.,  $\widetilde{CR}_i^j = \widetilde{CL}_i^j$ .

However, compared with the standard self-suspension model in Section 2.2, memory-copy over a bus has the following differences. (1) Because memory copy is nonpreemptive, a memory-copy segment of a high-priority task can be blocked by at most one memory-copy segment of any lower-priority task if this lower-priority segment has already occupied the bus. (2) The length of suspension between two consecutive memory-copies depends on the response time of the corresponding CPU or GPU segment. (3) The response times of CPU segments are related to the response times of memory-copy segments, which will be analyzed in Section 5.4. (4) Moreover, the lower bounds on the end-to-end response times of a task are related to the response times of all types of segments, which requires a holistic fixed-point calculation to be presented in Section 5.5.

We now define the following memory-copy workload function  $MW_i^h(t)$ , which is similar to the workload function defined for standard self-suspension tasks in Lemma 2.1.

**Lemma 5.2.**  $MW_i^h(t)$  bounds the maximum amount of memorycopy that task  $\tau_i$  can perform during an interval with a duration t and a starting memory-copy segment  $ML_i^h$ , where:

$$\begin{split} MW_i^h(t) &= \sum_{j=h}^{l} \widehat{ML}_i^{j \bmod 2m_i - 2} + \min\left(\widehat{ML}_i^{(l+1) \bmod 2m_i - 2}, \right. \\ &\left. t - \sum_{j=h}^{l} \left(\widehat{ML}_i^{j \bmod 2m_i - 2} + MS_i(j)\right)\right) \end{split}$$

where *l* is the maximum integer satisfying the following condition:

$$\sum_{j=h}^{l} \left( \widehat{ML}_{i}^{j \mod 2m_{i}-2} + MS_{i}(j) \right) \leqslant t$$

and  $MS_i(j)$  is defined as follow:

- If  $j \mod (2m_i-2) \neq (2m_i-3)$  and  $j \mod 2=0$ , then  $MS_i(j) = \widecheck{GR}_i^{\left(j \mod (2m_i-2)\right)/2}$ ; Else if  $j \mod (2m_i-2) \neq (2m_i-3)$  and  $j \mod 2=1$ , then  $MS_i(j) = \widecheck{CL}_i^{\left((j \mod (2m_i-2))+1\right)/2}$ ; Else if  $j = 2m_i-3$ , then  $MS_i(j) = T_i-D_i+\widecheck{CL}_i^{m_i-1}+CT^0$ .
- Else  $MS_i(j) = T_i \sum_{j=0}^{2m_i-3} \widehat{ML}_i^j \sum_{j=1}^{m_i-2} \widecheck{CL}_i^j \sum_{j=0}^{m_i-2} \widecheck{GR}_i^j$ ;

*Proof.* From the perspective of executing memory-copies over the bus, the  $2m_i - 2$  memory-copy segments are the execution segments by the definition of self-suspension task in Section 2.2. So the definition of  $MW_i^h(t)$  and l directly follows those in Lemma 2.1 by applying  $\widehat{ML}$  to  $\widehat{L}$  and changing from  $m_i$  to  $2m_i - 2$ .

The key difference is in the definition of  $MS_i(j)$ , which is the minimum "interval-arrival time" between execution segments  $ML_i^j$  and  $ML_i^{j+1}$ . By the RT-GPU task model, when  $j \mod (2m_i - 2) \neq (2m_i - 3)$ , there is either a GPU or CPU segment after  $ML_i^j$ , depending on whether the index is even or odd. So the lower bound on the response time of the corresponding GPU or CPU segment is the minimum interval-arrival time on the bus. For the latter case, the response time of a CPU segment is lower bounded by its minimum execution time. When  $j = 2m_i - 3$ ,  $ML_i^j$  is the last memory-copy segment of the first job of  $\tau_i$  occurring in the time interval t. In the worst case, all the segments of this job are delayed toward its deadline, so the minimum interval-arrival time between  $ML_i^j$  and  $ML_i^{j+1}$  is the sum of  $T_i - D_i$ , the minimum execution time of the last CPU segment  $\widetilde{CL}_i^{m_i-1}$ , and the minimum execution time of the first CPU segment  $CL_i^0$  of the next job. The last case calculates the minimum interval-arrival time between the last memory-copy segment of a job that is not the first job

and the first memory-copy segment of the next job. Since these two jobs have an inter-arrival time  $T_i$  between their first CPU segments, intuitively,  $MS_i(j)$  is  $T_i$  minus all the segments of the previous job plus the last CPU segment  $\widetilde{\mathit{CL}}_i^{m_i-1}$  of the previous job plus the first CPU segment  $\mathit{CL}_i^0$ of the next job, which is the above formula.

Hence, the response time of memory-copy segment  $ML_k^j$ can be bounded by calculating the interference caused by the workload of tasks hp(k) with higher-priorities than task  $\tau_k$  and the blocking term from a low-priority task in lp(k).

**Lemma 5.3.** The worst-case response time  $\widehat{MR}_k^j$  is the smallest value that satisfies the following recurrence:

$$\widehat{MR}_{k}^{j} = \widehat{ML}_{k}^{j} + \sum_{\substack{\tau_{i} \in hp(k) \\ \tau_{i} \in lp(k) \\ h \in [0, 2m_{i} - 3]}} MW_{i}^{h}(\widehat{MR}_{k}^{j})$$

$$+ \max_{\substack{\tau_{i} \in lp(k) \\ h \in [0, 2m_{i} - 3]}} \widehat{ML}_{i}^{h}$$

$$(5)$$

*Proof.* Because the execution of memory-copy segments is non-preemptive, the calculation of  $\widehat{MR}_k^J$  extends Lemma 2.2 by incorporating the blocking due to a low-priority memory-copy segment that is already under execution on the bus. Under non-preemptive fixed-priority scheduling, a segment can only be blocked by at most one lower-priority segment, so this blocking term is upper bounded by the longest lower-priority segment.

5.4 Fixed-Priority Scheduling for CPU Segments. Now, we will switch the view and focus on analyzing the fixedpriority scheduling of the CPU segments. Looking from the perspective of the uniprocessor, CPU segments become the "execution segments"; the time intervals where task  $\tau_i$ spends on waiting for memory-copy and GPU to complete now become the "suspension segments", since the processor can be used by other tasks during these intervals.

For now, let's assume that the upper bounds  $\widehat{MR}_i^J$  and lower bounds  $\widetilde{MR}_i^j$  on response times of memory-copy segments are already given in Section 5.3. As for GPU segments, the upper bounds  $\widehat{GR}_i^j$  and lower bounds  $\widecheck{GR}_i^j$  have been obtained in Section 5.2. Similarly, we define the following CPU workload function  $CW_i^h(t)$ .

**Lemma 5.4.**  $CW_i^h(t)$  bounds the maximum amount of CPU computation that task  $\tau_i$  can perform during an interval with a duration t and a starting CPU segment  $CL_i^h$ , where:

$$CW_i^h(t) = \sum_{j=h}^{l} \widehat{CL}_i^{j \mod m_i} + \min\left(\widehat{CL}_i^{(l+1) \mod m_i}, t - \sum_{j=h}^{l} \left(\widehat{CL}_i^{j \mod m_i} + CS_i(j)\right)\right)$$

where *l* is the maximum integer satisfying the following condition:

$$\sum_{j=h}^{l} \left( \widehat{CL}_{i}^{j \mod m_{i}} + CS_{i}(j) \right) \leqslant t$$

and  $CS_i(j)$  is defined as follow:

• If 
$$j \mod m_i \neq (m_i-1)$$
, then  $CS_i(j) = \widecheck{ML}_i^{2(j \mod m_i)} + \widecheck{GR}_i^{j \mod m_i} + \widecheck{ML}_i^{2(j \mod m_i)+1}$ ;

- Else if  $j=m_i-1$ , then  $CS_i(j)=T_i-D_i$ ; Else  $CS_i(j)=T_i-\sum_{j=0}^{m_i-1}\widehat{CL}_i^j-\sum_{j=0}^{2m_i-3}\widecheck{ML}_i^j$  $\sum_{i=0}^{m_i-2} \widecheck{GR}_i^j$ ;

*Proof.* From the perspective of the uniprocessor, the  $m_i$ CPU segments are the execution segments by the definition of self-suspension task in Section 2.2. So the definition of  $CW_i^h(t)$  and l directly follows those in Lemma 2.1 by applying  $\widehat{CL}$  to  $\widehat{L}$ . For the minimum "interval-arrival time"  $CS_i(j)$ , there are two memory-copy and one GPU segments between segments  $CL_i^j$  and  $CL_i^{j+1}$  by the RT-GPU task model, when  $j \mod m_i \neq (m_i-1)$ . So  $CS_i(j)$  is the sum of the minimum response times of these segments, where the response time of a memory-copy segment is lower bounded by its minimum length. The case of  $j = m_i - 1$  is the same. The last case considers for a job that is not the first job in interval t. The calculation is similar to the one in Lemma 2.1, except that both the  $2m_i - 2$  memory-copy and  $m_i - 1$  GPU segments constitute the suspension time.

Hence, the response time of CPU segment  $CL_k^j$  can be bounded by calculating the interference caused by the CPU workload of tasks hp(k) with higher-priorities than task  $\tau_k$ .

**Lemma 5.5.** The worst-case response time  $\widehat{CR}_k^J$  is the smallest value that satisfies the following recurrence:

$$\widehat{CR}_k^j = \widehat{CL}_k^j + \sum_{\tau_i \in hp(k)} \max_{h \in [0, m_i - 1]} CW_i^h(\widehat{CR}_k^j)$$
 (6)

*Proof.* The formula is directly extended from Lemma 2.2.

5.5 RT-GPU Scheduling Algorithm and Analysis. For a particular virtual SM allocation  $2GN_i$  for all tasks  $\tau_i$ , we can calculate the response times of all GPU, memory-copy, and CPU segments using formulas in Section 5.2 to 5.4. Note that a task starts with the CPU segment  $CL_i^0$  and ends with the CPU segment  $CL_i^{m_i-1}$ . Therefore, we can upper bound the end-to-end response times for all tasks using the following theorem, by looking at the perspective from CPU.

**Theorem 5.6.** The worst-case end-to-end response time  $R_k$  of task  $\tau_k$  is upper bounded by the minimum of  $R1_k$  and  $R2_k$ , i.e.,  $\hat{R}_k = \min(\hat{R}\hat{1}_k, \hat{R}\hat{2}_k)$ , where:

$$\widehat{R1}_k = \sum_{j=0}^{m_k - 2} \widehat{GR}_k^j + \sum_{j=0}^{2m_k - 3} \widehat{MR}_k^j + \sum_{j=0}^{m_k - 1} \widehat{CR}_k^j \tag{7}$$

and  $R2_k$  is the smallest value that satisfies the recurrence:

$$\widehat{R2}_{k} = \sum_{j=0}^{m_{k}-2} \widehat{GR}_{k}^{j} + \sum_{j=0}^{2m_{k}-3} \widehat{MR}_{k}^{j} + \sum_{j=0}^{m_{k}-1} \widehat{CL}_{k}^{j} + \sum_{\tau_{i} \in h_{p}(k)} \max_{h \in [0, m_{i}-1]} CW_{i}^{h}(\widehat{R2}_{k})$$
(8)

*Proof.* The calculations for  $\widehat{R1}_k$  and  $\widehat{R2}_k$  are extended from Lemma 2.3 by noticing that the time spent on waiting for GPU and memory-copy segments to complete are suspension segments from the perspective of CPU execution.

With the upper bound on the response time of a task, the following corollary follows immediately.

**Corollary 5.6.1.** A CPU-GPU task  $\tau_k$  is schedulable under federated scheduling on virtual SMs and fixed-priority scheduling on CPU and bus, if its worst-case end-to-end response time  $\hat{R}_k$  is no more than its deadline  $D_k$ .

Computational complexity. Note that the calculations for the worst-case response times of individual CPU and memory-copy segments, as well as one upper bound on the end-to-end response time, involves fixed-point calculation. Thus, the above schedulability analysis has pseudopolynomial time complexity. Note that the above schedulability analysis assumes a given virtual SM allocation under federated scheduling. Hence, we also need to decide the best virtual SM allocation for task sets, in order to get better schedulability. The following RT-GPU Scheduling Algorithm adopts a brute force approach to deciding virtual SM allocation. Specifically, it enumerates all possible allocations for a given task set on a CPU-GPU platform and uses the schedulability analysis to check whether the task set is schedulable or not. Alternatively, one could apply a greedy approach by assigning the minimum numbers of virtual SMs to tasks and increasing the numbers for tasks that miss their deadline according to the schedulability analysis, if one needs to reduce the running time of the algorithm while a slight loss in schedulability is affordable.

The full procedure of scheduling GPU tasks can be described as follows: (1) Grid search a federated scheduling for the GPU codes and calculate the GPU segment response time  $[\widetilde{GR}_i^j \ \widehat{GR}_i^j]$ , details in Section 5.4. (2) The CPU segments and memory copy segments are scheduled by fixed priority scheduling. (3) If all the tasks can meet the deadline, then they are schedulable and otherwise go back to step (1) to grid search for the next federated scheduling. This schedulability test for hard deadline parallel GPU tasks can be summarized in Algorithm 2.

# 6 FULL-SYSTEM EVALUATION

**6.1 Experiment Setup.** In this section, we describe extensive experiments using synthesized tasksets to evaluate the performance of the proposed RTGPU real-time scheduling

**Algorithm 2:** Fixed Priority Self-Suspension with Grid Searched Federated Scheduling

Input: Task set  $\tau$ , number of virtual SMs 2GNOutput: Scheduability, SM allocation  $2GN_i$ //Grid search for federated scheduling of GPU segments:

1 for  $GN_1 = 1, ..., GN$  do

2 for  $GN_i = 1, ..., GN - \sum_{j=1}^{i-1} GN_j$  do

3 for  $GN_n = 1, ..., GN - \sum_{j=1}^{i-1} GN_j$  do

4 //Calculate response times of GPU segments:

4  $\widetilde{GR}_i^j = \frac{\widetilde{GW}_i^j}{2GN_i}, 1 \le i \le n;$ 5  $\widehat{GR}_i^j = \frac{\widetilde{GW}_i^j \alpha_i^j - \widehat{GL}_i^j}{2GN_i} + \widehat{GL}_i^j, 1 \le i \le n;$ 6 Calculate worst-case response time  $\widehat{MR}_k^j$  for all memory copy segments using Eq.(5);

7 Calculate worst-case response time  $\widehat{CR}_k^j$  for all CPU segments using Eq.(6);

8 Calculate worst-case end-to-end response time  $\widehat{R}_k$  for all tasks using Theorem 5.6;

9 if  $\widehat{R}_k \le D_k$  for all  $\tau_k$  then

Scheduability = 1; break out of all for loops;

approach, via both schedulability tests and a real system. We choose self-suspension [47] and STGM [38]: Spatio-Temporal GPU Management for Real-Time Tasks as baselines to compare with, as they represent the state-of-the-art in fine-grained (SM-granularity) GPU real-time scheduling algorithms and schedulability tests. Three approaches are used in our experiments. 1. Proposed RTGPU: the proposed real-time GPU scheduling of hard deadline parallel tasks with fine-grain utilization of persistent threads, interleaved execution, virtual SM, and fixed-priority federated scheduling. 2. Self-Suspension: real-time GPU scheduling of hard deadline parallel tasks with the persistent threads with self-suspension scheduling, as in [47]. 3. STGM: real-time GPU scheduling of hard deadline parallel tasks with the persistent threads and busy-waiting scheduling, as in [38].

To compare the schedulability results for these approaches, we measured the acceptance ratio in each of four simulations with respect to a given goal for taskset utilization. We generated 100 tasksets for each utilization level, with the following task configurations. The acceptance ratio of a level was the number of schedulable tasksets, divided by the number of tasksets for this level, i.e., 100. According to the GPU workload profiling and characterization [48], the memory length upper bound was set to 1/4 of the GPU length upper bound. We first generated a set of utilization rates,  $U_i$ , with a uniform distribution for the tasks in the taskset, and then normalized the tasks to the taskset utilization values for the given goal. Next. we generated the CPU, memory, and GPU segment lengths, uniformly distributed within their ranges in Table 1. The deadline  $D_i$  of task i was set according to the generated segment lengths and its utilization rate:  $D_i = (\sum_{j=0}^{m_i-1} \widehat{CL}_i^j + \sum_{j=0}^{2m_i-3} \widehat{ML}_i^j + \sum_{j=0}^{m_i-2} \widehat{GL}_i^j)/U_i$ . In the configuration setting, the CPU, memory, and GPU lengths were normalized with one CPU, one memory interface, and one GPU SM. When the total utilization rate, U, is 1, the one CPU, one memory interface, and one GPU SM are fully utilized. As there are multiple SMs available (and used), the total utilization rate will be larger than 1. The period  $T_i$  is equal to the deadline  $D_i$ . The task priorities are determined with deadlinemonotonic priority assignment.

Meanwhile, in each experiment we evaluate two models. The first model has two memory copies: one memory copy from CPU to GPU and one memory copy back from GPU to CPU between a CPU segment and a GPU segment, which is exactly the execution model we introduced in section 4. The second model has one memory copy between a CPU segment and a GPU segment, which combines the memory copy from CPU to GPU and the memory copy from GPU to CPU. These two models can capture not only the CPU-GPU systems but also general heterogeneous computing architectures.

**6.2 Schedulability Analysis.** Our first evaluation focused on the schedulability of tasksets as the overall utilization increased, with respect to different parameters pertinent to schedulability. The following sub-subsections present the results of four simulations that each varied the different parameters we examined: the ratios of CPU, memory, and GPU segment lengths; the number of subtasks; the number of tasks; and the number of total SMs.



Figure 9: Schedulability under different numbers of subtasks

Table 1: Parameters for the taskset generation

| Parameters                              | Value          |
|-----------------------------------------|----------------|
| Number of tasks $N$ in taskset          | 5              |
| Task type                               | periodic tasks |
| Number of subtasks $M$ in each task     | 5              |
| Number of tasksets in each experiment   | 100            |
| CPU segment length (ms)                 | [1 to 20]      |
| Memory segment length (ms)              | [1 to 5]       |
| GPU segment length <sup>2</sup> (ms)    | [1 to 20]      |
| Task period and deadline                | $(T_i/D_i)$    |
| GPU kernel launch overhead $(\epsilon)$ | 12%            |
| Number of physical GPU SMs $N_{SM}/2$   | 10             |
| Priority assignment                     | D monotonic    |

**6.2.1 CPU, Memory, and GPU Lengths.** We investigated the impact of CPU, memory, and GPU segment lengths on the acceptance ratio. To study this quantitatively, We tested the acceptance ratio under different length range ratios. The CPU length is shown as Table 1 and we changed the memory, and GPU lengths according to the length ratio. Fig. 8 shows taskset acceptance ratio when the CPU, memory, and GPU length range ratios were set to 2:1, 1:2, and 1:8, which give an exponential scale.

Not surprisingly, the STGM approach is effective only when the memory and GPU segment (suspension segment) lengths are short enough: the STGM approach was developed based on "busy waiting". When tasks are being processed in memory copy and GPU segments, the CPU core is not released and remains busy waiting for the memory copy and GPU segments to finish. Although this is the most straightforward approach, its pessimistic aspect lies in the CPU waiting for the memory copy and GPU segments to finish. Thus, it will be ineffective and hugely pessimistic when the memory copy and GPU segments are large.

Self-suspension scheduling in [23] increases the schedulability performance compared with the straight forward STGM approach. Self-suspension models the memory and GPU segments as being suspended, and the CPU is released

during this suspension. The theoretical drawback of this approach is that the suspension does not distinguish between the memory segments and GPU segments. Instead, they are modelled as non-preemptive and will block higher priority tasks. However, in real systems, each task is allocated its own exclusive GPU SMs, and the GPU segments in one task will not interfere the GPU segments in other tasks.

The RTGPU schedulability analysis proposed in this paper is effective even when the memory and GPU segment (suspension segment) lengths are long. In this approach, we distinguish the CPU, memory, and GPU segments based on their individual properties. For example, if the CPU cores are preemptive, then no blocking will happen. Blocking happens only in non-preemptive memory segments. Meanwhile, because federated scheduling is applied for the GPU segments and each task is allocated its own exclusive GPU SMs, the GPU segments can be executed immediately when they are ready, without waiting for higher priority GPU segments to finish or being blocked by lower GPU segments.

Also, by comparing the models with one memory copy and two memory copies, we notice that the memory copy is the bottleneck in the CPU-GPU systems because of limited resource (bandwidth) and non preemption. Reducing the numbers of memory copies or combining memory copies can increase the system schedulability, especially when the memory copy length is large shown in Fig. 8 (b) and (c).

**6.2.2** Number of Subtasks. We then evaluated the impact of the number of subtasks in each task on the acceptance ratio. From the possible values in Table 1, the number of subtasks, M, in each task was set to 3, 5, or 7. The corresponding acceptance ratios are shown in Fig.9. The results show that with more subtasks in a task, schedulability decreases under all approaches but the proposed RTGPU approach still outperforms all other approaches. Compared with STGM, the proposed RTGPU approach and the self-suspension approach are the most robust as the number of subtasks increases.

**6.2.3 Number of Tasks.** In a third simulation, we evaluated the impact of the number of tasks in each taskset on the





Figure 11: Schedulability under different numbers of SMs

acceptance ratio. Again, from the possible values in Table 1, the number of tasks, N, in each task was set to 3, 5, or 7. The corresponding acceptance ratios are shown in Fig.10. As with subtasks, schedulability decreases under all the approaches as the number of tasks increases, but the proposed RTGPU approach outperformed the other two.

**6.2.4** Number of SMs. Finally, we examined the impact of the number of total SMs on the acceptance ratio. Based on the possible values in Table 1, the number of subtasks M and tasks N in each setting are again set to 5. The corresponding acceptance ratios are shown in Fig.10. All three approaches have better schedulability as the number of available SMs increases. From this set of experiments we can see that adding two more SMs will cause the utilization rate to increase for all three approaches. Meanwhile, among the three approaches, the proposed RTGPU approach again achieved the best schedulability across different numbers of SMs. As shown in Fig.10 (a), when the computation resources (GPU SMs) are limited, the bottleneck from memory copy is more obvious and serious. The two memories model has a poor scheduability in all approaches and the one memory model has a significant improved performance.

**6.3 GPU Experiment.** We also empirically evaluated the proposed RTGPU scheduling framework on a real system with an NVIDIA 1080TI GPU, which has 28 SMs modeled as 56 virtual SMs. (There are 28 physical streaming multiprocessors (SMs) in an NVIDIA GTX 1080Ti: 27 SMs can be used for executing parallel tasks, and 1 SM is reserved for handling default system applications.) The CPU was an Intel(R) Core(TM) i7-3930K CPU operating at 3.20GHz with 12 cores and 12,288 KB of on-chip cache. We implemented the synthetic benchmarks described in Section 4 in a common realtime scheduling context, since multiple GPU kernel concurrency is supported only within the same CUDA context. To run multiple kernels from different tasks simultaneously, we created a single parent process and launched each kernel using a separate CPU thread of that parent process. For parallel kernel execution, CUDA streams were used to allow



Figure 12: Schedulability under different numbers of SMs with schedulability analysis and Real GPU experiments (with worst case execution time model)

asynchronous copy and kernel execution. By default, the NVIDIA GPU adopts "adaptive power setting", in which the firmware adaptively throttles the clock speeds of SM cores and memory when they experience a low utilization rate. To avoid interference from adaptive power setting and guarantee hard deadlines, we manually fixed the SM core and memory frequencies respectively using the nvidia-smi command. We also set the GPUs to persistence mode to keep the NVIDIA driver loaded even when no applications are accessing the cards. This is particularly useful for a series of short jobs.

As in the previous schedulability analysis experiments, each task in a taskset was randomly assigned one of the values in Table 1. The deadline was set to the same value as the period. Theoretically, the memory copy and GPU kernels are modeled by their worst execution times. The execution time distributions of different sizes of memory copies through PCIe from CPU to GPU and from GPU to CPU and different GPU kernel thread lengths are measured by executed 10,000 times. Using the real GPU system, we examined schedulability using different numbers of SMs and compared the results from the schedulability analysis and from the real GPU experiments (with the worst and average execution time model). Fig. 12 presents the acceptance ratio results of the RTGPU schedulability analysis and experiments on the



Figure 13: Schedulability under different numbers of SMs with schedulability analysis and Real GPU experiments (with average execution time model)



(a) Improvement over whole GPU (b) Improvement over used resystem sources

Figure 14: RTGPU Throughput improvements

real GPU system. Both of them have better schedulability as the number of available SMs increases. The gaps between the schedulability analysis and real GPU system arise from the pessimistic aspect of the schedulability analysis and the model mismatches between worst execution time and acutual execution time. In the limited computation resource scenarios (5 SMs and 8 SMs), the bottlenecks from memory copy exist in both schedulability test and experiments with real GPU systems. Reducing the numbers of memory copies or combining memory copies are proper methods to deal with the bottlenecks. After this, the memory copy and GPU kernels are modeled by their average execution times. The results from the RTGPU schedulability analysis and real GPU system are presented in Fig.13. Because the segments are modeled by their average execution times, which is much tighter than the worst execution time, the gaps between the schedulability analysis and experiments on the real GPU system are further reduced.

Finally, we quantified the GPU throughput gained by the virtual SM model on the synthetic and real benchmark tasksets:

$$\eta_1 = \sum_{i=1}^{N=5} \frac{Numbers\ of\ SM_{\ task(i)}}{GPU\ Total\ Numbers\ of\ SMs} \times (\frac{2}{\alpha(i)} - 1)$$
 (9)

$$\eta_2 = \sum_{i=1}^{N=5} \frac{Numbers\ of\ SM\ _{task(i)}}{Total\ Numbers\ of\ SMs\ used\ in\ taskset} \times (\frac{2}{\alpha(i)} - 1)$$
(10)

where Numbers of  $SM_{task(i)}$  is the number of SMs used by task(i) and  $\alpha(i)$  is the interleaved ratio of task(i). Fig. 14(a) shows the throughput improvement over the whole GPU system according to E.q. (9). At low utilization, the actual used SMs are few so that it has small throughput over the whole GPU system. With the increase of utilization rate, more SMs are in use and bring more throughput over the whole system. To better quantify the throughput

improvement, we compare it with the actual used SMs as described in E.q. (10), in Fig. 14(b). We can see 20% and 11% throughput improvement in synthetic benchmarks and real benchmarks. This can be achieved with any GPU systems and with different numbers of SMs. The reason why the synthetic benchmark has more throughput improvement than the real benchmark is that the special function kernel in the synthetic benchmark has a low interleaved ratio, as it uses the special function units (SFUs) while other kernels rarely use these units.

# 7 CONCLUSION

To execute multiple parallel real-time applications on GPU systems, we propose *RTGPU*—a real-time scheduling method including both system work and and a real-time scheduling algorithm with schedulability analysis. *RTGPU* leverages a precise timing model of the GPU applications with the persistent threads technique and achieves improved fine-grained utilization through interleaved execution. The *RTGPU* real-time scheduling algorithm is able to provide real-time guarantees of meeting deadlines for GPU tasks with better schedulability compared with previous work. We empirically evaluate our approach using synthetic benchmarks on both schedulability analysis and real *NVIDIA GTX1080Ti GPU* systems and demonstrate significant performance gains compared to existing methods.

### **ACKNOWLEDGMENTS**

The research described in this article was supported in part by NSF grant CNS-1739643 and CNS-1948457. We are also grateful to the reviewers for their constructive feedback.

### REFERENCES

- [1] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- [2] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E Haque, Lingjia Tang, and Jason Mars. The architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems.
- [3] Nvidia accelerates race to autonomous driving at ces. https://blogs.nvidia.com/blog/2016/01/04/drive-px-ces-recap/note = Accessed: 2019-11-23.
- [4] Omid Hosseini Jafari, Dennis Mitzel, and Bastian Leibe. Real-time rgb-d based people detection and tracking for mobile robots and head-worn cameras. In 2014 IEEE international conference on robotics and automation (ICRA), pages 5636–5643. IEEE, 2014.
- [5] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv, 2018.
- [6] Christopher J Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. Ptask: operating system abstractions to manage gpus as compute devices. In *Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles*, 2011.
- [7] Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. Gdev: First-class {GPU} resource management in the operating system. In Presented as part of the 2012 {USENIX} Annual Technical Conference ({USENIX}{ATC} 12), pages 401–412, 2012.
- [8] Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing gpu concurrency in heterogeneous architectures. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 2014.
- [9] Chao-Tung Yang, Chih-Lin Huang, and Cheng-Fang Lin. Hybrid cuda, openmp, and mpi parallel programming on multicore gpu clusters. *Computer Physics Communications*, 182(1):266–269, 2011.

- [10] Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H Anderson, and F Donelson Smith. Avoiding pitfalls when using nvidia gpus for real-time tasks in autonomous systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- [11] Husheng Zhou, Soroush Bateni, and Cong Liu. S<sup>\*</sup> 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 190–201. IEEE, 2018.
- [12] Hyeonsu Lee, Jaehun Roh, and Euiseong Seo. A gpu kernel transactionization scheme for preemptive priority scheduling. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 202–213. IEEE, 2018.
- [13] Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. Timegraph: Gpu scheduling for real-time multi-tasking environments. In *Proc. USENIX ATC*, pages 17–30, 2011.
- [14] Glenn A Elliott and James H Anderson. Globally scheduled real-time multiprocessor systems with gpus. *Real-Time Systems*, 48(1):34–74, 2012.
- [15] Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. Chimera: Collaborative preemption for multitasking on a shared gpu. ACM SIGARCH Computer Architecture News, 43(1):593–606, 2015.
- [16] Can Basaran and Kyoung-Don Kang. Supporting preemptive task executions and memory copies in gpgpus. In 24th Euromicro Conference on Real-Time Systems (ECRTS 2012). IEEE, 2012.
- [17] Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on gpus. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 193–204. IEEE, 2014.
- [18] Husheng Zhou, Guangmo Tong, and Cong Liu. Gpes: A preemptive execution system for gpgpu computing. In *Real-Time and Embedded Technology and Applications Symposium*, 2015 IEEE.
- [19] Chao Yu, Yuebin Bai, Hailong Yang, Kun Cheng, Yuhao Gu, Zhongzhi Luan, and Depei Qian. Smguard: A flexible and finegrained resource management framework for gpus. *IEEE Transac*tions on Parallel and Distributed Systems, 2018.
- [20] Kshitij Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing-Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012). IEEE, 2012.
- [21] Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In *Proceedings of the 29th ACM on International Conference on Supercomputing*. ACM, 2015.
- [22] J. Li, Jian-Jia Chen, K. Agrawal, C.Lu, C.D. Gill, and Abusayeed Saifullah. Analysis of federated and global scheduling for parallel real-time tasks. In *Real-Time Systems (ECRTS)*, 26th Euromicro Conference on, pages 85–96, 2014.
- [23] Wen-Hung Huang and Jian-Jia Chen. Schedulability and priority assignment for multi-segment self-suspending real-time tasks under fixed-priority scheduling. In *Technical report*. Technical University of Dortmund, 2015.
- [24] Olivier Valery, Pangfeng Liu, and Jan-Jan Wu. A collaborative cpu–gpu approach for principal component analysis on mobile heterogeneous platforms. *Journal of Parallel and Distributed Com*puting, 120:44–61, 2018.
- [25] Bin Wang, Ruhui Ma, Zhengwei Qi, Jianguo Yao, and Haibing Guan. A user mode cpu–gpu scheduling framework for hybrid workloads. Future Generation Computer Systems, 63:25–36, 2016.
- [26] Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. Effisha: A software framework for enabling effficient preemptive scheduling of gpu. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017.
- [27] Cen Chen, Kenli Li, Aijia Ouyang, Zeng Zeng, and Keqin Li. Gflink: An in-memory computing architecture on heterogeneous cpu-gpu clusters for big data. *IEEE Transactions on Parallel and Distributed Systems*, 29(6):1275–1288, 2018.
- [28] Muhammad Husni Santriaji and Henry Hoffmann. Merlot: Architectural support for energy-efficient real-time processing in gpus. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 214–226. IEEE, 2018.
- [29] Seyedmehdi Hosseinimotlagh and Hyoseung Kim. Thermalaware servers for real-time tasks on multi-core gpu-integrated embedded systems. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 254–266. IEEE, 2019.
- [30] Glenn A Elliott, Bryan C Ward, and James H Anderson. Gpusync:

- A framework for real-time gpu management. In 2013 IEEE 34th Real-Time Systems Symposium, pages 33–44. IEEE, 2013.
- [31] Vladislav Golyanik, Mitra Nasri, and Didier Stricker. Towards scheduling hard real-time image processing tasks on a single gpu. In 2017 IEEE International Conference on Image Processing (ICIP).
- [32] Christoph Gerum, Oliver Bringmann, and Wolfgang Rosenstiel. Source level performance simulation of gpu cores. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pages 217–222. EDA Consortium, 2015.
- [33] Kostiantyn Berezovskyi, Konstantinos Bletsas, and Björn Andersson. Makespan computation for gpu threads running on a single streaming multiprocessor. In *Real-Time Systems (ECRTS)*, 2012 24th Euromicro Conference on, pages 277–286. IEEE, 2012.
- [34] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In *Performance Analysis of Systems and Software*, 2009. ISPASS 2009. IEEE International Symposium on.
- [35] Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. Simultaneous multikernel gpu: Multitasking throughput processors via fine-grained sharing. In *High Performance Computer Architecture (HPCA)*, 2016 IEEE International Symposium on, pages 358–369. IEEE, 2016.
- [36] Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei Qian. Scheduling tasks with mixed timing constraints in gpu-powered real-time systems. In Proceedings of the 2016 International Conference on Supercomputing, 2016.
- [37] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260–271. IEEE, 2014.
- [38] Sujan Kumar Saha, Yecheng Xiang, and Hyoseung Kim. Stgm: Spatio-temporal gpu management for real-time tasks. In 2019 IEEE 25th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pages 1–6. IEEE, 2019.
- [39] Jinghao Sun, Jing Li, Zhishan Guo, An Zou, Xuan Zhang, Kunal Agrawal, and Sanjoy Baruah. Real-time scheduling upon a hostcentric acceleration architecture with data offloading. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium.
- [40] Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson Smith. Gpu scheduling on the nvidia tx2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS).
- [41] Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H Anderson, F Donelson Smith, Alex Berg, and Shige Wang. An evaluation of the nvidia tx1 for supporting real-time computervision workloads. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 353–364. IEEE, 2017.
- [42] Steven Chien, Ivy Peng, and Stefano Markidis. Performance evaluation of advanced features in cuda unified memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC), pages 50–57. IEEE, 2019.
- [43] Saksham Jain, Iljoo Baek, Shige Wang, and Ragunathan Rajkumar. Fractional gpus: Software-based compute and memory bandwidth reservation for gpus. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 29–41. IEEE, 2019.
- [44] NVIDIA. Nvidia tesla p100: The most advanced datacenter accelerator ever built featuring pascal gp100, the world's fastest gpu. *Whitepaper*, 2016.
- [45] How utilize to compute preemption in the new pascal architecture (tesla p100 gtx1080)? https://devtalk.nvidia.com/default/topic/973140/how-toutilize-compute-preemption-in-the-new-pascal-architecture-teslap100-and-gtx1080-/.
- [46] Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. Automatically exploiting implicit pipeline parallelism from multiple dependent kernels for gpus. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on, pages 339–350. IEEE, 2016.
- [47] Konstantinos Bletsas, Neil Audsley, Wen-Hung Huang, Jian-Jia Chen, and Geoffrey Nelissen. Errata for three papers (2004-05) on fixed-priority scheduling with self-suspensions. Technical report, CISTER-Research Centre in Realtime and Embedded Computing Systems, 2015.
- [48] Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, and Tao Li. Gaas workload characterization under numa architecture for virtualized gpu. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE.