# Optimizing Large Model Training through Overlapped Activation Recomputation

Ping Chen, Wenjie Zhang, Shuibing He, Weijian Chen, Siling Yang, Kexin Huang, Yanlong Yin, Xuan Zhan, Yingjie Gu<sup>†</sup>, Zhuwei Peng<sup>†</sup>, Yi Zheng<sup>†</sup>, Zhefeng Wang<sup>†</sup>, Gang Chen Zhejiang University, Huawei Cloud<sup>†</sup>

#### **Abstract**

Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to  $1.37 \times$ .

Keywords: Large Model Training, Memory Optimization

#### **ACM Reference Format:**

## 1 Introduction

**Motivation.** Deep neural networks (DNNs) have excelled in domains like natural language processing [59], computer vision [3], and text-to-video generation [64]. Scaling laws [29] reveal that larger models achieve better performance, driving a rapid increase in model sizes. For instance, from GPT-2 (1.5B parameters, 2019 [53]) to PaLM (540B parameters, 2022 [9]), model sizes have grown over 360×. This trend is expected to continue [29], with the growth far outpacing the memory capacity of individual GPUs, typically limited to tens of GBs.

Training such massive models requires parallelization across multiple GPUs [28, 66] using techniques like pipeline

parallelism [21, 41] and tensor parallelism [56]. However, even these methods struggle with GPU memory limits. For example, attempting to train GPT 7B with a batch size of 32 on eight NVIDIA A100 GPUs (each with 40GB of memory) results in out-of-memory failures despite employing both tensor and pipeline parallelism. This highlights the urgent need for more efficient memory management techniques.

To address GPU memory limitations, recomputation techniques have emerged as a promising solution. By discarding activations generated during forward propagation and regenerating them during backpropagation, these methods reduce memory consumption significantly [6]. Recomputation is now widely adopted in frameworks like Megatron-LM [45], MindSpore [23], and Colossal-AI [10], each using specific policies to decide which tensors to retain and which to recompute.

#### Limitations of existing recomputation approaches.

The existing recomputation methods can be placed into two categories and introduce several problems. (1) Rule-based recomputation methods [45] lack adaptivity. These methods rely on predefined patterns that ignore available GPU memory and model-specific requirements. This lack of adaptivity often leads to excessive recomputation, inefficient memory utilization, and the need for manual tuning (section 2.2). (2) Model-adaptive methods like Checkmate [25] have poor scalability. These approaches use optimization algorithms to tailor recomputation policies to specific models. However, their scalability is limited, as large models create vast search spaces that are computationally expensive to navigate. (3) Both rule-based and model-adaptive methods incur significant latency because all recomputation is performed on the critical training path [32].

**Observations.** We have three observations in the paper. First, tensor parallelism introduces substantial communication overhead, with all-reduce operations (in Figure 1) between GPUs wasting 10%–70% of training time (§2.3). Second, in pipeline parallelism, memory usage across different pipeline stages is imbalanced, with earlier stages consuming up to 1.5× more memory than later stages (§2.3). Finally, recomputation operations can be scheduled flexibly before backward propagation, offering opportunities to optimize their timing and overlap with other operations (Figure 4). But existing methods fail to fully exploit this opportunity.

**Our work.** The aforementioned three findings motivate us to propose a new recomputation framework. Our design



(b) Backward propagation of Transformer layer

**Figure 1.** The training workflow of tensor parallelism. The shaded rectangle indicates the splitting of the tensor onto another GPU for parallel training. g denotes the all-reduce operation in the forward and backward.

goals are (1) overlapping recomputation with communication to minimize recomputation overhead, (2) optimizing GPU memory utilization by selectively storing tensors in memory to prevent unnecessary recomputation, (3) achieving load balancing across pipeline stages. To achieve these goals, we introduce two algorithms to determine recomputation scheduling policy considering which tensor should be recomputed, when they will be recomputed, and how to overlap them with communication.

The first algorithm achieves a global optimum by searching the whole solution space. We named it Lynx-OPT. It is modeled as a mixed-integer linear program. While Lynx-OPT provides an upper bound of training performance, it cannot be used for scheduling for large models because its search time is exponentially increased with the model size (§4.1).

To solve this challenge, we design a heuristic-based recomputation scheduling algorithm (Lynx-HEU) based on the observation that there are identical structures in large DNN models and local optimal scheduling policy obtained for one layer can be used for other layers with the same structure (§4.2). Lynx-HEU can be modeled as an integer linear program. Our results show that Lynx-HEU has search time of seconds and achieves near-optimal performance (§7.4). For achieving load balancing among pipeline stages, we design a greedy algorithm for model partitioning (§5). None of the existing partitioning algorithms work in our scenario because they do not consider overlapping recomputation with communication in training pipelines. Our partitioning algorithm iteratively searches for better results and terminates upon achieving load balance.

**Contributions.** In summary, we make the following contributions: (1) To the best of our knowledge, Lynx is the first recomputation framework that fully explores the potential of overlapping recomputation with communication and utilizing idle GPU memory to eliminate unnecessary tensor recomputation. (2) We introduce Lynx-OPT and Lynx-HEU for searching recomputation scheduling policy and devise a recomputation-aware model partitioning algorithm



**Figure 2.** The training workflow of pipeline parallelism (one-forward-one-backward). Each minibatch consists of 5 micro batches. The example illustrates that ideal computation-balanced model partitioning achieves the best training performance.

to ensure load balancing across pipeline stages, thereby maximizing training throughput. (3) We conduct comprehensive evaluation, demostrating that Lynx achieves up to a  $1.37 \times$  improvement over existing recomputation methods.

# 2 Background and Motivation

# 2.1 Large Model Training

Deep learning models are built with layers and iteratively trained using batches of samples. Each training step consists of forward propagation (FP) and backward propagation (BP), which refine the model's parameters to enhance accuracy. Activations are intermediate outputs generated during FP and are utilized by BP for gradient calculation. During FP, input activations, together with the current layer's weights and biases, generate output activations, which serve as the input data for the subsequent layer. BP starts from the output layer and traverses layers in reverse to optimize the weights and biases. To improve throughput and device utilization, training typically processes samples in large batches [1, 7, 43].

To accelerate training, large models are parallelized across multiple GPUs. For example, training GPT-3 (175B parameters) requires 355 GPU-years [34], OPT-175B uses 992 80GB A100 GPUs [66], and ByteDance's 175B model employs 12,288 GPUs [28]. To efficiently utilize training devices, data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP) have been proposed, and become the state-of-the-art distributed training methods [28, 33, 56, 66].

**Data parallelism.** DP accelerates training by distributing input samples across multiple workers, each of which holds a replica of the model. By splitting large batches among GPUs, DP enables faster training [37, 38, 51].

Tensor parallelism. TP addresses the challenge of accommodating large models by splitting model layers across multiple GPUs [56]. As shown in Figure 1, it parallelizes model parameters, optimizer states inside the attention and MLP blocks, and activations on GPUs. During training, it introduces two all-reduce communication operations in both the forward and backward passes to collect the computing result from each GPU to ensure training correctness.

Pipeline parallelism. PP splits a model into sub-modules and maps them to multiple GPUs. Outputs from one sub-module are transferred to the GPUs hosting the subsequent stage. A batch is split into smaller micro batches, which are processed as a stream in a pipeline, to maximize device utilization. Given the substantial memory demands during large model training, the mainstream systems often employ a one-forward-one-backward (1F1B) training mechanism [14, 32, 41, 42]. In this approach, each pipeline stage alternates between FP and BP for micro-batches. For optimal performance, all pipeline stages should have similar execution times as shown in Figure 2. Otherwise, stalls between stages may occur due to imbalanced load distribution [67].

**Impact of GPU memory.** The limited memory capacity of GPUs imposes significant constraints on large model training. Specifically, memory is required to manage both model states and activations (feature maps). Model states comprise parameters, gradients, and optimizer states, such as momentum and variances in Adam [31]. A model with *n* parameters requires 16n bytes of memory, including FP16 parameters (2n bytes), one copy of FP16 gradients (2n bytes), and FP32 optimizer data (4n bytes each for momentum, variances, and parameters). The memory consumption for activations depends on the batch size. Users often employ a large batch size to maximize GPU utilization [1], resulting in significant memory consumption during training. For instance, training a 4.7B GPT model on 8 A100 GPUs (TP=8) with a batch size of 4 per GPU requires 8GB for model states and 7.6GB for activations, leading to a GPU utilization [52] of 74%. Increasing the batch size to 8 raises GPU utilization to 89%, but also increases memory usage by 45% during training.

#### 2.2 Limitations of Existing Solutions

Activation recomputation (or activation checkpointing) is one of the major approaches used for training large models with limited GPU memory [6, 25, 32, 56]. It discards activation tensors after their final use in the forward pass and then recomputes them as required during the backward pass. However, due to the inherent complexity of large model training [36], existing efforts have the following weaknesses, which are summarized in Table 1.

1. Rule-based methods have poor adaptivity. Full Recomputation method used in the mainstream system, Megatron-LM [46], caches the input of each transformer layer, discards other activations, and recomputes them before backward propagation. Selective Recomputation [32] further reduces recomputation by recomputing only attention operators of each model layer. However, both methods overlook the available GPU memory size and the memory requirements of the model, making them less adaptive. Full Recomputation easily leads to excessive recomputation time overhead. Our experiments show that Full Recomputation over-releases 20 GB of activations to train a 7B GPT model on 8 A100 GPUs, with recomputation time accounting for 10%–30% of total

**Table 1.** The analysis of different activation recomputation policies.

| System                       | Adaptivity | Scalability | Efficiency |
|------------------------------|------------|-------------|------------|
| Full Recomputation [46]      | ×          | <b>✓</b>    | X          |
| Selective Recomputation [32] | ×          | <b>V</b>    | Х          |
| Megatron-Uniform [45]        | Manual     | V           | Х          |
| Megatron-Block [45]          | Manual     | V           | Х          |
| Checkmate [25]               | V          | X           | Х          |
| AdaPipe [57]                 | <b>V</b>   | <b>V</b>    | Х          |
| Lynx                         | V          | V           | ~          |

training time. *Selective Recomputation*, on the other hand, may release inadequate memory for training.

Megatron-Uniform and Megatron-Block [45] are two fine-grained and flexible recomputation methods in Megatron-LM. They allow users to manually choose which layers or operators to recompute to avoid out-of-memory while reducing recomputation time overhead. However, both approaches require extensive manual efforts to find the suitable recomputation configuration [47]. Even worse, each manual attempt requires running multiple iterations of training using thousands of GPUs for LLMs, incurring very high costs.

2. Model-adaptive recomputation methods have poor scalability. To overcome the shortcomings of rule-based methods, Checkmate [25] utilizes linear programming to automatically decide which operators need recomputation to minimize recomputation costs. However, the search space in Checkmate increases exponentially with the size of the DNN models, requiring immense computational time. As a result, Checkmate may not provide an optimal solution within time bounds, limiting its scalability for training large models. Our results show that Checkmate would take years to determine the optimal strategy for a 175B model, highlighting the need for faster policy-making to enable large-scale training.

3. All existing recomputation methods introduce significant overhead on the critical training paths. A recent state-of-the-art work, AdaPipe [57], uses dynamic programming methods to reduce the time required for automatically searching for optimal recomputation strategies. However, like all existing recomputation methods, it only starts performing recomputation when the released intermediate data needs to be reused. This introduces recomputation time on the critical training path, resulting in low training efficiency. Therefore, a method to reduce recomputation time is needed to improve overall training throughput.

#### 2.3 New Opportunities

We experimentally find three new observations that can be used to enhance recomputation efficiency. Specifically, we implement a pipeline training using both TP and PP to train





(b) Memory consumption.

**Figure 3.** (a) The ratio of TP communication during training. The *x*-axis represents the number of GPUs in a TP group. (b) Imbalanced stage (GPU) memory consumption (TP=2, PP=4). The memory usage is normalize with that of stage0.

the 7B GPT model on three micro batch size (mbs) with 1024 sequence length. For PP, we divide the training process into four stages. For TP, we use two, four, and eight GPUs for each stage. Experiments were conducted on NVLink- and PCIe-connected A100 GPUs. Detailed configurations can be found in §7.

Observation 1: existing approaches suffer from high communication overhead and low GPU utilization. Figure 3(a) demonstrates that the TP communication time for the NVLink-connected GPUs accounts for 10%–50% of the total training time. This ratio is over 70% on PCIe-connected GPUs due to lower bandwidth. Increasing the number of GPUs per pipeline stage reduces execution time but worsens communication bottlenecks. Additionally, profiling reveals that SMs of GPUs are mostly idle during data communication, indicating low GPU utilization.

Observation 2: GPU memory usage is imbalance across stages in training using PP. We observe that GPU memory is not fully utilized across GPUs and the GPU memory usage is varied across stages. For example, as shown in Figure 2, the GPUs hosting computations in the early stages of the pipelines (e.g., GPUs in Stage0) use more memory than the others. Figure 3(b) shows that the highest usage of GPU memory is up to 1.5× higher than that on the GPUs with the least memory usage. This is because that activation states are generated during the forward pass for each micro batch and then kept until used by the corresponding backward pass. Earlier stages require storing more activation copies. For instance, GPUs at stage 0 need to store three copies of activation states and the GPUs at stage 3 only need to store one.



**Figure 4.** An example of forward, backward, and recomputation processes.  $T_1$  is evicted at time  $t_1$  and can be recomputed anytime between  $t_1$  and  $t_2$ .

Observation 3: Recomputation overhead is not visible until the dependent backward operation begins. When the recomputation approach is used, selected activation tensors T are discarded. The backward operations Ops cannot be executed until the selected activation tensors T are recomputed. Therefore, Ops are dependent on T. We can schedule the recomputation operations at any point as long as T becomes available before Ops begins. Figure 4 illustrates this flexibility with an example where the recomputation of  $T_1$  can be executed anytime between  $t_1$  and  $t_2$ .

**Opportunities.** Current systems perform recomputation on the critical path and execute it on demand [45, 50]. Our observations highlight that we can further optimize activation recomputation by executing recomputation asynchronously in parallel with the TP communication process and selectively discarding tensors considering their recomputation time and the availability of idle memory space across GPUs and pipeline stages.

# 3 Design of Lynx

Lynx is designed to enable efficient memory management for large-model training. We have two design goals: (1) minimizing recomputation overhead by hiding recomputation behind communication and (2) maximizing pipeline throughput by model partitioning that ensures load balance across pipeline stages while accounting for recomputation time.

Lynx has three major components: *Model Profiler, Model Policy Maker*, and *Model Deployer*. Figure 5 shows the overview of the Lynx software architecture. The functionalities of each component are described below.

Model Profiler. Before deploying a new model, we will conduct a test run using user-defined training configurations. These configurations include the distributed training policy (e.g., pipeline parallelism, tensor parallelism, etc.), the number of GPUs, and hyperparameters ●. During the test run, Lynx collects critical model metrics including operator type, operator execution time, operator size, operator dependency, etc. These metrics are recorded in a database and serve as input for *Model Policy Maker* to guide scheduling decisions ●. Importantly, to avoid impacting model accuracy, Lynx does not alter user-defined hyperparameters such as batch size.

**Model Policy Maker.** It makes decisions on how to partition a model and how to schedule a tensor recomputation considering training throughput and load balancing among



Figure 5. Overview of Lynx.

all pipeline stages. It has two major sub-components: recomputation aware model partitioner which generates different model partitioning schemes and recomputation policy generator which generates a recomputation plan that minimizes recomputation overhead for a given partitioning scheme. Model Policy Maker initially partitions the model and assigns them to pipeline stages **3**. This partitioning scheme is then passed to the recomputation policy generator **9** to determine the recomputation policy for each stage **6**. After that, the recomputation time for each stage is returned to the model partitioner 6. Then, the model partitioner feeds the profiled forward and backward propagation times from the Model Profiler, along with the recomputation time from the recomputation policy generator, into the training cost model to compute the training time for each stage **3**. Finally, the Model Policy Maker evaluates whether the pipeline achieves load balancing using the per-stage execution time from the model partitioner. If not, a new partitioning scheme is generated 3, and the process repeats until load balancing is achieved.

**Model Deployer.** The *Model Deployer* implements the optimal schedule determined by the *Model Policy Maker*, utilizing deep learning frameworks to deploy the model on physical devices for training **②**.

#### 4 Recomputation Scheduling

Our goal is to develop a recomputation policy that maximizes training throughput while preventing out-of-memory issues. This requires addressing key challenges: (1) deciding which tensors to recompute, (2) determining whether recomputation is on the critical path or overlaps with communication, (3) identifying the communication phase for the recomputation to overlap with, and (4) ensuring the policy is yielded within an acceptable time. Given the NP-hard nature of recomputation scheduling, we use LP formulations to find solutions. To determine the upper bound of achievable throughput, we design an optimal LP called Lynx-OPT in §4.1. To address the vast search space of Lynx-OPT, we introduce Lynx-HEU, a heuristic approach that provides near-optimal solutions within a reasonable time in §4.2. While this work

focuses on homogeneous cluster, our formulation can be extended to heterogeneous GPU clusters, which we plan to explore in future work.

#### 4.1 Optimal Recomputation Scheduling

In this section, we present the Lynx-OPT algorithm and summarize the challenges of operationalizing Lynx-OPT. Table 2 and Table 3 summarizes all constraints and used variables.

**Problem definition.** The DNN model comprises N operators  $(OP_n)$  that perform training operations based on the model topology.  $OP_i$  must be executed at  $Phase_i$ . Other operators can also be performed at  $Phase_i$  for tensor recomputation. Whether  $OP_i$  can be executed depends on whether the result of its preceding dependencies  $OP_j$  (where j < i) have been available in the device.

**Objective.** The output of each operator can be either saved in GPUs or recomputed. Our objective is to minimize the end-to-end training time along the critical path including forward time, backward time, and recomputation overhead. In other words, we need to minimize the total computation time for all operators minus the overlapped recomputation time during communication:

minimize 
$$\sum_{t=1}^{n} \sum_{i=1}^{t} C_i \times R_{t,i} - \sum_{t \in COMM} \sum_{i=1}^{t-1} C_i \times R_{t,i}$$
 subject to Constraints in Table 2 (1)

**Dependency constraints.** Constraint D1 and D2 ensure that  $OP_i$  is computed in  $Phase_t$  only if all dependencies (i.e., outputs of  $OP_j$ ) of  $OP_i$  are available. In D1, the execution of  $OP_i$  requires that  $OP_j$  is either executed at  $Phase_t$  ( $R_{t,j}$ ) or its output was generated before  $Phase_t$  ( $S_{t,j}$ ). According to our definitions,  $OP_i$  must execute at  $Phase_i$ , as shown in D3. In the first phase of training, D4 specifies that no tensor are initially in memory.

Communication constraints. Lynx is the first work to consider how to overlap recomputation with communication. Overlapping recomputation is challenging because recomputation also has communication operators. These communication operations cannot overlap with the communication involved in forward or backward training due to bandwidth conflicts [28]. We define C1 to formulate this constraint. Additionally, we must prevent the overlapped recomputation time from exceeding the communication time, otherwise it may induce memory pressure for preloading the intermediate data on the device without substantial performance gains (C2).

**Memory constraints.** For each phase, in addition to the fixed memory consumption ( $M_{static}$ ), three factors dynamically impact memory usage: (1) checkpointed tensors stored in the device (determined by S); (2) tensors generated during training (determined by R); and (3) memory reduction resulting from freed tensors.

**Table 2.** The LP constraints of Lynx-OPT.

| Dependency Constraints.                                                | Communication Constraints.                                 | Memory Constraints.                                                                                                    |
|------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| <b>(D1)</b> $R_{t,i} \leq R_{t,j} + S_{t,j} \ \forall t \forall i$     | (C1) $R_{t,i} = 0 \ t, i \in COMM, t \neq i$               | <b>(M1)</b> $U_{t,0} = M_{static} + \sum_{i=1}^{n} M_i \times S_{t,i} \ \forall t$                                     |
| <b>(D2)</b> $S_{t,i} \leq R_{t-1,i} + S_{t-1,i} \ \forall t \forall i$ | (C2) $\sum_{i=1}^{t-1} C_i \times R_{t,i} \leq C_t, t \in$ | (M2) $U_{t,i+1} = U_{t,i} + M_{i+1} \times R_{t,i+1} - \sum_{d \in DEPS(i) \cup \{i\}} M_d \times F_{t,d,i} \forall t$ |
|                                                                        | COMM                                                       |                                                                                                                        |
| <b>(D3)</b> $R_{t,t} = 1 \ \forall t$                                  |                                                            | (M3) $F_{t,d,i} = R_{t,i} \times (1 - S_{t+1,d}) \times \prod_{j \in USER(d), j > i} (1 - R_{t,j})$                    |
| $(D4) S_{1,i} = 0 \ \forall i$                                         |                                                            | (M4) $U_{t,i} \leq M_{budget}  \forall t \forall i$                                                                    |

**Table 3.** Variables used in Lynx.

| Variables Description |                                                                                  |  |
|-----------------------|----------------------------------------------------------------------------------|--|
| $OP_n$                | N training operators                                                             |  |
| $Phase_n$             | N training execution phase. N operators correspond to N phases                   |  |
| COMM                  | Sets of communication operators (e.g., all-reduce).                              |  |
| $C_i$                 | Computation time of $OP_i$                                                       |  |
| $M_i$                 | Output memory of $OP_i$                                                          |  |
| $M_{budget}$          | The peak memory is limited by the GPU memory                                     |  |
| $M_{static}$          | Static memory including parameters, gradients, and optimizer states              |  |
| $R_{t,i}$             | Whether $OP_i$ is computed at $Phase_t$                                          |  |
| $S_{t,i}$             | The output of $OP_i$ is retained in GPUs from $Phase_{t-1}$ to $Phase_t$         |  |
| $U_{t,i}$             | $U_{t,i} \in \mathbb{R}^+$ . The memory used after computing $OP_i$ in $Phase_t$ |  |
| $F_{t,d,i}$           | Whether the output of $OP_d$ can be freed in Phase, after $OP_i$ is computed     |  |

We initialize the memory usage in M1 and recursively evaluate it (in M2) for all operations in  $Phase_t$ , considering newly generated tensors and freed memory. In M2, DEPS(i) represent the dependent operators (parents) of  $OP_i$ . We define  $F_{t,d,i}$  in M3, where USER(d) represents operators dependent on  $OP_d$  (children of  $OP_d$ ):

The output of  $OP_d$  can be discarded after the execution of  $OP_i$  if three conditions are met: (1)  $OP_i$  is executed in  $Phase_t$ , (2)  $OP_d$  is not checkpointed for  $Phase_{t+1}$ , and (3)  $OP_d$ 's children are not executed in  $Phase_t$ . We apply De Morgan's law and intersection interchange techniques from Checkmate [25] to linearize this equation, omitting details for brevity. Finally, memory usage for any phase must remain within the device constraint, as described in M4.

Challenges of operationalizing Lynx-OPT. While optimal recomputation identified by Lynx-OPT provides an upper bound on the training performance, it is impractical for models with a large number of layers to use Lynx-OPT, due to the extensive search space resulting from forward and backward operators. Despite using optimization techniques in [25, 68], *Lynx-OPT* takes 14 hours to generate policies even for relatively small models like GPT-300M (§7.4). As model size grows, search time increases exponentially—potentially requiring months or years for models with tens of billions of parameters. Therefore, a more practical algorithm is needed to generate policies within an acceptable time bound.

#### 4.2 Heuristic Recomputation Scheduling

In this section, we describe a heuristic-based recomputation scheduling approach, Lynx-HEU, to reduce the search time while achieving close-to-optimal training performance.

**Key observation of identical structures.** Large DNN models consist of multiple identical structures. For example, as shown in Figure 6(a), the pipeline parallelism has three

fixed training procedures [28], including *warm-up* (T0–T1), *steady* (T1–T2), and *cool-down* (T2–T4). Each procedure contains repeated training structures. Specifically, (1) there are several identical forward passes during *warm-up*. (2) During *steady*, each worker executes the pattern of one forward propagation followed by one backward propagation (i.e., 1F1B). (3) During *cool-down*, workers perform the repeated pattern of one synchronization stall followed by one recomputation and backward pass. Similarly, large-scale models, such as GPT [59], consist of numerous identical layers, like transformer layers (e.g., Layers 0–3 in Figure 6), which exhibit similar GPU memory footprints and computing times.

Key idea. We find that the local optimal recomputation policy for a single structure/layer can be applied to other identical structures/layers without triggering the search in the global space. For example, as shown in Figure 6(a), there are many repeated 1F1B training patterns in the *steady* stage (T1–T2), with each 1F1B training period involving multiple identical transform layers. Therefore, we can establish a policy for a single transform layer and apply this policy across layers and patterns. We formulate the problem as a linear program (LP), accounting for operator dependencies, overlapped recomputation and communication constraints, and device memory limitations.

**Problem definition.** A single basic layer (e.g., a transformer layer) consists of N operators  $(OP_1, ..., OP_n)$ . For each layer, there are four communication phases that can be used for hiding recomputation time, including two forward communication phases (named  $Phase_1$  and  $Phase_2$ ) and two backward communication phases (named  $Phase_3$  and  $Phase_4$ ) as shown in Figure 1. In addition, if overlapping is not feasible, we can always execute the recomputation ondemand in the critical path  $(Phase_5)$ . The definitions of  $R_{t,i}$ ,  $M_i$ , and  $C_i$ , COMM are the same as in Table 3. Boolean  $S_i$  denotes whether the output of  $OP_i$  will be retained in GPUs permanently. Besides, the forward passes of warm-up and steady share identical tensor retention and recomputation policies in our design.

**Objective.** Our objective is to minimize the recomputation time in the critical path for a basic model layer. In Equation 2,  $(1 - S_i) = 1$  indicates  $OP_i$  is recomputed, and  $R_{5,i} = 1$  represents  $OP_i$  is recomputed in the critical path.

minimize 
$$\sum_{i=1}^{n} (1 - S_i) \times R_{5,i} \times C_i$$
 (2)



(c) +Recomputation-aware Model Partition: Load balance across stages

Figure 6. The left part illustrates the recomputation and model partitioning policy in Lynx, while the right part shows the time and memory usage of the two stages with 3 microbatches (0 to 2). Comp. OP and Comm. OP represent the computation and communication operators, respectively. The output of shaded Comp. OPs is released from GPU after the forward pass and regenerated through recomputation during backward. The red dashed line on the right figure indicates the GPU memory budget.

**Dependency constraints.** We constraint each recomputation operator to be executed only once in Equation 3. Whether  $OP_i$  can be executed in  $Phase_t$  depends on whether  $OP_i$  is computed before *Phase*<sub>t</sub> or has been stored in the GPU, where  $OP_i$  is the preceding dependent operator of  $OP_i$ , as illustrated in Equation 4.

$$\sum_{t=1}^{5} R_{t,i} = 1 \quad \forall i$$

$$R_{t,i} \le \sum_{t'=1}^{5} R_{t',j} + S_j \quad t \in [1, 5], \forall i$$
(4)

$$R_{t,i} \le \sum_{t'=1}^{\infty} R_{t',j} + S_j \quad t \in [1,5], \forall i$$
 (4)

Communication constraints. We need to ensure that the overlapped recomputation time does not exceed the communication time (Equation 5), and communication operators should not be invoked during the communication process (Equation 6). This is because (1) limiting overlapped recomputation to stay within communication time reduces search time while maintaining performance, and (2) preventing concurrent communication avoids network contention, leading to more accurate performance predictions for policy maker and effective scheduling algorithms.

$$\sum_{i=1}^{n} (1 - S_i) \times R_{t,i} \times C_i \le CTime_t \quad t \in [1, 4]$$
 (5)

where CTime<sub>1</sub> and CTime<sub>2</sub> represent two forward communication time, and CTime<sub>3</sub> and CTime<sub>4</sub> represent two backward communication time, respectively.



Figure 7. The training details of 1F1B correspond to Stage 1 in Figure 6(a) and (b).

$$R_{t,i} = 0 \quad t \in [1,4] \ i \in COMM \tag{6}$$

**Memory constraints.** We need to ensure the peak memory usage is smaller than the GPU memory size  $(M_{budget})$ . Since unnecessary tensors are gradually released during backward propagation, the peak memory usage occurs before the first backward propagation begins [63]. Therefore, we define the peak memory usage as Equation 7. Specifically, the peak memory comprises the fixed memory ( $M_{static}$ ), tensors  $(M_{fwd})$  residing in the GPU after forward propagations before the first backward propagation, and tensors generated during the forward communication ( $M_{fwd\ comm}$ ).

$$M_{static} + M_{fwd} + M_{fwd\_comm} \le M_{budget}$$
 (7)

 $M_{fwd}$  is formulated in Equation 8, where  $N_{layer}$  denotes the number of transformer layers in the DNN model, and  $N_{batch}$  represents the number of forward pass before the first backward propagation (e.g., Stage0 has 4 forward passes in Figure 2). We define  $S_n = 1$  to store the output of  $OP_n$  in GPU as the checkpoint.

$$M_{fwd} = (N_{layer} \times \sum_{i=1}^{n} S_i \times M_i) \times N_{batch}$$
 (8)

In our design, recomputation is not overlapped with communication during the *warm-up* phase, as no recomputation operations occur in this phase. Therefore, we only calculate the size of data generated during forward communication for a single forward batch in the *steady* phase:

$$M_{fwd\_comm} = N_{layer} \times \sum_{i=1}^{n} (1 - S_i) \times (R_{1,i} + R_{2,i}) \times M_i$$
 (9)

**Optimizations.** First, in the last pipeline stage (e.g., Stage3 in Figure 2), it is unnecessary to overlap recomputation in the forward communication because recomputation will be immediately executed after discarding the corresponding tensors. In this scenario, we only consider 3 phases defined in LP: two backward communications and the critical path for on-demand recomputation. When modeling the LP for the last pipeline stage, we exclude  $M_{fwd\_comm}$  in the memory constraint.

Second, the recomputation scheduling during *cool-down* can be further improved. The training in *cool-down* incurs many synchronization stalls (T2–T3) in Figure 6(a). Lynx further uses the synchronization stalls for hiding recomputation overhead when all the dependent tensors are on the same GPU and sufficient GPU memory is available. For example, in Figure 6(b), Lynx parallelizes the recomputation of Batch 2 in Stage 0 with preceding synchronization stalls, further improving training efficiency.

**Example.** Figure 6(a) and (b) show the recomputation cases with Full recomputation and Lynx-HEU. Full recomputation disregards unused GPU memory, leading to excessive recomputation. In contrast, since GPUs in Stage 1 has ample memory space, Lynx-HEU stores some activations on the GPU to reduce recomputation overhead (Selective Recomp. in Figure 6(b)). Additionally, Lynx-HEU overlaps part of the recomputation with communication, further reducing training time (Overlapped Recomp.). Figure 7 illustrates the details. Assume that the first shaded OP of Layer 2 in Figure 6(a) corresponds to four finer-grained operators ①-④, and the second shaded operator corresponds to a single operator ⑤. In Lynx-HEU, ①-② are overlapped with communication during the forward pass of Layer 3 in the previous batch; ③-④ are overlapped with communication during backward pass of Layer 3 in the current batch; ⑤ is avoided entirely by storing its activation on the GPU. Thus, we eliminate all recomputation overhead in Stage 1 of Figure 6(b).



**Figure 8.** The recomputation-aware model partitioning approach. Each red rectangle represents a layer of the model.

**Search time.** Lynx-HEU significantly reduces the search space, requiring less than seconds to find an optimal policy in our evaluation (even for the very large model of 175B). More details are shown in §7.6.

# 5 Recomputation-Aware Model Partitioning

In this section, we describe a model partitioning approach that can achieve computation times balancing among pipeline stages when recomputation is overlapped with communication. It is not independent of the recomputation policy, as model partitioning is related to the recomputation scheduling algorithm. We use a greedy algorithm in the search of a partitioning policy as shown in Figure 8.

**Key idea.** We aim to iteratively reduce the number of layers in the slowest stage and increase the number of layers in the fastest stage until the execution times across stages are as equal as possible. Specifically, it consists of four steps: (1) The Policy Initializer generates an initial model partitioning scheme, where each stage has approximately the same number of layers; (2) Based on the current partitioning scheme, the Recomputation Policy Generator uses heuristic recomputation scheduling 4.2 to find the near-optimal recomputation policy; (3) The Training Cost Model estimates the total time for each stage based on the recomputation policy, leveraging information collected by the Model Profiler; (4) The Partition Policy Maker generates a new model partitioning scheme by reducing one layer from the slowest stage and adding one layer to the fastest stage. Then, steps (2) and (3) are re-executed to evaluate each stage's execution time. If the new partitioning scheme is valid (i.e., no out-of-memory errors) and the longest stage of the new partitioning scheme is shorter than the current longest stage, then the new partitioning scheme is adopted. Repeat steps (2), (3), and (4) until the partitioning scheme does not change compared to the last iteration, yielding the output policy (5).

**Example.** Figure 6(b) shows the case without recomputation-aware model partitioning, where each stage has two layers.

Stage 0 has the highest storage pressure, requiring recomputation of both shaded OPs, with only one overlapped with communication. In contrast, in Stage 1, only one shaded OP needs recomputation and can be fully overlapped. This causes Stage 0 to take longer than Stage 1 due to the additional recomputation time on the critical training path, leading to a pipeline bubble. In contrast, Figure 6(c) shows that after enabling recomputation-aware model partitioning, Stage 0 is assigned one layer and Stage 1 is assigned three layers. After readjusting the recomputation policy, the total times for both stages are approximately balanced, improving pipeline efficiency and reducing total training time.

#### 6 Implementation

Model Profiler. It collects model metrics before training and addresses two challenges. (1) Profiling the full model risks out-of-memory (OOM) issues and high computational costs. To address this, Lynx profiles only a single representative layer instead of a group of similar layers, leveraging the repetitive structures common in large-scale models. (2) Modeling hundreds of operators individually for the linear program (LP) formulation is inefficient. To reduce search overhead, smaller operators (e.g., add, get shape, and transpose) are grouped into a single unit, while major operators (e.g., Matmul and Fused Layernorm) remain as individual scheduling units.

**Policy Maker.** It supports any hybrid parallelism policy and applies search algorithms based on profiled metrics, allowing Lynx to find the optimal partitioning and recomputation policy for each PP stage. To improve practicality, we implement the policy search algorithm using the Gurobi optimizer [17] and integrate it into the profiling interface, streamlining the process by combining profiling and policy making.

Model Deployer. It supports two training frameworks, Megatron-LM [46] for NVIDIA GPUs and MindSpeed [22] for Ascend NPUs. Both frameworks provide basic interfaces for computation graph partitioning. We implement Lynx's partitioning using these interfaces, define custom overlapped and selective recomputation, and modify the decoder layer to support our recomputation policies.

#### 7 Evaluation

### 7.1 Experimental Setup

Clusters. We conduct experiments primarily on two NVIDIA clusters with different GPUs and network bandwidths. The *NVIDIA-NVLink cluster* consists of four nodes, each with 256GB DRAM, two Intel Xeon Gold 6130 CPUs and four NVIDIA A100-SXM 40GB GPUs interconnected via NVLink (600 GB/s bidirectional bandwidth). The *NVIDIA-PCIe cluster* also consists of four nodes, each equipped with 128GB DRAM, two Intel Xeon Gold 5318Y CPUs, and two NVIDIA A100-PCIe 40GB GPUs with PCIe 4.0 (64 GB/s bidirectional

bandwidth). All nodes are connected via ConnectX-5 Infiniband.

Additionally, we evaluate Lynx on an *Ascend NPU cluster* (section 7.5). It consists of four nodes, each with eight Ascend 910 32GB Accelerators (NPUs), 192 CPU cores in four sockets, and 512GB memory. The eight NPUs are installed on two NPU boards in each node, and the four NPUs on each board are fully meshed via 30 GB/s links in all directions. All nodes are connected via a 100 Gbps NIC for inter-node communication.

Baselines. We compare Lynx with the following systems: (1) Full Recomputation [46]: It releases all intermediate data in each layer and recomputes all model layer before backward. For model partitioning, it balances the number of model parameters on each pipeline stage [13]. We name this default partitioning approach as the dp-partitioning. (2) Selective Recomputation [32]: It only recomputes the attention operators within each layer and also adopts the *dp-partition*. (3) AdaPipe [57]: AdaPipe is the state-of-the-art model-adaptive recomputation system. It automatically determines the recomputation and model partitioning strategy through a dynamic programming algorithm. All these systems expose recomputation time along the critical computation path. In contrast, Lynx parallelizes recomputation time with communication time using the heuristic recomputation scheduling and applies a recomputation-aware model partitioning strat-

Besides, we compare Lynx with Lynx-OPT to demonstrate Lynx's superiority in balancing recomputation policy search overhead and model training performance in §7.4. We do not compare Lynx with Megatron-Block and Megatron-Uniform, as *AdaPipe* already outperforms them.

Workloads. We use six GPT [53] models of varying scales: GPT-300M, GPT-1.3B, GPT-4.7B, GPT-7B, GPT-13B, and GPT-23B. They have varying attention heads, hidden dimensions, and numbers of layers, as specified in the official documentation [2]. If not specified, the sequence length is set to 1024. All models are trained on the representative WikiText2 dataset [62] using mixed-precision training, following the approach outlined in related work [39].

#### 7.2 Overall Performance

Figure 9 shows the model training throughput results for different models with varying micro batch sizes (ranging from 8 to 32) across two clusters. The micro batch size refers to the number of training samples per GPU. We have the following five observations.

First, Lynx outperforms others, with up to 1.37×, 1.2×, and 1.18× throughput gains over Full Recomputation, Selective Recomputation, and AdaPipe, respectively, highlighting its effectiveness. Second, Lynx achieves greater average speedup on the NVIDIA-PCIe cluster (1.35×) than on the NVLink cluster (1.3×) compared to Full Recomputation, as slower PCIe



**Figure 9.** Overall training throughput of different recomputation policies across five models and two GPU clusters. We omit displaying evaluation results that encounter out-of-memory issues.



**Figure 10.** (a) The time overhead is normalized to that of Full Recomputation. (b) Time breakdown of Lynx recomputation of four pipeline stages with 32 micro batch size (mbs).

bandwidth allows more recomputation to overlap with communication. Third, Lynx's speedup varies by model scale and micro batch size, as these factors influence Lynx's recomputation and partitioning strategies. Fourth, Lynx outperforms AdaPipe by up to 1.18× and 1.2× on NVLink and PCIe clusters, respectively. This is because Lynx can overlap recomputation with communication, further reducing recomputation overhead (details in §7.3.1). Fifth, Selective Recomputation faces out-of-memory issues with large models or micro batch sizes due to insufficient memory release, unlike Lynx which adapts recomputation policies to GPU memory.

#### 7.3 Breakdown Analysis

**7.3.1** Effectiveness of Recomputation Policy. Recomputation time comparison. We use the *dp-partitioning* in all the experiments, ensuring an even distribution of model parameters across each pipeline stage. Due to space constraints, we present results for GPT-13B and GPT-23B models on the NVIDIA-NVLink cluster only. Similar trends are observed for other models and configurations. We exclude all selective recomputation results due to OOM issues. Figure 10(a) shows the normalized recomputation time on the critical path. We observe that Lynx reduces recomputation time by 71%–94% and 31%–80% compared to Full Recomputation, and AdaPipe, respectively. This is because Lynx selects appropriate layers for recomputation and hides recomputation time within communication.

Recomputation operator ratio. Figure 10(b) shows the ratio of recomputation operators on the critical path (denoted as *On-demand*), recomputation operators run in parallel with communication (denoted as *Overlapping*), and non-recomputation operators (denoted as *Non-Recomp.*) in Lynx. Lynx achieves up to 31% and 30% recomputation-communication overlap on the 13B and 23B models, respectively, with a uniform proportion across stages. We also observe that Lynx effectively reduces more recomputation overhead in the later pipeline stages. For example, it eliminates all recomputation overhead in stage 2 and stage 3 for the 13B model, while reducing it by only 57% and 80% in stage 0 and stage 1, respectively. This is because training in the earlier stages consumes more GPU memory, making it hard to fully hide recomputation within communication.

7.3.2 Effectivness of Model Partitioning. Figure 11 shows the throughput comparison of dp-partitioning and Lynx's partitioning. We use GPT-13B and GPT-23B models on the NVIDIA-NVLink cluster. Lynx's partitioning increases the throughput by 1.1×-1.14×, and 1.16×-1.23× for the 13B and 23B models respectively. The dp-partitioning scheme may cause uneven execution times across pipeline stages, negatively impacting overall training performance. Figure 12 shows that enabling Lynx's partitioning reduces the bubbles in the pipeline by 30%. Moreover, Lynx brings more benefits for larger models because training smaller models requires less GPU memory, leading to lower or even no recomputation overhead, thereby alleviating the issue of load unbalancing across stages.



**Figure 11.** Training throughputs with different model partitioning schemes. The throughput (samples/s) is normalized to that of Lynx with dp-partition.



**Figure 12.** The scheduling for dp-partition (top) and Lynx's partition (bottom). We train GPT-23B with 16 micro batch size.

7.3.3 Contribution of each technique. Figure 13 shows the impact of each optimization on the overall training throughput. +recomputation only applies heuristic recomputation, while +All further incorporates recomputation-aware model partitioning in Lynx. Specifically, +recomputation and +All enhance throughput by 1.19×-1.3× and 1.25×-1.35× compared to Full Recomputation, contributing an average of 77% and 23% to the overall performance improvement, respectively. This highlights the necessity of each technique. Moreover, model partitioning is more effective for larger models. For example, it contributes 25% of the performance improvement on the 23B model compared to 18% on the 13B model.

#### 7.4 The Effectiveness of Lynx-HEU

Figure 14 shows the policy search time and model training throughput of *Checkmate*, Lynx with optimal recomputation scheduling (*Lynx-OPT*), and Lynx with heuristic recomputation scheduling (*Lynx-HEU*). Checkmate uses MILP to determine the recomputation policy that minimizes additional recomputation costs, without accounting for overlapped recomputation (§2.2). We use the small GPT-300M model with micro batch sizes of 32 on the NVIDIA-NVLink cluster. Figure 14(a) shows that Checkmate and *Lynx-OPT* require 14 hours to find the best policy, while *Lynx-HEU* only takes 0.5 seconds. *Lynx-HEU* achieves 97.8% of the training throughput of *Lynx-OPT* while reducing the search time by 99.99%. This demonstrates the effectiveness and practicality of *Lynx-HEU*, as it can achieve a training throughput close to the optimal in an acceptable amount of time.



**Figure 13.** Performance contribution of each technique.



Figure 14. The effect of Lynx-HEU on GPT-300M model.



**Figure 15.** Training on Ascend cluster with 8 mbs.

#### 7.5 Sensitivity Analysis

Accelerator types. Besides the NVIDIA GPUs, Lynx is applicable to other processing units. Figure 15 shows that Lynx consistently outperforms other systems on the Ascend NPU cluster. It speeds up 1.15×-1.35× and 1.08×-1.23× compared to Full Recomputation and AdaPipe, respectively. Notably, Lynx performs better on Ascend clusters than NVIDIANVlink devices due to higher TP communication, enabling more overlapped recomputations during communication.

Parallelism policy. We configure different parallelism policy by changing the levels of tensor parallelism (TP) and pipeline parallelism (PP). The former equals the number of accelerators used within a single machine and the latter equals the number of machines used. Figure 16(a) shows that Lynx outperforms other baselines by 1.06×-1.37× in throughput across parallelism strategies on NVIDIA-NVLink. Besides, under the same total number of GPUs, Lynx shows greater speedup with larger TP, as increased communication time allows for more effective recomputation during communication. Figure 15 shows similar observations for Lynx on NPUs.

**Sequence length.** Figure 16(b) shows that Lynx consistently outperforms all counterparts across a range of sequence lengths from 512 to 2048. Additionally, increasing the sequence length slows down training throughput as it raises the complexity of training.



**Figure 16.** Parallelism policy and sequence length on GPT-7B with micro batch size 32.

## 7.6 Overhead Analysis

**Profiling time.** Lynx requires offline profiling of the model and collection of statistics for each operator. It introduces a small time overhead equivalent to several iterations of training (§3). In our experiments, profiling models ranging from 1.3B to 23B parameters takes only a few minutes due to our optimizations outlined in §6. The profiling time is negligible compared to the total training time.

**Search time for Lynx-HEU.** Lynx-HEU can generate a solution within 1 seconds for models ranging from 1.3B to 175B parameters. Moreover, Figure 17 shows that the search time for Lynx-HEU remains consistent across different model sizes, demonstrating its scalability.

**Search time for model partitioning.** Determining the model partitioning policy, requires multiple invocations of Lynx's recomputation scheduling mechanism. Figure 17 shows that Lynx-HEU takes less than 3 seconds to determine both partitioning and recomputation policies, even for large-scale models like the 175B parameter model.

#### 8 Discussion

Applicability to new techniques. Other parallel techniques, like sequence parallelism (SP) [32], are also employed in large model training. SP partitions tensors along the sequence dimension to decrease computational and memory demands for activations. Our experiments demonstrate that Lynx achieves an additional 10% speedup when SP is incorporated on top of TP. This is because SP decreases the execution time of each operator, providing more opportunities for overlapping recomputation.

Applicability to new hardwares. AI accelerators with extreme training performance, such as the NVIDIA GH200 [44] and B200 [48], are becoming available. Moreover, new AI training systems, such as NVIDIA DGX SuperPOD [49] and Google TPUv4 Pods [15], have been proposed, comprising thousands of high-performance AI accelerators. These systems may enable scaling tensor parallelism to more than eight GPUs, thereby increasing communication pressure. In these scenarios, we believe that the techniques proposed in Lynx will be more effective due to increased computing speed and high communication overhead.

**Applicability to other mainstream models.** Lynx is applicable to most mainstream models, as they are typically



**Figure 17.** Policy search time.

designed with repeated structures (e.g., GPT series, LLaMA series, PaLM, T5, ViT).

#### 9 Related Work

## Recomputation, swapping and compression techniques.

Prior work uses data recomputation to extend the limited capacity of GPU memory [6, 12, 25, 32, 57]. Lynx follows this way but can further reduce computational overhead by overlapping recomputation with communication. Data swapping [1, 5, 8, 20, 30, 55] and their combination with recomputation [18, 19, 27, 50, 60] can be also leveraged to minimize GPU memory footprint. These techniques complement to our approach. Compression techniques are widely used to eliminate data redundancy during DNN training [4, 5, 24, 65], but they may compromise model accuracy.

Data parallelism, tensor parallelism, and pipeline parallelism. DP partitions input samples among different workers [32, 37, 38, 51]. However, as the size of the model grows, these approaches will suffer from communication bottlenecks [33, 67]. TP splits model weight matrices and assign them to different devices [14, 21, 26, 41, 43, 56]. PP partitions a model into sub-modules to multiple GPUs and transfer the output of each module to the next device [14, 21, 35, 41, 42, 56]. Existing works also consider evenly partitioning models to achieve the computation balance [40, 41, 43, 58]. However, Lynx considers the impact of recomputation on performance when partitioning the model into different stages, whereas other approaches do not.

Overlapping computation within communication. Previous studies apply a variety of loop analysis and transformation techniques to extract loops containing only independent communication and computation for overlapping [11, 16]. Some works accelerate DNN training through hardware [54] or compiler optimizations [61]. They are orthogonal to Lynx as they do not consider overlapping recomputation.

#### 10 Conclustion

In this paper, we propose the Lynx framework for large DNN model training with recomputation. First, it reduces recomputation overhead by overlapping recomputation with communication, which is required in tensor and pipeline parallelism. Second, we model the recomputation scheduling problem and solve it using an integer linear program to achieve a near-optimal solution based on the heuristics that large models have identical structures to reduce the size of solution space. Finally, we design a model partitioning algorithm to achieve load balancing among pipeline stages. We evaluate the performance of Lynx across different models using both NVLink and PCIe connected GPU clusters. The results show that Lynx outperforms the existing approaches by up to 1.37×.

## References

- [1] Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W Lee. 2021. Flash "Neuron:SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In *Proceedings of the 19th USENIX Conference on File and Storage Technologies*.
- [2] Tom B Brown. 2020. Language Models are Few-shot Learners. arXiv preprint arXiv:2005.14165 (2020).
- [3] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative Pretraining from Pixels. In Proceedings of the International conference on machine learning.
- [4] Ping Chen, Shuibing He, Xuechen Zhang, Shuaiben Chen, Peiyi Hong, Yanlong Yin, and Xian-He Sun. 2022. Accelerating Tensor Swapping in GPUs With Self-Tuning Compression. *IEEE Transactions on Parallel* and Distributed Systems (2022).
- [5] Ping Chen, Shuibing He, Xuechen Zhang, Shuaiben Chen, Peiyi Hong, Yanlong Yin, Xian-He Sun, and Gang Chen. 2021. CSWAP: A Self-Tuning Compression Framework for Accelerating Tensor Swapping in GPUs. In Proceedings of the 2021 IEEE International Conference on Cluster Computing.
- [6] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174 (2016).
- [7] Weijian Chen, Shuibing He, Yaowen Xu, Xuechen Zhang, Siling Yang, Shuang Hu, Xian-He Sun, and Gang Chen. 2023. iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture.
- [8] Xiaoming Chen, Danny Z. Chen, and Xiaobo Sharon Hu. 2018. MoDNN: Memory Optimal DNN Training on GPUs. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition.
- [9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling Language Modeling with Pathways. *Journal of Machine Learning Research* (2023).
- $\begin{tabular}{ll} [10] Colossal AI. 2024. Colossal AI. https://colossal ai.org/. https://colossal.ai.org/. https://col$
- [11] A. Danalis, K.-Y. Kim, L. Pollock, and M. Swany. 2005. Transformations to Parallel Codes for Communication-Computation Overlap. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing.
- [12] Deepspeed. 2023. Activation Checkpointing. https://deepspeed.readthedocs.io/en/stable/activation-checkpointing.html. https://deepspeed.readthedocs.io/en/stable/activation-checkpointing.html
- [13] Deepspeed-Megatron. 2024. Pipeline Parallelism. https://www.deepspeed.ai/tutorials/pipeline/. https://www.deepspeed.ai/tutorials/pipeline/
- [14] Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. 2021. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
- [15] Google. 2024. Google showcases Cloud TPU v4 Pods for large model training. https://cloud.google.com/blog/topics/tpus/google-

- showcases-cloud-/tpu-v4-pods-for-large-model-training. https://cloud.google.com/blog/topics/tpus/google-showcases-cloud-tpu-v4-pods-for-large-model-training
- [16] J. Guo, Q. Yi, J. Meng, J. Zhang, and P. Balaji. 2016. Compiler-Assisted Overlapping of Communication and Computation in MPI Applications. In Proceedings of the 2016 IEEE International Conference on Cluster Computing.
- [17] gurobi. 2024. Gurobi. https://www.gurobi.com/.
- [18] Shuibing He, Ping Chen, Shuaiben Chen, Zheng Li, Siling Yang, Weijian Chen, and Lidan Shou. 2023. HOME: A Holistic GPU Memory Management Framework for Deep Learning. *IEEE Trans. Comput.* (2023).
- [19] Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. 2020. AutOTM: Automatic Tensor Movement in Heterogeneous Memory Systems Using Integer Linear Programming. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems.
- [20] Chien Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems.
- [21] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. In Proceedings of the Advances in neural information processing systems.
- [22] Huawei. 2024. Ascend MindSpeed-LLM. https://gitee.com/ascend/ MindSpeed-LLM. https://gitee.com/ascend/MindSpeed-LLM
- [23] Huawei. 2024. MindSpore. https://github.com/mindspore-ai. https://github.com/mindspore-ai
- [24] Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. 2018. GIST: Efficient Data Encoding for Deep Neural Network Training. In Proceedings of the International Symposium on Computer Architecture.
- [25] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. 2019. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. arXiv preprint arXiv:1910.02653 (2019).
- [26] Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In Proceedings of the 2022 USENIX Annual Technical Conference.
- [27] Wenbin Jiang, Yang Ma, Bo Liu, Haikun Liu, Bing Bing Zhou, Jian Zhu, Song Wu, and Hai Jin. 2019. Layup: Layer-adaptive and Multi-type Intermediate-oriented Memory Optimization for GPU-based CNNs. ACM Transactions on Architecture and Code Optimization (2019).
- [28] Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al. 2024. MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs. arXiv preprint arXiv:2402.15627 (2024).
- [29] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361 (2020).
- [30] Shine Kim, Yunho Jin, Gina Sohn, Jonghyun Bae, Tae Jun Ham, and Jae W Lee. 2021. Behemoth: a Flash-centric Training Accelerator for Extreme-scale DNNs. In Proceedings of the 19th USENIX Conference on File and Storage Technologies.
- [31] Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
- [32] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing Activation Recomputation in Large Transformer Models. In Proceedings of Machine Learning and Systems.

- [33] Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo Duan, Linbo Qiao, and Dongsheng Li. 2023. Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models. IEEE Transactions on Parallel and Distributed Systems (2023).
- [34] Lambda. 2020. OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3. https://lambdalabs.com/blog/demystifying-gpt-3
- [35] Shigang Li and Torsten Hoefler. 2021. Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
- [36] Peng Liang, Yu Tang, Xiaoda Zhang, Youhui Bai, Teng Su, Zhiquan Lai, Linbo Qiao, and Dongsheng Li. 2023. A Survey on Auto-Parallelism of Large-Scale Deep Learning Training. IEEE Transactions on Parallel and Distributed Systems (2023).
- [37] Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. 2021. Zico: Efficient GPU Memory Sharing for Concurrent DNN Training. In Proceedings of the 2021 USENIX Annual Technical Conference.
- [38] Google Brain Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vi. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation.
- [39] Megatron-WikiText. 2024. Collecting Wikipedia Training Data. https://github.com/NVIDIA/Megatron-LM?tab=readme-ovfile#collecting-wikipedia-training-data
- [40] Microsoft. 2023. Megatron-DeepSpeed. https://github.com/microsoft/ Megatron-DeepSpeed/tree/main. https://github.com/microsoft/ Megatron-DeepSpeed/tree/main
- [41] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles.
- [42] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient Pipeline-parallel DNN Training. In Proceedings of the International Conference on Machine Learning. PMLR.
- [43] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient Large-scale Language Model Training on Gpu Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
- [44] NVIDIA. 2023. NVIDIA DGX GH200. https://www.nvidia.cn/data-center/dgx-gh200/. https://www.nvidia.cn/data-center/dgx-gh200/
- [45] NVIDIA. 2024. The checkpointing of Megatron-LM. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/transformer\_block.py. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/transformer block.py
- [46] NVIDIA. 2024. Megatron-LM. https://github.com/NVIDIA/Megatron-LM/tree/main. https://github.com/NVIDIA/Megatron-LM/tree/main
- [47] NVIDIA. 2024. Megatron-LM. https://github.com/NVIDIA/Megatron-LM. https://github.com/NVIDIA/Megatron-LM
- [48] NVIDIA. 2024. NVIDIA DGX B200. https://www.nvidia.com/en-us/data-center/dgx-b200/. https://www.nvidia.com/en-us/data-center/dgx-b200/
- [49] NVIDIA. 2024. NVIDIA's DGX SuperPOD cloud-native supercomputer. https://www.nvidia.com/en-us/data-center/dgx-superpod-gb200/. https://www.nvidia.com/en-us/data-center/dgx-superpod-gb200/

- [50] Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU Memory Management for Deep Learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems.
- [51] PyTorch. 2020. PyTorch/Vision. https://github.com/pytorch/vision/ tree/master/torchvision. https://github.com/pytorch/vision/tree/ master/torchvision
- [52] PyTorch. 2024. Gpu utilization Kineto. https://github.com/pytorch/kineto/blob/main/tb\_plugin/docs/gpu\_utilization.md. https://github.com/pytorch/kineto/blob/main/tb\_plugin/docs/gpu\_utilization.md
- [53] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language Models are Unsupervised Multitask Learners. OpenAl blog (2019).
- [54] S. Rashidi, M. Denton, S. Sridharan, A. Suresh, J. Nie, and T. Krishna. [n. d.]. Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms. In Proceedings of the 48th Annual International Symposium on Computer Architecture.
- [55] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. VDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In Proceedings of the Annual International Symposium on Microarchitecture.
- [56] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053 (2019).
- [57] Zhenbo Sun, Huanqi Cao, Yuanwei Wang, Guanyu Feng, Shengqi Chen, Haojie Wang, and Wenguang Chen. 2024. AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3.
- [58] Jakub M Tarnawski, Amar Phanishayee, Nikhil Devanur, Divya Mahajan, and Fanny Nina Paravecino. 2020. Efficient Algorithms for Device Placement of DNN Graph Operators. Advances in Neural Information Processing Systems 33 (2020).
- [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the advances in neural information processing systems.
- [60] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuai-wen Leon Song, Zenglin Xu, and Tim Kraska. 2018. SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
- [61] Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, Sameer Kumar, Tongfei Guo, Yuanzhong Xu, and Zongwei Zhou. 2023. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.
- [62] WikiText2. 2024. WikiText2. https://paperswithcode.com/dataset/ wikitext-2. https://paperswithcode.com/dataset/wikitext-2
- [63] Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation.
- [64] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. 2021. Videogpt: Video Generation Using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157 (2021).
- [65] Siling Yang, Weijian Chen, Xuechen Zhang, Shuibing He, Yanlong Yin, and Xian-He Sun. 2021. AUTO-PRUNE: Automated DNN Pruning and Mapping for ReRAM-based Accelerator. In Proceedings of the ACM International Conference on Supercomputing.

- [66] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068 (2022).
- [67] Quan Zhou, Haiquan Wang, Xiaoyan Yu, Cheng Li, Youhui Bai, Feng Yan, and Yinlong Xu. 2023. MPress: Democratizing Billion-Scale Model
- Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism. In *Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture.*
- [68] Zan Zong, Li Lin, Leilei Lin, Lijie Wen, and Yu Sun. 2023. STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN Training. IEEE Transactions on Parallel and Distributed Systems (2023).