# PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

Pingyi Huo The Pennsylvania State University University Park, PA, USA pqh5140@psu.edu

Anusha Devulapally The Pennsylvania State University University Park, PA, USA akd5994@psu.edu

Hasan Al Maruf AMD, Inc Austin, TX, USA hasan.maruf@amd.com minseo.park@amd.com

Minseo Park AMD, Inc Austin, TX, USA

Krishnakumar Nair AMD, Inc Austin, TX, USA krishnakumar.nair@amd.com

Meena Arunachalam AMD, Inc Austin, TX, USA meena.arunachalam@amd.com

Gulsum Gudukbay Akbulut The Pennsylvania State University University Park, PA, USA gulsum@psu.edu

Mahmut Taylan Kandemir The Pennsylvania State University University Park, PA, USA mtk2@psu.edu

Vijaykrishnan Narayanan The Pennsylvania State University University Park, PA, USA vxn9@psu.edu

Abstract—Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available bandwidth due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXLenabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is  $3.89 \times$  lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-the-art scheme, by 2.03×.

Index Terms—Recommendation System, Compute eXpress Link, Software-Hardware Co-design, Memory Pooling.

#### I. INTRODUCTION

Personalized recommendation systems have emerged as a cornerstone in the interface between users and technology, spanning various application domains from e-commerce to social networking [1]. These systems, powered by deep learning techniques, sift through vast user data to deliver tailored content. This personalized approach boosts user engagement and significantly enhances overall satisfaction. As these systems become increasingly essential components of our digital experiences, datacenters worldwide are scaling up their capabilities, dedicating extensive resources to the AI inference tasks that underpin recommendation models.

Among various deployed models [2]-[4], Deep Learning Recommendation Models (DLRMs) stand out due to their unique characteristics: unlike their compute-heavy counterparts, DLRMs are predominantly bandwidth-intensive due to their large embedding table accumulations [5]-[8]. This distinction shifts the performance bottleneck from compute capacity to data bandwidth and transfer efficiency. The research community and industry have proposed several hardwarebased solutions, such as Process-near-Memory (PNM) [7], [9]-[12] and ASIC designs [13] to address these challenges. However, these solutions introduce new challenges. Firstly, the ever-expanding size of industrial DLRM models, now surpassing even the most significant Large Language Models (LLMs) [14]-[16], poses a significant challenge to the scalability of PNM and ASIC solutions, due to the limited physical interface on boards. Secondly, the PNM-based solutions, by their nature, diverge from standard DRAM protocols [17]. Consequently, adapting to these memory technologies may demand extensive hardware and software stack modifications [10], [18]-[20], elevating development costs and extending the product development cycle. Thirdly, they also introduce resource inefficiency - with shared memory capabilities at the board level, the PNM solutions can serve limited sockets per board. This limitation may cause redundant data copies within a rack or even across racks [21]-[23] to facilitate multihost access, leading to inefficient memory usage and increased latency, despite the advancements like RDMA [15], [24].

Compute Express Link (CXL) technology [25] is rapidly gaining traction in the contemporary datacenter landscape, setting a new standard in the industry. It ensures cache coherence over the PCIe physical layer and introduces memory pooling by using fabric switch [26]. This advancement offers enhanced memory scalability and utilization, leading towards a new data processing and management era. Furthermore, recent studies emphasize CXL's capability to operate as a separate and independent memory bandwidth source [27], [28], significantly enhancing the system's overall bandwidth availability. Together, these features provide a robust foundation for accelerating DLRMs at the datacenter scale.

Motivated by these observations, this study leverages the capabilities of the CXL standard (bandwidth/memory expander) and its interconnects to accelerate DLRMs. We present PIFS-Rec (Process-In-Fabric-Switch for Recommendation **Systems**), a scalable, near-data processing capability tailored for fabric switch hardware. Focusing on large-scale industrial DLRM inference systems, PIFS-Rec utilizes the scalability of downstream ports [25], [29] and proximity to memory within the CXL fabric switch [25] to accelerate the embedding table operations. Through minimal hardware and software optimizations within the fabric switch, we extend its capabilities beyond current implementations, including DRAM-based Type 3 memory expanders [30]–[32]. Our design (§IV) enhances existing CXL memory systems by leveraging the "scalable bandwidth" of fabric switches to address bandwidth bottlenecks in embedding table accesses of DLRM. This boosts performance through enhanced device-level I/O utilization and parallelization. Additionally, we explore integrating a processon-a-fabric-switch framework to reduce data movement costs.

Our main **contributions** in this work are as follows:

- We present results from a characterization study that analyzes recommendation models using real-industrial access traces, production-scale DLRM models, and CXL-ready system. We quantitatively assess the bottlenecks of CXL-enabled memory pooling as well as the potential opportunities it brings.
- We introduce **PIFS-Rec**, a scalable near-data processing approach customized for fabric switch hardware. Our optimizations include hardware and software enhancements such as data repacking, snooping mechanisms, on-switch buffer implementation, and optimized compute logic. Additionally, we explore software-assisted page management strategies to enhance the efficiency of the DLRM processing pipeline.
- We use open-sourced industrial DLRM traces to quantify the effectiveness of our optimizations. We find that PIFS-Rec outperforms an existing CXL-based system, Pond [26] by  $3.89\times$  and a state-of-the-art comparable design, BEA-CON [33] by  $2.03\times$  in terms of latency.

# II. BACKGROUND AND RELATED WORKS

# A. Deep Learning Recommendation Model (DLRM)

The end-to-end inference in DLRM involves several stages. Initially, the recommendation model is loaded into DRAM. Incoming input queries are grouped into batches, and the necessary dense and sparse features are organized for input into the DLRM. The inference step then processes these batches to generate predictions. In a DLRM architecture, four key stages can be identified: the bottom fully-connected layer (Bottom MLP), embedding lookup, feature interaction, and top fully-connected layer (Top MLP), as depicted in Figure 1.



Fig. 1. End-to-end DLRM pipeline for inference.

This architecture handles two types of input features: "dense features" which are continuous personal variables (e.g., age, gender), and "sparse features" which are categorical (e.g., product IDs, music genres). During the inference process, dense features undergo processing by the Bottom MLP. On the other hand, sparse features are transformed into "dense latent representation" in the embedding lookup stage. Each feature's value or indices are used to retrieve the corresponding "embedding vector" from a large table. Embedding vectors can be of different dimensions (e.g., 16B, 32B, etc.) in different setups; high embedding dimensions and the number of embeddings lead to large memory footprints. The outputs from the Bottom MLP and the embedding lookup are combined in the feature interaction layer to calculate interactions before being passed to the Top MLP for determining the click-through rate (CTR).

#### B. CXL Overview

1) Conventional CXL: CXL operates as a transaction layer designed for rack-level memory pooling [26], building upon the physical layer of PCIe. PCIe 5.0 supports a data transfer rate of 32 GT/s per lane, translating to approximately 64GB/s when utilizing 16× lanes. As CXL uses PCIe's physical layer, unlike RDMA, it does not require a device's DMA engine or Network Interface Card (NIC). CXL encompasses three protocols: "CXL.io", "CXL.cache", and "CXL.mem". The CXL.io protocol configures and establishes connections between CPUs and CXL devices. In contrast, the CXL.cache (resp. CXL.mem) protocol enables a device to access the host CPU cache (resp. memory) and vice versa. These protocols enable three types of CXL devices: Type 1 (only cache, e.g., NIC), Type 2 (both cache and memory, e.g., GPU, accelerator, etc.), and Type 3 (only memory, e.g., memory expander).

Memory Expander. Type 3 devices also known as CXL memory expanders [31], [32], are designed to increase *both* memory capacity [34] and bandwidth [27]. These conventional expanders include DDR memory and rely on the CXL.mem protocol for data storage and retrieval, as well as the CXL.io protocol for establishing the connections. This protocol enables CPU-to-device memory access, coordinated by a Home Agent (HA) and a CXL controller within the host and device. The HA manages the CXL.mem protocol, presenting memory to the CPU as if it were on a remote NUMA node, allowing direct access via standard load/store commands.

**Memory Pooling.** CXL enables memory pooling where each host and end device can access any shared, unified



Fig. 2. Architecture of a CXL-based system. The devices use Flexbus to communicate with the host. The fabric manager configures the Virtual PCI-to-PCI Bridge (VPPB) to control the FM endpoints in the fabric switches. These switches connect all devices within the system. Data leaves fabric switch through PCI-to-PCI Bridge (PPB).

memory space connected through CXL. This addresses memory stranding [26] and redundancy issues. To manage the data access flow within a memory pool, CXL employs an asymmetric cache protocol and introduces a "Bias Table" (4KB per table). This table operates in two modes: "host-bias" and "device-bias". In the host-bias mode, devices accessing addresses within CXL memory need control instructions to ensure data coherence, adding extra overhead. Conversely, in the device-bias mode, the region is locked for the device's exclusive use, preventing access by other hosts.

2) Fabric Switch: As shown in Figure 2, In the advancement from version 1.0 to 3.1, CXL introduces the "fabric switch" in CXL 2.0 [25]. Note that the fabric switch is a compulsory component and non-bypass hardware in a multinode CXL interconnect. Unlike the one-to-one communication of earlier CXL versions (1.0/1.1), CXL 2.0 and later versions facilitate multiple-device communication and interconnectivity. The fabric switch functions as both a memory request dispatcher and a connected device manager. Each device is assigned a *cacheID* when recognized by the FM endpoint in the fabric switch.

**Process-In-Fabric-Switch.** To the best of our knowledge, in the context of CXL, BEACON [33] is the first work that adds compute capability to a fabric switch. Specifically, BEACON integrates "compute units" within the fabric switch to harness the processing capabilities close to the data source and the high bandwidth of the downstream port for accelerating genome workloads. Note that BEACON has been developed to accelerate "genome analysis". Our analysis reveals several areas where BEACON may not fully maximize the potential benefits of in-switch computation. From a hardware perspective, BEACON's design relies on custom DIMM instructions for CXL memory management, diverging from the established CXL standards. It also needs an additional memory translation logic in the fabric switch which can introduce performance overheads. Moreover, the system is not scalable as it does not support fabric switch scaling. It also does not take advantage of data locality. On the software front, existing work [26]–[28], [34] suggests that directly accessing CXL memory without careful management can detrimentally affect



Fig. 3. A simplified illustration of the production-ready CXL-enabled experiment platform.

performance across various workloads, primarily due to higher data access latency compared to local DRAM. BEACON's standalone use of CXL memory, without integrating address interleaving with local DRAM, might result in suboptimal performance. Additionally, it focuses solely on single-host configurations, neglecting the complexities and opportunities brought by multi-host environments.

To address BEACON's shortcomings and maximize the potential of in-switch compute capability in the context of DLRM workloads, we undertake a comprehensive redesign encompassing *both* hardware and software support, establishing a new workflow and a new instruction flow. This effort results in the development of PIFS-Rec, which extends beyond DLRM workloads to cater to a broad range of applications with a focus on enhancing scalability, efficiency, and performance.

# C. Related Works

Numerous works address the acceleration ( [15], [35]–[37]) and optimization ( [38]–[40]) of DLRMs. Software-based approaches focus on techniques like feature-based resource allocation (CAFE [41]) and CPU optimizations for pre-fetching and overlapping computation with memory access ( [8]). Hercules [42] provides an adaptive scheduler to deploy various DLRM models across the datacenters with heterogeneous devices, considering multiple factors such as power budget, latency requirement, and throughput. In comparison, DisaggRec [43] explores the deployment of DLRM using hardware resource disaggregation to improve cost efficiency. Note that both of these solutions explore effective scheduling strategies targeting existing servers, while PIFS-Rec is a new hardware acceleration-based solution that targets scalability.

Hardware-based solutions leverage technologies like PIM [7], [10], [12], [44], [44] for faster data processing within memory and specialized ASICs designed for DLRMs [13]. Furthermore, recent research also explores CXL for both characterization and optimization purposes. For example, Pond [26] focuses on memory pooling with CXL to increase scalability; TPP [28] improves CXL system performance with tiered memory page management; and studies like [27], [45] explore the potential performance gains from CXL. Our research is centered on leveraging fabric switches in the context of memory disaggregation over CXL.



Fig. 4. (a) Batch Threading – each batch is assigned to a CPU core to be processed. (b) Table Threading – each embedding table is accessed by a CPU core to be processed.

#### III. CHARACTERIZATION STUDY AND MOTIVATION

The DLRM inference process is both memory capacity and bandwidth bound – the working set size increases with different parameters (e.g., the number of embeddings, embedding dimension, batch size, and number of tables) while the parallel computation demands high memory bandwidth. To get more memory bandwidth along with larger capacity, one possible solution in today's server system is to add more CPU sockets and populate memory channels in the power of two. However, this restricts flexibility and results in stranded memory resources [26]. As DRAM is a significant driver of infrastructure cost and power consumption [28], its excessive underutilization also leads to high total cost of ownership (TCO). Further, the interconnect between the CPU sockets can be a bottleneck and significantly impact the performance.

To illustrate this, we run the embedding table lookup phase of the DLRM inference process on a dual-socket AMD Genoa system (each socket having 96 physical CPU cores and 12 channels of DDR5 memory that populates a capacity of 768GB), with a representative DLRM trace. As shown in Figure 3, besides adding more CPU sockets, we can also increase the overall system's memory and bandwidth capacity by populating memory channels through CXL interconnects. Here, CXL memory is enabled through four channels of DDR4 memory resulting in memory capacity of 256GB. Including the CXL memory, the server has a total memory capacity of 1.8TB. Here, we configure the DLRM trace to use 192 tables with a batch size of 1024, and run embedding table lookup operation with different parallelization methods, namely, "batch threading" and "table threading" (detailed in Figure 4).

CXL allows flexibility in server memory capacity population, which is restricted during remote CPU socket-based memory capacity expansion. To consume the full memory bandwidth, we must populate and access all the memory channels in the remote socket. While accessing remote socket memory partially, the effective memory bandwidth might be low. For example, Figure 5 (a)-(b) illustrates that accessing only 20% of the whole working set size from a remote CPU socket with DDR5 DIMMs invariably reduces the application bandwidth consumption. This significantly impacts performance, particularly with large embedding dimensions and large numbers of embeddings, where we observe up to 95% degraded performance. In contrast, instead of remote CPU memory, while accessing the same amount of memory (i.e., 20% of the working set size) over CXL interconnects

with DDR4 DIMMs, we can have an enhanced performance of 5–30% (Figure 5 (c)-(d)). Note that CXL-attached DDR4 memory has a low refresh rate over CPU-attached DDR5 memory. Also, CXL memory is CPU-less – we do not need additional CPUs to expand memory capacity. Consequently, CXL memory can consume less power than remote CPU sockets. Moreover, re-purposing earlier generations of DDR DIMMs can save datacenter TCO while augmenting performance.

As CXL adds bandwidth to the overall system, it can act as a "bandwidth expander" when CPU-attached memory channels are saturated. For DLRMs with large thread counts, batch sizes, embedding dimensions and sizes, both the working set size and memory bandwidth increase significantly. At some point, the CPU-attached memory channels get saturated and become the performance bottleneck. In such cases, extra bandwidth from CXL can enhance the application throughput. For example, as shown in Figure 6, when we increase the thread count from 16 to 32 and embedding table dimensions from 64 to 128, system bandwidth increases by 43%. Here, DDR4-based CXL memory improves application throughput by 28.5–38.9%, compared to the standalone CPU-attached DDR5 memory system.

The potential of CXL to improve system performance and efficiency is evident. Our findings also highlight several "limitations" with the current CXL interconnects. These include the risk of flex bus congestion under heavy memory traffic, increased data access latency (by over 4× compared to local DRAM [26], [28]), and limited bandwidth expansion capabilities. Such constraints can lead to substantial performance degradation in specific configurations, with the observed impact reaching up to 90% when the bandwidth is saturated. Our analysis of the software front suggests a promising strategy involving the distribution of embedding tables across available memory tiers. We tried with different interleave ratios and empirically found that, when we allocate 20% of the total working set size to CXL memory and the remaining 80% to local DRAM, i.e., we allocate over a 4:1 interleave policy, we get a significant performance improvement (as shown in Figure 5 (e)-(f)). We can achieve up to a  $9\times$  performance increase over configurations where all memory is allocated to the CXL. Table threading scenarios can offer up to a  $1.73\times$ performance boost compared to operations running solely on local DRAM. These results reveal that any method that relies solely on CXL memory (e.g., [33]) may restrict performance. Our experiments underscore the significant potential of CXL technology in enhancing scalability and performance.

Key takeaways from our characterization experiments are:

- Key Takeaway 1: While CXL memory enhances system scalability by offering more flexible memory configurations, its data retrieval latency is higher than DRAM. This adversely affects performance. To mitigate this, computation should happen close to memory to minimize the data transfer latency.
- Key Takeaway 2: The CXL memory can outperform remote CPU socket configurations but requires memory management strategies. Specifically, spreading memory between DRAM and CXL, coupled with careful page management, can



Fig. 5. The X-axis indicates the embedding table size and Y-axis indicates the normalized application bandwidth. (a)-(b) The addition of CPU sockets can address the scale-up issue of memory-bound embedding table lookup operations at the cost of high-performance overhead. (c)-(d) CXL memory can provide better performance over remote CPU sockets. However, simply replacing CPU-attached memory with CXL memory causes performance overheads during high memory traffic over CXL. (e)-(f) Software interleaving during page allocation improves performance through CXL's bandwidth expansion.



Fig. 6. The CXL bandwidth contribution to the system with different workload configurations.

significantly reduce performance degradation.

We believe, from a hardware perspective, PIFS is a suitable solution to address these issues. First, placing a fabric switch closer to the memory reduces data movement significantly compared to traditional host-centric models. Second, unlike the PNM/PIM solutions, PIFS does *not* require modifications to existing CXL devices, thus maintaining compatibility. We can also optimize *both* memory usage and system performance by integrating PIFS hardware with effective page management with DRAMs. Based on these considerations and observed characteristics, we propose **PIFS-Rec**.

#### IV. SYSTEM DESIGN

PIFS-Rec scales the embedding tables and accelerates the embedding operations, known as SparseLengthSum (SLS), by leveraging device I/O level parallelism, and facilitates processing near memory by executing computations within a fabric switch. PIFS-Rec features a minimalist hardware architecture with specialized computation tailored to support the SLS family of inference operators. This special-purpose computation logic is localized to the Process Core (PC). Additionally, PIFS-Rec includes an enhanced memory controller (FM endpoint Extension) that extends the functionality of the Fabric Manager (FM) endpoint within the fabric switch. Throughout this paper, we refer to this component as the "memory controller".

Figure 7 shows the process flow (§IV-A2) and a new instruction (§IV-A3) to minimize modifications to standard CXL-based DRAM devices (Type 3) and mitigate additional overhead. Considering the "localized nature" of embedding table access patterns, our framework explores employing an

on-switch buffer (§IV-A4) to enhance overall system efficiency. We also implement an "out-of-order engine" (§IV-A5) to prevent pipeline stalling during DLRM row access accumulation. Additionally, we enhance the software architecture (§IV-B) through an optimized page management and migration process to complement our hardware design. Furthermore, we discuss scaling up multiple PIFS-Rec interconnections (§IV-C) by introducing multi-layer forwarding and necessary modifications to support this growth. These optimizations draw upon empirical insights from our workload characterization in a CXL hardware-ready memory system and prior research [7], [8], [10], [14], ensuring a grounded and practical approach to tackle the complexities of modern recommendation systems.

# A. Hardware Architecture

One of our objectives is to keep the fabric switch lightweight with minimal hardware and software modifications, ensuring cost-effective deployment and compatibility. We make several changes to the conventional fabric switch – from a hardware perspective, we design the processing core to passively receive instructions from the host and operate exclusively on physical addresses, eliminating the need for a softcore or host presence within the fabric switch. With this, we do *not* need to modify the hardware or software on CXL end devices, which facilitates seamless integration to existing Type 3 devices.

1) System Overview: PIFS-Rec is integrated within the fabric switch, as depicted in Figure 7. The PC (Process Core), a hardware component within the fabric switch, facilitates this integration. The memory controller for PIFS-Rec is an FM endpoint extension with an enhanced memory indexing unit. Communication between the host-side CXL controller HA (Home Agent) and PIFS-Rec occurs via CXL-based instructions through the CXL interface using PCIE PHY. PIFS-Rec returns the accumulated embedding table row access results to the host. Regular CXL-based instructions are decoded by the FM endpoint extension and forwarded to the corresponding CXL devices with the modified instructions. By locating the logic within the fabric switch, PIFS-Rec can issue "concurrent requests" to parallel CXL devices and efficiently utilize bandwidth across multiple memory channels. The embedding table region is designated as a device-bias region.



Fig. 7. Overview of Process-In-Fabric-Switch (PIFS) for DLRMs architecture, which includes Process Core (PC), accumulate configuration logic, accumulate configuration register, and a mechanism for instruction ingress registry.



Fig. 8. (a) Instruction flow of PIFS-Rec. The valid signal indicates a successful retrieval of data. (b) The host gets the result from the fabric switch.

2) Process Flow: Previous work [33] introduces customized DIMM-based instructions and an independent CXL workflow, diverging from the current CXL protocol standard [25]. To avoid complete hardware and software stack changes, we design the fabric switch from scratch, maintaining compatibility with CXL memory and avoiding major modifications to the CXL host-device control protocol.

**Instruction Flow.** In BEACON [33], the computational logic within the fabric switch initiates memory requests. However, bypassing the host in a cloud-based inference system presents challenges, as each query might access different row candidates. Hence, the host must relay essential memory address information to the fabric switch for accurate memory access. The PC decodes instructions from the host and issues memory fetch request to specific CXL memory devices. In Figure 8, during row accumulation access, the host issues a standard CXL.mem {M2S} request [25] to the fabric switch 1, while reserving a memory address in the LLC or specific CXL cache region and transmitting it to the fabric switch's process core register. Upon receiving a memory request from the host, the memopcode checker examines the instruction's memory operation (memOpcode) field. If the instruction is standard, it bypasses the processing core and is sent directly to the VCS. Otherwise, it is routed to the process core for further handling. After receiving CXL instructions via the interface, the process core decodes the instruction and proceeds with instruction repacking. This repacking modifies two instruction fields: Firstly, for the requests initiated by the host as read requests, memopcode is modified to transform them into standard read requests with data directed toward the CXL memory 2. From the host (CPU) point of view, it issues the memory read request, but the actual memory issue request source point comes from the fabric switch. However, the host still acts as a "monitor", that is, if the address is polluted or invalid, the host will realize it and inform the application or runtime. Secondly, the repacking alters the *SPID* (the ID of device that initiated the request) from the host to the fabric switch, ensuring that the retrieved data are stored in the fabric switch. Once the data are retrieved from the memory and sent to the fabric switch 3, the process core dispatches a control signal back to the host 3, indicating successful data retrieval.

#### **Asynchronous Communication.**

As mentioned earlier, the fabric switch's processing core initiates data accumulation. When the embedding tables interleave between local DRAMs, remote DRAMs, and Type 3 devices, the host computes the SumCandidateCounter for each request to accumulate rows. The host first identifies all related row vector candidates' memory addresses (using the data ptr() API in PyTorch). It then uses the memory addresses to determine the location of each row vector, checking whether it is in the local DRAM or elsewhere (using the move\_pages() API in numactl). Subsequently, it calculates the SumCandidateCounter by tallying the number of vectors not stored in the local DRAM. Note that SumCandidateCounter is configured into a fabric switch using the instruction. Specifically, the PC decrements the counter by 1 each time it accumulates a row candidate. The process is considered complete when the SumCandidateCounter reaches 0. Upon the completion of the accumulation process, the accumulated result is transmitted to the previously reserved memory address of the host with CXL.cache {D2H} 4 through the egress queue. The host continuously monitors (snoops) the designated address using the standard CXL snooping mechanism. Upon detecting a change in the memory location, it recognizes that the data at this location represents an accumulated result. It then retrieves the accumulated data for further processing.

3) Instruction Modification: To implement the described mechanism (§ IV-A2), modifications have been made to the instruction set (CXL 3.0), as depicted in Figure 9. Specifically,



Fig. 9. Blue chunks indicating modified/added fields and SPID modification by the fabric switch.

the memory opcode within the Request (Req) instruction serves dual purposes: it can either initiate a request for row vector data or configure the Accumulation Configuration Register (ACR). For row vector data access, the instruction includes a *sumtag* that designates the accumulation cluster to which it belongs and specifies the vector size. Conversely, if the instruction is intended for configuration, it conveys the number of row vectors needed for a particular row accumulation (SumCandidateCount). In this case, the address field is re-purposed to specify the location reserved for the accumulated result, which is then set within the ACR. The minimum data granularity managed is 16B, while the row vector size can vary [14] from 16B to 64B or 128B [7], [8]. The vectorsize field indicates the number of data chunks to form a row access, supporting 8 different row vector size configurations with 3-bit width using binary coding. Considering the CXL standard's slot size limitation of 16 bytes [25], weight (since FP32 for weight has 32 bits) and other extra information are allocated within the data slot field. When a memory fetch based instruction arrives at the PC, it is stored in Instruction Ingress Registry (IIR). New data arriving from the CXL memory to the fabric switch is indexed in the IIR, and the corresponding instruction is retrieved by comparing the address field. This instruction is then forwarded to the instruction decoder, which sets up the ACR based on instruction's fields. Each new row accumulate request from host is assigned a *sumtag*; each new request increases the capacity counter; and each finished request decreases it. If the ACR hits its capacity limit CapacityCounter, the system imposes back-pressure on the upstream modules until space is freed. This cycle continues until all data elements are processed, culminating in the dispatch of the result to the host.

4) On-Switch Buffer: The on-switch buffer in PIFS-Rec utilizes on-chip SRAM and acts as a "cache" to store frequently accessed "hot content". Unlike prior works that use buffers for queue management or traffic shaping [46]–[49], our buffer is specifically designed to exploit the temporal locality observed in specific embedding tables where vectors are frequently reused [7]. Conventional prefetching [8] strategies are less effective due to the irregular, time-wise relationship patterns exhibited by row accesses, potentially degrading system performance by consuming available band-

width budget or displacing vectors prematurely. RecNMP [7], a PNM-DIMM-based solution, explored DIMM caching to reduce latency by leveraging data locality. We integrated an on-switch "buffer" to exploit the reuse of embedding vectors. Fetching a single address from memory pools can take up to 270 ns, with approximately 37% attributed to frequent CXL I/O port transfers and retimer delays, as per profiling [26]. This reduction in latency is achieved by minimizing wire transfers and reducing CXL I/O port overhead [26]. Distinct from traditional strategies like LRU or FIFO, PIFS-Rec employs a strategy, Hottest Recording (HTR), akin to RecNMP [7]. An address profiler logs and ranks frequently accessed row vectors, curating the cache to retain highest-priority candidates based on access frequency. Managed by the FM endpoint extension on the fabric switch, this memory region is inaccessible and unmanageable by the host.

5) Out-of-Order Accumulation: In PIFS-Rec, the accumulation operations are processed in the accumulate logic unit. Here, we optimize existing data management solutions on computational logic, leveraging insights from previous research by [7], [10], [12]. In scenarios involving multiple hosts or devices, batch requests from various hosts can trigger numerous accumulation requests to different devices. However, access congestion [50] at frequently-used memory I/O ports may cause delays in the arrival of row data in the embedding table, as observed by [51], [52].

Eliminating Hardware Stalls. We do not solely rely on hardware parallelism such as deploying multiple Near Data Processing (NDP) units [33], due to two main constraints. Firstly, the extensive computational logic required on the fabric switch demands significant amount of resources, limiting scalability. Secondly, the system's throughput is ultimately constrained by the number of parallel compute units, potentially leading to stalls once this limit is reached. To overcome these limitations, we introduce an "out-of-order" compute approach, enabling immediate data processing upon arrival of the same accumulation request. In case of incoming data corresponding to a different request, the system transfers the accumulated intermediate result from the accumulation register to a swap register during the first half of the clock cycle, allowing for processing of the new data in the subsequent half. The shared swap region approach among multiple processing cores

and accumulation logic ensures efficient data handling. Note that the SRAM in the switch buffer can also contain the intermediate result while the swap register is full. However, accessing data from the SRAM in the switch buffer requires at least two clock cycles, potentially causing stalls. In our current approach, we make this function configurable by configuring the Functional Configuration Register (FCR).

# B. Software Architecture

From our characterization study (§III), we find that for DLRM, proper utilization of all the available memory channels simultaneously can provide the optimized performance in a CXL-enabled system. Considering this, our software architecture incorporates the following design principles – (a) as CPU-attached local memory node has the lowest access latency and comparatively high bandwidth over CXL-memory, hottest or most frequently accessed pages should reside on the local memory tier; (b) When we need to access CXL memory, if we can spread the memory across multiple CXL nodes, then we can parallelize better and utilize the bandwidth across all the channels; (c) As migration of pages is a widespread event in a tiered memory system, optimizing that software feature can significantly enhance the overall system's performance.

- 1) Page Granular Access: In DLRM, the dimension of an embedding vector can be very small (e.g., typically, ranges between 16-128B). As CXL supports cache-line granular access, we can consider the embedding dimension to be the granularity of memory access and efficiently identify the hotcold rows to perform fine-grained vector embedding management. However, the metadata management overhead will be high. On the other hand, a single OS page (e.g., typically, a 4KB-sized page) can contain multiple row vectors (e.g., 256 embeddings of 16B size). It is possible that all the embeddings within a page may not be accessed simultaneously, which will cause amplification of data movement. However, even with this caveat, in our system, same as previous work [26]-[28], we opt to manage memory placement at page-granular as pagegranular metadata management and migration is supported and compatible with the current OS. Hot-cold detection also happens on page-granular.
- 2) Global Hotness Detection: We provide a unified memory architecture where all the hosts can access pages across the system. When a host accesses a page frequently, we identify it as the hottest one and put it to its local DRAM (we call it "Private Hot Region") (Figure 10 (a)). As CXL has higher latency (around 100ns extra over local DRAM), we put the relatively cold pages in the CXL memory address space (we call it "Public Cold Region"), which is shared between all the connected hosts. To identify the global page temperature, each host monitors the access frequency of a page across all devices – the most frequently accessed pages within a device are categorized as "hottest" while the least frequently accessed pages are categorized as "coldest". After generating all the device's page heatmaps, the host compares them. Therefore, it finds the most frequently accessed pages across all devices and stores them in its private hot region. If a host identifies a page



Fig. 10. (a) Page migration and management. The socket on same board can access remote DIMM using the board-level interconnect, but it needs fabric switch to access remote DIMM on another board. (b) In the worst case, memory requests are not localized across devices.



Fig. 11. From CXL 2.0 to 3.0. PIFS-Rec supports scale-up with multi-host scenarios (T3: Type 3 memory devices).

already designated as a private hot page by another host, it selects its next most frequently accessed page as its private hot page. Remote hosts access memory from another host's private hot region over the flex bus, incorporating an accumulation process within the fabric switch. If a host retrieves a row vector from local memory, accumulation happens locally, although it is capable of receiving (but only partially processing) the accumulated results. Every host periodically reclassifies hot private pages as public cold pages if their access frequency exceeds the least accessed private hot page's access frequency by more than "cold\_age\_threshold" (by default, 20%).

3) Embedding Spreading for Bandwidth Optimization: In our architecture, we address the potential bottleneck scenario due to disproportionate demand on specific memory devices. As illustrated in Figure 10 (b), despite dedicated processing threads and specialized task allocation, the system's bandwidth may not be fully optimized if a particular memory node consistently handles most data requests. To address this, we introduce a simple yet effective adaptive "page migration strategy" to ensure the maximal utilization of channel capabilities across the system. The objective is to redistribute the workload more evenly across the available memory nodes to alleviate the bottleneck and optimize the bandwidth usage.

When we place the cold pages in the "Public Cold Region," we initially spread them across the CXL memory nodes

through the interleave policy. At a later point, if a CXL memory node becomes warm, i.e., the memory access count for a node exceeds the average access count for other nodes by "1 - migrate\_threshold" (by default, 35%), we initiate the page redistribution process for that particular memory node. This process entails transferring the most accessed pages from the overburdened memory node to the least accessed one. If the destination node is out of capacity, we also move the coldest page of that device to the overburdened memory node. Therefore, the page with the second-highest access frequency on the overburdened node becomes the new hottest page for that node. Similarly, if the cold page is moved, the coldest page of the least frequently accessed memory node also gets redefined. We re-iterate the procedure across all the memory nodes until the access frequency gets balanced.

4) Optimization in Page Migration: Existing work [28], [53], [54] usually focuses on page-level migration due to software compatibility and OS support. However, page migration during the live-on inference system can stall query processing due to migration overhead and data inaccessibility. When a page is migrated, the OS typically marks it as non-accessible (page block). In that case, for a row vector dimension of 64B, a 4KB-sized page migration will block access to all the 64-row vectors residing within that page. To address this, although we use page-granular memory management to reduce metadata overhead, during migration, we leverage CXL's cache-line granular memory access feature.

We enhance the peer-to-peer (P2P) communication mechanism [25], with the support of the "Migration Controller" (MC) in FM endpoint extension. Once the OS triggers migration, instead of copying the whole page, the host migrates in cache-line granularity. So, when the page is migrated, locking on a particular cache line cannot restrict the accessibility of the remaining cache lines. During this process, the cache-line is not stored back in the secondary memory or returned to the host; instead, it is stored in a temporal location in the switch (cache-line block). This optimization reduces the overhead by up to  $5.1\times$  over the OS's page-granular migration process.

#### C. Fabric Switch at Scale

As shown in Figure 11, fabric switches play a crucial role in scaling out, connecting hosts and devices into a "unified fabric" for coherent memory sharing and device communication. In a scaled-out CXL environment, multiple hosts connect to a non-tree-shaped CXL fabric switch, facilitating connections to shared Type 3 memory devices or other hosts. This configuration enables a "distributed computing" model, distributing computational and memory resources across multiple nodes. Notably, while [33] focuses on single fabric switch computation, our work demonstrates how multiple fabric switches can communicate and collaborate.

1) Multi-layer Instruction Forwarding: In a simplified scenario, we assume each fabric switch has a process core. PIFS-Rec enables vector accumulation to be executed directly on the remote fabric switch (close to the Type 3 memory device),

thereby conserving a significant network interconnect bandwidth. Consequently, instruction repacking is confined to the remote fabric switch that handles its local memory. The fabric switch tracks the row candidate requests sent to and received from other fabric switches. Its scheduler reads the sumtage field and corresponding memory address, recording the number of memory requests for specific row access to each remote fabric switch, and sends the new Sub-SumCandidateCount to replace SumCandidateCount. This mechanism ensures the sanity of data exchange and accumulation processing. When a fabric switch positioned near the local host (the host issues the row access request) receives results from a remote fabric switch, there is a possibility that some candidates from other nodes have not vet arrived. The forward controller in the fabric switch monitors the sumcandidatecounter to address such situations. The remote fabric switch transmits its subsumcandidatecounter back to the local switch, which then uses this information to determine whether to forward the accumulated result to the host (once all candidates have been processed), wait for missing data from other nodes, or discard the result if errors occurred during data transfer.

2) Versatility: The framework can also work with fabric switches without processing cores. During the initial setup and configuration phase, the local fabric switch must identify remote fabric switches' lack of processing capabilities. The scheduler will read a 1-bit CNV (Compute Node Valid) field for each fabric switch during the configuration process. If it is determined that a remote fabric switch lacks processing power, the local fabric switch will undertake all operations and instruction repackaging tasks by itself.

#### D. Programming-Related Aspects

PIFS employs an easy-to-use, OpenCL-like heterogeneous computing programming model similar to the one adopted in the previous work [7]. More specifically, PIFS-Rec provides SLS APIs that can be called from user-space, allowing integration with mainstream frameworks like PyTorch. Users can utilize this function call to accelerate specific types of DLRMs and all ML models that require sparse network embedding table accumulation. We aim to reduce the programmer's burden by abstracting as much information as possible. When allocating memory, users must supply the embedding table file as a parameter, along with the number of embeddings and vector size. Additional parameters, including batch size, indices, offset, or length, are also required. After receiving the information, the PIFS kernel allocates memory space from the CXL memory pool using numactl mapping information. The allocated memory region will be implicitly defined as hostbiased using the OpenCL API, e.g., clEnqueueSVMUnmap. Each function call also allocates an address that points to the output result and pins the address. It generates the embedding table iterative accumulation codes using parameters such as length, batch size, and indices. Simultaneously, the host allocates an embedding table based on the CXL-recognized memory space. The CXL-supported CPU compiles and generates CXL instructions with the corresponding MemOpcode

field by reading the generated PIFS kernel code. The fabric switch then begins to receive the instruction and passively starts the computation. Meanwhile, a daemon process on the host starts snooping the result address and monitoring the process's integrity for each called function. While instructions are being processed in the fabric switch, the memory indexing function directs data to the corresponding devices and retrieves the data. When page migration is triggered and certain pages are mapped to CXL memory, the bias table flip function hooks into page migration() and marks it as a device-biased page using APIs (e.g., clEnqueueSVMMap), informing the host and the fabric switch, changing the candidate count number, and indexing data to a new location (recall that the rest of the workflow is described earlier in Section (§IV-A2)).

#### V. DISCUSSION

Workload Generality. Our proposed framework for PIFS is highly adaptable, facilitating its usage across various practical workloads (e.g. Kv aggregation [55], Mapreduce [56]). This adaptability is achieved by substituting the compute logic, and DLRM-specific registers with components tailored to the new workload. Unlike BEACON [33], which introduces optimizations specific to genome analysis, our proposed optimizations are designed to be applicable across new workloads without modifying the proposed instruction flow and process flow.

Hardware Compatibility. PIFS-Rec is designed to *complement*, *not replace*, PNM/PIM-based approaches. PIFS-Rec is compatible with hardware that includes embedded cores on DIMM, such as RecNMP [7], which addresses bandwidth limitations by using intra-DIMM bandwidth. Integrating such technologies increases operating system complexity, demanding runtime and compiler adaptations for CXL-based heterogeneous computing. This integration requires significant software and hardware modifications, including the adoption of CXL-compatible and core-enabled DIMMs. Addressing these challenges will require extensive research in operating system support, memory management, and hardware innovations.

TABLE I
MODEL PARAMETERS IN THE SCOPE OF THIS STUDY.

| Name | Emb. Num | Emb. Dim | <b>Bottom-MLP</b> | Top-MLP   |
|------|----------|----------|-------------------|-----------|
| RMC1 | 16384    | 64       | 256-128-128       | 128-64-1  |
| RMC2 | 131072   | 64       | 1024-512-128      | 384-192-1 |
| RMC3 | 1048576  | 64       | 2048-1024-256     | 512-256-1 |
| RMC4 | 1048576  | 128      | 2048-2048-256     | 768-384-1 |

# VI. IMPLEMENTATION AND EVALUATION

# A. Setup

We adhered to the methodology outlined in a previous study [7], [33]. As described in Table II, we conducted cycle-level memory simulations using Ramulator 2.0 [57] for a detailed evaluation of PIFS-Rec. We wrapped Ramulator 2.0 into our simulator; the top module includes a cycle-accurate processing logic for Process Core, FM Endpoint Extension, Instruction Repacking, and Memopcode Checker and with a

TABLE II HARDWARE CONFIGURATION.

| DRAM Configuration              |                           |  |  |
|---------------------------------|---------------------------|--|--|
| DIMM Capacity                   | 64 GBs per DIMM           |  |  |
| DIMM Channels/Ranks             | 4/2                       |  |  |
| Frequency (MHz)                 | 4800                      |  |  |
| Timings (CL-RCD-RP-RAS)         | 28-28-28-52               |  |  |
| tRC/tWR/tRTP                    | 79/48/12                  |  |  |
| tCWL/nRFC1/tCK_ps               | 22/30/625                 |  |  |
| CXL Configuration               |                           |  |  |
| Fabric Switch Downstream Ports: | 64GB/s ×16                |  |  |
| Fabric Switch Buffer R/W Speed  | 0.91-4.19 ns/0.91-4.17 ns |  |  |
| CXL Access Penalty over DRAM:   | 100 ns [28]               |  |  |

top-module clock tick period of one ns/clk. We introduced additional latency (ticks) for data directed to the CXL memory to accurately simulate performance impacts, considering its inherently higher access latency than on-switch DRAM (refers to Table II ). We use the open-source Meta traces [58] and models (Table I) for reproducibility. A "lookup table" was developed to facilitate address indexing and mapping logic, directing the memory footprint to either CXL memory or an on-switch buffer. This table is used to record memory access and I/O patterns. Our comprehensive latency evaluation considered several critical factors, such as the additional DRAM cycles required for initializing accumulation counters, the latency introduced by the fabric switch, and the time needed to transfer the final computed sum back to the host system; we extracted the performance from top module synthesis.

#### B. Baselines

We selected several previous works as "baselines" to highlight the state-of-the-art in various aspects of memory pooling and processing architectures. Pond [26] introduces a straightforward CXL-based memory pooling approach with OS support, emphasizing simplicity in design. We add our PM (page management) optimization to Pond, denoted as "Pond + PM" to demonstrate the performance of software optimization independently. In comparison, BEACON [33] presents the PIFS architecture to accelerate DNA computation. In our evaluations, we modified the compute logic only to process vector accumulation. Since our main workload is DLRM inference, we implemented the BEACON (BEACON-S) without algorithm-specific optimizations. RecNMP [7] is a DIMM-based hardware solution that accelerates SLS operations. We implemented the design using their computational hardware configuration with our memory setting. We used a fixed amount of 128GB local DRAM, and memory addresses exceeding this amount will be mapped into CXL regions. Even though BEACON does not support DRAM and CXL interleaving, we still used reduced DRAM latency to access the corresponding 128 GB for BEACON.

#### C. Evaluation

To demonstrate the performance benefits of our proposed design for Sparse Length Sum (SLS) operations, we employ various memory devices across different models. Specifically, we utilize four memory devices with default parameters: 8 per



Fig. 12. The performance of systems with different: (a) models, (b) types of traces (ZF: Zipfian, NoL: Normal, Um: uniform, Rm: Random), (c) memory devices, e.g., X2 means two memory devices, (d) DRAM size, e.g., X2 means 256 GB DRAM, and (e) ablation study: the PC (processing core) is PIFS-Rec specified. The plot uses min-max normalization.

batch, RMC4 model, page swap threshold 12%, embedding migration threshold 35%, and 512 KB buffer size. We evaluate the performance by using the total ticks used to process the traces and use min-max normalization.

1) HW/SW Co-Evaluation: Figure 12 (a) presents the performance results with different schemes. Pond, which integrates standard CXL support, demonstrates the lowest performance among the evaluated systems. This is anticipated because, although CXL provides an increase in bandwidth over traditional DIMM-based systems, it is hindered by significant data retrieval latency. As the workload size increases from RMC1 to RMC4, the number of pages mapped to CXL increases. Consequently, the latency increases in all approaches. However, PIFS-Rec has the lowest latency for all these workloads. On average, PIFS-Rec outperforms Pond by 3.8x, Pond + PM by 3.5x, RecNMP by 8.5%, and BEACON by 1.94x across all the workloads. On the other hand, the potential for bandwidth scalability through connecting multiple memory devices via a fabric switch has yet to be fully realized due to the lack of embedding table migration, limiting the system's ability to exploit the increased bandwidth effectively. For the largest model (RMC4), PIFS-Rec achieves only about 11% improvement over RecNMP because the latter performs data fetch with bank-level parallelism. Also, PIFS-Rec achieves  $3.89\times$ ,  $3.57\times$  and  $2.03\times$  improvements over Pond, Pond + PM and BEACON, respectively, due to its intelligent page management and optimized device-level parallelism.

2) Generality: In addition to the Meta traces [58], we have conducted experiments with synthetic traces to cover a large spectrum of DLRM workload scenarios. Note that Meta traces primarily reflect workload distribution, particularly in Meta's implementation of DLRM, which may not comprehensively represent the diversity of DLRM workloads. Our synthetic traces emulate various distribution types based on the access candidates observed in the Meta traces. As depicted in Figure 12 (b), our findings indicate that the uniform distribution yields the most favorable performance since it creates a perfectly balanced distribution of embedding table accesses across devices. This distribution strategy results in a 1.1× improvement over RecNMP. Conversely, the Zipfian distribution is identified as the least effective, yielding only a 2% performance enhancement over RecNMP. Without the help of hardware support to mitigate the bandwidth bottleneck, Pond + PM improves over the baseline by only 21%, on average. PIFS-Rec achieves improvements of 2-2.2× over BEACON

and 3.8-3.9× over Pond, underscoring the importance of careful memory mapping and maximized I/O parallelism.

3) Ablation Study: We explore several optimizations encompassing both hardware and software enhancements. The results in Figure 12 (d) reveal that adding processing cores yields a modest 26% improvement compared to Pond, partially utilizing the high bandwidth. Incorporating out-of-order processing (§IV-A5) provides at most 7.3% enhancement due to the elimination of cycle stalling. We observe a performance boost from page management (§IV-B2,§IV-B3), resulting in around 27% improvement due to optimized memory access and better device-level parallelism. On-switch buffering (§IV-A4) with PIFS effectively mitigates CXL's high retrieval latency, resulting in an additional 15% improvement over Pond. Combining out-of-order processing with page migration optimizes I/O parallelism and minimizes stall time. We analyzed the impact of varying on-switch buffer capacities on performance in the following section(§VI-C5).

4) Scalability: Previous studies [33], the above discussions, and our experimental analysis collectively confirm that CXL memory pooling enhances scalability compared to DIMMbased solutions. We divide the trace file region evenly across memory devices. In Figure 12 (c), the performance can be reduced with limited hardware setup, likely due to constrained optimization opportunities for I/O performance and inherently poorer device-level parallelism (compared to bank-level parallelism). Nevertheless, as the hardware inclusion expands, our design demonstrates superior performance in latency, achieving approximately a 12.5× improvement over Pond, 8.3× improvement over Pond + PM, a 22% improvement on RecNMP when there are 16 memory devices. We conduct a sensitivity study by increasing the local DRAM capacity and find that PIFS-Rec still performs the best. Here, the DRAM capacity plays a relatively minor role in shaping the performance. Specifically, 256GB and 512GB DRAM budget result in average performance improvements of 4% and 6% compared to 128GB DRAM configuration. This limited effect of DRAM capacity is due to two main reasons - the model size is in the several terabytes range and the primary bottleneck is memory bandwidth. Therefore, increasing memory capacity alone cannot alleviate the issue of bandwidth saturation.

To estimate the improvements in end-to-end inference latency with multi-host and multi-batch cases, we calculate the speedup by *weighting* the speedup of both SLS and non-SLS operators. In Figure 14, the performance enhancements due to







Fig. 13. Performance Comparison of (a) Embedding migration strategy across different thresholds, displaying normalized latency on SLS-operations (blue). (b) Embedding migration across devices, comparing IO access frequency before and after page migration. (c) Instruction forwarding across various fabric switches and batches. (d) Page swapping strategy for private hot and public cold pages at different thresholds. For (a) and (d), the right Y-axis is a normalized overhead for page block (red) and cache-line block (green). The migration costs are with respect to the total latency.



Fig. 14. Speedup of PIFS-Rec with different numbers of hosts sending concurrent requests under different models.



Fig. 15. Comparison of HTR performance across different cache sizes and replacement strategies.

PIFS-Rec vary with batch size. As the time spent in accelerated SLS operators grows, the model-level speedup increases with larger batch sizes. In RMC4, with the number of hosts increasing from 2 to 8, the performance improves by  $1.9-4.7\times$ . With the support of the multi-layer instruction forwarding strategy (§IV-C1), We illustrate the latency improvements over different fabric switch counts in Figure 13 (c). Assuming each fabric switch has one local CXL memory and one host. These fabric switches are fully connected and we add an extra 100 ns latency when data needs to be transferred between them. The results with RMC4 indicate that as the fabric switch count is increased from  $2\times$  to  $32\times$ , the latency improves by  $1.8-20.8\times$  in the largest batch.

5) On Switch Buffer Capacity: As shown in Figure 15, the speedup numbers demonstrate how each caching strategy benefits system performance by reducing access times compared to the baseline (no caching). As the model size increase from RMC1 to RMC4, our HTR's maximum gain reduces from

19.3% to 14.8% with 512KB SRAM, due to larger footprint of the trace. In the largest model (RMC4), our hottest recording strategy (HTR) (§IV-A4) generates better performance scaling with increasing cache size, achieving a speedup ranging from 7.6% to 14.8%, as the cache size increases from 64 KB to 512 KB. However, for the HTR strategy, a larger cache size (1 MB) results in performance degradation due to the absence of a significant increase in cache hit ratio (41.9%) for the 1 MB cache, coupled with an increase in cache hit latency. This suggests that HTR with a cache size of 512KB effectively leverages data locality. In contrast, the LRU and FIFO strategies exhibit more modest improvements.

6) Page Management: We collected the memory address ranges from the trace file and divided it into 4KB chunks as pages in the OS. In Figure 13 (a), the embedding migration (§IV-B3) strategy shows optimal performance at a 35% migrate threshold, reducing latency by 14% due to fewer page movements. Higher thresholds increase the embedding migration and degrade the performance (in fact, the migration cost increases from 1.67% to around 10% when we increase migrate threshold from 10% to 50% using page block). Our cache-line granular migration approach (§IV-B4) outperforms standard OS page migration (page block) by up to 5.1×, which decrease the migration cost to less than 2%. We calculate the standard deviation (Std Dev) for access frequencies before and after the embedding migration (Figure 13 (b)) to quantify the variability and assess the impact of 35% migrate\_threshold. The standard deviation of access frequencies after the embedding migration drops from 20.6 to 7.8. This suggests that the PM effectively harmonizes access frequencies among the devices, leading to more uniform access frequency distributions and higher I/O parallelism. In Figure 13 (d), for page swapping strategy (§IV-B2), the best performance was observed with a cold\_age\_threshold of 16% - leading to a 12% lower latency than TPP [28], which contributes to less page migration cost. The average migration cost decreases from around 8% to 1%. However, increasing the threshold further results in certain hot pages not being migrated to the local DRAM, consequently degrading the overall performance.

# D. Hardware Overheads

We compare the power consumptions and hardware overheads with DLRM-dedicated system RecNMP, and the tradi-



Fig. 16. TCO under different models with increasing GPU budgets. e.g., X2 means 2 GPUs.



Fig. 17. Normalized throughput using different models.

| System                                                          | Power/Area                                                       |  |  |  |
|-----------------------------------------------------------------|------------------------------------------------------------------|--|--|--|
| RecNMP-base(X8) [7]                                             | 75.4 mW / 215984 um²                                             |  |  |  |
| PIFS-Rec Breakdown<br>Process Core<br>Control Logic + Registers | 9.3 mW / 33709 um <sup>2</sup><br>3.2 mW / 73114 um <sup>2</sup> |  |  |  |
| On Switch Buffer                                                | 15.2 mW / 2.38 mm <sup>2</sup>                                   |  |  |  |

Fig. 18. Hardware overheads.

# TABLE III HARDWARE SPECIFICATIONS.

| Hardware           | Spec                          |       | TDP    | Price    |
|--------------------|-------------------------------|-------|--------|----------|
| Server CPU [59]    | AMD EPYC™ 9654 96C@2.4GHz     |       | 360W   | \$4,695  |
| DIMM & CXL mem [60 | ] per GB, DDR4                | 21.6W | (64GB) | \$4.90   |
| DIMM [60]          | per GB, DDR5                  | 24W   | (64GB) | \$11.25  |
| NIC [61]           | NVIDIA ConnectX-6@200Gbps IB  |       | 23.6W  | \$1,900  |
| SWITCH [62]        | Juniper QFX10002-36Q @100Gbps |       | 360W   | \$11,899 |
| SWITCH + PUs [63]  | 3.2Tbps, 2 pipelines (ASIC)   |       | 400W   | \$13,039 |
| GPU [64]           | NVIDIA A100 80GB PCIe HBM2e   |       | 300W   | \$18,900 |

Note: The prices shown here are subject to market fluctuations and may not accurately reflect the actual procurement prices.

tional DRAM based solutions. Since the previous work use different fabrication processes and EDA platforms that enabling different post-design optimization strategies, we keep the same functions as the prior work describe and map them to our fabrication process. We use Synopsys Design Compiler (DC) with a 1GHz clock to estimate the area and power consumption values. We also use this information to calculate the total energy consumption using conventional 45nm technology. To evaluate the energy consumption of standalone DIMMs with CPUs, we use Cacti-3DD [65] for memory devices and Cacti-IO [66] for the off-chip input/output operations at the DIMM level. In comparison to the prior solution that solely based on conventional DIMMs and CPU, PIFS-Rec reduces the energy consumption by 15.3% on average. As shown in Figure 18, PIFS-Rec reduces the power  $2.7 \times$  compared to RecNMPs. PIFS-Rec requires 2.02× less area than an equivalent RecN-MPs (x8) configuration with the same cache buffer.

# E. Cost and Performance Analysis

We evaluate the performance of the parameter server over a simulation using one CPU-based server and a GPU server equipped with four A100 GPUs, obtaining power information using Nvidia-smi [67]. We conservatively estimate – CXL memory's power consumption is 90% of the local DRAM.

TCO Model: We assess capital expenditure (CAPEX) (Table III) for RDMA hardware acquisition and operational expenditure (OPEX), including three years of power usage. Traditional setups involve a CPU in the GPU server along with NICs and a network switch. PIFS-Rec uses a CPU and fabric switch. We estimate the cost of a fabric switch considering the price of a standard network switch with an Intel Tofino core [63]. We get power costs from the network's standalone consumption and DC analysis data. Figure 16 demonstrates

that PIFS-Rec offers superior TCO benefits compared to traditional GPU-based systems. For models with a few hundred parameters (RMC1), PIFS-Rec is 3.38× more cost-effective. Even for the largest models (RMC4) utilizing one GPU, PIFS-Rec is 2.53× cheaper. For instance, deploying RMC4 on a 2TB system with 64GB DIMMs requires \$27,769 to build a PIFS-Rec system, whereas a parameter server with a single GPU costs \$57,639. Assuming an energy cost of \$0.05 per-KWh [68], PIFS-Rec can save an additional \$2,332.14 in OPEX over three years. In a traditional GPU system, memory cost increases with the model size, whereas in our system, TCO benefit converges to the cost-benefit of DIMM and CXL memory. For smaller models (RMC1), GPU provides better throughput (Figure 17). However, with a large memory footprint and vector size, when memory bandwidth on the parameter server becomes the bottleneck throughput drops. In contrast, PIFS-Rec demonstrates high robustness compared to parameter server-based solutions and outperforms a 4-GPU cluster by 1.6×. To understand the margin gain, we calculate performance-per-watt (PPW). As the model size increases, the PIFS-Rec's PPW improves from  $1.22\times$  to  $1.61\times$ , compared to a 4-GPU conventional parameter server-based system.

# VII. CONCLUDING REMARKS

Deep Learning Recommendation Models (DLRMs) consume extensive datacenter cycles and struggle with bandwidth due to large embeddings and growing parameters. This study introduces a Process-in-Fabric Switch (PIFS) strategy to enhance DLRM efficiency in CXL-based systems. By optimizing both hardware and software for large-scale DLRM workloads, our system PIFS-Rec notably surpasses existing solutions, achieving up to 3.89× greater efficiency compared to current systems and doubling that of comparable architectures.

# ACKNOWLEDGMENT

We thank the anonymous reviewers for their useful feed-back. This work was supported in part by gifts from AMD, INC and PRISM, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

#### REFERENCES

[1] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini *et al.*, "Deep learning recommendation model for personalization and recommendation systems," *arXiv preprint arXiv:1906.00091*, 2019.

- [2] J. G. Carbonell, R. S. Michalski, and T. M. Mitchell, "An overview of machine learning," *Machine learning*, pp. 3–23, 1983.
- [3] M. I. Jordan and T. M. Mitchell, "Machine learning: Trends, perspectives, and prospects," *Science*, vol. 349, no. 6245, pp. 255–260, 2015.
- [4] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, "Multimodal machine learning: A survey and taxonomy," *IEEE transactions on pattern analysis and machine intelligence*, vol. 41, no. 2, pp. 423–443, 2018.
- [5] A. Desai and A. Shrivastava, "The trade-offs of model size in large recommendation models: 100gb to 10mb criteo-tb dlrm model," Advances in Neural Information Processing Systems, vol. 35, pp. 33 961–33 972, 2022.
- [6] H.-J. M. Shi, D. Mudigere, M. Naumov, and J. Yang, "Compositional embeddings using complementary partitions for memory-efficient recommendation systems," in *Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 2020, pp. 165–175.
- [7] L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee et al., "Recnmp: Accelerating personalized recommendation with near-memory processing," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 790–803.
- [8] R. Jain, S. Cheng, V. Kalagi, V. Sanghavi, S. Kaul, M. Arunachalam, K. Maeng, A. Jog, A. Sivasubramaniam, M. T. Kandemir et al., "Optimizing cpu performance for recommendation systems at-scale," in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15.
- [9] T. Yang, H. Ma, Y. Zhao, F. Liu, Z. He, X. Sun, and L. Jiang, "Pimpr: Pim-based personalized recommendation with heterogeneous memory hierarchy," in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6.
- [10] M. Wilkening, U. Gupta, S. Hsia, C. Trippel, C.-J. Wu, D. Brooks, and G.-Y. Wei, "Recssd: near data processing for solid state drive based recommendation inference," in *Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, 2021, pp. 717–729.
- [11] X. Sun, H. Wan, Q. Li, C.-L. Yang, T.-W. Kuo, and C. J. Xue, "Rm-ssd: In-storage computing for large-scale recommendation inference," in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1056–1070.
- [12] Y. Kwon, Y. Lee, and M. Rhu, "Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning," in *Proceedings of the 52nd Annual IEEE/ACM International* Symposium on Microarchitecture, 2019, pp. 740–753.
- [13] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, "Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 968–981.
- [14] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, "Deep learning recommendation model for personalization and recommendation systems," *CoRR*, vol. abs/1906.00091, 2019. [Online]. Available: https://arxiv.org/abs/1906.00091
- [15] A. Guo, Y. Hao, C. Wu, P. Haghi, Z. Pan, M. Si, D. Tao, A. Li, M. Herbordt, and T. Geng, "Software-hardware co-design of heterogeneous smartnic system for recommendation models inference and training," in *Proceedings of the 37th International Conference on Supercomputing*, 2023, pp. 336–347.
- [16] H. M. Shi, D. Mudigere, M. Naumov, and J. Yang, "Compositional embeddings using complementary partitions for memory-efficient recommendation systems," *CoRR*, vol. abs/1909.02107, 2019. [Online]. Available: https://arxiv.org/abs/1909.02107
- [17] B. Keeth and R. J. Baker, DRAM circuit design: a tutorial. IEEE, 2001.
- [18] X. Shen, X. Liao, L. Zheng, Y. Huang, D. Chen, and H. Jin, "Archer: a reram-based accelerator for compressed recommendation systems," *Frontiers of Computer Science*, vol. 18, no. 5, p. 185607, 2024.
- [19] G. Singh, L. Chelini, S. Corda, A. J. Awan, S. Stuijk, R. Jordans, H. Corporaal, and A.-J. Boonstra, "A review of near-memory computing architectures: Opportunities and challenges," in 2018 21st Euromicro Conference on Digital System Design (DSD). IEEE, 2018, pp. 608– 617.

- [20] K. Khan, S. Pasricha, and R. G. Kim, "A survey of resource management for processing-in-memory and near-memory processing architectures," *Journal of Low Power Electronics and Applications*, vol. 10, no. 4, p. 30, 2020
- [21] D. Kalamkar, E. Georganas, S. Srinivasan, J. Chen, M. Shiryaev, and A. Heinecke, "Optimizing deep learning recommender systems training on cpu cluster architectures," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–15.
- [22] Z. Wang, Y. Wei, M. Lee, M. Langer, F. Yu, J. Liu, S. Liu, D. G. Abel, X. Guo, J. Dong et al., "Merlin hugectr: Gpu-accelerated recommender system training and inference," in *Proceedings of the 16th ACM Con*ference on Recommender Systems, 2022, pp. 534–537.
- [23] S. Pumma and A. Vishnu, "Semantic-aware lossless data compression for deep learning recommendation model (dlrm)," in 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 2021, pp. 1–8.
- [24] Y. Yuan, J. Huang, Y. Sun, T. Wang, J. Nelson, D. R. Ports, Y. Wang, R. Wang, C. Tai, and N. S. Kim, "Rambda: Rdma-driven acceleration framework for memory-intensive μs-scale datacenter applications," in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 499–515.
- [25] "Compute Express Link (CXL)," https://computeexpresslink.org/, accessed: 2023-03-14.
- [26] H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal et al., "Pond: Cxl-based memory pooling systems for cloud platforms," in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 574– 587.
- [27] Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong et al., "Demystifying cxl memory with genuine cxl-ready systems and devices," in *Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture*, 2023, pp. 105–121.
- [28] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan, "Tpp: Transparent page placement for cxl-enabled tiered-memory," in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 742–755.
- [29] X. Technologies, "World's first cxl 2.0 and pcie gen5 switch ic," 2023, accessed: 2023-03-14. [Online]. Available: https://www.xconntech.com/product
- [30] S. Park, H. Kim, K. Kim, J. So, J. Ahn, W. Lee, D. Kim, Y. Kim, J. Seok, J. Lee *et al.*, "Scaling of memory performance and capacity with cxl memory expander." in *HCS*, 2022, pp. 1–27.
- [31] "Micron CXL," https://www.micron.com/products/memory/cxl-memory, accessed: 2023-03-14.
- [32] "Samsung CXL," https://news.samsung.com/global/samsung-electronics-introduces-industrys-first-512gb-cxl-memory-module, accessed: 2023-03-14.
- [33] W. Huangfu, K. T. Malladi, A. Chang, and Y. Xie, "Beacon: Scalable near-data-processing accelerators for genome analysis near memory pool with the cxl support," in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 727–743.
- [34] J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, "{CXL-ANNS}:{Software-Hardware} collaborative memory disaggregation and computation for {Billion-Scale} approximate nearest neighbor search," in 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 585–600.
- [35] S. Daghaghi, N. Meisburger, M. Zhao, and A. Shrivastava, "Accelerating slide deep learning on modern cpus: Vectorization, quantizations, memory optimizations, and more," *Proceedings of Machine Learning and Systems*, vol. 3, pp. 156–166, 2021.
- [36] U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia et al., "The architectural implications of facebook's dnn-based personalized recommendation," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 488–501.
- [37] A. Firoozshahian, J. Coburn, R. Levenstein, R. Nattoji, A. Kamath, O. Wu, G. Grewal, H. Aepala, B. Jakka, B. Dreyer et al., "Mtia: First generation silicon targeting meta's recommendation systems," in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13.

- [38] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G.-Y. Wei, H.-H. S. Lee, D. Brooks, and C.-J. Wu, "Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 982–995.
- [39] C. Yin, B. Acun, C.-J. Wu, and X. Liu, "Tt-rec: Tensor train compression for deep learning recommendation models," *Proceedings of Machine Learning and Systems*, vol. 3, pp. 448–462, 2021.
- [40] G. Sethi, B. Acun, N. Agarwal, C. Kozyrakis, C. Trippel, and C.-J. Wu, "Recshard: statistical feature-based memory optimization for industry-scale neural recommendation," in *Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, 2022, pp. 344–358.
- [41] H. Zhang, Z. Liu, B. Chen, Y. Zhao, T. Zhao, T. Yang, and B. Cui, "Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models," *Proceedings of the ACM on Management of Data*, vol. 2, no. 1, pp. 1–28, 2024.
- [42] L. Ke, U. Gupta, M. Hempstead, C.-J. Wu, H.-H. S. Lee, and X. Zhang, "Hercules: Heterogeneity-aware inference serving for at-scale personalized recommendation," in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 141– 154.
- [43] L. Ke, X. Zhang, B. Lee, G. E. Suh, and H.-H. S. Lee, "Disaggrec: Architecting disaggregated systems for large-scale personalized recommendation," arXiv preprint arXiv:2212.00939, 2022.
- [44] Y. Wang, S. Li, Q. Zheng, A. Chang, H. Li, and Y. Chen, "Ems-i: An efficient memory system design with specialized caching mechanism for recommendation inference," ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–22, 2023.
- [45] I. Calciu, M. T. Imran, I. Puddu, S. Kashyap, H. A. Maruf, O. Mutlu, and A. Kolli, "Rethinking software runtimes for disaggregated memory," in ASPLOS, 2021.
- [46] V. Addanki, C. Avin, and S. Schmid, "Mars: Near-optimal throughput with shallow buffers in reconfigurable datacenter networks," 2022.
- [47] V. Addanki, O. Michel, and S. Schmid, "{PowerTCP}: Pushing the performance limits of datacenter networks," in 19th USENIX symposium on networked systems design and implementation (NSDI 22), 2022, pp. 51–70.
- [48] M. Apostolaki, V. Addanki, M. Ghobadi, and L. Vanbever, "Fb: A flexible buffer management scheme for data center switches," 2021.
- [49] Y. Yu, X. Jiang, G. Jin, Z. Gao, and P. Li, "A buffer management algorithm based on dynamic marking threshold to restrain microburst in data center network," *Information*, vol. 12, no. 9, p. 369, 2021.
- [50] D. Dillow, G. M. Shipman, S. H. Oral, and Z. Zhang, "I/o congestion avoidance via routing and object placement," 2011. [Online]. Available: https://api.semanticscholar.org/CorpusID:59774578
- [51] X. Liao, Z. Yang, and J. Shu, "Rio: Order-preserving and cpu-efficient remote storage access," in *Proceedings of the Eighteenth European Conference on Computer Systems*, 2023, pp. 703–717.
- [52] K. Kim, S. Kim, and T. Kim, "Hmb-i/o: Fast track for handling urgent i/os in nonvolatile memory express solid-state drives," *Applied Sciences*, vol. 10, no. 12, 2020. [Online]. Available: https://www.mdpi.com/2076-3417/10/12/4341
- [53] J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, "Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 598–611.
- [54] A. Choudhary, M. C. Govil, G. Singh, L. K. Awasthi, E. S. Pilli, and D. Kapil, "A critical survey of live virtual machine migration techniques," *Journal of Cloud Computing*, vol. 6, pp. 1–41, 2017.
- [55] M. Inc., "Mongodb," 2024, accessed: 2024-06-23. [Online]. Available: https://www.mongodb.com/
- [56] J. Dean and S. Ghemawat, "Mapreduce: simplified data processing on large clusters," *Communications of the ACM*, vol. 51, no. 1, pp. 107–113, 2008.
- [57] H. Luo, Y. C. Tuğrul, F. N. Bostancı, A. Olgun, A. G. Yağlıkçı, , and O. Mutlu, "Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator," 2023.
- [58] M. A. Research, "Meta inc. trace," 2023, accessed: 2023-03-14. [Online]. Available: https://github.com/facebookresearch/dlrm\_datasets
- [59] AMD, "Amd epyc 9654x 2.4 ghz processor oem," 2024, accessed: 2024-09-22. [Online]. Avail-

- able: https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series.html#specs
- [60] MemVerge, "Cxl use case: Slash memory costs and expand capacity," 2024, accessed: 2024-09-22. [Online]. Available: https://memverge. com/cxl-use-case-slash-memory-costs-and-expand-capacity/
- [61] J. Networks, "Juniper qfx10002," 2024, accessed: 2024-09-22. [Online]. Available: https://www.juniper.net/documentation/us/en/ hardware/qfx10002/topics/topic-map/qfx10002-port-panel.html
- [62] NVIDIA, "Nvidia connectx-6 vpi adapter card hdr 200gbe," 2024, accessed: 2024-09-22. [Online]. Available: https://store.nvidia.com/en-us/networking/store/product/ mcx653106a-hdat-sp/nvidia-connectx-6-vpi-adapter-card-hdr-200gbe/
- [63] Intel, "Intel tofino intelligent fabric processors," 2024, accessed: 2024-09-22. [Online]. Available: https://www.intel.com/content/www/us/en/ products/details/network-io/intelligent-fabric-processors/tofino.html
- [64] NVIDIA, "Nvidia a100 80gb datasheet," 2024, accessed: 2024-09-22. [Online]. Available: https://www.nvidia.com/content/dam/enzz/Solutions/Data-Center/a100/pdf/a100-80gb-datasheet-update-nvidiaus-1521051-r2-web.pdf
- [65] K. Chen, S. Li, N. Muralimanohar, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "Cacti-3dd: Architecture-level modeling for 3d die-stacked dram main memory," in 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2012, pp. 33–38.
- [66] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, "Cactiio: Cacti with off-chip power-area-timing models," in *Proceedings of the International Conference on Computer-Aided Design*, 2012, pp. 294–301.
- [67] Nvidia-SMI, "System Management Interface SMI," 2024. [Online]. Available: https://developer.nvidia.com/system-management-interface
- [68] BLS Strategies, "Power requirements, energy costs, and incentives for data centers," https://www.blsstrategies.com/insights-press/powerrequirements-energy-costs-and-incentives-for-data-centers, accessed: 2024-06-26.