Skip to content

2. SMDK Architecture

KyungsanKim edited this page Dec 14, 2022 · 30 revisions

2.1 High Level Architecture


The picture below explains high level architecture of SMDK.


image


  • The top layer of SMDK is memory user interfaces. The layer consists of a CLI tool, libraries, application programming interfaces (API), sets of pre-built and reusable codes, and the connections to access these software codes.

    • Allocator library provides the compatible and optimization path for application integration. Using the library, system developers not only easily incorporate CXL memory into existing systems without having to modify existing application, but also optimize and re-write application source code to reach up a higher level optimization.
    • CLI tool allows a unified way of retrieval and controlling a CXL device and SMDK functions.
  • The middle layer of SMDK is the userspace memory tiering engine with scalable near/far memory management. The layer allows best-fit memory usecases based on memory access pattern and footprint of applications.

  • The bottom layer of SMDK is primitive logical memory views by SMDK kernel. The kernel change is geared to provide a flexible memory utilization for CXL users on system point of view.

Please note that SMDK is being developed to support SDM, Software-Defined Memory, penetrating full-stack software.




2.2 User Interface


The memory map of a Linux process is composed of code, data (bss), stack, heap segment, etc., and each segment is represented as a VMA (Virtual Memory Area).

Among the segments, the heap segment is commonly used to deal with large memory consumption of the process by dynamically adjusting its size. Linux provides several system calls such as mmap(), brk() for processes to support a live expansion of the heap segment.

On SMDK Kernel, when a process attempts to expand its heap segment by calling mmap(), the process should choose which memory partition would be used among NORMAL and EXMEM zones. Given context and size, mmap() internally allocates a proper amount of pages out of free lists in buddy allocator of the NORMAL or EXMEM zone, and it attaches the virtual pages to its heap segment, and then adjusts the segment size. Finally, the calling context is terminated, returning the result of the execution to the caller.

To be specific, SMDK expands mmap() system call by adding MAP_NORMAL and MAP_EXMEM flag to selectively allocate free pages from NORMAL or EXMEM zone. The two flags explicitly tells designated memory type on a memory mapping request. Also, a novel sysfs interface is provided to determine and prioritize the two memory types in case of implicit memory mapping request. In other words, calling mmap() without specifying either of the two flags. This is one specific point that SMDK expands as well as inherits traditional Linux MM for a heterogeneous memory system with CXL device. Please refer to the code snippet next.

/*
 * 1) Explicit memory-mapping from CXL Memory: flag |= MAP_EXMEM;
 * 2) Explicit memory-mapping from NORMAL Memory: flag |= MAP_NORMAL;
 * 3) Implicit memory-mapping from CXL or NORMAL, depending on priority: no flag change
 */
char *addr = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);

As for user interface, SMDK provides heap allocator and numactl extension that aim to orchestrate the capacity and bandwidth of DRAM and CXL memory. Also, SMDK allocator supports scalability in performance by assigning pair(s) of lock-free DRAM/CXL arenas per CPU to resolve the race condition that multiple CPUs concurrently request memory chunk out of the arenas. SMDK allocator is an extension of Jemalloc allocator, and it is used both compatible and optimization way.

Figure below describes how SMDK allocator works.

image



Furthermore, a CLI tool is ready as the unified interface to manage CXL device(s) on SMDK.

image

Some commands directly communicate with CXL device such as CXL spec commands, while others interact with SMDK components such as sysfs interfaces for memory grouping. On SMDK v1.2, the CXL-CLI tool is the extension of Intel CXL-CLI.


Compatible Path

  • Background - When a new memory device emerges, it has been common that a novel API and programming model are provided to enable the new device. Given the API and programming model, application developers and open-source community members are forced to modify the source code of their applications to adopt and use the device properly. However, this way harms the reliability of running services and increases management cost. Compatible API of SMDK is geared to resolve the pain point, reflecting the VOC of the industry.

  • SMDK enables an application that tiers CXL/DDR memory without SW modification. In other words, neither change API nor programming model. Also, SMDK provides transparent memory management by inheriting and extending Linux process/VMM design.

  • Technically, compatible path is not an API, but the way for application developer and service operator to use CXL memory without application change.

  • Compatible path provides heap segment tiering using standard heap allocation API and system call such as malloc(), realloc(), and mmap(), overriding that of libc. Especially, the path possesses intelligent allocation engine that enables DDR/CXL usecases of memory tiering such as memory priority, size, and bandwidth.

  • Compatible Path also provides not only heap segment tiering, but user process tiering and resource isolation.

    image


Optimization Path

  • Optimization Path is literally the API to meet higher level optimization by rewriting application.

  • Optimization API consists of Allocation API and Metadata API. Allocation API is for memory allocation and deallocation, while Metadata API is to acquire online memory status.

    image


Language Binding

SMDK allocator provides language binding interfaces for application use. In SMDK, the compatible and optimization path support a different language binding coverage.

  • Compatible Path - C, C++, Python, Java, Go

    • Python, Java, and Go framework design and implement internal memory management schemes, respectively. Those are based on primitive heap allocation ways such as malloc() and mmap().
    • Above that, the SMDK compatible library supports language binding for the languages.
  • Optimization Path - C, C++, Python

    • Proprietary API is provided from the SMDK optimization library to implement an application with sophisticated CXL/DDR memory use.

The picture below is drawn to depict SMDK on language binding aspect.

image




2.3 Kernel


SMDK introduces 3 interfaces - System RAM, Swap, Cache - for application to make use of CXL memory, by inheriting and expanding Linux VMM.

Memory Zone/Partition, Swap, and Cache paragraphs explain the 3 interfaces repectively.


Memory Zone

Background - Historically, Linux VMM has the hierarchy of logical memory views from node, zone, buddy, and finally page granularity. Also, Linux VMM has expanded the zone unit for better use of physical memory according to HW and SW needs of the time such as DMA/DMA32, MOVABLE, DEVICE, NORMAL and HIGHMEM zones.

  • DMA/DMA32 zone includes specific pages that are addressable from some IO devices that support limited address space.

  • MOVABLE zone is geared to have less fragmentation and support memory hot-plugging.

  • HIGHMEM zone was used to access memory space that was located beyond NORMAL zone due to the addressing limitation.

  • Most DRAM pages are located in NORMAL zone and used by system and application contexts.

image

SMDK introduces a new memory zone, ZONE_EXMEM, to independently manage DDR and CXL memory. It is designed for two reasons. One is to serve a new memory device that has different characteristics originated from differenct HW controllers such as frequency range, wire protocol, and functionalities. Another is to provide programmable interfaces that explicitly distinguish DDR and CXL memory. Node id is an ephemeral information that often varies reflecting to system condition. ZONE_EXMEM is now a configurable option of the SMDK kernel. We are open on a discussion and feedback of the design approach.

Linux Zone Trigger Description Option
ZONE_NORMAL Initial design NORMAL addressable memory. mandatory
ZONE_DMA HW (I/O) Addressable memory for some peripheral devices supporting limited address resolution. configurable
ZONE_HIGHMEM HW (CPU) A memory area that is only addressable by the kernel through. Mapping portions into its own address space. This is for example used by i386 to allow the kernel to address the memory beyond 900MB. configurable
ZONE_DEVICE HW (I/O) Offers paging and mmap for device driver identified physical address ranges. (e.g., pmem, hmm, p2pdma) configurable
ZONE_MOVABLE SW Is similar to ZONE_NORMAL, except that it contains movable pages. Main usecases for ZONE_MOVABLE are to make memory offlining/unplug more likely to succeed, and to locally limit unmovable allocations. configurable
ZONE_EXMEM HW (CXL) Extended memory for latency isolation, composability, and pluggability. configurable

Memory Partition

Right now, industry deals with a single physical CXL memory channel as a single logical memory-only node in Linux system. Hence, "CXL Memory : Node = 1 : 1", which is "Single Node" status in picture below.

However, we think there would be drawbacks in that service operators or high level application developers have to be aware of the existence of CXL memory and manually control them by themselves using 3rd party plugins such as numactl and libnuma. The more CXL memories a system have, the more management efforts would be needed.

In addition, it is not compatible with the way of traditional Linux that depicts an array of physical DRAM as one logical numa node.

For the reasons, SMDK further suggests an abstraction layer that hides and minimizes the efforts, grouping multiple CXL devices into a logical partition in a user configurable manner.

  • The zone partition presents CXL memories on the system in a zone, EXMEM_ZONE, within a numa node segregated from NORMAL_ZONE.

  • The node partition presents CXL memories on the system in a memory node grouped.

  • The noop partition presents a CXL memory on the system in a separate memory node.

  • The n-way partition presents the specified number of CXL memories on the system in a memory node grouped.

Thses partitions, except noop, support SW interleaving of the multiple CXL memories in the partiton, as a result, it leads to aggregated bandwidth among the CXL device connected.

Plus, we assume CXL memory can be used not only as memory interface also device interface such as DAX. SMDK allows both ways.

image

The figure below further depicts the SMDK zone partition with multiple CXL memories.

Single CXL memory is managed as a sub-zone possessing its own buddy list, then multiple CXL memories are grouped as so-called a super-zone. If a memory request comes from a thread context, the CXL super-zone assigns proper amount of pages in order out of the managed sub-zone array. Given context, the operation results in bandwidth aggregation effect out of the connected CXL memory array.

Without the super-zone design, memory requests from a process is not balanced, but concentrated on a single CXL device. Because a zone's buddy list is composed in a sequential order of connected CXL devices.

image


Swap

Background - Swap has typically used to resolve memory pressure condition of a system. When PFRA(Page Frame Reclaiming Algorithm) works, victim pages are stored into the target swap device or file (swap-out), and then move pages in needs into VM space later (swap-in). Usually, a disk device is selected as the swap device because it is capable of data persistency, but DRAM device is also used to store volatile swap data by (de)compresssing pages being swapped. e.g.) zSwap zSwap is designed to logically expand DRAM density with (de)compression technology using host cpu cycle, and it has beed widely used so far. One of the CXL philosophy is, however, obviously to adopt a larger amount of physical memory on a system. This implies the cpu cycle can be consumed more worthy, possibly no longer need to be consumed to handle compression of data being swapped.

Based on the architectural thoughts, SMDK provides CXL Swap interface that allows an application software to use a CXL memory as Swap device for volatile swap data. Specifically, CXL Swap is a kernel module that complies frotswap interface.

image


Cache

Design and development is ongoing. Will be released next SMDK version.


Device Driver

CXL device driver in SMDK kernel performs two operations.

Firstly, on system booting, driver registers and manages a "CXL device : memory partition" map based on SRAT/CEDT/DVSEC and E820 information provided by the device and BIOS. The map maintains the relations between physical CXL memory and logical memory group.

Please note that the map is mutable, reflecting the grouping status of a CXL memory, which is configurable via SMDK plugins.

Next, a set of sysfs are exported to report the static and dynamic information of CXL device(s). Specifically, /sys/kernel/cxl/devices/cxlX exports the address, size, and socket location of the CXL device X, while /sys/kernel/cxl/nodes/nodeY reports CXL device(s) in the memory-node Y at the time.

image