Skip to content

2. SMDK Architecture

KyungsanKim edited this page Aug 18, 2022 · 30 revisions

2.1 High Level Architecture


The picture below explains high level architecture of SMDK


image


  • The top layer of SMDK is memory user interfaces. The layer consists of a cli tool, libraries, application programming interfaces (API), sets of pre-built and reusable codes, and the connections to access these software codes.

    • Allocator library provides the compatible and optimization path for application integration. Using the library, system developers not only easily incorporate CXL memory into existing systems without having to modify existing application, but also optimize and re-write application source code to reach up a higher level optimization.
    • Cli tool allows a unified way of retrieval and controlling a CXL device and SMDK functions.
  • The middle layer of SMDK is the userspace memory tiering engine with scalable near/far memory management. The layer allows best-fit memory use cases based on memory access pattern and footprint of applications.

  • The bottom layer of SMDK is primitive logical memory views by SMDK kernel. The kernel change is geared to provide a flexible memory utilization for CXL users on system point of view.

Please note that SMDK is being developed to support SDM, Software-Defined Memory, penetrating full-stack software.




2.2 User Interface

The memory map of a Linux process is composed of code, data (bss), stack, heap segment, etc., and each segment is represented as a VMA (Virtual Memory Area).

Among the segments, the heap segment is commonly used to deal with large memory consumption of the process by dynamically adjusting its size. Linux provides several system calls such as mmap(), brk() for processes to support a live expansion of the heap segment.

On SMDK Kernel, when a process attempts to expand its heap segment by calling mmap(), the process should choose which memory partition would be used among NORMAL and EXMEM zones. Given context and size, mmap() internally allocates a proper amount of pages out of free lists in buddy allocator of the NORMAL or EXMEM zone, and it attaches the virtual pages to its heap segment, and then adjusts the segment size. Finally, the calling context is terminated, returning the result of the execution to the caller.

To be specific, SMDK expands mmap() system call by adding MAP_NORMAL and MAP_EXMEM flag to selectively allocate free pages from NORMAL or EXMEM zone. The two flags explicitly tells designated memory type on a memory mapping request. Also, a novel sysfs interface is provided to determine and prioritize the two memory types in case of implicit memory mapping request. In other words, calling mmap() without specifying either of the two flags. This is one specific point that SMDK expands as well as inherits traditional Linux MM for a heterogeneous memory system with CXL device. Please refer to the code snippet next.

/*
 * 1) Explicit memory-mapping from CXL Memory: flag |= MAP_EXMEM;
 * 2) Explicit memory-mapping from Normal Memory: flag |= MAP_NORMAL;
 * 3) Implicit memory-mapping from CXL or Normal, depending on priority: no flag change
 */
char *addr = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);

As for user interface, SMDK provides heap allocator and numactl extension that aim to orchestrate the capacity and bandwidth of DRAM and CXL memory. Also, SMDK allocator supports scalability in performance by assigning pair(s) of lock-free DRAM/CXL arenas per CPU to resolve the race condition that multiple CPUs concurrently request memory chunk out of the arenas. SMDK allocator is an extension of Jemalloc allocator, and it is used both compatible and optimization way.

Figure below describes how SMDK allocator works.

image



Furthermore, a cli tool is ready as the unified interface to manage CXL device(s) on SMDK.

image

Some commands directly communicate with CXL device such as CXL spec commands, while others interact with SMDK components such as sysfs interfaces for memory grouping. On SMDK v1.2, the cli tool is the extension of Intel CXL-CLI.


Compatible Path

  • Background - When a new memory device emerges, it has been common that a novel API and programming model are provided to enable the new device. Given the API and programming model, application developers and open-source community members are forced to modify the source code of their applications to adopt and use the device properly. However, this way harms the reliability of running services and increases management cost. Compatible API of SMDK is geared to resolve the pain point, reflecting the VOC of the industry.

  • SMDK enables an application that tiers CXL/DDR memory without SW modification. In other words, neither change API nor programming model. Also, SMDK provides transparent memory management by inheriting and extending Linux process/VMM design.

  • Technically, compatible path is not an API, but the way for application developer and service operator to use CXL memory without application change.

  • Compatible path provides heap segment tiering using standard heap allocation API and system call such as malloc(), realloc(), and mmap(), overriding that of libc. Especially, the path possesses intelligent allocation engine that enables DDR/CXL usecases of memory tiering such as memory priority, size, and bandwidth.

  • Compatible Path also provides not only heap segment tiering, but user process tiering and resource isolation.

    image


Optimization Path

  • Optimization Path is literally the API to meet higher level optimization by rewriting application.

  • Optimization API consists of Allocation API and Metadata API. Allocation API is for memory allocation and deallocation, while Metadata API is to acquire online memory status.

    image


Language Binding

SMDK allocator provides language binding interfaces for application use. In SMDK, the compatible and optimization path support a different language binding coverage.

  • Compatible Path - C, C++, Python, Java, Go

    • Python, Java, and Go framework design and implement internal memory management schemes, respectively. Those are based on primitive heap allocation ways such as malloc() and mmap().
    • Above that, the SMDK compatible library supports language binding for the languages.
  • Optimization Path - C, C++, Python

    • Proprietary API is provided from the SMDK optimization library to implement an application with sophisticated CXL/DDR memory use.

The picture below is drawn to depict SMDK on language binding aspect.

image




2.3 Kernel


Memory Zone

  • Background - Historically, Linux VMM has the hierarchy of logical memory views from node, zone, buddy, and finally page granularity. Also, Linux VMM has expanded the zone unit for better use of physical memory according to HW and SW needs of the time such as DMA/DMA32, MOVABLE, DEVICE, NORMAL and HIGHMEM zones.

  • DMA/DMA32 zone includes specific pages that are addressable from some IO devices that support limited address space.

  • MOVABLE zone is geared to have less fragmentation and support memory hot-plugging.

  • HIGHMEM zone was used to access memory space that was located beyond NORMAL zone due to the addressing limitation.

  • Most DRAM pages are located in NORMAL zone and used by system and application contexts.

image

SMDK introduces a new memory zone, ZONE_EXMEM, to independently manage DDR and extended memory, like CXL memory. It is designed for two reasons. One is to serve a new memory hardware that has a different latency range due to a different wire protocol, another is to serve memory usecases like composability and pluggability. ZONE_EXMEM is now a configurable option of the SMDK kernel. We are open on a discussion and feedback of the design approach.

Linux Zone Trigger Description Option
ZONE_NORMAL Initial design Normal addressable memory. mandatory
ZONE_DMA HW (I/O) Addressable memory for some peripheral devices supporting limited address resolution. configurable
ZONE_HIGHMEM HW (CPU) A memory area that is only addressable by the kernel through. Mapping portions into its own address space. This is for example used by i386 to allow the kernel to address the memory beyond 900MB. configurable
ZONE_DEVICE HW (I/O) Offers paging and mmap for device driver identified physical address ranges. Ex) pmem, hmm, p2pdma configurable
ZONE_MOVABLE SW Is similar to ZONE_NORMAL, except that it contains movable pages. Main usecases for ZONE_MOVABLE are to make memory offlining/unplug more likely to succeed, and to locally limit unmovable allocations. configurable
ZONE_EXMEM HW (CXL) Extended memory for latency isolation, composability, and pluggability. configurable

Memory Partition

Right now, industry deals with a single CXL memory channel as a single logical memory-only node in Linux system. Hence, "CXL Memory : Node = 1 : 1", which is "Single Node" status in picture below.

However, we think it has drawbacks in that service operators or high level application developers have to be aware of the existence of CXL memory and manually control it by themselves using 3rd party plugins such as numactl and libnuma. The more CXL memories a system have, the more management efforts are needed.

In addition, we experienced a malfunction case when multiple applications perform their own node traversal policy simultaneously, it did not work properly. We were able to avoid the issue using SMDK zone partition type because the CXL memory(s) is seen as a new zone granularity at the type.

For the reasons, SMDK further suggests an abstraction layer that hides and minimizes the efforts, allowing grouping CXL devices in a user configurable manner.

  • The zone partition presents CXL memories on the system in a single zone, EXMEM ZONE, segregated from NORMAL_ZONE.

  • The node partition presents CXL memories on the system in a separate memory node grouped.

  • The noop partition presents a CXL memory on the system in a separate memory node.

Both zone and node partitions support interleaving of multiple CXL memories in a software way, as a result, it leads to aggregated bandwidth among the connected CXL devices.

Plus, we assume CXL memory can be used not only as memory interface also device interface such as DAX. SMDK allows both ways.

image

The figure below further depicts the SMDK zone partition with multiple CXL memories.

Single CXL memory is managed as a sub-zone possessing its own buddy list, then multiple CXL memories are grouped as so-called a super-zone. If a memory request comes from a thread context, the CXL super-zone assigns proper amount of pages in order out of the managed sub-zone array. Given context, the operation results in bandwidth aggregation effect out of the connected CXL memory array.

Without the super-zone design, memory requests from a process is not balanced, but concentrated on a single CXL device. Because a zone's buddy list is composed in a sequential order of connected CXL devices.

image


Device Driver

CXL device driver in SMDK kernel performs two operations.

Firstly, on system booting, driver registers and manages a "CXL device : memory partition" map based on SRAT/CEDT/DVSEC and E820 information provided by the device and BIOS. The map maintains the relations between physical CXL memory and logical memory group.

Please note that the map is mutable, reflecting the grouping status of a CXL memory, which is configurable via SMDK plugins.

Next, a set of sysfs are exported to report the static and dynamic information of CXL device(s). Specifically, /sys/kernel/cxl/devices/cxlX exports the address, size, and socket location of the CXL device X, while /sys/kernel/cxl/nodes/nodeY reports CXL device(s) in the memory-node Y at the time.

image