Skip to content

7. Experiment Results

Junhyeok Im edited this page May 9, 2024 · 8 revisions

7.1 CXL Usecases

We believe it plays a pivotal role in a success of the new memory device to find out best and worst fit usecases, when a user investigates applying a new memory device. Thus, we have assumed 3 possible CXL usecases and results for those who consider adopting CXL memory in their system. The experiments have been conducted in our lab condition on developing CXL system, therefore consequences should be understood as a reference purpose.

image



7.2 Experiment 1 - IMDB(Redis, UX A)

Including the results of this chapter, all the experimental results on this page were based on the prototypes of CXL enabled system and CXL memory module, so the results may vary depending on the evaluation environment.

7.2.1 Testbed

The table below describes the HW/SW testbed information that we used for the experiments.

HW / SW Description
CPU / Board Prototype CPU and board system which supports PCIe Gen5 I/F and CXL
DRAM Samsung DDR5 DIMM 4800MT/s
CXL Memory Expander Samsung CMM-D prototype
OS Ubuntu 20.04 LTS, CentOS-7 x86 2009, Fedora Workstation 38, OpenSUSE Leap 15.5
Kernel SMDK kernel: 5.17.0-rc5-smdk and later (Latest: 6.6.0-smdk)

7.2.2 Test Configuraton

Redis Configuration

  • maxmemory: 8g
  • maxmemory-policy: allkeys-lru
  • maxmemory samples: 10

Memtier Configuration

  • target 60GB W/L
  • pipeline_num: 1
  • thread_num: 24 (using taskset -c 24-47)
  • client_num: 50
  • key_pattern P:P
  • ratio 1:0 / 0:1 for Set/Get each

SMDK Configuration (CXLMALLOC_CONF)

use_exmem Test A Test B maxmemory_policy use_auto_arena_scaling
exmem_size normal_size priority exmem_size normal_size priority
TRUE 131072 2048 normal 131072 2048 normal remain FALSE

7.2.3 Test Results

The table below summarizes the benchmark result using Memtier benchmark tool. It can be seen that the performance is improved by up to x2.5 in the case of expanding the system memory (DRAM + CXL).

UX A. More system memory

Relative Bandwidth Test A - DRAM Test B - CXL + DRAM
Set 128B 1 2.08
256B 1 2.57
512B 1 2.18
1KB 1 1.82


7.3 Experiment 2 - IMDB(Memcached, UX A and B)

7.3.1 Testbed

HW / SW Description
CPU / Board Prototype CPU and board system which supports PCIe Gen5 I/F and CXL
DRAM Samsung DDR5 DIMM 4800MT/s
CXL Memory Expander Samsung CMM-D prototype
OS Ubuntu 20.04 LTS, CentOS-7 x86 2009, Fedora Workstation 38, OpenSUSE Leap 15.5
Kernel SMDK kernel: 5.17.0-rc5-smdk and later (Latest: 6.6.0-smdk)

7.3.2 Test Configuraton

Memcached Configuration

  • UX A
    • #threads: 24
    • maxmemory: 2048
    • memory pre-allocation by extstore options
  • UX B
    • #threads: 24
    • extstore write buffer: 2MiB

Memtier Configuration

  • target 60GB W/L (cf. UX2: 10GB W/L)
  • pipeline_num: 1
  • thread_num: 24 (using taskset -c 24-47)
  • client_num: 50
  • key_pattern P:P
  • ratio 1:0 / 0:1 for Set/Get each
  • The performance of UX B - Test A (DRAM + Storage) was calculated based on the time when all the value data was written in the storage.

SMDK Configuration (CXLMALLOC_CONF)

use_exmem Test A Test B maxmemory_policy use_auto_arena_scaling
exmem_size normal_size priority exmem_size normal_size priority
TRUE 131072 2048 normal 131072 2048 normal remain FALSE
TRUE 131072 131072 normal 131072 131702 exmem remain FALSE

7.3.3 Test Results

The tables below summarize the benchmark results using Memtier benchmark tool. In UX A, we found that the performance figures are similar to those evaluated in DRAM-only systems, and also found that in memory scale-up case(UX B, CXL vs. DRAM + Storage), the performance is greatly improved compared to the comparative groups.

UX A. More system memory

Relative Bandwidth Test A - DRAM Test B - CXL + DRAM
Set 128B 1 0.96
256B 1 0.98
512B 1 0.83
1KB 1 0.91

UX B. Memory scale-up (vs. system memory + storage)

Relative Bandwidth Test A - DRAM + Storage Test B - CXL scale-up
Set 4KB 1 3.2
512KB 1 253.3
Get 4KB 1 14.1
512KB 1 7.1


7.4 Experiment 3 - ML/AI(GPT model, UX C)

7.4.1 Methodology

We invented an experiment methodology to lead to the aggregated memory bandwidth (UX C) on a heterogeneous memory system that equips both DRAM and CXL memory.

The key idea is that

  1. When DRAM bandwidth is saturated by an application use, we further allocate a CXL memory to be used by the application.
  2. CPU/memory resources are isolated and reserved to the application.
  3. The application triggers a sufficient memory workload on the isolated HW resources while running to result in the maximum memory bandwidth out of the memory. Also, there are some notations we draw for detailed explanation. Please refer to the composition of the reference testbed below.

image


Also, there are some notations we draw for detailed explanation.

image

Dmax_bw = the maximum BW of DDR DRAM (ex: 28GB/s)
Cmax_bw = the maximum BW of CXL RAM
Duse_bw = in-use BW of DDR DRAM (ex: < 25GB/s)
Cuse_bw = in-use BW of CXL DRAM
Dmax_wl = the minimum Workload size for DDR BW saturation
Cmax_wl = the minimum Workload size of CXL BW saturation
Duse_wl = in-use WL size on DDR DRAM, out of running application 
Cuse_wl = in-use WL size on CXL DRAM, out of running application
Texec = application execution time
Hence, the condition to reach up peak BW aggregation is Duse_wl > Dmax_wl and Cuse_wl > Cmax_wl, while Texec.

Applying the methodology, we conducted experiments on ML/AI application, GPT, BERT, and NASNET, to validate how additional BW out of CXL memory actually helps application performance. Given the test set, ML/AI applications show increased inference throughput(IPM, Inference Per Minute). Following section explains the test results of GPT-based application that performs generative sentences.


7.4.2 GPT + SMDK Bandwidth-based tiering

This experiment shows the improved throughput of the GPT2 application with SMDK's intelligent tiering.

7.4.2.1 Testbed

The table below describes HW/SW testbed for the experiments.

HW / SW Description
CPU / Board Prototype CPU and board system which supports PCIe Gen5 I/F and CXL
DRAM Samsung 64GB DDR5 DIMM 4800MT/s
CXL Memory Expander Samsung 128GB CMM-D prototype
OS Ubuntu 20.04 LTS
SMDK SMDK v1.5

7.4.2.2 Test Configuration

GPT2 Model Configuration

  • Pytorch >= 2.0, Python >= 3.10
  • Model type : GPT2-base, 12-layer, 768-hidden, 12-heads, 117M parameters, batch_size 8, max-length 128
  • Dataset: imdb reviews

SMDK Configuration (CXLMALLOC_CONF)

  • use_exmem: true
  • use_adaptive_interleaving: true
  • adaptive_interleaving_policy: bw_saturation

7.4.2.3 Test Results

Below figures depict the results of the experiment on GPT2 application.
In case of vanilla Linux, i.e. not using adaptive interleaving, CXL DRAM was not utilized by GPT2 application, even though DRAM bandwidth is saturated. This is because vanilla Linux does not consider bandwidth saturation of DDR DRAM, but only care available capacity of DDR DRAM. As a result, the throughput is limited at around 240 IPM, not utilizing additional bandwidth of CXL DRAM.

image


Meanwhile, adaptive interleaving automatically detects when DRAM bandwidth saturation is happening, by monitoring in-use memory bandwidth of the system using CPU PMU. Having DDR DRAM bandwidth saturated, CXL DRAM is used by the application from the point. As a result, the throughput is much improved to 360 IPM, which is around x1.5 higher than the vanilla case.

image


7.4.3 GPT + SMDK Optimization path

This experiment shows the improved throughput of the GPT2 application using the optimization path, especially by modifying Pytorch to allocate CXL memory using s_posix_memalign for specific types of pre-trained weights.

7.4.3.1 Testbed

The table below describes HW/SW testbed for the experiments.

HW / SW Description
CPU / Board Prototype CPU and board system which supports PCIe Gen5 I/F and CXL, Logical cores: 144
DRAM Samsung 64GB DDR5 DIMM 4800MT/s
CXL Memory Expander Samsung 128GB CMM-D prototype
OS Ubuntu 22.04 LTS
SMDK SMDK v1.5

7.4.3.2 Test Configuration

GPT2 Model Configuration

  • Pytorch >= 2.0, Python >= 3.10
  • Model type : GPT2-large, 36-layer, 1280-hidden, 20-heads, 774M parameters, max-length 128
  • Dataset: imdb reviews

SMDK Configuration (SMALLOC_CONF)

  • use_auto_arena_scaling:true

7.4.3.3 Test Results

By using optimized path, CPU cores are fully utilized by GPT2 application, even though DRAM bandwidth is moderately saturated. This is because in non-optimized path, pre-trained weights are stored in DRAM that utilizes the available memory. Since the large language modeling pre-trained weights are considered as a part of model but not as a part of inference data, keeping the pre-trained data items in CXL helps us to serve more request in DRAM. So in this experiment model, some weights are offloaded through optimization path to CXL DDR during weights initialization which helps DDR to serve more inference requests keeping memory bandwidth saturated.

As a result, the throughput which is limited at 1 (normalized metric) without optimization path has been increased to 1.99 (normalized metric) with optimization path.

image


It is also observed that in optimization path the GPT2-large pre-trained weights utilize 4GB per instance. If the total cores used are increased from 96 to 144 and CXL DDR usage is increased from 51.32% to 90%, the performance growth factor will increase from 1.99 to 2.5. The improved performance growth can also be seen if pre-weights utilization per instance increased from 4GB to 6GB. This observation is based on GPT2 variants and the memory consumption.



7.5 Experiment 4 - Containerization

Virtualization is also an important SW stack as promising usecase of CXL memory adoption. The container, thin virtualization, technology is being widely used in industry. Therefore we performed integration of container and the SMDK and conducted some experiments.

image


7.5.1 Containerization with SMDK

  1. Install required container runtime and Docker.
  2. When creating the container image of an application, a SMDK plugin needs to be included. (e.g., libcxlmalloc.so for compatible path)
  3. Start Docker container as usual. No additional setting is needed when starting a container.
  4. When running the application, set and export required configurations according to the plugin. (e.g., LD_PRELOAD and CXLMALLOC_CONF for libcxlmalloc.so) Please refer to Compatible path section for more details.

7.5.2 Experiment Result

The Docker, container and SMDK integration have worked well. Those are the list of application containers we tested.

  • ML/AI Applications
    • GPT2 Inference (with python, pytorch framework)
    • BERT Inference (with python, tensorflow framework)
    • NASNet Inference (with python, tensorflow framework)
    • DLRM Inference (with python, pytorch framework)
  • In-memory Database(IMDB) Applications
    • Redis
    • Memcached