## **Section I. Complementary Experiments**



Fig.1 (a) Instruction per cycle normalized to a system with 32GB PCM, without DRAM cache, (b) DRAM cache utilization and (c) average access frequency of every DRAM cache page, normalized to HDRC

Table I:Instruction per cycle of selected applications (normalized to a system with 32GB PCM) varies with different CPU parameters

| e 2 d 2 d 2 d 2 d 2 d 2 d 2 d 2 d 2 d 2 |                             |            |             |                              |            |             |
|-----------------------------------------|-----------------------------|------------|-------------|------------------------------|------------|-------------|
| App                                     | 4GHz, 4 cores configuration |            |             | 2GHz, 32 cores configuration |            |             |
|                                         | (Normalized IPC)            |            |             | (Normalized IPC)             |            |             |
|                                         | HDRC                        | SHMA-HMDyn | SHMA-Static | HDRC                         | SHMA-HMDyn | SHMA-Static |
| Astar                                   | 1.30                        | 1.30       | 1.33        | 1.17                         | 1.18       | 1.18        |
| Canneal                                 | 0.96                        | 1.27       | 1.34        | 1.05                         | 1.16       | 1.21        |
| DICT                                    | 0.58                        | 1.34       | 1.34        | 0.67                         | 1.21       | 1.21        |
| KNN                                     | 0.23                        | 1.20       | 1.08        | 0.31                         | 1.13       | 1.05        |
| BFS                                     | 0.18                        | 1.11       | 1.13        | 0.25                         | 1.07       | 1.08        |

We conduct some experiments with 2GHz, 32 cores configuration. Experimental results are shown in Figure 1. Figure 1(a) depicts the normalized performance of HDRC, SHMA-HMDyn, SHMA-Static and a system with 32GB DRAM only, all with respect to a system with 32GB PCM. For all these applications, HDRC only obtains 69.1% performance of the baseline configuration, SHMA-HMDyn, SHMA-Static and a system with 32GB DRAM(the performance upper bound) achieve 15.0%, 14.7% and 22.3% performance improvement on average. Compared to HDRC,

SHMA-Static and SHM-HMDyn exhibit 45.9% and 45.6% performance improvement respectively.

To figure out how modifying hardware parameters might change the experimental results, we compare the IPC of each selected application under4GHz, 4cores configuration with 2GHz, 32cores configuration in Table I. We can observe that performance improvement is more remarkable in 4GHz configuration than 2GHz configuration to SHMA. Because speed gap between memory and 4GHz CPUs is more huge than 2GHz CPUs. Reduction of average memory access latency in SHMA makes greater influence on reducing CPUs' stall time and cores contentions of 4GHz CPUs configuration than 2GHz CPUs configuration. On the other hand, our applications hold little shared data, hence the number of cores impact their performance barely.

Figure 2 shows normalized power and energy of researched systems configuring with 2GHz CPUs. The system with 32GB DRAM is a baseline. We observe that HDRC, SHMA-HMDyn, SHMA-Static and the system with 32GB PCM achieve 26.4%, 68.0%, 67.9% and 66.8% less energy consumption than the system with 32GB DRAM. For these applications, SHMA-HMDyn and SHMA-Static consume 96.2%, 96.4% energy relative to the system with 32GB PCM, while the energy consumption of HDRC is 221.4%. We can conclude that SHMA and its promoted versions exhibit much more energy efficiency than HDRC when configuring with 2GHz, 32 cores.



Fig. 2: (a) Power, (b) Energy of HDRC, SHMA-HMDyn, SHMA-Static and a system with 32GB PCM (both are normalized to a system with 32GB DRAM)

As shown in Figure 1(a), we do our experiment with 2GHz, 32 cores configuration and measure IPC of the system with 32GB DRAM. For all these applications, the system with 32GB DRAM (the performance upper bound) achieves 22.3% performance improvement compared to the system with 32GB PCM.

Performance gap between SHMA-static (SHMA-HMDyn) and the upper bound is within 5% (6%). We can conclude that SHMA and its promoted versions can achieve good performance in DRAM-NVM hybrid memory architecture.

## Section II Details of Last Level Page Table Entry and TLB Entry

Concrete extended last level page table entry in different paging modes and modified TLB entry of SHMA in MIPS R2000/3000 architecture are shown in Figure 3 and Figure 4 respectively.



(d) Extended format of last level page table entry in long/PAE paging mode

(NVM Page Address)

 $S \mid W$ 

D

T

D X

Address

Page Address

Fig. 3 Extended last level page table entry of SHMA in different paging mode



Fig. 4 Modified TLB entry of SHMA in MIPS R2000/3000 architecture