## Section I. Complementary Experiments as Advised Part A. Review 1-P1



Fig. 1 (a) Instruction per cycle normalized to a system with 32GB PCM, without DRAM cache, (b) DRAM cache utilization and (c) average access frequency of every DRAM cache page, normalized to HDRC

Table I: Instruction per cycle of selected applications (normalized to a system with 32GB PCM) before and after modifying hardware parameters

| 2202 1011) outsit with which in our jing nate ware parameters |                             |            |             |                              |            |             |
|---------------------------------------------------------------|-----------------------------|------------|-------------|------------------------------|------------|-------------|
| app                                                           | 4GHz, 4 cores configuration |            |             | 2GHz, 32 cores configuration |            |             |
|                                                               | (Normalized IPC)            |            |             | (Normalized IPC)             |            |             |
|                                                               | HDRC                        | SHMA-HMDyn | SHMA-Static | HDRC                         | SHMA-HMDyn | SHMA-Static |
| astar                                                         | 1.30                        | 1.30       | 1.33        | 1.17                         | 1.18       | 1.18        |
| Canneal                                                       | 0.96                        | 1.27       | 1.34        | 1.05                         | 1.16       | 1.21        |
| DICT                                                          | 0.58                        | 1.34       | 1.34        | 0.67                         | 1.21       | 1.21        |
| KNN                                                           | 0.23                        | 1.20       | 1.08        | 0.31                         | 1.13       | 1.05        |
| BFS                                                           | 0.18                        | 1.11       | 1.13        | 0.25                         | 1.07       | 1.08        |

We repeat part of the experiment with 2GHz, 32 cores configuration as advised. Experimental results are shown in Figure 1. Figure 1(a) depicts the nomalized performance of HDRC, SHMA-HMDyn, SHMA-Static and a system with 32GB DRAM only, a system with 32GB PCM is the baseline. For all these applications, HDRC only reaches 69.1% performance of the baseline, SHMA-HMDyn, SHMA-Static and a system with 32GB DRAM(the performance upper bound) achieve 15.0%, 14.7% and 22.3% performance improvement on average. Compared to HDRC,

SHMA-Static and SHM-HMDyn exhibit 45.9% and 45.6% performance improvement respectively. Just selecting part of workloads has resulted in lower performance improvement that isn't as remarkable as results shown in our thesis.

To figure out how modifying hardware parameters might change the results, we compare IPC of every selected application before(4GHz, 4cores) and after(2GHz, 32cores) modifying hardware parameters as shown in table I. We can observe that performance improvement is more remarkable in 4GHz configuration than 2GHz configuration to SHMA. This is because speed gap between 4GHz CPUs and memory is more huge than speed gap between 2GHz CPUs and memory. Reduction of average memory access latency makes greater influence to 4GHz CPUs for reducing core contention compared to 2GHz CPUs.

Figure 2 shows normalized power and energy of researched systems configuring with 2GHz CPUs. The system with 32GB DRAM is baseline. We observe that HDRC, SHMA-HMDyn, SHMA-Static and the system with 32GB PCM achieve 26.4%, 68.0%, 67.9% and 66.8% less energy consumption than the system with 32GB DRAM. For the researched applications, SHMA-HMDyn and SHMA-Static expend 96.2%, 96.4% energy relative to the system with 32GB PCM, compared to 221.4% energy consumption of HDRC. Just selecting part of workloads used in our thesis has caused energy efficiency is not as remarkable as shown in figure 10 of our thesis. We can conclude that SHMA and its promoted versions exhibit much more energy efficiency than HDRC when configuring with 2GHz, 32 cores.



Fig. 2: (a) Power, (b) Energy of HDRC, SHMA-HMDyn, SHMA-Static and a system with 32GB PCM (both are normalized to a system with 32GB DRAM)

## Part B. Reviewer3-p5

As shown in Figure 1(a), we redo our experiment with 2GHz, 32 cores configuration and measure IPC of a system with 32GB DRAM. For all these

applications, the system with 32GB DRAM(the performance upper bound) achieves 22.3% performance improvement compared to the system with 32GB PCM. Performance gaps between SHMA-static (SHMA-HMDyn) and the performance upper bound is within 5% (6%). We can conclude that SHMA and its promoted versions can achieve good performance in DRAM-NVM hybrid memory architecture.

## Section II Details of Last Level Page Table Entry and TLB Entry

Concrete extended last level page table entry in different paging modes and modified TLB entry of SHMA in MIPS R2000/3000 architecture are shown in Figure 3 and Figure 4 respectively.



(d) SHMA extended format of last level page table entry in long/PAE paging mode

(NVM Page Address)

T

D

W

D X

Address

Page Address

Fig. 3 Extended last level page table entry of SHMA in different paging mode



(b) SHMA modified format of TLB entry in MIPS R2000/3000 architecture

Fig. 4 Modified TLB entry of SHMA in MIPS R2000/3000 architecture