## [Demo] Task Mapping on a DVFS-enabled Heterogeneous System
[A2] Task Mapping on Soft Heterogeneous Systems   
Apan Qasem [\<apan@txstate.edu\>](apan@txstate.edu)


### Description 

Demonstrate the performance and energy impact of operational frequency on heterogeneous multicore systems. 

### Software and Tools

The following Linux tools are used in this demo.

  * `cpufrequtils`
  * `cpupower`
  * `perf`
  * `energy`
  * `taskset`

The demo also includes a simple C++/OpenMP code that performance matrix-vector multiplication in
parallel. 


### Environment

Below are instructions for setting a homogenous multicore system as a DVFS-supported heterogeneous platform. 
These steps should be carried out prior to class time. We created a [script](./code/build_hc_env.sh)
to carry out these tasks automatically. Note the below tasks require root access. Follow the
guidelines in the script if root access is not available. 

**0. Download sample codes and utility scripts from the ToUCH repo**

An OpenMP parallel implementation of matrix-vector multiplication is used as a running example for
this demo. There are three utlity scripts for tweaking the frequencies.  

```bash 
git clone https://github.com/TeachingUndergradsCHC/modules.git
```
 
**1. Install necessary packages and their dependencies**

Install `cpufrequtils`

In [8]:
sudo apt install cpufrequtils

Reading package lists... Done
Building dependency tree       
Reading state information... Done
cpufrequtils is already the newest version (008-1build1).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.


Install `perf`, `taskset` and `cpupower` if they are not alreay installed. These tools are available
in the common tools package 

In [9]:
sudo apt install linux-tools-common

Reading package lists... Done
Building dependency tree       
Reading state information... Done
linux-tools-common is already the newest version (4.15.0-136.140).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.


In [10]:
sudo apt install linux-tools-`uname-r`

uname-r: command not found
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Virtual packages like 'linux-tools' can't be removed
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.


**2. Check CPU clock frequencies**

Clock frequencies of individual cores can be inspected with various utilites. 

In [12]:
cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping	: 2
microcode	: 0x27
cpu MHz		: 1601.556
cache size	: 20480 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 15
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_by

cache size	: 20480 KB
physical id	: 0
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 14
initial apicid	: 14
fpu		: yes
fpu_exception	: yes
cpuid level	: 15
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4788.89
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 8
vendor_

wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4788.89
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping	: 2
microcode	: 0x27
cpu MHz		: 1198.931
cache siz

In [13]:
cpufreq-info

cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 4294.55 ms.
  hardware limits: 1.20 GHz - 3.20 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.80 GHz and 1.80 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.55 GHz.
analyzing CPU 1:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 4294.55 ms.
  hardware limits: 1.20 GHz - 3.20 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.80 GHz and 1.80 GHz.
     

                  within this range.
  current CPU frequency is 1.20 GHz.


The ToUCH repository has a script that provides cleaner output. This script might be more suitable for the in-class demo. 

In [17]:
cd SIGCSE21

In [18]:
./check_clk_speed.sh

CPU0:	180 GHz - 180 GHz
CPU1:	180 GHz - 180 GHz
CPU2:	180 GHz - 180 GHz
CPU3:	180 GHz - 180 GHz
CPU4:	120 GHz - 320 GHz
CPU5:	120 GHz - 320 GHz
CPU6:	120 GHz - 320 GHz
CPU7:	120 GHz - 320 GHz
CPU8:	120 GHz - 320 GHz
CPU9:	120 GHz - 320 GHz
CPU10:	120 GHz - 320 GHz
CPU11:	120 GHz - 320 GHz
CPU12:	120 GHz - 320 GHz
CPU13:	120 GHz - 320 GHz
CPU14:	120 GHz - 320 GHz
CPU15:	120 GHz - 320 GHz


**3. Lower frequencies for a subset of cores**

We will simulate a less powerful (i.e., _little_) core by lowering its frequency to the lowest allowed
value. To lower the frequency of an individual we can use the `cpupower` utility. We need to root privileges to change the clock frequency (obviously!). The commands below lowers the frequency of core 0 to 1.80 GHz. 

In [11]:
sudo cpupower -c 0 frequency-set -d 1800000
sudo cpupower -c 0 frequency-set -u 1800000

Setting cpu: 0
Setting cpu: 0


Verify if the change has taken effect

In [19]:
./check_clk_speed.sh

CPU0:	180 GHz - 180 GHz
CPU1:	180 GHz - 180 GHz
CPU2:	180 GHz - 180 GHz
CPU3:	180 GHz - 180 GHz
CPU4:	120 GHz - 320 GHz
CPU5:	120 GHz - 320 GHz
CPU6:	120 GHz - 320 GHz
CPU7:	120 GHz - 320 GHz
CPU8:	120 GHz - 320 GHz
CPU9:	120 GHz - 320 GHz
CPU10:	120 GHz - 320 GHz
CPU11:	120 GHz - 320 GHz
CPU12:	120 GHz - 320 GHz
CPU13:	120 GHz - 320 GHz
CPU14:	120 GHz - 320 GHz
CPU15:	120 GHz - 320 GHz


The syntax for the `cpupower` utility is a little cumbersome when we are trying to fix the frequency to a specific value. The `set_clk_speed.sh` script in the ToUCH repo is a wrapper around `cpupower` that provides a cleaner interface. 

In [20]:
sudo ./set_clk_speed.sh 0-3 2.4

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


In [21]:
./check_clk_speed.sh

CPU0:	240 GHz - 240 GHz
CPU1:	240 GHz - 240 GHz
CPU2:	240 GHz - 240 GHz
CPU3:	240 GHz - 240 GHz
CPU4:	120 GHz - 320 GHz
CPU5:	120 GHz - 320 GHz
CPU6:	120 GHz - 320 GHz
CPU7:	120 GHz - 320 GHz
CPU8:	120 GHz - 320 GHz
CPU9:	120 GHz - 320 GHz
CPU10:	120 GHz - 320 GHz
CPU11:	120 GHz - 320 GHz
CPU12:	120 GHz - 320 GHz
CPU13:	120 GHz - 320 GHz
CPU14:	120 GHz - 320 GHz
CPU15:	120 GHz - 320 GHz


There is another script `reset_clk_speed.sh` that resets the frequencies to their default values. 

In [22]:
./reset_clk_speed.sh 0-3

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


In [23]:
./check_clk_speed.sh

CPU0:	120 GHz - 320 GHz
CPU1:	120 GHz - 320 GHz
CPU2:	120 GHz - 320 GHz
CPU3:	120 GHz - 320 GHz
CPU4:	120 GHz - 320 GHz
CPU5:	120 GHz - 320 GHz
CPU6:	120 GHz - 320 GHz
CPU7:	120 GHz - 320 GHz
CPU8:	120 GHz - 320 GHz
CPU9:	120 GHz - 320 GHz
CPU10:	120 GHz - 320 GHz
CPU11:	120 GHz - 320 GHz
CPU12:	120 GHz - 320 GHz
CPU13:	120 GHz - 320 GHz
CPU14:	120 GHz - 320 GHz
CPU15:	120 GHz - 320 GHz


To configure this 16-core system as "big-LITTLE", we will lower the frequencies for cores 0-7 and leave the rest at their defaul values. Cores 0-7 will serve as the _little_ cores and 8-15 will serve as the _big_ cores. Other more complex configurations can be easily set up if the instructor chooses to do a more involved (e.g., in a CS2 course rather CS1)

In [25]:
./set_clk_speed.sh 0-7 1.8

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7


In [26]:
./check_clk_speed.sh

CPU0:	180 GHz - 180 GHz
CPU1:	180 GHz - 180 GHz
CPU2:	180 GHz - 180 GHz
CPU3:	180 GHz - 180 GHz
CPU4:	180 GHz - 180 GHz
CPU5:	180 GHz - 180 GHz
CPU6:	180 GHz - 180 GHz
CPU7:	180 GHz - 180 GHz
CPU8:	120 GHz - 320 GHz
CPU9:	120 GHz - 320 GHz
CPU10:	120 GHz - 320 GHz
CPU11:	120 GHz - 320 GHz
CPU12:	120 GHz - 320 GHz
CPU13:	120 GHz - 320 GHz
CPU14:	120 GHz - 320 GHz
CPU15:	120 GHz - 320 GHz


### Instructions 

The main steps for the in-class demo are outlined below

**1. Discuss heterogeneous system.**

Log into system that has been set up to simulate a heterogeneous system (or use this notebook) and review it's attributes.

In [27]:
cpufreq-info

cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 4294.55 ms.
  hardware limits: 1.20 GHz - 3.20 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.80 GHz and 1.80 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.80 GHz.
analyzing CPU 1:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 4294.55 ms.
  hardware limits: 1.20 GHz - 3.20 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.80 GHz and 1.80 GHz.
     

                  within this range.
  current CPU frequency is 1.80 GHz.


**2. Review matrix-multiply code**

Pull up the matrix-vector source code in an editor and do a walk-through.

  * discuss command-line arguments 
  * discuss basics of an OpenMP directive
  
```C++
double dot_prod(double *x, double *y, int n) {
  double sum = 0.0;
  int i;
#pragma omp parallel for reduction(+:sum)
  for (i = 0; i < n; i++)
      sum += x[i] * y[i];
  return sum;
}

void matrix_vector_mult(double **mat, double *vec, double *result,
                        long long rows, long long cols) {

  /* not parallelelized to ensure runtimes are more meaningful */
  int i;
  for (i = 0; i < rows; i++)
    result[i] = dot_prod(mat[i], vec, cols);
}
```

**3. Build the code on the command-line**

In [28]:
gcc -o matvec -fopenmp -O3 matvec.c

 `matvec` is parallelized with OpenMP. So the `-fopenmp` flag is required. Compiling at `-O3` is
   likely to give more predictable performance numbers. 
   
**4. Run and time the sequential and parallel version of the code**

Run the code with a single thread (i.e., serial version). The matrix size and number of reps can be
adjusted based on the system where the code is running and the amount of time to be devoted to this
demo. With 10000 and 20 the sequential version should run for 3-4 seconds. 

In [29]:
time ./matvec 10000 20 1

Verification: result[0] = 2.61e+09

[0;33mCompute time = 2.034 s
[0m
real	0m3.232s
user	0m2.976s
sys	0m0.256s


In [30]:
time ./matvec 10000 20 2

Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.333 s
[0m
real	0m2.512s
user	0m3.560s
sys	0m0.284s


Discuss the performance improvements with parallelization. Time permitting, the code can be run with
2, 4, ... N threads (where N = number of processing cores on the system) to show the scalability of
the code and discuss Amdahl's Law. 

**4. Discuss mapping of threads to processors**

   Introduce the `taskset` utility and discuss how it can be used to map threads to processing cores.

In [31]:
## run program on core 0 with 4 threads 
taskset -c 0 ./matvec 10000 20 4

Verification: result[0] = 2.61e+09

[0;33mCompute time = 7.708 s
[0m

In [32]:
## run program on 2 cores (2 and 5) with 4 threads 
taskset -c 2,5 ./matvec 10000 20 4

Verification: result[0] = 2.61e+09

[0;33mCompute time = 4.812 s
[0m

**5. Run code on _little_ cores**
  
  Run the code on the cores set up as little cores and measure execution time.

In [33]:
taskset -c 0-7 ./matvec 10000 20 8

Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.224 s
[0m

Re-run the code and measure detailed performance metrics with `perf`

In [34]:
perf stat taskset -c 0-7 ./matvec 10000 20 8

Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.204 s
[0m
 Performance counter stats for 'taskset -c 0-7 ./matvec 10000 20 8':

      11661.482175      task-clock (msec)         #    3.604 CPUs utilized          
                10      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           195,554      page-faults               #    0.017 M/sec                  
    21,657,946,400      cycles                    #    1.857 GHz                    
    18,678,511,328      instructions              #    0.86  insn per cycle         
     3,617,563,856      branches                  #  310.215 M/sec                  
        10,779,682      branch-misses             #    0.30% of all branches        

       3.235513519 seconds time elapsed



Re-run the code and measure power and energy

In [35]:
likwid-perfctr -c 0-7 -g ENERGY taskset -c 0-7 ./matvec 10000 20 8

--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
CPU type:	Intel Xeon Haswell EN/EP/EX processor
CPU clock:	2.39 GHz
--------------------------------------------------------------------------------
Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.239 s
[0m--------------------------------------------------------------------------------
Group 1: ENERGY
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+
|         Event         | Counter |   Core 0   |   Core 1   |   Core 2   |   Core 3   |   Core 4   |   Core 5   |   Core 6   |   Core 7   |
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 1456503798 | 7731626633 | 1454744282 | 1475625371 | 1453539232 | 1480270729 | 1411951656 | 1412

**6. Run code on _big_ cores**

   Run the code on the cores set up as little cores and measure execution time.

In [37]:
time taskset -c 8-15 ./matvec 10000 20 8

Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.021 s
[0m
real	0m2.202s
user	0m9.076s
sys	0m0.272s


Re-run the code and measure power and energy.

In [38]:
likwid-perfctr -c 8-15 -g ENERGY taskset -c 8-15 ./matvec 10000 20 8

--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
CPU type:	Intel Xeon Haswell EN/EP/EX processor
CPU clock:	2.39 GHz
--------------------------------------------------------------------------------
Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.030 s
[0m--------------------------------------------------------------------------------
Group 1: ENERGY
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+
|         Event         | Counter |   Core 8   |   Core 9   |   Core 10  |   Core 11  |   Core 12  |   Core 13  |   Core 14  |   Core 15  |
+-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 7805093436 | 1445053775 | 1466861291 | 1506870869 | 1491563097 | 1472260593 | 1431446313 | 1455

**7. Discuss the implications of the results** 

   * little cores will consume less power than big cores
   * little cores will have lower performance than big cores
   * threads must be mapped to cores based on the characteristic of the application and the target
     objective