## [Demo] Task Mapping on a DVFS-enabled Heterogeneous System
[A2] Task Mapping on Soft Heterogeneous Systems   
Apan Qasem [\<apan@txstate.edu\>](apan@txstate.edu)


### Description 

Demonstrate the performance and energy impact of operational frequency on heterogeneous multicore systems. 

### Software and Tools

The following Linux tools are used in this demo.

  * `cpufrequtils`
  * `cpupower`
  * `perf`
  * `energy`
  * `taskset`
  * `likwid`
  * `gcc` (OpenMP support is already built-in the standard distribution for ubuntu)

The demo also includes a simple C++/OpenMP code that performance matrix-vector multiplication in
parallel. 


### Environment

Below are instructions for setting a homogenous multicore system as a DVFS-supported heterogeneous platform. 
These steps should be carried out prior to class time. We created a [script](./code/build_hc_env.sh)
to carry out these tasks automatically. Note the below tasks require root access. The installation commands are specific to Ubuntu -- for other platforms, you may need to adapt the commands to the suitable platform specific ones. Follow the
guidelines in the script if root access is not available. 

**0. Download sample codes and utility scripts from the ToUCH repo**

An OpenMP parallel implementation of matrix-vector multiplication is used as a running example for
this demo. There are three utility scripts for tweaking the frequencies.  

```bash 
git clone https://github.com/TeachingUndergradsCHC/modules.git
```
 
**1. Install necessary packages and their dependencies**

Install `cpufrequtils`

In [22]:
sudo apt install -y -qq cpufrequtils

cpufrequtils is already the newest version (008-1build1).
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.


Install `perf`, `taskset` and `cpupower` if they are not alreay installed. These tools are available
in the common tools package. The second package is specific to the linux kernel that you have -- if the kernel-specific version is not available from the repositories, you may need to download the kernel source and build this tool yourself. 

In [23]:
sudo apt install -y -qq linux-tools-common

linux-tools-common is already the newest version (4.15.0-151.157).
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.


In [24]:
sudo apt install -y -qq linux-tools-`uname -r`

linux-tools-5.4.0-77-generic is already the newest version (5.4.0-77.86~18.04.1).
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.


**2. Check CPU clock frequencies**

Clock frequencies of individual cores can be inspected with various utilites. 

In [25]:
cpufreq-info

cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 4294.55 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.33 GHz.
analyzing CPU 1:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 4294.55 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.60 GHz.
         

The ToUCH repository has a script that provides cleaner output. This script might be more suitable for the in-class demo. 

In [26]:
## change to the code directory and setup the path to the mapping scripts
cd code

bash: cd: code: No such file or directory


: 1

In [27]:
./mapping_scripts/check_clk_speed.sh

CPU0:	800 MHz - 360 GHz
CPU1:	800 MHz - 360 GHz
CPU2:	800 MHz - 360 GHz
CPU3:	800 MHz - 360 GHz


**3. Lower frequencies for a subset of cores**

We will simulate a less powerful (i.e., _little_) core by lowering its frequency to the lowest allowed
value. To lower the frequency of an individual we can use the `cpupower` utility. We need to root privileges to change the clock frequency (obviously!). The commands below lowers the frequency of core 0 to 1.80 GHz. 

In [28]:
sudo cpupower -c 0 frequency-set -d 1800000
sudo cpupower -c 0 frequency-set -u 1800000

Setting cpu: 0
Setting cpu: 0


Verify if the change has taken effect

In [29]:
./mapping_scripts/check_clk_speed.sh

CPU0:	180 GHz - 180 GHz
CPU1:	800 MHz - 360 GHz
CPU2:	800 MHz - 360 GHz
CPU3:	800 MHz - 360 GHz


The syntax for the `cpupower` utility is a little cumbersome when we are trying to fix the frequency to a specific value. The `set_clk_speed.sh` script in the ToUCH repo is a wrapper around `cpupower` that provides a cleaner interface. 

In [15]:
sudo mapping_scripts/set_clk_speed.sh 0-3 1.8

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


In [17]:
./mapping_scripts/check_clk_speed.sh

CPU0:	180 GHz - 180 GHz
CPU1:	180 GHz - 180 GHz
CPU2:	180 GHz - 180 GHz
CPU3:	180 GHz - 180 GHz


There is another script `reset_clk_speed.sh` that resets the frequencies to their default values. 

In [30]:
sudo ./mapping_scripts/reset_clk_speed.sh 0-3

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


In [31]:
check_clk_speed.sh

CPU0:	800 MHz - 360 GHz
CPU1:	800 MHz - 360 GHz
CPU2:	800 MHz - 360 GHz
CPU3:	800 MHz - 360 GHz


To configure this multi-core system as "big-LITTLE", we will lower the frequencies for cores Lstart-Lend for now. These cores will serve as the _little_ cores. Later we will setup other cores as the _big_ cores. Other more complex configurations can be easily set up if the instructor chooses to do a more involved (e.g., in a CS2 course rather CS1)

In [32]:
# the CPU number for little CPUs
export Lstart=0
export Lend=3

In [33]:
# setup the speed for the little CPUs
sudo ./mapping_scripts/set_clk_speed.sh $Lstart-$Lend 1.8

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


In [36]:
./mapping_scripts/check_clk_speed.sh

CPU0:	180 GHz - 180 GHz
CPU1:	180 GHz - 180 GHz
CPU2:	180 GHz - 180 GHz
CPU3:	180 GHz - 180 GHz


### Instructions 

The main steps for the in-class demo are outlined below

**1. Discuss heterogeneous system.**

Log into system that has been set up to simulate a heterogeneous system (or use this notebook) and review it's attributes.

In [35]:
cpufreq-info

cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 4294.55 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.80 GHz and 1.80 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.80 GHz.
analyzing CPU 1:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 4294.55 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.80 GHz and 1.80 GHz.
       

**2. Review matrix-multiply code**

Pull up the matrix-vector source code in an editor and do a walk-through.

  * discuss command-line arguments 
  * discuss basics of an OpenMP directive
  
```C++
double dot_prod(double *x, double *y, int n) {
  double sum = 0.0;
  int i;
#pragma omp parallel for reduction(+:sum)
  for (i = 0; i < n; i++)
      sum += x[i] * y[i];
  return sum;
}

void matrix_vector_mult(double **mat, double *vec, double *result,
                        long long rows, long long cols) {

  /* not parallelelized to ensure runtimes are more meaningful */
  int i;
  for (i = 0; i < rows; i++)
    result[i] = dot_prod(mat[i], vec, cols);
}
```

**3. Build the code on the command-line**

If `gcc` is not installed yet, you can install it using the following command:
```bash 
    sudo apt install -y -qq gcc
```

In [37]:
gcc -o matvec -fopenmp -O3 matvec.c

 `matvec` is parallelized with OpenMP. So the `-fopenmp` flag is required. Compiling at `-O3` is
   likely to give more predictable performance numbers. 
   
**4. Run and time the sequential and parallel version of the code**

Run the code with a single thread (i.e., serial version). The matrix size and number of reps can be
adjusted based on the system where the code is running and the amount of time to be devoted to this
demo. With 10000 and 20 the sequential version should run for 3-4 seconds. 

In [38]:
time ./matvec 10000 20 1

Verification: result[0] = 2.61e+09

[0;33mCompute time = 3.597 s
[0m
real	0m5.668s
user	0m5.216s
sys	0m0.452s


In [39]:
time ./matvec 10000 20 2

Verification: result[0] = 2.61e+09

[0;33mCompute time = 2.146 s
[0m
real	0m4.217s
user	0m5.935s
sys	0m0.428s


Discuss the performance improvements with parallelization. Time permitting, the code can be run with
2, 4, ... N threads (where N = number of processing cores on the system) to show the scalability of
the code and discuss Amdahl's Law. 

**4. Discuss mapping of threads to processors**

   Introduce the `taskset` utility and discuss how it can be used to map threads to processing cores.

In [40]:
## run program on core 0 with 4 threads 
taskset -c $Lstart ./matvec 10000 20 4

Verification: result[0] = 2.61e+09

[0;33mCompute time = 8.992 s
[0m

In [41]:
## run program on 2 cores (Lstart and Lstart+1) with 4 threads 
taskset -c $Lstart,$(($Lstart+1)) ./matvec 10000 20 4

Verification: result[0] = 2.61e+09

[0;33mCompute time = 5.407 s
[0m

**5. Run code on _little_ cores**
  
  Run the code on the cores set up as little cores and measure execution time. Set the number of threads to the number of little cores. 

In [43]:
taskset -c $Lstart-$Lend ./matvec 10000 20 $(($Lend-$Lstart+1))

Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.364 s
[0m

Re-run the code and measure detailed performance metrics with `perf`

In [44]:
perf stat taskset -c $Lstart-$Lend ./matvec 10000 20 $(($Lend-$Lstart+1))

Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.362 s
[0m
 Performance counter stats for 'taskset -c 0-3 ./matvec 10000 20 4':

          7,521.76 msec task-clock                #    2.187 CPUs utilized          
                94      context-switches          #    0.012 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           195,547      page-faults               #    0.026 M/sec                  
    13,507,272,020      cycles                    #    1.796 GHz                    
    20,289,133,747      instructions              #    1.50  insn per cycle         
     4,226,512,735      branches                  #  561.905 M/sec                  
         6,520,510      branch-misses             #    0.15% of all branches        

       3.439552761 seconds time elapsed

       7.125656000 seconds user
       0.396537000 seconds sys




Re-run the code and measure power and energy

In [45]:
likwid-perfctr -c $Lstart-$Lend -g ENERGY taskset -c $Lstart-$Lend ./matvec 10000 20 $(($Lend-$Lstart+1))

--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz
CPU type:	Intel Core Haswell processor
CPU clock:	2.89 GHz
--------------------------------------------------------------------------------
Verification: result[0] = 2.61e+09

[0;33mCompute time = 1.373 s
[0m--------------------------------------------------------------------------------
Group 1: ENERGY
+-----------------------+---------+------------+------------+------------+------------+
|         Event         | Counter |   Core 0   |   Core 1   |   Core 2   |   Core 3   |
+-----------------------+---------+------------+------------+------------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 3316041051 | 3323876040 | 3326638040 | 9456693146 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  | 2458250477 | 2464458508 | 2467537662 | 4911848598 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  | 3960515152 | 3970516643 | 3975476310 | 7913418674 |
|       TEMP_CORE       |   TMP0 

**6. Run code on _big_ cores**

   Run the code on the cores set up as little cores and measure execution time.

In [47]:
# the CPU number for BIG CPUs
export Bstart=0
export Bend=3

In [48]:
# find the maximum frequency of the CPUs
export MaxFreq=`cpufreq-info | grep limits | head -1 | awk '{print $6}'`
echo $MaxFreq

3.60


In [49]:
# set the clock speed for the BIG CPUs
sudo ./mapping_scripts/set_clk_speed.sh $Bstart-$Bend $MaxFreq

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


In [50]:
./mapping_scripts/check_clk_speed.sh

CPU0:	360 GHz - 360 GHz
CPU1:	360 GHz - 360 GHz
CPU2:	360 GHz - 360 GHz
CPU3:	360 GHz - 360 GHz


In [51]:
time taskset -c $Bstart-$Bend ./matvec 10000 20 $(($Bend-$Bstart+1))

Verification: result[0] = 2.61e+09

[0;33mCompute time = 0.918 s
[0m
real	0m1.962s
user	0m4.516s
sys	0m0.200s


Re-run the code and measure power and energy.

In [52]:
likwid-perfctr -c $Bstart-$Bend -g ENERGY taskset -c $Bstart-$Bend ./matvec 10000 20 $(($Bend-$Bstart+1))

--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz
CPU type:	Intel Core Haswell processor
CPU clock:	2.89 GHz
--------------------------------------------------------------------------------
Verification: result[0] = 2.61e+09

[0;33mCompute time = 0.927 s
[0m--------------------------------------------------------------------------------
Group 1: ENERGY
+-----------------------+---------+------------+------------+------------+------------+
|         Event         | Counter |   Core 0   |   Core 1   |   Core 2   |   Core 3   |
+-----------------------+---------+------------+------------+------------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 3273531680 | 3280728594 | 9493861056 | 3298285643 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  | 2899283896 | 2912803690 | 5547442256 | 2919950987 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  | 2627461654 | 2638793665 | 4726380357 | 2645915166 |
|       TEMP_CORE       |   TMP0 

In [56]:
# reset the cpu clock speed for both little and big cores

./mapping_scripts/reset_clk_speed.sh $Lstart-$Lend
./mapping_scripts/reset_clk_speed.sh $Bstart-$Bend

Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3


**7. Discuss the implications of the results** 

   * little cores will consume less power than big cores
   * little cores will have lower performance than big cores
   * threads must be mapped to cores based on the characteristic of the application and the target
     objective