Tutorial: put a function on the roofline model #28

antoine-levitt · 2022-05-06T09:56:26Z

Great package! As far as I understand it provides the information needed to locate an application on the roofline model but finding the info without screwing up feels a bit scary to me. Could an example be put in a tutorial or in the library, so that I could do something like roofline(f) and it'd plot the roofline model and locate f() on it?

(this is for teaching basic performance notions)

The text was updated successfully, but these errors were encountered:

carstenbauer · 2022-05-06T10:46:51Z

Hey Antoine,

That's a great idea and something I thought about as well at some point. Would be really nice to have. LIKWID itself has an "Empirical Roofline Model" tutorial and I think we should add a similar tutorial to the LIKWID.jl documentation.
Automating it fully (as in a roofline(f) function) is a bit tricky though. A few (potentially obvious) things that come to mind:

How to reliably obtain the maximal performance and data bandwidth of the system? Typically, one needs to either put in these values by hand (e.g. derived from data sheets of the CPU) or perform benchmarks to empirically estimate these quantities. But that can be hard and require some fine tuning, in particular for the peak performance (for the max bandwidth we could use STREAMBenchmark.jl).
Thread pinning: How should we pin Julia threads (i.e. for optimal performance of a general f)? Threads need to be pinned properly to monitor performance with hardware counters. (Could restrict usage to single-threaded functions?)

Having said that, it's still worth trying! I'll happily support any efforts in this direction, so, feel free to draft a PR. Unfortunately, I probably won't have much free bandwidth for playing around with this any time soon. In any case, I'd start with the manual tutorial first and then try to make it automatic.

Best,
Carsten

(@ranocha Based on our previous discussions, e.g. https://gist.github.com/ranocha/0ad5716e77e55b2c61cbde10ad4f210c, I think you're also interested in this.)
(There has been a related effort by @vchuravy over at https://github.com/JuliaPerf/Roofline.jl based on linux perf event.)

antoine-levitt · 2022-05-06T10:57:54Z

Automating it fully (as in a roofline(f) function) is a bit tricky though

Yeah, it would make sense as a tutorial, since the roofline thing is not really meant as a black box but more as an explanatory thing (at least for me) so you want to see what goes into it.

How to reliably obtain the maximal performance and data bandwidth of the system? Typically, one needs to either put in these values by hand (e.g. derived from data sheets of the CPU) or perform benchmarks to empirically estimate these quantities. But that can be hard and require some fine tuning, in particular for the peak performance (for the max bandwidth we could use STREAMBenchmark.jl).

What's wrong with peakflops()?

Thread pinning

single thread would be a good start.

Having said that, it's still worth trying! I'll happily support any efforts in this direction, so, feel free to draft a PR. Unfortunately, I probably won't have much free bandwidth for playing around with this any time soon. In any case, I'd start with the manual tutorial first and then try to make it automatic.

OK, I'll try to come up with something and you can correct me when I screw up then!

antoine-levitt · 2022-05-06T11:08:41Z

Oh I just noticed that I have to start julia in a separate process under likwid then kill julia to get the output. That's pretty annoying (eg I can't put it in a notebook), is there a way around it?

carstenbauer · 2022-05-06T11:15:56Z

Oh I just noticed that I have to start julia in a separate process under likwid then kill julia to get the output. That's pretty annoying (eg I can't put it in a notebook), is there a way around it?

Yes, see https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ (although this tutorial may not be in the best shape). Essentially, likwid-perfctr is just a lua script that sets a bunch of environment variable, then calls into the C API, and eventually prints stuff. All of that can be done within a julia session.

(We should probably also introduce some higher-level API to automate this "measure perf counters within a julia session" workflow. Hast hasn't been my focus so far.)

ranocha · 2022-05-06T11:19:33Z

Thanks for pinging me, @carstenbauer. Yes, I also think it's a nice idea. I just followed the LIKWID tutorial to get an empirical roofline model while working on Trixi.jl and our paper on performance stuff.

Here is some code I used for my experiments. Feel free to use it in a PR for the new tutorial (maybe adding me as co-author in GitHub if you like by adding the line Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com> to a commit message):

## gather data for the empirical roofline model

# measure optimistic peakflops (AVX2 FMA or AVX512 FMA if available)
L1_cache_size = LIKWID.get_cpu_topology().cacheLevels[1].size ÷ 1024 # in kB

cpuinfo = LIKWID.get_cpu_info()
if occursin("AVX512", cpuinfo.features)
  likwid_bench_kernel = "peakflops_avx512_fma"
elseif occursin("AVX2", cpuinfo.features)
  likwid_bench_kernel = "peakflops_avx_fma"
else
  likwid_bench_kernel = "peakflops_sse_fma"
end

max_flops_string = read(`likwid-bench -t $likwid_bench_kernel -W N:$(L1_cache_size)kB:1`, String)
max_flops = parse(Float64, match(r"(MFlops/s:\s+)(\d+\.\d+)", max_flops_string).captures[2]) / 1024

# measure optimistic memory bandwidth using reads
if occursin("AVX512", cpuinfo.features)
  likwid_bench_kernel = "load_avx512"
elseif occursin("AVX2", cpuinfo.features)
  likwid_bench_kernel = "load_avx"
else
  likwid_bench_kernel = "load_sse"
end

max_bandwidth_string = read(`likwid-bench -t $likwid_bench_kernel -W N:2GB:1`, String)
max_bandwidth = parse(Float64, match(r"(MByte/s:\s+)(\d+\.\d+)", max_bandwidth_string).captures[2])

## gather data for volume terms implemented in Trixi.jl
measured_string = read(`likwid-perfctr -C 0 -g MEM_DP -m $(Base.julia_cmd()) --check-bounds=no --threads=1 $(joinpath(@__DIR__, "measure_volume_terms.jl"))`, String)

# You can combine different measurements by setting appropriate region names, e.g.,
# NAME_OF_THE_REGION_YOU_USED
offset = findfirst("Region NAME_OF_THE_REGION_YOU_USED", measured_string) |> last
m = match(r"(DP \[MFLOP/s\]\s+\|\s+)(\d+\.\d+)", measured_string, offset)
flops_NAME_OF_THE_REGION_YOU_USED = parse(Float64, m.captures[2]) / 1024
m = match(r"(Operational intensity\s+\|\s+)(\d+\.\d+)", measured_string, offset)
intensity_NAME_OF_THE_REGION_YOU_USED = parse(Float64, m.captures[2])
@info "NAME_OF_THE_REGION_YOU_USED" intensity_NAME_OF_THE_REGION_YOU_USED flops_NAME_OF_THE_REGION_YOU_USED

The file measure_volume_terms.jl contains basic setup code such as

Marker.init()

# compile and cool down
compute_a_lot_of_stuff()
sleep(1.0)

# measure
@region "NAME_OF_THE_REGION_YOU_USED" begin
  compute_a_lot_of_stuff()
end

Marker.close()

in a function called once.

carstenbauer · 2022-05-06T11:28:32Z

What's wrong with peakflops()?

Nothing's fundamentally wrong with it, it's a fine starting point / an estimation. But it should be clear that it doesn't give you the true maximal achievable performance (running such a benchmark takes some care: which kernel to use? which parameters, e.g. problem size, to use? etc.) That's why LIKWID has likwid-bench and lots of kernels (see e.g. here) for different CPU types, often defined in machine instructions. Instead of gemm I think a "pure FMAs" kernel would be better suited for estimating the peak flops (LIKWID uses peakflops_avx_fma at least on x86_64).

(Note that we also use pure FMA / WMMA kernels in GPUInspector.jl to measure the peakflops of NVIDIA GPUs.)

antoine-levitt · 2022-05-06T11:43:05Z

OK, https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ looks good. How do I get the operational intensity? It doesn't appear in the table at the end.

TomTheBear · 2022-05-06T11:47:11Z

The operational intensity is only provided by the MEM_SP and MEM_DP groups directly (example code uses FLOPS_DP and L2). If you want if for caches, you have to calculate it yourself using the FLOP/s count and the bandwidth value.

carstenbauer · 2022-05-06T11:48:01Z

Was about to write the same but Thomas beat me to it :)

antoine-levitt · 2022-05-06T12:05:46Z

Does this mean that I should do LIKWID.LIKWID_EVENTS("FLOPS_DP|MEM_DP")? It doesn't appear to change anything if I do.

carstenbauer · 2022-05-06T12:08:50Z

You can drop the FLOPS_DP part and only use MEM_DP (as they do in the empirical roofline tutorial). If the printed table then doesn't include it, maybe it's just the printing that is flawed?

antoine-levitt · 2022-05-06T12:12:03Z

Ooh, OK, not sure what happened but it works with only MEM_DP, thanks.

TomTheBear · 2022-05-06T12:44:19Z

As far as I can see, there is no group switch in the example code at https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/, that's why only the first group gets measured (| is the group delimiter). You would need to run your benchmark code twice, once for the first group, then "switch groups" and run your benchmark again for the second group measurement.
The two groups have different gids, so when reading the results/metrics, you have to specify the right gid.

carstenbauer · 2022-05-06T13:05:35Z

As far as I can see, there is no group switch in the example code

Good catch, I will extend the example in this direction when I find time for it.

carstenbauer · 2022-05-06T14:21:01Z

@antoine-levitt As an alternative to the approach in https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ (environment variables) you may take a look at https://github.com/JuliaPerf/LIKWID.jl/blob/main/examples/perfmon/perfmon.jl (using LIKWID.PerfMon). This also doesn't seem to have the issue #29.

antoine-levitt · 2022-05-06T15:35:59Z

That indeed seems nicer. But if I do that with MEM_DP I get

  "DP [MFLOP/s]"                      => 0.0
  "AVX DP [MFLOP/s]"                  => 0.0
  "Packed [MUOPS/s]"                  => 0.0
  "Scalar [MUOPS/s]"                  => 0.0
  "Memory load bandwidth [MBytes/s]"  => 1164.26
  "Memory load data volume [GBytes]"  => 0.139599
  "Memory evict bandwidth [MBytes/s]" => 324.848
  "Memory evict data volume [GBytes]" => 0.0389503
  "Memory bandwidth [MBytes/s]"       => 1489.11
  "Memory data volume [GBytes]"       => 0.178549
  "Operational intensity"             => 0.0

(in case it was not obvious, I have absolutely no idea what's going on with these codes)

TomTheBear · 2022-05-07T11:06:02Z

That might be a valid measurement if the functions don't do any double precision floating-point operations. For single precision, use MEM_SP. Integer operations cannot be counted with hardware performance counters on almost any platforms.

antoine-levitt · 2022-05-09T09:24:53Z

No this is a benchmark of double precision matrix multiplication. It might be related to #29 so I'll wait for a fix there before investigating further

carstenbauer · 2022-05-11T15:10:49Z

Why do you think it's related to #29? I would guess that it's wrong thread pinning (or no thread pinning at all). Example (for FLOPS_DP):

With LIKWID.pinthread(0) and LIKWID.PerfMon.init(0) (RIGHT):

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  julia. example.jl 
OrderedCollections.OrderedDict{String, Float64} with 5 entries:
  "Runtime (RDTSC) [s]" => 0.634277
  "Runtime unhalted [s]" => 0.91243
  "Clock [MHz]" => 3524.29
  "CPI" => 0.520886
  "DP [MFLOP/s]" => 333.282
OrderedCollections.OrderedDict{String, Float64} with 6 entries:
  "ACTUAL_CPU_CLOCK" => 2.23547e9
  "MAX_CPU_CLOCK" => 1.55405e9
  "RETIRED_INSTRUCTIONS" => 4.24566e9
  "CPU_CLOCKS_UNHALTED" => 2.2115e9
  "RETIRED_SSE_AVX_FLOPS_ALL" => 2.11393e8
  "MERGE" => 0.0

With LIKWID.pinthread(1) and LIKWID.PerfMon.init(0) (WRONG):

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  julia. example.jl 
OrderedCollections.OrderedDict{String, Float64} with 5 entries:
  "Runtime (RDTSC) [s]" => 0.628969
  "Runtime unhalted [s]" => 0.00110777
  "Clock [MHz]" => 2920.05
  "CPI" => NaN
  "DP [MFLOP/s]" => 0.0
OrderedCollections.OrderedDict{String, Float64} with 6 entries:
  "ACTUAL_CPU_CLOCK" => 2.71398e6
  "MAX_CPU_CLOCK" => 2.27706e6
  "RETIRED_INSTRUCTIONS" => 0.0
  "CPU_CLOCKS_UNHALTED" => 0.0
  "RETIRED_SSE_AVX_FLOPS_ALL" => 0.0
  "MERGE" => 0.0

Also note that I have set OPENBLAS_NUM_THREADS=1 here (so that OpenBLAS uses the available Julia thread(s) for linalg operations). If you set OPENBLAS_NUM_THREADS to something > 1 you must make sure to pin the OpenBLAS threads to specific cores and measure the performance counters on these cores as well.

(FYI: I'm on vacation this week and very unresponsive.)

carstenbauer · 2022-05-11T15:19:30Z

Just to be a bit more explicit about the multiple threads / cores case: you can, e.g., use LIKWID.PerfMon.init(0:9) to init monitoring on the first 10 cores and then later use LIKWID.PerfMon.get_metric_results(groupid, nth) to query the results for the nth of these cores.

carstenbauer · 2022-05-11T15:37:37Z

Hm, maybe I'm wrong / missing something about the OpenBLAS threads story:

# example.jl
using LIKWID
using LinearAlgebra
nblasthreads = BLAS.get_num_threads()
@show BLAS.get_num_threads()

A = rand(1000, 1000)
B = rand(1000, 1000)
C = zeros(1000, 1000)

LIKWID.pinthread(nblasthreads) # pin Julia thread to first core not occupied by BLAS threads
println("OMP threads on cores 0:$(nblasthreads-1), Julia thread on core $(nblasthreads)")
LIKWID.PerfMon.init(0:3)
groupid = LIKWID.PerfMon.add_event_set("FLOPS_DP")
LIKWID.PerfMon.setup_counters(groupid)

LIKWID.PerfMon.start_counters()
for _ in 1:10
    mul!(C, A, B)
end
LIKWID.PerfMon.stop_counters()

str = "DP [MFLOP/s]"
for i in 0:3
    mdict = LIKWID.PerfMon.get_metric_results(groupid, i)
    println(str, " (core $i): ", mdict[str])
end

LIKWID.PerfMon.finalize()

Output:

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  OMP_NUM_THREADS=3 OMP_PLACES=cores OMP_PROC_BIND=close julia --project example.jl 
BLAS.get_num_threads() = 3
OMP threads on cores 0:2, Julia thread on core 3
DP [MFLOP/s] (core 0): 0.0
DP [MFLOP/s] (core 1): 0.0
DP [MFLOP/s] (core 2): 0.0
DP [MFLOP/s] (core 3): 8173.052486987061

UPDATE:

Ok, must really be something with OpenBLAS. It works with Octavian.jl (which uses Julia threads):

# octavian.jl
using LIKWID
using Octavian
using ThreadPinning

A = rand(1000, 1000)
B = rand(1000, 1000)
C = zeros(1000, 1000)

LIKWID.pinthreads(0:3)
threadinfo(; color=false)
println()
LIKWID.PerfMon.init(0:3)
groupid = LIKWID.PerfMon.add_event_set("FLOPS_DP")
LIKWID.PerfMon.setup_counters(groupid)

LIKWID.PerfMon.start_counters()
for _ in 1:10
    matmul!(C, A, B)
end
LIKWID.PerfMon.stop_counters()

str = "DP [MFLOP/s]"
for i in 0:3
    mdict = LIKWID.PerfMon.get_metric_results(groupid, i)
    println(str, " (core $i): ", mdict[str])
end

LIKWID.PerfMon.finalize()

Output:

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  julia --project -t 4 octavian.jl 

| 0,1,2,3,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ |
| _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ |

# = Julia thread, | = Socket seperator

Julia threads: 4
├ Occupied CPU-threads: 4
└ Mapping (Thread => CPUID): 1 => 0, 2 => 1, 3 => 2, 4 => 3,


DP [MFLOP/s] (core 0): 366.4515143259297
DP [MFLOP/s] (core 1): 372.2449620791664
DP [MFLOP/s] (core 2): 372.24495182826337
DP [MFLOP/s] (core 3): 366.3363018460671

antoine-levitt · 2022-05-12T06:49:37Z

Good catch on the thread stuff, I'll take a closer look - this was for benchmarking non-openblas, single thread code.

carstenbauer · 2022-05-14T22:01:30Z

FYI: #31

antoine-levitt changed the title ~~Tutorial, put a function on the roofline model~~ Tutorial: put a function on the roofline model May 6, 2022

carstenbauer mentioned this issue May 6, 2022

Switch group example (Marker API) #30

Open

carstenbauer added documentation Improvements or additions to documentation discussion labels Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: put a function on the roofline model #28

Tutorial: put a function on the roofline model #28

antoine-levitt commented May 6, 2022

carstenbauer commented May 6, 2022

antoine-levitt commented May 6, 2022

antoine-levitt commented May 6, 2022

carstenbauer commented May 6, 2022 •

edited

Loading

ranocha commented May 6, 2022

carstenbauer commented May 6, 2022 •

edited

Loading

antoine-levitt commented May 6, 2022

TomTheBear commented May 6, 2022

carstenbauer commented May 6, 2022

antoine-levitt commented May 6, 2022

carstenbauer commented May 6, 2022

antoine-levitt commented May 6, 2022

TomTheBear commented May 6, 2022 •

edited

Loading

carstenbauer commented May 6, 2022

carstenbauer commented May 6, 2022 •

edited

Loading

antoine-levitt commented May 6, 2022

TomTheBear commented May 7, 2022

antoine-levitt commented May 9, 2022

carstenbauer commented May 11, 2022 •

edited

Loading

carstenbauer commented May 11, 2022

carstenbauer commented May 11, 2022 •

edited

Loading

antoine-levitt commented May 12, 2022

carstenbauer commented May 14, 2022

Tutorial: put a function on the roofline model #28

Tutorial: put a function on the roofline model #28

Comments

antoine-levitt commented May 6, 2022

carstenbauer commented May 6, 2022

antoine-levitt commented May 6, 2022

antoine-levitt commented May 6, 2022

carstenbauer commented May 6, 2022 • edited Loading

ranocha commented May 6, 2022

carstenbauer commented May 6, 2022 • edited Loading

antoine-levitt commented May 6, 2022

TomTheBear commented May 6, 2022

carstenbauer commented May 6, 2022

antoine-levitt commented May 6, 2022

carstenbauer commented May 6, 2022

antoine-levitt commented May 6, 2022

TomTheBear commented May 6, 2022 • edited Loading

carstenbauer commented May 6, 2022

carstenbauer commented May 6, 2022 • edited Loading

antoine-levitt commented May 6, 2022

TomTheBear commented May 7, 2022

antoine-levitt commented May 9, 2022

carstenbauer commented May 11, 2022 • edited Loading

carstenbauer commented May 11, 2022

carstenbauer commented May 11, 2022 • edited Loading

antoine-levitt commented May 12, 2022

carstenbauer commented May 14, 2022

carstenbauer commented May 6, 2022 •

edited

Loading

carstenbauer commented May 6, 2022 •

edited

Loading

TomTheBear commented May 6, 2022 •

edited

Loading

carstenbauer commented May 6, 2022 •

edited

Loading

carstenbauer commented May 11, 2022 •

edited

Loading

carstenbauer commented May 11, 2022 •

edited

Loading