Add Intel GPU Energy APIs #563

masterleinad · 2024-06-28T17:44:27Z

Description

Corresponds to #559 for Intel GPUs using APMIDG. Outputs look like

> ./variorum-print-verbose-energy-example
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 0, DeviceID: 0, Energy: 682460.802429 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 0, DeviceID: 1, Energy: 789783.862792 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 0, DeviceID: 2, Energy: 776530.864990 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 1, DeviceID: 3, Energy: 687270.160583 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 1, DeviceID: 4, Energy: 568826.738464 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 1, DeviceID: 5, Energy: 710078.444824 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 0, DeviceID: 0, Energy: 683044.307495 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 0, DeviceID: 1, Energy: 790386.810180 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 0, DeviceID: 2, Energy: 777118.108947 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 1, DeviceID: 3, Energy: 687852.643859 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 1, DeviceID: 4, Energy: 569396.025085 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s2b0n0, Socket: 1, DeviceID: 5, Energy: 710663.270080 J

The node has 6 GPUs with two tiles each for 12 GPU tiles in total. Thus, seeing 2 sockets with 3 devices is surprising but we should see similar output for existing Intel GPU APIs. This needs some investigation either in this pull request or elsewhere.
Also, #559 should probably be merged first.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature/architecture support (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Build/CI update

How Has This Been Tested?

Running on testing on ALCF's sunspot testbed.

Checklist:

I have run ./scripts/check-code-format.sh and confirm my code code follows the style guidelines of variorum
I have added comments in my code
My changes generate no new warnings (build with -DENABLE_WARNINGS=ON)
New and existing unit tests pass with my changes

tpatki · 2024-06-28T19:28:23Z

Hi @masterleinad : thank you so much for this contribution! We don't have this system at our end to build and test on, so this is extremely helpful.

I am going to need some time to review and merge this; we have no resources/funding at our end to put in significant time. I am hoping to get these GPU energy (print/JSON) APIs merged and documented in the next 2-3 weeks.

One thing here needs to be fixed. We report energy values in Variorum as deltas, so the first call to all the print/get_energy APIs should always return a zero, and the rest would be deltas that we observe. A raw energy value from the NVML/Intel/ROCm APIs is not as useful to the user, the deltas are more meaningful.
I'm working on the NVIDIA/ROCm side for similar support (see Add GPU Energy APIs #559), and once I have this working correctly, I'll ask you to make the same edits at your end so we're consistent across the different devices in the API. I'm trying to figure out the use cases for this (sampling? function-level energy with Caliper/Kokkos etc? context?) and will update there once I have thought this through.
I will also need your help documenting the Intel GPU APIs we are using at the low-level.
Once the PR is ready, I will ask you to paste your exact JSON output with a multi-arch build (Intel Sapphire Rapids + Intel GPUs) and a GPU-only build (Intel GPU only). This needs to be documented for our records as well as for ensuring that multi-platform builds are working correctly with this change. We haven't been able to test the node/CPU/mem energy as we don't have a Sapphire Rapids node here, so may ask for your help there to test.
I also need to update and formalize the JSON object format, and I will be working on this as well.

Just sending a note that this is on my list, and your contribution on the Intel GPU side is extremely helpful! Thank you! :)

masterleinad · 2024-07-08T21:24:25Z

With the last commit, the output is now

> ./variorum-print-verbose-energy-example 
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 0, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 1, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 2, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 3, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 4, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 5, Energy: 0.000000 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 0, Energy: 586.303284 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 1, Energy: 587.496582 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 0, DeviceID: 2, Energy: 568.604737 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 3, Energy: 560.272522 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 4, Energy: 594.283875 J
_INTEL_GPU_ENERGY_USAGE Host: x1921c0s0b0n0, Socket: 1, DeviceID: 5, Energy: 572.796081 J

masterleinad · 2024-07-08T21:28:35Z

Currently, I'm storing the initial energy in a global static variable. I noticed that initAPMIDG and shutdownAPMIDG are called for every API call and not in some function that would initialize and finalize variorum. Since I need the initial energy readings to be persistent between API calls, it's not clear to me when I could free the memory allocated for the initial energy so this version leaks some memory.

tpatki · 2024-07-10T16:25:25Z

Hi @masterleinad Thanks for getting to this before I could! I've been meaning to circle back to Variorum, but have been very short on time.
It might be better to just do static memory allocations here as opposed to malloc/free. It is not a lot of additional memory (only 3 or 4 variables and one JSON struct to track), so it should be ok to allocate statically instead of dynamically. That way we don't leak memory.

Take a look at how we did this for IBM port [here], where we need to explicitly sample and track those as energy is not directly reported by the sensors on Power9. You can track where the variables are set from the link below. (

variorum/src/variorum/IBM/Power9.c

Lines 20 to 26 in 42e0803

    
           pthread_mutex_t mlock; 
        
           struct thread_args th_args; 
        
           pthread_attr_t mattr; 
        
           pthread_t mthread; 
        
           /* For the get_energy sampling thread */ 
        
           static int active_sampling = 0;

).

masterleinad · 2024-07-11T13:24:49Z

It might be better to just do static memory allocations here as opposed to malloc/free. It is not a lot of additional memory (only 3 or 4 variables and one JSON struct to track), so it should be ok to allocate statically instead of dynamically. That way we don't leak memory.

The problem is that I don't know how many GPUs I have at compile-time but I need to store data for each of them to report the difference in energy consumption since the first call. The only way to avoid allocating memory dynamically would require setting an upper limit on the number of GPUs. Would you prefer that over the current approach (possibly with registering an atexit function for deallocating the dynamic memory)?

Add Intel GPU Energy APIs

22b4331

masterleinad marked this pull request as draft June 28, 2024 17:44

tpatki added status-revisions-needed status-work-in-progress In progress, not ready to merge. type-feature labels Jun 28, 2024

Only report energy consumption since first reading

a29ea1f

Use atexit hook to avoid leaking memory

47702b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Intel GPU Energy APIs #563

Add Intel GPU Energy APIs #563

masterleinad commented Jun 28, 2024

tpatki commented Jun 28, 2024 •

edited

Loading

masterleinad commented Jul 8, 2024

masterleinad commented Jul 8, 2024

tpatki commented Jul 10, 2024 •

edited

Loading

masterleinad commented Jul 11, 2024

Add Intel GPU Energy APIs #563

Are you sure you want to change the base?

Add Intel GPU Energy APIs #563

Conversation

masterleinad commented Jun 28, 2024

Description

Type of change

How Has This Been Tested?

Checklist:

tpatki commented Jun 28, 2024 • edited Loading

masterleinad commented Jul 8, 2024

masterleinad commented Jul 8, 2024

tpatki commented Jul 10, 2024 • edited Loading

masterleinad commented Jul 11, 2024

tpatki commented Jun 28, 2024 •

edited

Loading

tpatki commented Jul 10, 2024 •

edited

Loading