Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU and Node Utilization JSON API (CPU, GPU, Memory) #431

Merged
merged 49 commits into from
Jan 17, 2024

Conversation

altahat2003
Copy link
Collaborator

@altahat2003 altahat2003 commented Jun 20, 2023

Description

This PR adds two utilization APIs:

  • variorum_get_node_utilization_json: it populates a string in JSON format with total CPU node utilization, user CPU utilization, kernel CPU utilization, total node memory utilization, and GPU utilization.
  • variorum_get_gpu_utilization_json: it populates string in JSON format with utilization of each GPU

Fixes #440

Docs: https://variorum.readthedocs.io/en/pr-from-fork-431/

Closes #495 once all tests pass.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature/architecture support (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Build/CI update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please provide hardware architecture specs and
instructions so we can reproduce.

  • I rebuilt variorum on Lassen and test added API with a simple hello world example.

Checklist:

  • I have run ./scripts/check-code-format.sh and confirm my code code follows the style guidelines of variorum
  • I have added comments in my code
  • My changes generate no new warnings (build with -DENABLE_WARNINGS=ON)
  • New and existing unit tests pass with my changes

src/variorum/IBM/Power9.c Outdated Show resolved Hide resolved
@tpatki tpatki added this to the Production: v0.8.0 Release milestone Jun 26, 2023
@tpatki tpatki added status-work-in-progress In progress, not ready to merge. area-feature-support labels Jun 26, 2023
@tpatki
Copy link
Member

tpatki commented Jun 26, 2023

Capturing discussion here as we'll need to resolve the comments that are on outdated commits.

@altahat2003

The getrusage part will be common to all architectures, and we will have to maintain CPU and memory delta values (save initial state). My suggestion is that you move this code out to a higher level in variorum.c itself instead of creating redundant code in each lower-level platform file. You don't really need a platform-specific implementation for the CPU and memory usage part. Just for the GPU part.

Only the GPU usage will be different based on which architecture you are on. We have a print_gpu_utilization API, and we can add a new get_gpu_utilization_json API, that you can then call from variorum.c.

Collaborator
@amarathe84 amarathe84 5 days ago
Summarizing the suggestion from our standup call here: Could the high-level utilization API be parameterized to choose between (a) reporting accumulator-style data when available on the underlying system or (b) reporting delta between current and previous sample? This will satisfy a variety of use cases depending on the monitoring requirements and configurations of the client application. For example, RoCM reports GPU utilization via. both accumulator-style utilization API as well as instantaneous utilization API. So the client could select which underlying API it wants to use.

@tpatki tpatki changed the title initial mem util API WIP: initial mem util API Jul 6, 2023
@tpatki tpatki changed the title WIP: initial mem util API Node Utilization API (CPU, GPU, Memory) Sep 19, 2023
@tpatki tpatki changed the title Node Utilization API (CPU, GPU, Memory) Node Utilization JSON API (CPU, GPU, Memory) Sep 19, 2023
Mohamd Yousf Hazza Al-Tahat added 2 commits October 3, 2023 11:47
@tpatki
Copy link
Member

tpatki commented Oct 10, 2023

To do before merging:

  • Rebase
  • Add example for get_gpu_utilization_json
  • Add docs for get_gpu_utilization_json
  • Update description to show that this PR adds 2 APIs: get_gpu_utilization_json and get_node_utilization_json.

@altahat2003 altahat2003 mentioned this pull request Oct 31, 2023
12 tasks
@tpatki tpatki added status-ready-for-review Formatted, and tested on multiple systems. and removed status-work-in-progress In progress, not ready to merge. labels Nov 3, 2023
@slabasan slabasan changed the title Node Utilization JSON API (CPU, GPU, Memory) GPU and Node Utilization JSON API (CPU, GPU, Memory) Nov 3, 2023
@slabasan slabasan self-requested a review November 14, 2023 23:42
@slabasan slabasan mentioned this pull request Jan 17, 2024
@slabasan slabasan merged commit 1e1ce60 into LLNL:dev Jan 17, 2024
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-feature-support status-ready-for-review Formatted, and tested on multiple systems. type-feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add JSON Node utilization API
4 participants