Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Not for Merge] [POC] Goodput async monitoring and upload to Tensorboard POC #648

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dipannita08
Copy link
Collaborator

This changes adds the following:

  • Allows creating on a monitor object that spins up a secondary "monitor & upload" thread to query Goodput of the job using the ml-goodput-measurement pip package and and write a scalar metric to TB every interval period.

Tested:

  • Example run on v4-8 w/ ~180 steps here

Note: This is a POC and this change is intended to be moved to the cloud-accelerator-doagnostics and goodput package eventually.


def _query_and_upload_goodput(self):
"""Queries and uploads goodput data to TensorBoard."""
time.sleep(10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be goodput_upload_interval_seconds instead of hardcoded to 10? I don't see where goodput_upload_interval_seconds is used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ln65 can be removed, its just something I was using in the initial PoC. goodput_upload_interval_seconds (stored as self._upload_interval) is being used in Ln 68.

I apologize for the confusion, this PR is not meant to be merged it was just meant to be a PoC, I'll update the title - eventually this implementation will go into the Goodput package so that MaxText/anyone can import the module from the package and just instantiate the GoodputCalculator with monitor_goodput=True

@dipannita08 dipannita08 changed the title Goodput async monitoring and upload to Tensorboard POC [Not for Merge] [POC] Goodput async monitoring and upload to Tensorboard POC May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants