Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add example which trains a distributed GraphCast model on shallow-water-equations data #400

Merged
merged 52 commits into from
May 31, 2024

Conversation

stadlmax
Copy link
Collaborator

@stadlmax stadlmax commented Mar 22, 2024

Modulus Pull Request

Description

  • add example which trains a GraphCast model on a shallow-water-equations dataset
  • setup example in such a way to showcase "tensor-parallel" training of a GNN model
  • improvements and bugfixes in used distributed communication primitives

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@akshaysubr akshaysubr added the distributed Distributed and model parallel tools label Apr 23, 2024
@stadlmax stadlmax marked this pull request as ready for review May 8, 2024 22:20
@stadlmax
Copy link
Collaborator Author

/blossom-ci

@stadlmax
Copy link
Collaborator Author

@mnabian I added a first version of the README. Could you have a look at that and some of the remarks Akshay brought up?

@stadlmax
Copy link
Collaborator Author

/blossom-ci

@stadlmax
Copy link
Collaborator Author

/blossom-ci

@stadlmax stadlmax changed the title example distributed graphcast for shallow-water-equations add example which trains a distributed GraphCast model on shallow-water-equations data May 31, 2024
@stadlmax stadlmax merged commit 15ea3c9 into NVIDIA:main May 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Distributed and model parallel tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants