Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the NCCL flags for A3+. #650

Merged
merged 1 commit into from
May 20, 2024
Merged

Update the NCCL flags for A3+. #650

merged 1 commit into from
May 20, 2024

Conversation

yangyuwei
Copy link
Collaborator

End-to-end test passed: first created an A3+ cluster including 2 nodes and then run MaxText at 2 nodes.

@michelle-yooh
Copy link
Collaborator

Have you tested this on A3 nodes as well?

@michelle-yooh
Copy link
Collaborator

Could you help me understand how these flags were determined? Also what differences in results this change causes

@yangyuwei
Copy link
Collaborator Author

Have you tested this on A3 nodes as well?

No. It's not needed, given the changes in this PR only impact A3+, not A3. As you see it's under fastrak branch, FasTrak is specific for A3+, the A3 config is under tcpx branch and it remains the same.

Could you help me understand how these flags were determined? Also what differences in results this change causes

It's determined by setting USE_GPUDIRECT in the env vars, which is done via XPK (see https://screenshot.googleplex.com/6augXoEPxzCrPYZ) or via helm (https://screenshot.googleplex.com/59XtkdZKsN5g5Hm).

The changes are needed, in order to use the latest FasTrak release (Pony + Aperture), which has networking perf improvement. They are consistent with the internal config in google3.

@copybara-service copybara-service bot merged commit e60cabf into main May 20, 2024
13 checks passed
@copybara-service copybara-service bot deleted the yangyuwei-maxtext branch May 20, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants