Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Oct 20, 2024

Summary

This PR squashes the bug during the refactoring of Jaxtoolbox command generation. While it appreared to be a configuration issue that resulted in OOM, it turns out there were three bugs introduced during the refactoring introduced by PRs 249/253

  1. XLA flags were not getting updated properly due to a bug in XLA parsing logic. RM issue: 4127411:
  2. The Pre-test command was not generated in the sbatch script due to a bug in parsing pre-test command from test toml files. RM issue: 4127427
  3. The Pre-test command after fixing 2nd item were getting misconfiguring the NCCL flags resulting in incorrect behavior and thus failing. This failure will cause the LLM training to not start. RM Issue: 4127915

Test Plan

CI/CD

Ran on Cluster

1-node grok test

$ python ./cloudaix.py run --system-config conf/common/system/xxxx.toml --tests-dir conf/xxx/xxxxx/test/ --test-scenario conf/xxx/xxxx/test_scenario/xxxxxxxxxxxx.toml
[INFO] System Name: [REDACTED]
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: [REDACTED]
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: [REDACTED]

Section Name: Tests.1
  Test Name: [REDACTED]
  Description: [REDACTED]
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
True
[INFO] Executing command for test Tests.1: sbatch [REDACTED]
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: [REDACTED]
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

Result logs: Here

4 node Grok Test

$ python ./cloudaix.py run --system-config conf/common/system/xxxxx.toml --tests-dir conf/xxx/xxxx/test/ --test-scenario conf/xxxx/xxx/xxx/xxxxx.toml
[INFO] System Name: [REDACTED]
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: [REDACTED]
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: [REDACTED]

Section Name: Tests.1
  Test Name: [REDACTED]
  Description: [REDACTED]
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
True
[INFO] Executing command for test Tests.1: sbatch [REDACTED]
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: [REDACTED]
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

Results log: Here

Additional Notes

@TaekyungHeo TaekyungHeo added bug Something isn't working Oct24 Oct'24 release feature labels Oct 21, 2024
@srinivas212 srinivas212 merged commit 3e0d840 into main Oct 23, 2024
@srinivas212 srinivas212 deleted the pr_249_253_bug branch October 23, 2024 20:04
@srinivas212 srinivas212 changed the title Pr 249 253 bug Fix bug from refactoring of Jaxtoolbox command generation Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Oct24 Oct'24 release feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants