Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Jan 11, 2025

Summary

This PR introduces the concept of dse_iteration and having it baked into the how the results are stored for each iteration. For each dse iteration should generate its own iteration_x folder and then the generated batch and srun scripts. There is already current_iteration and iteration field in the TestRun definition. Need more discussion on this + accordingly update the test_slurm unit test.

For normal benchmarking job, there is no concept of iterations. Hence we should expect a folder structure like this.

results
|--iteration_1
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh

For DSE job, th

results
|--iteration_1
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh
|--iteration_2
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh

Test Plan

CI/CD
Dry-Run

$ cloudai dry-run --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_jaxtoolbox.toml
.
└── dse_jaxtoolbox_grok
    ├── iteration_1
    │   └── 2025-01-11_14-28-16
    │       └── Tests.1
    │           └── 0
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_2
    │   └── 2025-01-11_14-28-17
    │       └── Tests.1
    │           └── 1
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_3
    │   └── 2025-01-11_14-28-18
    │       └── Tests.1
    │           └── 2
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_4
    │   └── 2025-01-11_14-28-19
    │       └── Tests.1
    │           └── 3
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_5
    │   └── 2025-01-11_14-28-20
    │       └── Tests.1
    │           └── 4
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_6
    │   └── 2025-01-11_14-28-21
    │       └── Tests.1
    │           └── 5
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_7
    │   └── 2025-01-11_14-28-22
    │       └── Tests.1
    │           └── 6
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_8
    │   └── 2025-01-11_14-28-23
    │       └── Tests.1
    │           └── 7
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_9
    │   └── 2025-01-11_14-28-24
    │       └── Tests.1
    │           └── 8
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_10
    │   └── 2025-01-11_14-28-25
    │       └── Tests.1
    │           └── 9
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_11
    │   └── 2025-01-11_14-28-26
    │       └── Tests.1
    │           └── 10
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_12
    │   └── 2025-01-11_14-28-27
    │       └── Tests.1
    │           └── 11
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_13
    │   └── 2025-01-11_14-28-28
    │       └── Tests.1
    │           └── 12
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_14
    │   └── 2025-01-11_14-28-29
    │       └── Tests.1
    │           └── 13
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_15
    │   └── 2025-01-11_14-28-30
    │       └── Tests.1
    │           └── 14
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_16
    │   └── 2025-01-11_14-28-31
    │       └── Tests.1
    │           └── 15
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_17
    │   └── 2025-01-11_14-28-32
    │       └── Tests.1
    │           └── 16
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_18
    │   └── 2025-01-11_14-28-33
    │       └── Tests.1
    │           └── 17
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_19
    │   └── 2025-01-11_14-28-34
    │       └── Tests.1
    │           └── 18
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_20
    │   └── 2025-01-11_14-28-35
    │       └── Tests.1
    │           └── 19
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_21
    │   └── 2025-01-11_14-28-36
    │       └── Tests.1
    │           └── 20
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_22
    │   └── 2025-01-11_14-28-37
    │       └── Tests.1
    │           └── 21
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_23
    │   └── 2025-01-11_14-28-38
    │       └── Tests.1
    │           └── 22
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_24
    │   └── 2025-01-11_14-28-39
    │       └── Tests.1
    │           └── 23
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_25
    │   └── 2025-01-11_14-28-40
    │       └── Tests.1
    │           └── 24
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_26
    │   └── 2025-01-11_14-28-42
    │       └── Tests.1
    │           └── 25
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_27
    │   └── 2025-01-11_14-28-43
    │       └── Tests.1
    │           └── 26
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_28
    │   └── 2025-01-11_14-28-44
    │       └── Tests.1
    │           └── 27
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_29
    │   └── 2025-01-11_14-28-45
    │       └── Tests.1
    │           └── 28
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_30
    │   └── 2025-01-11_14-28-46
    │       └── Tests.1
    │           └── 29
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_31
    │   └── 2025-01-11_14-28-47
    │       └── Tests.1
    │           └── 30
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    └── iteration_32
        └── 2025-01-11_14-28-48
            └── Tests.1
                └── 31
                    ├── cloudai_sbatch_script.sh
                    └── run.sh

Additional Notes

amaslenn and others added 30 commits January 10, 2025 14:21
1. Inherit from ABC if there @abstractmethods
2. Do not make gen_srun_success_check() abstruct, simply return an empty
   string by default. When needed, this method will be overriden.
Some slurm setups do not allow running enroot from the head node. Let's
rely on actual 'enroot import' run via srun and report its real error
message to user.
* Remove conf/common/test/chakra_replay.toml

* Remove conf/common/test_scenario/chakra_replay.toml
@srivatsankrishnan srivatsankrishnan closed this pull request by merging all changes into NVIDIA:main in 36e3993 Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants