Results directory for DSE job #333

srivatsankrishnan · 2025-01-11T23:05:07Z

Summary

This PR introduces the concept of dse_iteration and having it baked into the how the results are stored for each iteration. For each dse iteration should generate its own iteration_x folder and then the generated batch and srun scripts. There is already current_iteration and iteration field in the TestRun definition. Need more discussion on this + accordingly update the test_slurm unit test.

For normal benchmarking job, there is no concept of iterations. Hence we should expect a folder structure like this.

results
|--iteration_1
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh

For DSE job, th

results
|--iteration_1
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh
|--iteration_2
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh

Test Plan

CI/CD
Dry-Run

$ cloudai dry-run --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_jaxtoolbox.toml

.
└── dse_jaxtoolbox_grok
    ├── iteration_1
    │   └── 2025-01-11_14-28-16
    │       └── Tests.1
    │           └── 0
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_2
    │   └── 2025-01-11_14-28-17
    │       └── Tests.1
    │           └── 1
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_3
    │   └── 2025-01-11_14-28-18
    │       └── Tests.1
    │           └── 2
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_4
    │   └── 2025-01-11_14-28-19
    │       └── Tests.1
    │           └── 3
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_5
    │   └── 2025-01-11_14-28-20
    │       └── Tests.1
    │           └── 4
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_6
    │   └── 2025-01-11_14-28-21
    │       └── Tests.1
    │           └── 5
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_7
    │   └── 2025-01-11_14-28-22
    │       └── Tests.1
    │           └── 6
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_8
    │   └── 2025-01-11_14-28-23
    │       └── Tests.1
    │           └── 7
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_9
    │   └── 2025-01-11_14-28-24
    │       └── Tests.1
    │           └── 8
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_10
    │   └── 2025-01-11_14-28-25
    │       └── Tests.1
    │           └── 9
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_11
    │   └── 2025-01-11_14-28-26
    │       └── Tests.1
    │           └── 10
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_12
    │   └── 2025-01-11_14-28-27
    │       └── Tests.1
    │           └── 11
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_13
    │   └── 2025-01-11_14-28-28
    │       └── Tests.1
    │           └── 12
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_14
    │   └── 2025-01-11_14-28-29
    │       └── Tests.1
    │           └── 13
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_15
    │   └── 2025-01-11_14-28-30
    │       └── Tests.1
    │           └── 14
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_16
    │   └── 2025-01-11_14-28-31
    │       └── Tests.1
    │           └── 15
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_17
    │   └── 2025-01-11_14-28-32
    │       └── Tests.1
    │           └── 16
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_18
    │   └── 2025-01-11_14-28-33
    │       └── Tests.1
    │           └── 17
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_19
    │   └── 2025-01-11_14-28-34
    │       └── Tests.1
    │           └── 18
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_20
    │   └── 2025-01-11_14-28-35
    │       └── Tests.1
    │           └── 19
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_21
    │   └── 2025-01-11_14-28-36
    │       └── Tests.1
    │           └── 20
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_22
    │   └── 2025-01-11_14-28-37
    │       └── Tests.1
    │           └── 21
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_23
    │   └── 2025-01-11_14-28-38
    │       └── Tests.1
    │           └── 22
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_24
    │   └── 2025-01-11_14-28-39
    │       └── Tests.1
    │           └── 23
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_25
    │   └── 2025-01-11_14-28-40
    │       └── Tests.1
    │           └── 24
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_26
    │   └── 2025-01-11_14-28-42
    │       └── Tests.1
    │           └── 25
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_27
    │   └── 2025-01-11_14-28-43
    │       └── Tests.1
    │           └── 26
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_28
    │   └── 2025-01-11_14-28-44
    │       └── Tests.1
    │           └── 27
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_29
    │   └── 2025-01-11_14-28-45
    │       └── Tests.1
    │           └── 28
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_30
    │   └── 2025-01-11_14-28-46
    │       └── Tests.1
    │           └── 29
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_31
    │   └── 2025-01-11_14-28-47
    │       └── Tests.1
    │           └── 30
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    └── iteration_32
        └── 2025-01-11_14-28-48
            └── Tests.1
                └── 31
                    ├── cloudai_sbatch_script.sh
                    └── run.sh

Additional Notes

1. Inherit from ABC if there @abstractmethods 2. Do not make gen_srun_success_check() abstruct, simply return an empty string by default. When needed, this method will be overriden.

Some slurm setups do not allow running enroot from the head node. Let's rely on actual 'enroot import' run via srun and report its real error message to user.

…e of the variables that is list.

…axtookbox definitions

* Remove conf/common/test/chakra_replay.toml * Remove conf/common/test_scenario/chakra_replay.toml

…in pyright

amaslenn and others added 30 commits January 10, 2025 14:21

Fix ABC usage issues

d9e854a

1. Inherit from ABC if there @abstractmethods 2. Do not make gen_srun_success_check() abstruct, simply return an empty string by default. When needed, this method will be overriden.

Add test hooks to USER_GUIDE.md (NVIDIA#322)

34cda1f

remove the default condition check

0cfd5e6

preserves lists in cmd_args as is (for pydantic validation)

baa2925

propate cmd_args type to all places in cloudAI for pyright errors

1a26f7f

Add ClassVar to remove pydantic annonation error

aa8a38c

fix pytest

c95fefa

Do not check image accessibility using "local" enroot

cd8fbbf

Some slurm setups do not allow running enroot from the head node. Let's rely on actual 'enroot import' run via srun and report its real error message to user.

Do not require enroot binary on head node

3f358ea

Pass SlurmSystem into DockerImageCacheManager

41694b7

Specify account while caching images

1bcb661

Reduce noise in CLI output

2fb9b92

Make ruff happy

a1ed291

more unit tests for parser with Grok Test definition + pydantic of on…

f85961b

…e of the variables that is list.

ruffing

aa2d0d9

Add more test to have ranges for FDL flags.

3e0864d

More test for XLA flags as list other fixed + fixing typing in Grok/J…

e2fe3aa

…axtookbox definitions

All static values (benchmarking scenarios in CloudAI)

8367cb7

negative tests with various types in the list

a7b1633

remove the unit tests

695a5d0

remove instance check (assuming model_dump() never fails)

44a9fda

fix the typing for slurm_args

6124236

removing the old _parser_cmd method that is not used.

f229433

test and test scenario for environment configuration

fd83f65

Add configurable gym environment from test run object

118638a

Configurable cloudaigym environment and tests

532c9ad

reorg the environment under _core directory

db516e3

fix pyright and pytest issues

6bfe9e2

Remove conf/common/test/chakra_replay.toml (NVIDIA#328)

4fc907e

* Remove conf/common/test/chakra_replay.toml * Remove conf/common/test_scenario/chakra_replay.toml

checkpoint policy serializer for list/ranges

c0f8f6c

srivatsankrishnan added 24 commits January 10, 2025 14:23

Add farma gym to requirements

dae998f

vulture check

6487344

fix pyproject.toml

cdfb826

Fix the test package errors

76f1f2f

taplo

897c4b9

port agent interface and grid search

bcd45da

Ignore vulture for grid search

d31412c

vulture and ruff fixes

3ad97cd

remove comments

bf6c96a

removed the fixed value

7e10b22

Merge branch 'main' into config-agent

dff5c64

remove the setter

275bf39

Not introduce range as of now. Stick to static lists

63efa41

agent environment intergation with runner

d945201

more fixes

1be4398

Remove Farma gym dependies for more control over types + other fixes …

4df4ab9

…in pyright

vulture fix

e6905f7

remove farma gym dependencies + update the pytest for cloudai_gym

177694f

remove farma gym from pyproject

b10dbfb

fix the copyright headers checks

15be693

use iterators to avoid indexing errors.

d5d1e14

helper method for manipulating the TestRun object directly

96ab055

Modifcations for storing dse results

8cab450

add dse_iteration to TestRun object

0acf43e

srivatsankrishnan force-pushed the fix_dse_output_dir branch from 5b93508 to 0acf43e Compare January 12, 2025 00:04

srivatsankrishnan mentioned this pull request Jan 14, 2025

Configurable agents interface #329

Merged

TaekyungHeo added the feature label Jan 14, 2025

srivatsankrishnan closed this pull request by merging all changes into NVIDIA:main in 36e3993 Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results directory for DSE job #333

Results directory for DSE job #333

Uh oh!

srivatsankrishnan commented Jan 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Results directory for DSE job #333

Results directory for DSE job #333

Uh oh!

Conversation

srivatsankrishnan commented Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srivatsankrishnan commented Jan 11, 2025 •

edited

Loading