Skip to content

[JAX] Fixes for L0_jax_distributed_unittest#1884

Merged
phu0ngng merged 11 commits intoNVIDIA:mainfrom
phu0ngng:test_script
Jun 17, 2025
Merged

[JAX] Fixes for L0_jax_distributed_unittest#1884
phu0ngng merged 11 commits intoNVIDIA:mainfrom
phu0ngng:test_script

Conversation

@phu0ngng
Copy link
Copy Markdown
Collaborator

Description

  • Removed exit code from run_test_multiprocessing_encoder.sh to avoid premature exit in the test suite.
  • Restore the tests that were accidentally left out earlier.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • [] I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

phu0ngng added 2 commits June 16, 2025 05:39
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng
Copy link
Copy Markdown
Collaborator Author

/te-ci JAX L0

Comment thread qa/L0_jax_distributed_unittest/test.sh Outdated
python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_multigpu_encoder.xml $TE_PATH/examples/jax/encoder/test_multigpu_encoder.py || test_fail "test_multigpu_encoder.py"
wait
python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_model_parallel_encoder.xml $TE_PATH/examples/jax/encoder/test_model_parallel_encoder.py || test_fail "test_model_parallel_encoder.py"
# wait
Copy link
Copy Markdown

@jaro-sevcik jaro-sevcik Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not uncomment this wait?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng
Copy link
Copy Markdown
Collaborator Author

/te-ci JAX L0

phu0ngng added 2 commits June 16, 2025 08:51
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng
Copy link
Copy Markdown
Collaborator Author

/te-ci JAX L0

@phu0ngng phu0ngng added the 2.5.0 label Jun 16, 2025
Comment thread qa/L0_jax_distributed_unittest/test.sh Outdated
Comment on lines +27 to +31
python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_multigpu_encoder.xml $TE_PATH/examples/jax/encoder/test_multigpu_encoder.py || test_fail "test_multigpu_encoder.py"
wait
python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_model_parallel_encoder.xml $TE_PATH/examples/jax/encoder/test_model_parallel_encoder.py || test_fail "test_model_parallel_encoder.py"
wait
bash $TE_PATH/examples/jax/encoder/run_test_multiprocessing_encoder.sh || test_fail "run_test_multiprocessing_encoder.sh"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me !

Quick question:

  1. Do we know if these were ever enabled in the past ? If yes, is there a reason we disabled them ?
  2. Also, interested in knowing if we know approximately how much longer will this test run due to us uncommenting these tests ? (asking from a CI/QA budget point of view)
  3. Is it okay to explicitly uses bash - what if the system running the script has a shell is not bash ?
    Thanks !

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I accidentally left out in one of the PR that merged recently.
  2. We have always been running all of these tests, and the whole suite takes ~25 mins, I think so we should not have any issues.
  3. The qa/L0_jax_distributed_unittest/test.sh is a bash script itself.

@jberchtold-nvidia
Copy link
Copy Markdown
Collaborator

These tests were all running recently, I assume they were recently disabled when debugging a PR and accidentally merged, right? If so, then LGTM pending one CI failure

phu0ngng added 4 commits June 16, 2025 11:25
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng
Copy link
Copy Markdown
Collaborator Author

/te-ci JAX L0

phu0ngng added 2 commits June 16, 2025 14:10
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng
Copy link
Copy Markdown
Collaborator Author

/te-ci JAX L0

1 similar comment
@jberchtold-nvidia
Copy link
Copy Markdown
Collaborator

/te-ci JAX L0

@phu0ngng phu0ngng requested a review from KshitijLakhani June 16, 2025 23:54
# python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_model_parallel_encoder.xml $TE_PATH/examples/jax/encoder/test_model_parallel_encoder.py || test_fail "test_model_parallel_encoder.py"
# wait
. $TE_PATH/examples/jax/encoder/run_test_multiprocessing_encoder.sh || test_fail "run_test_multiprocessing_encoder.sh"
python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_multigpu_encoder.xml $TE_PATH/examples/jax/encoder/test_multigpu_encoder.py || test_fail "test_multigpu_encoder.py"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be unrelated to this PR, but TE_PATH doesn't seem to be set correctly at least in one configuration. CI is failing here with

FileNotFoundError: [Errno 2] No such file or directory: '/tests/jax/pytest.ini'

which I'd assume is because TE_PATH is empty.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was not the latest pipeline.
The latest pipeline (#30209220) shows that all tests have passed.

@phu0ngng phu0ngng merged commit ae572af into NVIDIA:main Jun 17, 2025
23 checks passed
@phu0ngng phu0ngng deleted the test_script branch June 17, 2025 12:59
KshitijLakhani pushed a commit that referenced this pull request Jun 17, 2025
* include previously accidentally excluded tests

* Execute run_test_multiprocessing_encoder with nested bash + exit code for inner bash shell

* Adapt run_test_multiprocessing to handle segfault

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants