[JAX] Fixes for L0_jax_distributed_unittest#1884
Conversation
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
|
/te-ci JAX L0 |
| python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_multigpu_encoder.xml $TE_PATH/examples/jax/encoder/test_multigpu_encoder.py || test_fail "test_multigpu_encoder.py" | ||
| wait | ||
| python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_model_parallel_encoder.xml $TE_PATH/examples/jax/encoder/test_model_parallel_encoder.py || test_fail "test_model_parallel_encoder.py" | ||
| # wait |
There was a problem hiding this comment.
Any reason to not uncomment this wait?
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
|
/te-ci JAX L0 |
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
|
/te-ci JAX L0 |
| python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_multigpu_encoder.xml $TE_PATH/examples/jax/encoder/test_multigpu_encoder.py || test_fail "test_multigpu_encoder.py" | ||
| wait | ||
| python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_model_parallel_encoder.xml $TE_PATH/examples/jax/encoder/test_model_parallel_encoder.py || test_fail "test_model_parallel_encoder.py" | ||
| wait | ||
| bash $TE_PATH/examples/jax/encoder/run_test_multiprocessing_encoder.sh || test_fail "run_test_multiprocessing_encoder.sh" |
There was a problem hiding this comment.
Looks good to me !
Quick question:
- Do we know if these were ever enabled in the past ? If yes, is there a reason we disabled them ?
- Also, interested in knowing if we know approximately how much longer will this test run due to us uncommenting these tests ? (asking from a CI/QA budget point of view)
- Is it okay to explicitly uses
bash- what if the system running the script has a shell is notbash?
Thanks !
There was a problem hiding this comment.
- I accidentally left out in one of the PR that merged recently.
- We have always been running all of these tests, and the whole suite takes ~25 mins, I think so we should not have any issues.
- The
qa/L0_jax_distributed_unittest/test.shis a bash script itself.
|
These tests were all running recently, I assume they were recently disabled when debugging a PR and accidentally merged, right? If so, then LGTM pending one CI failure |
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
|
/te-ci JAX L0 |
|
/te-ci JAX L0 |
1 similar comment
|
/te-ci JAX L0 |
| # python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_model_parallel_encoder.xml $TE_PATH/examples/jax/encoder/test_model_parallel_encoder.py || test_fail "test_model_parallel_encoder.py" | ||
| # wait | ||
| . $TE_PATH/examples/jax/encoder/run_test_multiprocessing_encoder.sh || test_fail "run_test_multiprocessing_encoder.sh" | ||
| python3 -m pytest -c $TE_PATH/tests/jax/pytest.ini -v --junitxml=$XML_LOG_DIR/pytest_test_multigpu_encoder.xml $TE_PATH/examples/jax/encoder/test_multigpu_encoder.py || test_fail "test_multigpu_encoder.py" |
There was a problem hiding this comment.
Might be unrelated to this PR, but TE_PATH doesn't seem to be set correctly at least in one configuration. CI is failing here with
FileNotFoundError: [Errno 2] No such file or directory: '/tests/jax/pytest.ini'
which I'd assume is because TE_PATH is empty.
There was a problem hiding this comment.
Yeah, that was not the latest pipeline.
The latest pipeline (#30209220) shows that all tests have passed.
* include previously accidentally excluded tests * Execute run_test_multiprocessing_encoder with nested bash + exit code for inner bash shell * Adapt run_test_multiprocessing to handle segfault Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Description
run_test_multiprocessing_encoder.shto avoid premature exit in the test suite.Type of change
Checklist: