Run standalone tests in batches #13673

carmocca · 2022-07-15T12:21:09Z

What does this PR do?

Launch the standalone tests in batches.

Standalone tests are launched in separate processes because they have interactions with others. pytest runs all tests together in one process by default so we need to launch them 1-by-1.

But we can still batch the tests ourselves but launch them in separate processes. This is what this PR does.

batch size	Standalone runtime
master (1)	22m 30s
4	8m 15s
6	7m 2s
8	failed allclose on a test

We might be able to push the batch size higher. But there seem to be interactions between the deepspeed tests. As a drawback, this opens up a potential source of flakiness depending on how the tests get batched.

Discussed with @tchaton

There's a failing test in standalone tests:

Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

But it's happening in master too.

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[n/a] Did you make sure to update the documentation with your changes? (if necessary)
[n/a] Did you write any new necessary tests? (not for typos and docs)
[n/a] Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

cc @carmocca @akihironitta @Borda

for more information, see https://pre-commit.ci

tests/tests_pytorch/strategies/test_deepspeed_strategy.py

Borda · 2022-07-15T16:27:17Z

just curious, if we set it all in one batch, then what is the difference to run standards tests?

carmocca · 2022-07-15T16:36:41Z

See the issue description

This reverts commit 5e014e1.

tests/tests_pytorch/run_standalone_tests.sh

tchaton

Love it !

.azure/gpu-tests.yml

tests/tests_pytorch/run_standalone_tests.sh

Borda · 2022-07-15T17:22:11Z

We might be able to push the batch size higher. But there seem to be interactions between tests. As a drawback, this opens up a potential source of flakiness depending on how the tests get batched.

yes, I see, so shall we bit more experiments what is the right size for min interactions?

carmocca · 2022-07-15T17:24:06Z

shall we bit more experiments what is the right size for min interactions?

I've done it already. See table at the to. "6" is the highest we can run at the moment without interactions. I would like to experiment further in a follow-up PR.

We can always tune it higher or lower in the future. And setting it to "1" should be equal to reverting this PR

awaelchli · 2022-07-16T18:03:33Z

If each test is still standalone, launches its own processes and lets them terminate, then there should be absolutely no interactions between tests (assuming only writing to tmpdir). Is this the case? This property has to remain, otherwise we will have problems.

carmocca added 2 commits July 15, 2022 14:19

Support batched standalone tests

7b50b58

Remove CI steps - REVERT ME

5e014e1

carmocca added the ci Continuous Integration label Jul 15, 2022

carmocca added this to the pl:1.7 milestone Jul 15, 2022

carmocca self-assigned this Jul 15, 2022

carmocca and others added 8 commits July 15, 2022 14:36

Redirect output into a file to avoid tests messing each other up

d5238d6

Exit if background jobs failed

e8c9b33

No need for touch

bfe5874

Decrease batch size

62d94fe

Fix wait usage

12db8c8

Proper cat and cleanup on exit

2d9f1de

Remove flaky convergence checks. These should be part of benchmarks

3584e75

[pre-commit.ci] auto fixes from pre-commit.com hooks

32cda76

for more information, see https://pre-commit.ci

carmocca commented Jul 15, 2022

View reviewed changes

tests/tests_pytorch/strategies/test_deepspeed_strategy.py Show resolved Hide resolved

carmocca added 2 commits July 15, 2022 17:18

Try 8 again

639f7c3

Try 6

a66c10b

carmocca added 3 commits July 15, 2022 18:38

Unused imports

f9cbbd0

Revert "Remove CI steps - REVERT ME"

eb3e96d

This reverts commit 5e014e1.

Merge branch 'master' into ci/batched-standalone-tests

35fd442

carmocca marked this pull request as ready for review July 15, 2022 16:40

carmocca requested review from williamFalcon, Borda, tchaton, SeanNaren, awaelchli, justusschock, kaushikb11 and rohitgr7 as code owners July 15, 2022 16:40

Borda reviewed Jul 15, 2022

View reviewed changes

tests/tests_pytorch/run_standalone_tests.sh Show resolved Hide resolved

tchaton approved these changes Jul 15, 2022

View reviewed changes

.azure/gpu-tests.yml Outdated Show resolved Hide resolved

tests/tests_pytorch/run_standalone_tests.sh Show resolved Hide resolved

This was referenced Jul 16, 2022

Decouple benchmarks from tests #13684

Open

Utilise durations of test cases across previous CI runs Lightning-AI/utilities#8

Closed

justusschock approved these changes Jul 16, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jul 16, 2022

akihironitta approved these changes Jul 16, 2022

View reviewed changes

Borda and others added 2 commits July 17, 2022 20:20

Merge branch 'master' into ci/batched-standalone-tests

31ef3b3

Use tmpdir in deepspeed test

8fbde26

carmocca enabled auto-merge (squash) July 18, 2022 11:52

carmocca merged commit d058190 into master Jul 18, 2022

carmocca deleted the ci/batched-standalone-tests branch July 18, 2022 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run standalone tests in batches #13673

Run standalone tests in batches #13673

carmocca commented Jul 15, 2022 •

edited

Borda commented Jul 15, 2022

carmocca commented Jul 15, 2022

tchaton left a comment

Borda commented Jul 15, 2022

carmocca commented Jul 15, 2022 •

edited

awaelchli commented Jul 16, 2022

Run standalone tests in batches #13673

Run standalone tests in batches #13673

Conversation

carmocca commented Jul 15, 2022 • edited

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Borda commented Jul 15, 2022

carmocca commented Jul 15, 2022

tchaton left a comment

Choose a reason for hiding this comment

Borda commented Jul 15, 2022

carmocca commented Jul 15, 2022 • edited

awaelchli commented Jul 16, 2022

carmocca commented Jul 15, 2022 •

edited

carmocca commented Jul 15, 2022 •

edited