Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run standalone tests in batches #13673

Merged
merged 17 commits into from Jul 18, 2022
Merged

Run standalone tests in batches #13673

merged 17 commits into from Jul 18, 2022

Conversation

carmocca
Copy link
Member

@carmocca carmocca commented Jul 15, 2022

What does this PR do?

Launch the standalone tests in batches.

Standalone tests are launched in separate processes because they have interactions with others. pytest runs all tests together in one process by default so we need to launch them 1-by-1.

But we can still batch the tests ourselves but launch them in separate processes. This is what this PR does.

batch size Standalone runtime
master (1) 22m 30s
4 8m 15s
6 7m 2s
8 failed allclose on a test

We might be able to push the batch size higher. But there seem to be interactions between the deepspeed tests. As a drawback, this opens up a potential source of flakiness depending on how the tests get batched.

Discussed with @tchaton

There's a failing test in standalone tests:

Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

But it's happening in master too.

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [n/a] Did you make sure to update the documentation with your changes? (if necessary)
  • [n/a] Did you write any new necessary tests? (not for typos and docs)
  • [n/a] Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

cc @carmocca @akihironitta @Borda

@carmocca carmocca added the ci Continuous Integration label Jul 15, 2022
@carmocca carmocca added this to the pl:1.7 milestone Jul 15, 2022
@carmocca carmocca self-assigned this Jul 15, 2022
@Borda
Copy link
Member

Borda commented Jul 15, 2022

just curious, if we set it all in one batch, then what is the difference to run standards tests?

@carmocca
Copy link
Member Author

See the issue description

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it !

.azure/gpu-tests.yml Outdated Show resolved Hide resolved
tests/tests_pytorch/run_standalone_tests.sh Show resolved Hide resolved
@Borda
Copy link
Member

Borda commented Jul 15, 2022

We might be able to push the batch size higher. But there seem to be interactions between tests. As a drawback, this opens up a potential source of flakiness depending on how the tests get batched.

yes, I see, so shall we bit more experiments what is the right size for min interactions?

@carmocca
Copy link
Member Author

carmocca commented Jul 15, 2022

shall we bit more experiments what is the right size for min interactions?

I've done it already. See table at the to. "6" is the highest we can run at the moment without interactions. I would like to experiment further in a follow-up PR.

We can always tune it higher or lower in the future. And setting it to "1" should be equal to reverting this PR

@mergify mergify bot added the ready PRs ready to be merged label Jul 16, 2022
@awaelchli
Copy link
Member

If each test is still standalone, launches its own processes and lets them terminate, then there should be absolutely no interactions between tests (assuming only writing to tmpdir). Is this the case? This property has to remain, otherwise we will have problems.

@carmocca carmocca enabled auto-merge (squash) July 18, 2022 11:52
@carmocca carmocca merged commit d058190 into master Jul 18, 2022
@carmocca carmocca deleted the ci/batched-standalone-tests branch July 18, 2022 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants