Skip to content

Conversation

@Nic-Ma
Copy link
Contributor

@Nic-Ma Nic-Ma commented Jul 5, 2021

Fixes #2507 .

Description

This PR added the Horovod unit tests for distributed data parallel.

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma Nic-Ma force-pushed the 2507-add-horovod-tests branch from 80ad38e to 8e07681 Compare July 5, 2021 06:11
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jul 5, 2021

Hi @IsaacYangSLA and @wyli ,

I am trying to add distributed tests based on Horovod environment, but I got a question here:
As Horovod needs to execute its own command in shell to launch the program, how can we launch it in CI automatically?
For example: horovodrun -np 2 python test_evenly_divisible_all_gather_hvd.py
I checked our current distributed tests, seems it uses torch.multiprocessing.Process to launch in a python program instead of runningtorch.dist.launch command.

Thanks in advance.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jul 5, 2021

/black

monai-bot and others added 4 commits July 5, 2021 06:22
Signed-off-by: monai-bot <monai.miccai2019@gmail.com>
Signed-off-by: Nic Ma <nma@nvidia.com>
Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jul 5, 2021

Hi @wyli @IsaacYangSLA , or maybe let's just put this Horovod test in the tests to manually run locally first, then try to integrate into CI later when we have more Horovod tests? Just like what we did for the PyTorch distributed tests before?

Thanks.

@Nic-Ma Nic-Ma marked this pull request as ready for review July 5, 2021 15:39
@Nic-Ma Nic-Ma changed the title [WIP] 2507 Add Horovod unit tests 2507 Add Horovod unit tests Jul 5, 2021
@Nic-Ma Nic-Ma requested review from IsaacYangSLA and wyli July 5, 2021 15:40
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jul 6, 2021

To run this new Horovod test locally, just follow:

  1. Install Horovod:
    HOROVOD_NCCL_INCLUDE=/usr/include HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_OPERATIONS=NCCL \ HOROVOD_NCCL_LINK=SHARED pip install --no-cache-dir horovod
  2. Execute on 2 GPUs in a single machine:
    horovodrun -np 2 python test_evenly_divisible_all_gather_hvd.py

Thanks.

@Nic-Ma Nic-Ma requested review from ericspod and rijobro July 6, 2021 06:16
@wyli
Copy link
Contributor

wyli commented Jul 6, 2021

@IsaacYangSLA could you please help review?

@IsaacYangSLA IsaacYangSLA merged commit 6b88a50 into dev Jul 6, 2021
@IsaacYangSLA IsaacYangSLA deleted the 2507-add-horovod-tests branch July 6, 2021 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Horovod unit tests

5 participants