Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch Multi-GPU support #480

Merged
merged 1 commit into from
Jan 5, 2016
Merged

Conversation

gheinrich
Copy link
Contributor

Close #138

This commit enables data parallelism, i.e. train
batches are evenly spread across selected GPUs.

Strong scaling is applied by default, i.e. batch
size is unchanged and each GPU receives its share
during training.

For better performance, installation of the NCCL
module is recommended.

On a 2xTitanX machine (with NCCL):

  • GoogleNet: 51m4s (1 GPU) -> 34m3s (2 GPUs)
  • Alexnet: 12m31s (1 GPU) -> 10m38s (2 GPUs)
  • LeNet*: 53s (1GPU) -> 1m30s (2 GPUs)
  • the overhead of inter-GPU communication seems to
    outweigh the compute gain on LeNet

@lukeyeager
Copy link
Member

Nice work! Everything seems to be working for me. I'm trying out the NCCL integration now...

One things I've noticed is that we should update the Torch warning box for creating new classification/generic models.

% make CUDA_HOME=/usr/local/cuda test
```

> NOTE: if the above command fails due to missing libraries you may explicitely point the makefile to the location of your NVidia driver. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - should be "explicitly"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I always make this spelling mistake!

@lukeyeager
Copy link
Member

This is working for me both with and without NCCL. Looks good except for the little doc nitpicks above.

This commit enables data parallelism, i.e. train
batches are evenly spread across selected GPUs.

Strong scaling is applied by default, i.e. batch
size is unchanged and each GPU receives its share
during training.

For better performance, installation of the NCCL
module is recommended.

On a 2xTitanX machine (with NCCL):
- GoogleNet: 51m4s (1 GPU) -> 34m3s (2 GPUs)
- Alexnet: 12m31s (1 GPU) -> 10m38s (2 GPUs)
- LeNet*: 53s (1GPU) -> 1m30s (2 GPUs)

* the overhead of inter-GPU communication seems to
outweigh the compute gain on LeNet
lukeyeager added a commit that referenced this pull request Jan 5, 2016
@lukeyeager lukeyeager merged commit 8d4ee4b into NVIDIA:master Jan 5, 2016
@gheinrich gheinrich deleted the dev/torch-multi-gpu branch April 14, 2016 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants