Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch performance issues #339

Closed
lukeyeager opened this issue Oct 2, 2015 · 8 comments
Closed

Torch performance issues #339

lukeyeager opened this issue Oct 2, 2015 · 8 comments

Comments

@lukeyeager
Copy link
Member

Breaking discussion from #324 (comment) out into a separate issue.

@lukeyeager - I just wanted to make a note that a training job which takes 6sec with Caffe takes 38sec with Torch.

@gheinrich - The difference between Torch and Caffe should be less dramatic on "bigger" models.

You're right. Here's the results of a [very much non-rigorous] test I tried. It's one epoch on a dataset of ~40k images.

. AlexNet (sec) GoogLeNet (sec)
Caffe (v0.13.2, with cuDNN v3 and CNMeM) 126 722
Torch (cuDNN v3) 252 969
Slowdown 2x 1.3x
@gheinrich
Copy link
Contributor

cc @soumith as he expressed interest in this.

I have pushed a couple of changes there which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
7635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

@soumith
Copy link

soumith commented Oct 9, 2015

nice.

Could you point me to instructions on how to create the imagenet lmdb? I'm trying to get things up and running, but that's kind of a blocker for me at the moment...

Should I use caffe's imagenet instructions? https://github.com/BVLC/caffe/tree/master/examples/imagenet

@borisfom
Copy link

borisfom commented Oct 9, 2015

Thanks Soumith!

I have submitted PR#64 with decently complete cud_v4 bindings - please take a look:
soumith/cudnn.torch#63

I will merge it later today if there are no objections.

Best,
-Boris.

From: Soumith Chintala <notifications@github.commailto:notifications@github.com>
Reply-To: NVIDIA/DIGITS <reply@reply.github.commailto:reply@reply.github.com>
Date: Friday, October 9, 2015 at 10:45 AM
To: NVIDIA/DIGITS <DIGITS@noreply.github.commailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

cc @soumithhttps://github.com/soumith as he expressed interest in this.

I have pushed a couple of changes therehttps://github.com/gheinrich/DIGITS/commits/dev/torch-speed which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
76358647635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

Reply to this email directly or view it on GitHubhttps://github.com//issues/339#issuecomment-146944476.


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@borisfom
Copy link

borisfom commented Oct 9, 2015

Correction - had to resubmit PR against the right branch (R4), here:
soumith/cudnn.torch#64

From: Boris Fomitchev <bfomitchev@nvidia.commailto:bfomitchev@nvidia.com>
Date: Friday, October 9, 2015 at 11:39 AM
To: NVIDIA/DIGITS <reply@reply.github.commailto:reply@reply.github.com>, NVIDIA/DIGITS <DIGITS@noreply.github.commailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

Thanks Soumith!

I have submitted PR#64 with decently complete cud_v4 bindings - please take a look:
soumith/cudnn.torch#63

I will merge it later today if there are no objections.

Best,
-Boris.

From: Soumith Chintala <notifications@github.commailto:notifications@github.com>
Reply-To: NVIDIA/DIGITS <reply@reply.github.commailto:reply@reply.github.com>
Date: Friday, October 9, 2015 at 10:45 AM
To: NVIDIA/DIGITS <DIGITS@noreply.github.commailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

cc @soumithhttps://github.com/soumith as he expressed interest in this.

I have pushed a couple of changes therehttps://github.com/gheinrich/DIGITS/commits/dev/torch-speed which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
76358647635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

Reply to this email directly or view it on GitHubhttps://github.com//issues/339#issuecomment-146944476.


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@gheinrich
Copy link
Contributor

@soumith

Could you point me to instructions on how to create the imagenet lmdb?

You need to create the LMDB using DIGITS if you want to use DIGITS to train a model. This page explains how to structure your image folders in a way that DIGITS can understand. There is a section on this page that explains how to create a subset of imagenet (it takes a while to create the full imagenet LMDB).

This page then explains how to create the LMDB using DIGITS. Let us know if you need any more information. Thanks!

@borisfom
Copy link

This is now merged into master cudnn.torch/R4, thanks Soumith!

Best,
-Boris.

From: Boris Fomitchev
Sent: Friday, October 09, 2015 11:43 AM
To: NVIDIA/DIGITS; Soumith Chintala
Subject: Re: [DIGITS] Torch performance issues (#339)

Correction - had to resubmit PR against the right branch (R4), here:
soumith/cudnn.torch#64

From: Boris Fomitchev <bfomitchev@nvidia.commailto:bfomitchev@nvidia.com>
Date: Friday, October 9, 2015 at 11:39 AM
To: NVIDIA/DIGITS <reply@reply.github.commailto:reply@reply.github.com>, NVIDIA/DIGITS <DIGITS@noreply.github.commailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

Thanks Soumith!

I have submitted PR#64 with decently complete cud_v4 bindings - please take a look:
soumith/cudnn.torch#63

I will merge it later today if there are no objections.

Best,
-Boris.

From: Soumith Chintala <notifications@github.commailto:notifications@github.com>
Reply-To: NVIDIA/DIGITS <reply@reply.github.commailto:reply@reply.github.com>
Date: Friday, October 9, 2015 at 10:45 AM
To: NVIDIA/DIGITS <DIGITS@noreply.github.commailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

cc @soumithhttps://github.com/soumith as he expressed interest in this.

I have pushed a couple of changes therehttps://github.com/gheinrich/DIGITS/commits/dev/torch-speed which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
76358647635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

Reply to this email directly or view it on GitHubhttps://github.com//issues/339#issuecomment-146944476.


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

gheinrich added a commit to gheinrich/DIGITS that referenced this issue Oct 15, 2015
Original (20 epochs LeNet on MNIST): 187s
Now: 142s

Helps with bug NVIDIA#339
gheinrich added a commit to gheinrich/DIGITS that referenced this issue Oct 15, 2015
Original (20 epochs LeNet on MNIST): 187s
Now: 142s

Helps with bug NVIDIA#339
@gheinrich
Copy link
Contributor

I don't think there is a general performance issue with the integration of Torch into DIGITS anymore. Some models train faster with Torch, other models train faster with Caffe. Some numbers below:

  • LeNet on MNIST (30 epochs):
Number of GPUs Caffe Torch
1 55s 56s
2 1m6s 1m27s

(training is slower with multiple GPUs presumably due to the communication overhead)

  • Alexnet on upscaled 256x256 CIFAR10 (5 epochs):
Number of GPUs Caffe Torch
1 9m31s 6m38s
2 7m46s 5m33s
  • GoogleNet on upscaled 256x256 CIFAR10 (1 epoch):
Number of GPUs Caffe Torch
1 4m3s 11m13s
2 3m12s 7m32s

(Torch slowliness mostly due to extra Batch Normalization layers)

@soumith
Copy link

soumith commented Jan 14, 2016

Interesting data points. Thanks for sharing. Just a note on BatchNorm, latest nn/cunn have super optimized batchnorm (faster than CuDNN R4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants