Torch performance issues #339

lukeyeager · 2015-10-02T21:50:04Z

Breaking discussion from #324 (comment) out into a separate issue.

@lukeyeager - I just wanted to make a note that a training job which takes 6sec with Caffe takes 38sec with Torch.

@gheinrich - The difference between Torch and Caffe should be less dramatic on "bigger" models.

You're right. Here's the results of a [very much non-rigorous] test I tried. It's one epoch on a dataset of ~40k images.

. AlexNet (sec) GoogLeNet (sec)

Caffe (v0.13.2, with cuDNN v3 and CNMeM) 126 722

Torch (cuDNN v3) 252 969

Slowdown 2x 1.3x

gheinrich · 2015-10-09T17:45:35Z

cc @soumith as he expressed interest in this.

I have pushed a couple of changes there which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
7635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

soumith · 2015-10-09T17:50:38Z

nice.

Could you point me to instructions on how to create the imagenet lmdb? I'm trying to get things up and running, but that's kind of a blocker for me at the moment...

Should I use caffe's imagenet instructions? https://github.com/BVLC/caffe/tree/master/examples/imagenet

borisfom · 2015-10-09T18:39:12Z

Thanks Soumith!

I have submitted PR#64 with decently complete cud_v4 bindings - please take a look:
soumith/cudnn.torch#63

I will merge it later today if there are no objections.

Best,
-Boris.

From: Soumith Chintala <notifications@github.com mailto:notifications@github.com>
Reply-To: NVIDIA/DIGITS <reply@reply.github.com mailto:reply@reply.github.com>
Date: Friday, October 9, 2015 at 10:45 AM
To: NVIDIA/DIGITS <DIGITS@noreply.github.com mailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

cc @soumithhttps://github.com/soumith as he expressed interest in this.

I have pushed a couple of changes therehttps://github.com/gheinrich/DIGITS/commits/dev/torch-speed which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
76358647635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

Reply to this email directly or view it on GitHubhttps://github.com//issues/339#issuecomment-146944476.

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

borisfom · 2015-10-09T18:43:19Z

Correction - had to resubmit PR against the right branch (R4), here:
soumith/cudnn.torch#64

From: Boris Fomitchev <bfomitchev@nvidia.com mailto:bfomitchev@nvidia.com>
Date: Friday, October 9, 2015 at 11:39 AM
To: NVIDIA/DIGITS <reply@reply.github.com mailto:reply@reply.github.com>, NVIDIA/DIGITS <DIGITS@noreply.github.com mailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

Thanks Soumith!

I have submitted PR#64 with decently complete cud_v4 bindings - please take a look:
soumith/cudnn.torch#63

I will merge it later today if there are no objections.

Best,
-Boris.

From: Soumith Chintala <notifications@github.com mailto:notifications@github.com>
Reply-To: NVIDIA/DIGITS <reply@reply.github.com mailto:reply@reply.github.com>
Date: Friday, October 9, 2015 at 10:45 AM
To: NVIDIA/DIGITS <DIGITS@noreply.github.com mailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

cc @soumithhttps://github.com/soumith as he expressed interest in this.

I have pushed a couple of changes therehttps://github.com/gheinrich/DIGITS/commits/dev/torch-speed which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
76358647635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

Reply to this email directly or view it on GitHubhttps://github.com//issues/339#issuecomment-146944476.

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

gheinrich · 2015-10-09T19:12:59Z

@soumith

Could you point me to instructions on how to create the imagenet lmdb?

You need to create the LMDB using DIGITS if you want to use DIGITS to train a model. This page explains how to structure your image folders in a way that DIGITS can understand. There is a section on this page that explains how to create a subset of imagenet (it takes a while to create the full imagenet LMDB).

This page then explains how to create the LMDB using DIGITS. Let us know if you need any more information. Thanks!

borisfom · 2015-10-10T00:20:08Z

This is now merged into master cudnn.torch/R4, thanks Soumith!

Best,
-Boris.

From: Boris Fomitchev
Sent: Friday, October 09, 2015 11:43 AM
To: NVIDIA/DIGITS; Soumith Chintala
Subject: Re: [DIGITS] Torch performance issues (#339)

Correction - had to resubmit PR against the right branch (R4), here:
soumith/cudnn.torch#64

From: Boris Fomitchev <bfomitchev@nvidia.com mailto:bfomitchev@nvidia.com>
Date: Friday, October 9, 2015 at 11:39 AM
To: NVIDIA/DIGITS <reply@reply.github.com mailto:reply@reply.github.com>, NVIDIA/DIGITS <DIGITS@noreply.github.com mailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

Thanks Soumith!

I have submitted PR#64 with decently complete cud_v4 bindings - please take a look:
soumith/cudnn.torch#63

I will merge it later today if there are no objections.

Best,
-Boris.

From: Soumith Chintala <notifications@github.com mailto:notifications@github.com>
Reply-To: NVIDIA/DIGITS <reply@reply.github.com mailto:reply@reply.github.com>
Date: Friday, October 9, 2015 at 10:45 AM
To: NVIDIA/DIGITS <DIGITS@noreply.github.com mailto:DIGITS@noreply.github.com>
Subject: Re: [DIGITS] Torch performance issues (#339)

cc @soumithhttps://github.com/soumith as he expressed interest in this.

I have pushed a couple of changes therehttps://github.com/gheinrich/DIGITS/commits/dev/torch-speed which speed Torch training up (figures below are for 20 epochs of training LeNet on MNIST 45k training samples and 15k validation samples, with one validation per epoch):
fa00a94fa00a94 removes some unnecessary computations of train set accuracy on CPU side: 187s -> 142s
76358647635864 implements a multi-threaded data loader for LMDB: 142s -> 83s
The same benchmark takes 73s with Caffe so there is still some room for improvement.
For example I don't understand why training is slower when I copy tensors to GPU from the data loader threads (as opposed to doing it from the main thread).
The current version greedily allocates new tensors for each training example. It might speed things up to re-use a small number of tensors across batches.

Reply to this email directly or view it on GitHubhttps://github.com//issues/339#issuecomment-146944476.

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

Original (20 epochs LeNet on MNIST): 187s Now: 142s Helps with bug NVIDIA#339

gheinrich · 2016-01-14T15:55:48Z

I don't think there is a general performance issue with the integration of Torch into DIGITS anymore. Some models train faster with Torch, other models train faster with Caffe. Some numbers below:

LeNet on MNIST (30 epochs):

Number of GPUs	Caffe	Torch
1	55s	56s
2	1m6s	1m27s

(training is slower with multiple GPUs presumably due to the communication overhead)

Alexnet on upscaled 256x256 CIFAR10 (5 epochs):

Number of GPUs	Caffe	Torch
1	9m31s	6m38s
2	7m46s	5m33s

GoogleNet on upscaled 256x256 CIFAR10 (1 epoch):

Number of GPUs	Caffe	Torch
1	4m3s	11m13s
2	3m12s	7m32s

(Torch slowliness mostly due to extra Batch Normalization layers)

soumith · 2016-01-14T18:08:34Z

Interesting data points. Thanks for sharing. Just a note on BatchNorm, latest nn/cunn have super optimized batchnorm (faster than CuDNN R4).

lukeyeager added bug torch labels Oct 2, 2015

lukeyeager mentioned this issue Oct 2, 2015

Initial support for Torch7 in DIGITS #324

Merged

This was referenced Oct 5, 2015

Torch generic inference #345

Merged

Torch7/LeNet: use cudnn when available #344

Merged

gheinrich added a commit to gheinrich/DIGITS that referenced this issue Oct 15, 2015

Make unnecessary computations optional

42c4f54

Original (20 epochs LeNet on MNIST): 187s Now: 142s Helps with bug NVIDIA#339

gheinrich added a commit to gheinrich/DIGITS that referenced this issue Oct 15, 2015

Make unnecessary computations optional

899da0b

Original (20 epochs LeNet on MNIST): 187s Now: 142s Helps with bug NVIDIA#339

gheinrich mentioned this issue Oct 15, 2015

Make unnecessary computations optional #368

Merged

lukeyeager mentioned this issue Oct 30, 2015

Torch multi-threaded data loader #390

Merged

lukeyeager mentioned this issue Nov 30, 2015

Enable shared serialization in Torch #441

Merged

gheinrich closed this as completed Jan 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch performance issues #339

Torch performance issues #339

lukeyeager commented Oct 2, 2015

gheinrich commented Oct 9, 2015

soumith commented Oct 9, 2015

borisfom commented Oct 9, 2015

borisfom commented Oct 9, 2015

gheinrich commented Oct 9, 2015

borisfom commented Oct 10, 2015

gheinrich commented Jan 14, 2016

soumith commented Jan 14, 2016

Torch performance issues #339

Torch performance issues #339

Comments

lukeyeager commented Oct 2, 2015

gheinrich commented Oct 9, 2015

soumith commented Oct 9, 2015

borisfom commented Oct 9, 2015

reply email and destroy all copies of the original message.

borisfom commented Oct 9, 2015

reply email and destroy all copies of the original message.

gheinrich commented Oct 9, 2015

borisfom commented Oct 10, 2015

reply email and destroy all copies of the original message.

gheinrich commented Jan 14, 2016

soumith commented Jan 14, 2016