CNKT ~2 times slower than TF #560

snurkabill · 2016-06-08T18:56:29Z

Hi!

I would like to switch from TF to CNTK. I have performed some small benchmark to compare both frameworks. I've reuse CNTK's example code with little alternations and I was suprised by differences between TF and CNTK performance:

4 hidden sigmoidal layers, each with 2K neurons, batch-size 32 (and 600). softmax at the output layers. used float32 arithmetic.

Results:

CNTK get me better accuracy on MNIST than TF, but was more than 2 times slower. Using SGD with same learning rate.

I would like to ask, if that is expected or CNTK is doing something that I do not see (some dynamic calculations, learning rate alternation atc)

frankseide · 2016-06-09T00:32:05Z

You mean slower in terms of samples per second?

Would you mind sharing your log output, specifically the Validation section, and a few lines of the epoch log which shows the #samples per second?

Is this with CPU or GPU?

BTW batch size 32 is small. In many tasks, you should be able to use 128 or more. GPUs are less efficient for small minibatch sizes.

snurkabill · 2016-06-09T01:41:10Z

Yeah, I mean slower in terms of samples per second. Please note, that I did not measure samples per second for both processes, I measured the whole time (learning + testing).

Both scripts were executed on same machine, same GPU.

I know that batch 32 is not ideal size, but with four wide hidden layers it kept GPU on 94% of load (and also for batch size 600, the results are similar.)

I will be able to provide you full log in one day, stay tuned.

I've written script for TF by myself so I actually know what is happening and how are data pulled from database into learning mechanism. That's why I ask. maybe CNTK is doing something additional. Also, the results are really better (when TF is given more epochs or adjusted learning rate, it can also reach such accuracy), but I am not sure it CNTK does the same amount of weight updates as TF do.

frankseide · 2016-06-09T02:24:03Z

You say for MB size 600 the speed is similar? Both should just predominantly spend their GPU time in cublasSGemm(), and thus have the same speed. May I ask how many minibatches you ran (the question really aims at the possibility of having some overhead included in the CNTK measurement)?

BTW I am not sure if 94% GPU load says something about actual multi-proc utilization, or simply amount of time GPU is used vs. unused.

The different convergence rate may have to do with different interpretation of the learning-rate parameter. Please see [[https://github.com/Microsoft/CNTK/wiki/Tutorial2#sgd-parameters]], "Converting Learning-rate and Momentum Parameters From Other Toolkits"

zhouwangzw closed this as completed Jul 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNKT ~2 times slower than TF #560

CNKT ~2 times slower than TF #560

snurkabill commented Jun 8, 2016

frankseide commented Jun 9, 2016

snurkabill commented Jun 9, 2016 •

edited

Loading

frankseide commented Jun 9, 2016

CNKT ~2 times slower than TF #560

CNKT ~2 times slower than TF #560

Comments

snurkabill commented Jun 8, 2016

frankseide commented Jun 9, 2016

snurkabill commented Jun 9, 2016 • edited Loading

frankseide commented Jun 9, 2016

snurkabill commented Jun 9, 2016 •

edited

Loading