Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNKT ~2 times slower than TF #560

Closed
snurkabill opened this issue Jun 8, 2016 · 3 comments
Closed

CNKT ~2 times slower than TF #560

snurkabill opened this issue Jun 8, 2016 · 3 comments

Comments

@snurkabill
Copy link

Hi!

I would like to switch from TF to CNTK. I have performed some small benchmark to compare both frameworks. I've reuse CNTK's example code with little alternations and I was suprised by differences between TF and CNTK performance:

4 hidden sigmoidal layers, each with 2K neurons, batch-size 32 (and 600). softmax at the output layers. used float32 arithmetic.

Results:

CNTK get me better accuracy on MNIST than TF, but was more than 2 times slower. Using SGD with same learning rate.

I would like to ask, if that is expected or CNTK is doing something that I do not see (some dynamic calculations, learning rate alternation atc)

@frankseide
Copy link
Contributor

You mean slower in terms of samples per second?

Would you mind sharing your log output, specifically the Validation section, and a few lines of the epoch log which shows the #samples per second?

Is this with CPU or GPU?

BTW batch size 32 is small. In many tasks, you should be able to use 128 or more. GPUs are less efficient for small minibatch sizes.

@snurkabill
Copy link
Author

snurkabill commented Jun 9, 2016

Yeah, I mean slower in terms of samples per second. Please note, that I did not measure samples per second for both processes, I measured the whole time (learning + testing).

Both scripts were executed on same machine, same GPU.

I know that batch 32 is not ideal size, but with four wide hidden layers it kept GPU on 94% of load (and also for batch size 600, the results are similar.)

I will be able to provide you full log in one day, stay tuned.

I've written script for TF by myself so I actually know what is happening and how are data pulled from database into learning mechanism. That's why I ask. maybe CNTK is doing something additional. Also, the results are really better (when TF is given more epochs or adjusted learning rate, it can also reach such accuracy), but I am not sure it CNTK does the same amount of weight updates as TF do.

@frankseide
Copy link
Contributor

You say for MB size 600 the speed is similar? Both should just predominantly spend their GPU time in cublasSGemm(), and thus have the same speed. May I ask how many minibatches you ran (the question really aims at the possibility of having some overhead included in the CNTK measurement)?

BTW I am not sure if 94% GPU load says something about actual multi-proc utilization, or simply amount of time GPU is used vs. unused.

The different convergence rate may have to do with different interpretation of the learning-rate parameter. Please see [[https://github.com/Microsoft/CNTK/wiki/Tutorial2#sgd-parameters]], "Converting Learning-rate and Momentum Parameters From Other Toolkits"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants