-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNKT ~2 times slower than TF #560
Comments
You mean slower in terms of samples per second? Would you mind sharing your log output, specifically the Validation section, and a few lines of the epoch log which shows the #samples per second? Is this with CPU or GPU? BTW batch size 32 is small. In many tasks, you should be able to use 128 or more. GPUs are less efficient for small minibatch sizes. |
Yeah, I mean slower in terms of samples per second. Please note, that I did not measure samples per second for both processes, I measured the whole time (learning + testing). Both scripts were executed on same machine, same GPU. I know that batch 32 is not ideal size, but with four wide hidden layers it kept GPU on 94% of load (and also for batch size 600, the results are similar.) I will be able to provide you full log in one day, stay tuned. I've written script for TF by myself so I actually know what is happening and how are data pulled from database into learning mechanism. That's why I ask. maybe CNTK is doing something additional. Also, the results are really better (when TF is given more epochs or adjusted learning rate, it can also reach such accuracy), but I am not sure it CNTK does the same amount of weight updates as TF do. |
You say for MB size 600 the speed is similar? Both should just predominantly spend their GPU time in cublasSGemm(), and thus have the same speed. May I ask how many minibatches you ran (the question really aims at the possibility of having some overhead included in the CNTK measurement)? BTW I am not sure if 94% GPU load says something about actual multi-proc utilization, or simply amount of time GPU is used vs. unused. The different convergence rate may have to do with different interpretation of the learning-rate parameter. Please see [[https://github.com/Microsoft/CNTK/wiki/Tutorial2#sgd-parameters]], "Converting Learning-rate and Momentum Parameters From Other Toolkits" |
Hi!
I would like to switch from TF to CNTK. I have performed some small benchmark to compare both frameworks. I've reuse CNTK's example code with little alternations and I was suprised by differences between TF and CNTK performance:
4 hidden sigmoidal layers, each with 2K neurons, batch-size 32 (and 600). softmax at the output layers. used float32 arithmetic.
Results:
CNTK get me better accuracy on MNIST than TF, but was more than 2 times slower. Using SGD with same learning rate.
I would like to ask, if that is expected or CNTK is doing something that I do not see (some dynamic calculations, learning rate alternation atc)
The text was updated successfully, but these errors were encountered: