New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance on Apple silicon for larger images #110
Improve performance on Apple silicon for larger images #110
Conversation
MacOS 11.0 (BigSur) have updated support for various neural networks layers via improvements to Apple’s BNNS API’s (https://developer.apple.com/documentation/accelerate/bnns?language=objc). Specifically, 2D convolution+relu and max-pooling layers are now available from BNNS and as a result, OIDN u-Network could be easily implemented using BNNS API's. The attached patch: 1) Allows OIDN to use BNNS API’s instead on MKL-DNN: - Convolution2D and max-pool layers are calling BNNS services directly - As BNNS uses NCHW data arrangement, new input, output and upscale services were added to support NCHW format - MKL-DNN is still used to manage memory - changes to cmake script to support new modeBNNS with all required versioning checks 2) Allows compilation for Apple Silicon based Macs and iPhones - Removing all files and dependencies of MKL-DNN that are specific for x86_64 platforms from compilation process - target ISPC to generate ios-arm64 object files, which are compatible with Apple Silicon Mac and iOS By default, Intel based Mac system will continue to use MKL-DNN. When targeting the build for Apple silicon Mac targets (CMAKE_OSX_ARCHITECTURES=arm64), cmake will enable BNNS builds and APPLE_SILICON. X86_64 Mac targets can build the BNNS option by setting OIDN_BNNS cmake boolean.
…le BNNS API" This reverts commit 3e0b897.
When using large images (4k +), performance is increased by ~50% by re using memory.
|
For what image sizes did you measure a ~50% speedup? For 5-10K images I measured speedups up to ~35% on an M1, which is also great but lower than expected. Thanks! |
|
The built in |
|
For which test/resolution did you get these numbers (e.g. |
|
Hi Attila, We've been performing more tests, there is a bit of variability and we see a larger difference on higher vs lower memory configs (8GB vs 16GB). Latest brings us closer to what you see (25%-35%) ./oidnBenchmark --maxmem 6000 -n 3 vs Would you prefer a change anchored to available memory size? |
|
Thanks for the details! These numbers are much closer to my measurements but it seems that the variation between machines is quite high: ./oidnBenchmark --maxmem 6000 -n 3 -r RT.ldr_alb_nrm.3840x2160 ./oidnBenchmark --maxmem 2000 -n 3 -r RT.ldr_alb_nrm.3840x2160 So on my machine limiting the memory usage to 2000 MB is actually a bit worse. I recently discovered some inefficiency in the HDR code path which introduces some bias into these performance results, so I recommend testing in LDR mode for now. This means that the actual performance gap is even smaller. I think a global memory limit for M1 is sufficient but 2000 MB seems to hurt performance a bit in some cases. ~4000 MB may be a safer choice and is reasonable on 8 GB systems as well. I will run some more benchmarks for various resolutions with the performance fix to make sure that the new memory limit works well. In any case, this change will be included in the upcoming release. |
|
I'm closing this because since |
When using large images (4k+), performance is increased by ~50% when re using memory on M1.