Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance on Apple silicon for larger images #110

Conversation

Developer-Ecosystem-Engineering

When using large images (4k+), performance is increased by ~50% when re using memory on M1.

MacOS 11.0 (BigSur) have updated support for various neural networks layers via improvements to Apple’s BNNS API’s (https://developer.apple.com/documentation/accelerate/bnns?language=objc). Specifically, 2D convolution+relu and max-pooling layers are now available from BNNS and as a result, OIDN u-Network could be easily implemented using BNNS API's.
The attached patch:
1) Allows OIDN to use BNNS API’s instead on MKL-DNN:
   - Convolution2D and max-pool layers are calling BNNS services directly
   - As BNNS uses NCHW data arrangement, new input, output and upscale services were added to support NCHW format
   - MKL-DNN is still used to manage memory
   - changes to cmake script to support new modeBNNS with all required versioning checks
2) Allows compilation for Apple Silicon based Macs and iPhones
   - Removing all files and dependencies of MKL-DNN that are specific for x86_64 platforms from compilation process
   - target ISPC to generate ios-arm64 object files, which are compatible with Apple Silicon Mac and iOS

By default, Intel based Mac system will continue to use MKL-DNN.  When targeting the build for Apple silicon Mac targets (CMAKE_OSX_ARCHITECTURES=arm64), cmake will enable BNNS builds and APPLE_SILICON. X86_64 Mac targets can build the BNNS option by setting OIDN_BNNS cmake boolean.
When using large images (4k +), performance is increased by ~50% by re using memory.
@atafra
Copy link
Collaborator

atafra commented Apr 21, 2021

For what image sizes did you measure a ~50% speedup? For 5-10K images I measured speedups up to ~35% on an M1, which is also great but lower than expected. Thanks!

@Developer-Ecosystem-Engineering
Copy link
Author

The built in oidnBenchmark reports a change from 7500ms to 5100ms (47% improvement)

@atafra
Copy link
Collaborator

atafra commented Apr 22, 2021

For which test/resolution did you get these numbers (e.g. RTLightmap.hdr.4096x4096)?

@Developer-Ecosystem-Engineering
Copy link
Author

Hi Attila,

We've been performing more tests, there is a bit of variability and we see a larger difference on higher vs lower memory configs (8GB vs 16GB). Latest brings us closer to what you see (25%-35%)

./oidnBenchmark --maxmem 6000 -n 3
RT.hdr_alb_nrm.3840x2160 ... 6723.03 msec/image
RT.ldr_alb_nrm.3840x2160 ... 5509.69 msec/image

vs
./oidnBenchmark --maxmem 2000 -n 3
RT.hdr_alb_nrm.3840x2160 ... 5354.92 msec/image
RT.ldr_alb_nrm.3840x2160 ... 5417.03 msec/image

Would you prefer a change anchored to available memory size?

@atafra
Copy link
Collaborator

atafra commented Apr 23, 2021

Thanks for the details! These numbers are much closer to my measurements but it seems that the variation between machines is quite high:

./oidnBenchmark --maxmem 6000 -n 3 -r RT.ldr_alb_nrm.3840x2160
RT.ldr_alb_nrm.3840x2160 ... 5048.53 msec/image

./oidnBenchmark --maxmem 2000 -n 3 -r RT.ldr_alb_nrm.3840x2160
RT.ldr_alb_nrm.3840x2160 ... 5169.98 msec/image

So on my machine limiting the memory usage to 2000 MB is actually a bit worse.

I recently discovered some inefficiency in the HDR code path which introduces some bias into these performance results, so I recommend testing in LDR mode for now. This means that the actual performance gap is even smaller.

I think a global memory limit for M1 is sufficient but 2000 MB seems to hurt performance a bit in some cases. ~4000 MB may be a safer choice and is reasonable on 8 GB systems as well. I will run some more benchmarks for various resolutions with the performance fix to make sure that the new memory limit works well. In any case, this change will be included in the upcoming release.

@atafra
Copy link
Collaborator

atafra commented Oct 13, 2021

I'm closing this because since v1.4.0 the default memory usage is only 3 GB regardless of the architecture (due to some optimizations), and thus performance has increased on M1.

@atafra atafra closed this Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants