-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Given the successful speedups using CUDA for parts of the Sync3N algorithm, we should implement a similar GPU implementation for building the CL matrix.
For unit test sized problems our current implementation is tolerable, but for larger experiments (say 3000 images) it can take 5-6 hours with the current python implementation. The legacy MATLAB code provided both a CPU and GPU implementation, though I am not sure how relevant either are to the implementation that exists in python today (tbd).
Another feature that was nice about the MATLAB code is that it provided a way to store and recall the CL matrix via the workspace. We can consider optionally writing to disk and providing a method to load from disk. I expect that might speed up some development tasks in the future.