-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38
Conversation
… timings into a macro that is enabled by compiling with -DTIME_CODE
Regarding this:
I've been running it now on my small personal laptop (so can't really do any proper benchmarks), and noticed this:
Not sure if it's related or not... |
@dkobak Yes, that's the kind of behavior I'm referring to... Are those numbers with the most recent commit? |
Yes. |
What is the size of the data you are embedding with those numbers? |
It's the MNIST example from my Python notebook in |
Hm, so I just ran the MNIST example on my personal laptop (MBP 2017, 2.9 GHz Intel Core i7, 4 cores 2 threads each). The code I'm using is from your most recent PR (which I will merge shortly):
So I'm not seeing the same kind of instability in the times. I wonder if it is a hardware difference? To investigate it further, you can pass -DTIME_CODE when you compile, and it will print out the times for each step of the interpolation scheme for each iteration. It's a lot of output, but can be helpful in these situations. |
Wow, nice performance! Just ran it on my lab computer (Intel Xeon CPU E3-1230 v5 @ 3.40GHz, 4 cores 2 threads each) which I'd expect to be faster than MBP 2017, but it's substantially slower:
I'll double-check what happens on my laptop and try -DTIME_CODE. |
-FFTW works best on arrays of certain sizes. To quote here:
So, we fixed "allowed sizes" for the FFT. For a given number of boxes per dimension (which increases as points expand as a function of iterations), we pick "round up" to the nearest "allowed size." This gives ~25% improvement in speed for small N, and it solves the weird times that people would sometimes get (e.g. the 850-900 iterations might be much slower than the 900-950 iterations because the FFT sizes were not optimal in the former but optimal in the latter.)
-Refactored most of the multithreading into a PARALELL_FOR macro so that OpenMP or C++11 threads could be both used and compared. There does not seem to be a practical difference on most machines, however.
-Removed the timing libraries that were causing errors on windows (#36). Also implemented @dkobak's fix for the other problem here. It builds and runs on Windows, Linux, and Mac OS X.
-Refactored detailed profiling timings into macro that can be enabled with -DTIME_CODE at compile time
-Multithreaded the computation of cost every 50 iterations, partially addressing #37.
-Changed the progress bar width to 60 as suggested by @dkobak here.
It is also worth mentioning a "negative result" from these optimization efforts. Our original interpolation code, e.g. here, sorted the points into boxes before interpolation, then unsorted them afterwards. The reason was for this was to allow for better parallelization, i.e. so that all points in a box were adjacent in memory. Indeed, we were able to achieve substantial improvements (e.g. 4 times faster, for a million points) on multithreaded vs. single threaded (implementation here). However, these accelerations all come at the cost of sorting and unsorting at each time. It turns out even though the implementation without sorting does not mulitthread optimally, the fact that it does not have to sort/unsort makes it faster for the N encountered most commonly with t-SNE on most machines. Therefore, we kept @pavlin-policar's modified implementation of the interpolation that does not sort/unsort.
Finally, it is important to note that even though the attractive forces are multithreaded as of #32, they still dominate on most machines. So, efforts to multithread the interpolation scheme do not necessarily translate to speed-ups on most machines.