Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38

linqiaozhi · 2018-09-22T15:29:46Z

-FFTW works best on arrays of certain sizes. To quote here:

FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other exponents are arbitrary. Other sizes are computed by means of a slow, general-purpose algorithm (which nevertheless retains O(n log n) performance even for prime sizes)

So, we fixed "allowed sizes" for the FFT. For a given number of boxes per dimension (which increases as points expand as a function of iterations), we pick "round up" to the nearest "allowed size." This gives ~25% improvement in speed for small N, and it solves the weird times that people would sometimes get (e.g. the 850-900 iterations might be much slower than the 900-950 iterations because the FFT sizes were not optimal in the former but optimal in the latter.)

-Refactored most of the multithreading into a PARALELL_FOR macro so that OpenMP or C++11 threads could be both used and compared. There does not seem to be a practical difference on most machines, however.

-Removed the timing libraries that were causing errors on windows (#36). Also implemented @dkobak's fix for the other problem here. It builds and runs on Windows, Linux, and Mac OS X.

-Refactored detailed profiling timings into macro that can be enabled with -DTIME_CODE at compile time

-Multithreaded the computation of cost every 50 iterations, partially addressing #37.

-Changed the progress bar width to 60 as suggested by @dkobak here.

It is also worth mentioning a "negative result" from these optimization efforts. Our original interpolation code, e.g. here, sorted the points into boxes before interpolation, then unsorted them afterwards. The reason was for this was to allow for better parallelization, i.e. so that all points in a box were adjacent in memory. Indeed, we were able to achieve substantial improvements (e.g. 4 times faster, for a million points) on multithreaded vs. single threaded (implementation here). However, these accelerations all come at the cost of sorting and unsorting at each time. It turns out even though the implementation without sorting does not mulitthread optimally, the fact that it does not have to sort/unsort makes it faster for the N encountered most commonly with t-SNE on most machines. Therefore, we kept @pavlin-policar's modified implementation of the interpolation that does not sort/unsort.

Finally, it is important to note that even though the attractive forces are multithreaded as of #32, they still dominate on most machines. So, efforts to multithread the interpolation scheme do not necessarily translate to speed-ups on most machines.

…fixed size

… timings into a macro that is enabled by compiling with -DTIME_CODE

dkobak · 2018-09-22T19:41:31Z

Regarding this:

it solves the weird times that people would sometimes get (e.g. the 850-900 iterations might be much slower than the 900-950 iterations because the FFT sizes were not optimal in the former but optimal in the latter

I've been running it now on my small personal laptop (so can't really do any proper benchmarks), and noticed this:

Iteration 500 (50 iterations in 9.64 seconds), cost 5.426600
Iteration 550 (50 iterations in 9.18 seconds), cost 4.531247
Iteration 600 (50 iterations in 15.51 seconds), cost 4.174289
Iteration 650 (50 iterations in 8.68 seconds), cost 3.956146
Iteration 700 (50 iterations in 8.72 seconds), cost 3.801069
Iteration 750 (50 iterations in 8.58 seconds), cost 3.682041
Iteration 800 (50 iterations in 14.31 seconds), cost 3.586660
Iteration 850 (50 iterations in 9.40 seconds), cost 3.507664

Not sure if it's related or not...

linqiaozhi · 2018-09-22T19:44:47Z

@dkobak Yes, that's the kind of behavior I'm referring to... Are those numbers with the most recent commit?

dkobak · 2018-09-22T19:46:04Z

Yes.

linqiaozhi · 2018-09-22T19:51:00Z

What is the size of the data you are embedding with those numbers?

dkobak · 2018-09-22T19:53:18Z

It's the MNIST example from my Python notebook in examples. So n=70k.

linqiaozhi · 2018-09-23T18:43:10Z

Hm, so I just ran the MNIST example on my personal laptop (MBP 2017, 2.9 GHz Intel Core i7, 4 cores 2 threads each). The code I'm using is from your most recent PR (which I will merge shortly):

Iteration 50 (50 iterations in 2.46 seconds), cost 7.070435
Iteration 100 (50 iterations in 2.11 seconds), cost 7.069987
Iteration 150 (50 iterations in 2.05 seconds), cost 6.513471
Iteration 200 (50 iterations in 2.06 seconds), cost 5.950450
Iteration 250 (50 iterations in 2.06 seconds), cost 5.721760
Iteration 300 (50 iterations in 2.07 seconds), cost 5.522811
Iteration 350 (50 iterations in 2.09 seconds), cost 5.459729
Iteration 400 (50 iterations in 2.08 seconds), cost 5.437526
Iteration 450 (50 iterations in 2.10 seconds), cost 5.429920
Unexaggerating Ps by 12.000000
Iteration 500 (50 iterations in 2.18 seconds), cost 5.426591
Iteration 550 (50 iterations in 2.19 seconds), cost 4.531622
Iteration 600 (50 iterations in 2.25 seconds), cost 4.174597
Iteration 650 (50 iterations in 2.32 seconds), cost 3.956292
Iteration 700 (50 iterations in 2.16 seconds), cost 3.800906
Iteration 750 (50 iterations in 2.31 seconds), cost 3.681921
Iteration 800 (50 iterations in 2.29 seconds), cost 3.586615
Iteration 850 (50 iterations in 2.14 seconds), cost 3.507568
Iteration 900 (50 iterations in 2.37 seconds), cost 3.440483
Iteration 950 (50 iterations in 2.59 seconds), cost 3.382371
Iteration 999 (50 iterations in 2.64 seconds), cost 3.332170
Wrote the 70000 x 2 data matrix successfully.
Done.

So I'm not seeing the same kind of instability in the times. I wonder if it is a hardware difference?

To investigate it further, you can pass -DTIME_CODE when you compile, and it will print out the times for each step of the interpolation scheme for each iteration. It's a lot of output, but can be helpful in these situations.

dkobak · 2018-09-24T08:24:27Z

Wow, nice performance! Just ran it on my lab computer (Intel Xeon CPU E3-1230 v5 @ 3.40GHz, 4 cores 2 threads each) which I'd expect to be faster than MBP 2017, but it's substantially slower:

Iteration 50 (50 iterations in 3.93 seconds), cost 7.070435
Iteration 100 (50 iterations in 3.99 seconds), cost 7.069987
Iteration 150 (50 iterations in 3.90 seconds), cost 6.513471
Iteration 200 (50 iterations in 3.55 seconds), cost 5.950450
Iteration 250 (50 iterations in 3.41 seconds), cost 5.721760
Iteration 300 (50 iterations in 3.47 seconds), cost 5.522811
Iteration 350 (50 iterations in 3.38 seconds), cost 5.459729
Iteration 400 (50 iterations in 3.88 seconds), cost 5.437526
Iteration 450 (50 iterations in 3.89 seconds), cost 5.429920
Unexaggerating Ps by 12.000000
Iteration 500 (50 iterations in 3.72 seconds), cost 5.426591
Iteration 550 (50 iterations in 3.54 seconds), cost 4.531767
Iteration 600 (50 iterations in 3.58 seconds), cost 4.174701
Iteration 650 (50 iterations in 3.65 seconds), cost 3.956297
Iteration 700 (50 iterations in 3.50 seconds), cost 3.800883
Iteration 750 (50 iterations in 3.57 seconds), cost 3.681909
Iteration 800 (50 iterations in 3.48 seconds), cost 3.586550
Iteration 850 (50 iterations in 3.47 seconds), cost 3.507551
Iteration 900 (50 iterations in 3.63 seconds), cost 3.440430
Iteration 950 (50 iterations in 3.76 seconds), cost 3.382319
Iteration 999 (50 iterations in 3.83 seconds), cost 3.332099
Wrote the 70000 x 2 data matrix successfully.
Done.

I'll double-check what happens on my laptop and try -DTIME_CODE.

linqiaozhi added 5 commits August 29, 2018 13:59

added support for setting nthreads to matlab wrapper

61e4a2b

Multithreaded interpolation and require the number of boxes to be of …

19b6781

…fixed size

Removed timing library that was incompatible with Windows. Refactored…

a67b914

… timings into a macro that is enabled by compiling with -DTIME_CODE

removed timespec from nbodyfft.h

4ad6344

made progress bar narrower to fit in default sized terminals

6e2bd7b

linqiaozhi merged commit 817fed7 into KlugerLab:master Sep 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38

Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38

linqiaozhi commented Sep 22, 2018 •

edited

Loading

dkobak commented Sep 22, 2018 •

edited

Loading

linqiaozhi commented Sep 22, 2018

dkobak commented Sep 22, 2018

linqiaozhi commented Sep 22, 2018

dkobak commented Sep 22, 2018

linqiaozhi commented Sep 23, 2018

dkobak commented Sep 24, 2018

Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38

Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38

Conversation

linqiaozhi commented Sep 22, 2018 • edited Loading

dkobak commented Sep 22, 2018 • edited Loading

linqiaozhi commented Sep 22, 2018

dkobak commented Sep 22, 2018

linqiaozhi commented Sep 22, 2018

dkobak commented Sep 22, 2018

linqiaozhi commented Sep 23, 2018

dkobak commented Sep 24, 2018

linqiaozhi commented Sep 22, 2018 •

edited

Loading

dkobak commented Sep 22, 2018 •

edited

Loading