Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading gradient descent, required FFT to be of preset sizes, clean up timing, fix Windows bug #38

Merged
merged 5 commits into from
Sep 22, 2018

Conversation

linqiaozhi
Copy link
Member

@linqiaozhi linqiaozhi commented Sep 22, 2018

-FFTW works best on arrays of certain sizes. To quote here:

FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other exponents are arbitrary. Other sizes are computed by means of a slow, general-purpose algorithm (which nevertheless retains O(n log n) performance even for prime sizes)

So, we fixed "allowed sizes" for the FFT. For a given number of boxes per dimension (which increases as points expand as a function of iterations), we pick "round up" to the nearest "allowed size." This gives ~25% improvement in speed for small N, and it solves the weird times that people would sometimes get (e.g. the 850-900 iterations might be much slower than the 900-950 iterations because the FFT sizes were not optimal in the former but optimal in the latter.)

-Refactored most of the multithreading into a PARALELL_FOR macro so that OpenMP or C++11 threads could be both used and compared. There does not seem to be a practical difference on most machines, however.

-Removed the timing libraries that were causing errors on windows (#36). Also implemented @dkobak's fix for the other problem here. It builds and runs on Windows, Linux, and Mac OS X.

-Refactored detailed profiling timings into macro that can be enabled with -DTIME_CODE at compile time

-Multithreaded the computation of cost every 50 iterations, partially addressing #37.

-Changed the progress bar width to 60 as suggested by @dkobak here.

It is also worth mentioning a "negative result" from these optimization efforts. Our original interpolation code, e.g. here, sorted the points into boxes before interpolation, then unsorted them afterwards. The reason was for this was to allow for better parallelization, i.e. so that all points in a box were adjacent in memory. Indeed, we were able to achieve substantial improvements (e.g. 4 times faster, for a million points) on multithreaded vs. single threaded (implementation here). However, these accelerations all come at the cost of sorting and unsorting at each time. It turns out even though the implementation without sorting does not mulitthread optimally, the fact that it does not have to sort/unsort makes it faster for the N encountered most commonly with t-SNE on most machines. Therefore, we kept @pavlin-policar's modified implementation of the interpolation that does not sort/unsort.

Finally, it is important to note that even though the attractive forces are multithreaded as of #32, they still dominate on most machines. So, efforts to multithread the interpolation scheme do not necessarily translate to speed-ups on most machines.

@linqiaozhi linqiaozhi merged commit 817fed7 into KlugerLab:master Sep 22, 2018
@dkobak
Copy link
Collaborator

dkobak commented Sep 22, 2018

Regarding this:

it solves the weird times that people would sometimes get (e.g. the 850-900 iterations might be much slower than the 900-950 iterations because the FFT sizes were not optimal in the former but optimal in the latter

I've been running it now on my small personal laptop (so can't really do any proper benchmarks), and noticed this:

Iteration 500 (50 iterations in 9.64 seconds), cost 5.426600
Iteration 550 (50 iterations in 9.18 seconds), cost 4.531247
Iteration 600 (50 iterations in 15.51 seconds), cost 4.174289
Iteration 650 (50 iterations in 8.68 seconds), cost 3.956146
Iteration 700 (50 iterations in 8.72 seconds), cost 3.801069
Iteration 750 (50 iterations in 8.58 seconds), cost 3.682041
Iteration 800 (50 iterations in 14.31 seconds), cost 3.586660
Iteration 850 (50 iterations in 9.40 seconds), cost 3.507664

Not sure if it's related or not...

@linqiaozhi
Copy link
Member Author

@dkobak Yes, that's the kind of behavior I'm referring to... Are those numbers with the most recent commit?

@dkobak
Copy link
Collaborator

dkobak commented Sep 22, 2018

Yes.

@linqiaozhi
Copy link
Member Author

What is the size of the data you are embedding with those numbers?

@dkobak
Copy link
Collaborator

dkobak commented Sep 22, 2018

It's the MNIST example from my Python notebook in examples. So n=70k.

@linqiaozhi
Copy link
Member Author

Hm, so I just ran the MNIST example on my personal laptop (MBP 2017, 2.9 GHz Intel Core i7, 4 cores 2 threads each). The code I'm using is from your most recent PR (which I will merge shortly):

Iteration 50 (50 iterations in 2.46 seconds), cost 7.070435
Iteration 100 (50 iterations in 2.11 seconds), cost 7.069987
Iteration 150 (50 iterations in 2.05 seconds), cost 6.513471
Iteration 200 (50 iterations in 2.06 seconds), cost 5.950450
Iteration 250 (50 iterations in 2.06 seconds), cost 5.721760
Iteration 300 (50 iterations in 2.07 seconds), cost 5.522811
Iteration 350 (50 iterations in 2.09 seconds), cost 5.459729
Iteration 400 (50 iterations in 2.08 seconds), cost 5.437526
Iteration 450 (50 iterations in 2.10 seconds), cost 5.429920
Unexaggerating Ps by 12.000000
Iteration 500 (50 iterations in 2.18 seconds), cost 5.426591
Iteration 550 (50 iterations in 2.19 seconds), cost 4.531622
Iteration 600 (50 iterations in 2.25 seconds), cost 4.174597
Iteration 650 (50 iterations in 2.32 seconds), cost 3.956292
Iteration 700 (50 iterations in 2.16 seconds), cost 3.800906
Iteration 750 (50 iterations in 2.31 seconds), cost 3.681921
Iteration 800 (50 iterations in 2.29 seconds), cost 3.586615
Iteration 850 (50 iterations in 2.14 seconds), cost 3.507568
Iteration 900 (50 iterations in 2.37 seconds), cost 3.440483
Iteration 950 (50 iterations in 2.59 seconds), cost 3.382371
Iteration 999 (50 iterations in 2.64 seconds), cost 3.332170
Wrote the 70000 x 2 data matrix successfully.
Done.

So I'm not seeing the same kind of instability in the times. I wonder if it is a hardware difference?

To investigate it further, you can pass -DTIME_CODE when you compile, and it will print out the times for each step of the interpolation scheme for each iteration. It's a lot of output, but can be helpful in these situations.

@dkobak
Copy link
Collaborator

dkobak commented Sep 24, 2018

Wow, nice performance! Just ran it on my lab computer (Intel Xeon CPU E3-1230 v5 @ 3.40GHz, 4 cores 2 threads each) which I'd expect to be faster than MBP 2017, but it's substantially slower:

Iteration 50 (50 iterations in 3.93 seconds), cost 7.070435
Iteration 100 (50 iterations in 3.99 seconds), cost 7.069987
Iteration 150 (50 iterations in 3.90 seconds), cost 6.513471
Iteration 200 (50 iterations in 3.55 seconds), cost 5.950450
Iteration 250 (50 iterations in 3.41 seconds), cost 5.721760
Iteration 300 (50 iterations in 3.47 seconds), cost 5.522811
Iteration 350 (50 iterations in 3.38 seconds), cost 5.459729
Iteration 400 (50 iterations in 3.88 seconds), cost 5.437526
Iteration 450 (50 iterations in 3.89 seconds), cost 5.429920
Unexaggerating Ps by 12.000000
Iteration 500 (50 iterations in 3.72 seconds), cost 5.426591
Iteration 550 (50 iterations in 3.54 seconds), cost 4.531767
Iteration 600 (50 iterations in 3.58 seconds), cost 4.174701
Iteration 650 (50 iterations in 3.65 seconds), cost 3.956297
Iteration 700 (50 iterations in 3.50 seconds), cost 3.800883
Iteration 750 (50 iterations in 3.57 seconds), cost 3.681909
Iteration 800 (50 iterations in 3.48 seconds), cost 3.586550
Iteration 850 (50 iterations in 3.47 seconds), cost 3.507551
Iteration 900 (50 iterations in 3.63 seconds), cost 3.440430
Iteration 950 (50 iterations in 3.76 seconds), cost 3.382319
Iteration 999 (50 iterations in 3.83 seconds), cost 3.332099
Wrote the 70000 x 2 data matrix successfully.
Done.

I'll double-check what happens on my laptop and try -DTIME_CODE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants