Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime of the cost computation #37

Closed
dkobak opened this issue Sep 20, 2018 · 30 comments
Closed

Runtime of the cost computation #37

dkobak opened this issue Sep 20, 2018 · 30 comments

Comments

@dkobak
Copy link
Collaborator

dkobak commented Sep 20, 2018

While running the code on the n=1.3mln dataset, I noticed that the CPU usage looks like that:

screenshot from 2018-09-20 15-09-04

There are exactly 50 short valleys (each valley takes ~1s) with short periods of 100% CPU for all cores (also ~1s) -- this I assume corresponds to the attractive force computation (parallel on all cores) and the repulsive force computation (currently not parallel). Together they take ~2s per iteration, i.e. ~100s per 50 iterations. But after 50 iterations there is a long ~30s period when something happens on one core. WHAT IS IT?!

The only thing that is done every 50 iterations, is the cost computation. Does the cost computation really take so long? Can it be accelerated? If not, and if it really takes 25% of the computation time, then I think it should be optional and switched off by default (at least for large datasets).

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 20, 2018

By the way, see Sections 3.1 and 3.2 in @pavlin-policar's write-up here: https://github.com/pavlin-policar/fastTSNE/blob/master/notes/notes.pdf. If I understood 3.1 correctly, then this should accelerate the computation 2 times (even though I would argue that 12% of the runtime for cost computation is still quite a lot). The 3.2 is about something else but it would be a cool thing to implement as well.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 20, 2018

Also, the cost is computed once before the iterations start (iter==0) but never displayed! See https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L497

@linqiaozhi
Copy link
Member

Wow, this is a great catch. I had forgotten to parallelize the cost computation. It is now parallelized in my fork; I will PR it hopefully tomorrow (just want to test on linux and fix the windows problems from the other issue).

I agree it would be nice to accelerate it further using a faster computation as in Pavlin's writeup...but for now, this should make it negligible relative to the rest of the code. If you don't see that, then we can work on accelerating it or making it optional as you suggested.

Thanks for catching this!

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 21, 2018

Cool, looking forward to all the changes!

@pavlin-policar
Copy link
Contributor

The thing in Section 3.1 is pretty useful since we can reduce the number of passes needed over the data from 2 to 1 and can be done completely during the gradient computation. Actually evaluating the error also turns out to take a long time - I guess computing the log in the KL divergence a lot of times can be quite slow.

I like the correction in Section 3.2 because with exaggeration, p_{ij}s are not actual probabilities and the KL divergence is just plain wrong. However, if we apply the correction the true KL divergence can go up during the early exaggeration phase which could be unsettling if you don't know what's going on (I imagine most people expect optimization to reduce the error, not increase it!).

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 22, 2018

@pavlin-policar Let me see if I understand eq (29) in your PDF correctly: it seems it can be further simplified as

$$\frac{1}{\alpha} \text{KL}_\text{raw} - \frac{1}{\alpha}\sum_{ij}\alpha p_{ij} \log(\alpha) 
= \frac{1}{\alpha} \text{KL}_\text{raw} - \log(\alpha) \sum_{ij} p_{ij} 
= \frac{1}{\alpha} \text{KL}_\text{raw} - \log(\alpha),$$

right?

So the adjustment boils down to: take the "raw" KL divergence (computed with exaggerated p's), divide by alpha and subtract log(alpha).

@pavlin-policar
Copy link
Contributor

@dkobak so I'll just convert your expression to an image so it's easier to see:

image

And yes, you're exactly right! and now I feel very silly for not seeing this before. And this is also slightly more efficient since we get rid of the sum (although this is evaluated so rarely it won't really matter) but the expression is much nicer.

I've also tested this numerically in my code and, indeed, it is the same - like the maths say.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 22, 2018

@pavlin-policar I agree it's a nice expression! Feel free to add it to your Section 3.2.

I've just implemented it and added to the pending #39.

However, if we apply the correction the true KL divergence can go up during the early exaggeration phase which could be unsettling if you don't know what's going on (I imagine most people expect optimization to reduce the error, not increase it!).

Not sure what you mean here exactly: if KL_raw decreases, then KL_raw/alpha - log(alpha) will also decrease. Do you mean that the adjusted KL can jump up after the exaggeration is switched off?

@dkobak dkobak mentioned this issue Sep 22, 2018
@pavlin-policar
Copy link
Contributor

Yes, sorry for the confusion. This happens when we use late exaggeration and not in general. When switching from the normal regime to the late exaggeration regime, the corrected KL divergence does indeed go up.

I've copied below a run on the MNIST data set - ignore the timing and output - I've used my implementation - but the effect should be the same. The horizontal lines indicate the switch of regime - from early exaggeration to no exaggeration to late exaggeration. During the early exaggeration phase and no exaggeration regime the KL divergence drops down to 3.3663, and it rises to 3.4769 during late exaggeration.

------------------------------------------------------------------
Iteration   50, KL divergence  7.5186, 50 iterations in 2.1725 sec
Iteration  100, KL divergence  7.4820, 50 iterations in 2.1296 sec
Iteration  150, KL divergence  6.4166, 50 iterations in 2.1392 sec
Iteration  200, KL divergence  6.0529, 50 iterations in 2.1661 sec
Iteration  250, KL divergence  5.8980, 50 iterations in 2.1653 sec
------------------------------------------------------------------
Iteration   50, KL divergence  5.5385, 50 iterations in 2.1775 sec
Iteration  100, KL divergence  5.0582, 50 iterations in 2.2110 sec
Iteration  150, KL divergence  4.6849, 50 iterations in 2.2959 sec
Iteration  200, KL divergence  4.4088, 50 iterations in 2.3607 sec
Iteration  250, KL divergence  4.2014, 50 iterations in 2.5555 sec
Iteration  300, KL divergence  4.0415, 50 iterations in 2.7112 sec
Iteration  350, KL divergence  3.9136, 50 iterations in 2.9802 sec
Iteration  400, KL divergence  3.8079, 50 iterations in 3.3894 sec
Iteration  450, KL divergence  3.7185, 50 iterations in 3.3327 sec
Iteration  500, KL divergence  3.6412, 50 iterations in 3.7742 sec
Iteration  550, KL divergence  3.5734, 50 iterations in 4.9178 sec
Iteration  600, KL divergence  3.5135, 50 iterations in 4.6910 sec
Iteration  650, KL divergence  3.4595, 50 iterations in 5.3508 sec
Iteration  700, KL divergence  3.4106, 50 iterations in 6.0794 sec
Iteration  750, KL divergence  3.3663, 50 iterations in 6.1707 sec
------------------------------------------------------------------
Iteration   50, KL divergence  3.3697, 50 iterations in 6.6408 sec
Iteration  100, KL divergence  3.3889, 50 iterations in 4.5863 sec
Iteration  150, KL divergence  3.4169, 50 iterations in 9.6097 sec
Iteration  200, KL divergence  3.4477, 50 iterations in 5.0035 sec
Iteration  250, KL divergence  3.4769, 50 iterations in 5.4145 sec

This is what I had in mind and this doesn't apply at all if we don't use late exaggeration, which I believe is the default.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 22, 2018

@pavlin-policar Hmm. Looking at your output, I don't quite understand why the KL divergence keeps growing during the last 250 iterations. The "raw" KL divergence should be decreasing because that's what gradient descent is doing, and the adjusted KL divergence is a linear function of it, as we discussed above. What am I missing?

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 22, 2018

@pavlin-policar Actually by now I got myself completely confused. We seem to have concluded that the loss (KL divergence) after exaggeration is given by a linear function of the loss without exaggeration. If so, then how would any exaggeration have any effect?.. This does not make any sense; I should probably go to bed and get some sleep.

@pavlin-policar
Copy link
Contributor

@dkobak You're right, this makes no sense - I have no idea what's going on. I'll have to look into this a bit more - it seems like it could be a bug in my code, but it seems unlikely it since I use the same code for all optimization.

I've tried to run a simple example using this implementation using the R wrapper, but I don't think the costs are being transferred into R properly and I just get an array of zeros (or I may be using it wrong). I'll look at this again tomorrow, hopefully, a bit of sleep does us both some good.

@pavlin-policar
Copy link
Contributor

Of course, I had forgotten to compile the pulled C++ code... But at least now I know I'm not completely crazy. Running the R wrapper on iris with the following commands:

source('fast_tsne.R')
tmp = as.matrix(iris[, 1:4])
fftRtsne(tmp, start_late_exag_iter=750, perplexity=12, late_exag_coeff=5, get_costs=T)

I get the following output (using the latest master)

Symmetrizing...
Randomly initializing the solution.
Exaggerating Ps by 12.000000
Input similarities computed (sparsity = 0.291244)!
Learning embedding...
Using FIt-SNE approximation.
Iteration 50 (50 iterations in 0.35 seconds), cost 54.519008
Iteration 100 (50 iterations in 0.35 seconds), cost 55.227940
Iteration 150 (50 iterations in 0.33 seconds), cost 54.600685
Iteration 200 (50 iterations in 0.33 seconds), cost 52.697806
Unexaggerating Ps by 12.000000
Iteration 250 (50 iterations in 0.33 seconds), cost 1.957103
Iteration 300 (50 iterations in 0.60 seconds), cost 1.399361
Iteration 350 (50 iterations in 0.44 seconds), cost 0.474211
Iteration 400 (50 iterations in 0.35 seconds), cost 0.303934
Iteration 450 (50 iterations in 0.33 seconds), cost 0.262422
Iteration 500 (50 iterations in 0.33 seconds), cost 0.258651
Iteration 550 (50 iterations in 0.34 seconds), cost 0.255248
Iteration 600 (50 iterations in 0.33 seconds), cost 0.257843
Iteration 650 (50 iterations in 0.33 seconds), cost 0.258827
Iteration 700 (50 iterations in 0.33 seconds), cost 0.253146
Exaggerating Ps by 5.000000
Iteration 750 (50 iterations in 0.33 seconds), cost 9.311141
Iteration 800 (50 iterations in 0.35 seconds), cost 13.777067
Iteration 850 (50 iterations in 0.33 seconds), cost 13.705468
Iteration 900 (50 iterations in 0.33 seconds), cost 14.461525
Iteration 950 (50 iterations in 0.33 seconds), cost 14.457056
Iteration 999 (50 iterations in 0.33 seconds), cost 14.456085
Wrote the 150 x 2 data matrix successfully.
Done.

So in this case, it's not going up consistently - probably because the data set is different, but it does go up. I still have no idea what's going on, but it seems to be the "correct" behaviour. Maybe @linqiaozhi can shed some light on this?

@linqiaozhi
Copy link
Member

@dkobak @pavlin-policar I like the adjusted KL divergence, if only because people don't get freaked out by the fact that it changes so wildly when exaggeration turns on/off.

@pavlin-policar I think the reason that the cost is bouncing around in the last 250 iterations is that the gradient descent is overshooting. This dataset has 150 points, after a few iterations it has essentially converged. So, you are at the minimum and bouncing around with every iteration, so some times it goes up, some times it goes down.

This behavior is much more apparent with exaggeration, because exaggeration effectively increases the step size. That's not strictly true, because the exaggeration is not only increasing the magnitude of each step, but also the direction of each step (whereas the step size only increases the magnitude). So, when you are already at the minimum, increasing the magnitude of each step by late exaggeration means you will overshoot more easily.

Note that you see the same behavior without late exaggeration (Iteration 550 to 600 it increases), but it is more obvious in the late exaggeration phase for the above reason. That is my interpretation, at least.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 23, 2018

@linqiaozhi @pavlin-policar George, I don't think you have fully appreciated Pavlin's and mine confusion! Not sure if it's because you did not notice the apparent "paradox" or because its resolution was obvious to you :)

Here is the paradox:

  1. The t-SNE loss is KL(p || q).
  2. Exaggeration means $p$ is replaced by $ap$.
  3. Meaning that after the exaggeration the loss is KL(ap || q).
  4. Elementary calculation shows that KL(ap || q) = a * KL(p || q) + a * log(a).
  5. Which is a linear function of KL(p || q), meaning that it has the same minimum.
  6. Meaning that the exaggeration should have zero effect.
  7. But we know that it has a very strong effect.
  8. PARADOX.

To be honest, yesterday I was stupefied by this. 1-2 are true by definition. 4-6 are really straightforward. Where is the mistake??!!

Well, I think I got it now. The mistake is in 3. The t-SNE loss after exaggeration is not KL(ap || q). When one uses the standard loss KL(p || q) to arrive to the expression for the gradient which is a sum of attractive and repulsive forces, along the way one uses that \sum p_ij = 1. This is used for the repulsive term. When doing exaggeration, the coefficient \alpha is inserted into the attractive term only, meaning that the resulting gradient is not the gradient of KL(ap || q), because \sum (ap_ij) is not 1. Does it make sense? It's a bit cumbersome to tell without writing all the math out.

This resolves the paradox. As far as the computation of KL(p||q) is concerned, I think Pavlin's adjustment still works fine because KL computation is not influenced by the gradient and remains correct. Now it also isn't surprising that the KL(p||q) can go up during late exaggeration.

What I am wondering, is what actually is the form of the loss that is (implicitly) optimized when using exaggeration!

@linqiaozhi
Copy link
Member

@dkobak Oh, I see what you are saying! Yes I totally missed the point with my first response!

Yes, you are absolutely right. I think that part of the confusion was in the fact that in the code, we implement early/late exaggeration by multiplying the P_ijs by a constant alpha to give \tilde{P_ij}. This suggests that we are minimizing the KL divergence between \tilde{P_ij} and Q_ij. However, this is not the case.

As you know, when you work out the gradient of the KL divergence between P_ij and Q_ij, you end up with two terms, corresponding to the attractive force and the repulsive force. We multiply alpha times this attractive force to make it stronger. This changes everything...the result is that we are no longer optimizing the KL divergence between P_ij and Q_ij, but something totally different. To figure out what we are optimizing (as you asked in the end of your message), one would essentially have to integrate this new "gradient" and see what it looks like. That actually might be quite interesting, not sure how hard it would be.

But anyways, it will not look like KL(\tilde{P_ij}, Q_ij). As you pointed out, if you go to Appendix 1 of the original t-SNE paper, where van der Maaten and Hinton derive the derivative of KL(P_ij, Q_ij), you'll see that they use the fact that P_ij sums to 1. If you plug in \tilde{P_ij} instead of P_ij and follow that derivation, I expect that the resulting gradient would not be alpha*(Attractive Term) + (Repulsive Term).

In other words, I do not believe the "adjusted KL divergence" that we are now reporting is actually what is being minimized...but probably something close to it.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 23, 2018

@linqiaozhi Yes.

I think this might be relevant to the stuff we've been discussing via email about exaggeration vs. fatter tails.

To figure out what we are optimizing (as you asked in the end of your message), one would essentially have to integrate this new "gradient" and see what it looks like. That actually might be quite interesting, not sure how hard it would be.

One actually does not need to integrate. I looked at the derivation of the gradient, and the whole thing can be followed backwards very straightforwardly. The standard loss is

KL(p || q) = const - \sum p_ij log q_ij ~ -\sum p_ij log w_ij + \sum p_ij log Z = -\sum p_ij log w_ij + log Z

where q = w/Z.

Introducing \alpha in the gradient is equivalent to introducing \alpha into the first term here in this loss. So this gives

L = - a \sum p_ij log w_ij + log Z = - \sum (ap_ij) log w_ij + \sum (ap_ij) (log Z)/a = \sum (ap_ij) log (w_ij/Z^1/a)

meaning that the new loss is KL (ap || w/Z^(1/a)).

However, using $ap$ here is equivalent to using $p$ (it only transforms the loss linearly), so one can equivalently say that the new loss is KL(p || w/Z^(1/a)).

I like it! The input probabilities stay the same, but the output probabilities (not really probabilities anymore as they do not sum to 1) are changed by effectively decreasing Z. In the context of our email discussion the question now becomes, is there a different output kernel for $w$ (e.g. fatter-tailed one?) that could mimic this behaviour to some extent.

@linqiaozhi
Copy link
Member

@dkobak Wow, I like that very much...this seems to be a really fruitful perspective on the early/late exaggeration phases.

I'll have to think more about what this means in terms of the output kernel and late exaggeration--I agree it could lead to a more principled approach. Very cool.

@pavlin-policar
Copy link
Contributor

@linqiaozhi, @dkobak Awesome that you've figured it out. I too was wrong at the third point in @dkobak's list, but looking at the derivation, it makes sense. Intuitively, it seemed to me that if we want to pull the points closer together, this should change the objective, but I was really confused how the correction was linear for the exaggeration phase if the cost was changed. Anyways, really cool.

So one implication of this expression I guess is that piling on exaggeration won't really help much, because of the 1/a relation. So switching from no exaggeration to 12 will have a much larger effect than switching from 12 to 24. Does that sound about right?

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 24, 2018

@pavlin-policar Well, going from 1 to 12 is 12x increase, so I'd guess a comparable increase would be from 12 to 144, instead of 12 to 24. But it's hard to say. I think our understanding of exaggeration is still very poor.

In general, one should really distinguish "early" exaggeration from "late" exaggeration (by the way, I haven't seen any benefit in using a phase with exaggeration=1 between early and late; I prefer to switch from early to late directly). Early exaggeration is an optimization trick. Late exaggeration is a meaningful change of the loss function. For MNIST with n=70k early alpha=12 and late alpha=4 work quite well, see https://github.com/KlugerLab/FIt-SNE/blob/master/examples/test.ipynb. For the 10x dataset with n=1.3 mln, Linderman & Steinberger seem to suggest something like alpha=1.3e+6/10/200 = 650 (given learning rate h=200) for the early phase (right @linqiaozhi ?) , whereas I find that alpha=8 works well for the late. I am doing some additional experiments with it right now...

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 24, 2018

@linqiaozhi Running current master on the same data as in the first post in this issue (n=1.3mln), I see that the cost computation still happens on a single thread; I think this is due to #40.

More worryingly, however, I now get 160s runtime per 50 iterations, instead of 130s previously (before the big PR over the weekend) :-(

@linqiaozhi
Copy link
Member

linqiaozhi commented Sep 24, 2018

@dkobak Wow, that's not the right direction!

I compared the current master with 61e4a2b, for N=1 million, and the times are about the same on my machine. Until we address #40, this is what I would have expected, since as I explained with the "negative result" in that PR over the weekend, we didn't really get much out of multithreading the repulsive forces.

However, it should definitely not be slower! I wonder if the multithreading in "Step 3" of that function is causing overhead on your architecture. Would you mind comparing the times with and without multithreading the n_body_fft_2d() function? That is, you could add the line nthreads=1 to here and compare to times without that line?

Anyways, this is hugely helpful. I am finding that small improvements from multithreading don't generalize well, and can even backfire in very bad ways on different machines (which is hard to know unless other people compare the times on their machines). If we confirm that the problem is overhead from multithreading, then we can just remove it for that function. The real boost that you get from multithreading is for the attractive forces, anyways.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 24, 2018

I recompiled the source and restarted my Python scripts. I get 168s with the current master, 174s if n_body_fft_2d() is called with 1 instead of nthreads in https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L786, and 132s if I do git checkout 61e4a2b...

The machine is Intel Xeon CPU E3-1230 v5 @ 3.40GHz, 4 cores 2 threads each.

@linqiaozhi
Copy link
Member

@dkobak I also tried comparing on my linux server, and still can't replicate the difference in your timings.

In my fork, I added timings to 61e4a2b in a branch called profiled_legacy. Would you mind compiling that branch with -DTIME_CODE, and also compiling the master of that fork with -DTIME_CODE and posting the results? You could just copy the resulting binaries into whatever test directory structure you have already been using, if that is easier. Hopefully this will tell us what is slowing down. Sorry to bother you with this...

For example, on my server, a typical iteration has the following timings (and I added in the cost of error too):

61e4a2:						
Step 1: 140 ms						
FFT: 9 ms                                               
Step 3: 77 ms                                           
Total Interpolation: 261 ms                             
Attractive Forces: 306 ms                               
Computing Error: 12734 ms                               

Current:
Step 1: 138 ms
FFT: 8 ms
Step 3: 20 ms
Total Interpolation: 201 ms
Attractive Forces: 332 ms
Computing Error: 14900 ms                               

Everything should be the same except for Step 3, which is a bit faster due to the parallelization.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 24, 2018

Sure. The timings jump around a bit (would be good to see averages over some iterations...), especially Attractive Forces (it's also by far the longest here), but from quickly eyeballing the output for a couple of minutes it seems that maybe Step 1 got consistently slower?.. However, the difference is tiny and I don't quite understand why the difference per 50 iterations is 160s vs 130s.

I have to leave now, let me double check it tomorrow.

Profiled_legacy:

Step 1: 119 ms
FFT: 7 ms
Step 3: 77 ms
Total Interpolation: 222 ms
Attractive Forces: 2298 ms

Master:

Step 1: 136 ms
FFT: 7 ms
Step 3: 38 ms
Total Interpolation: 201 ms
Attractive Forces: 2362 ms

@linqiaozhi
Copy link
Member

linqiaozhi commented Sep 24, 2018

Okay, thanks. When you get the chance, could you also report the times for the computation of error? That's essentially all that's left...in the times you just sent, the Interpolation in Master is faster in total (Because speedup of Step 3).

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 24, 2018

Yes. Something does not add up here. I'm pretty sure that the error took the same time (around 30s). I'll try to analyse the output for 50 consecutive iterations.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 25, 2018

Good morning! I spent a fair amount of time profiling the two versions of code and preparing a plot of the timing difference for each step over 50 iterations...

... before realizing that in profiled_legacy the error evaluation is excluded from the reported time of 50 iterations
https://github.com/linqiaozhi/FIt-SNE/blob/profiled_legacy/src/tsne.cpp#L501
whereas in the master the error evaluation it is included:
https://github.com/linqiaozhi/FIt-SNE/blob/master/src/tsne.cpp#L527

That's exactly the 30s difference I was observing 😱 Apart from that, the interpolation did get a little faster and the rest is unchanged.

We should both be embarrassed now! :)

@linqiaozhi
Copy link
Member

@dkobak Alas...oh man, that is rough. So sorry you had to spend all that time on it.

Anyways, thanks for figuring it out. Of course, that explains it perfectly. That should have been the first thing I thought of, but I guess I did not realize that I made that change when I switched timing libraries (which was just so that it would run on Windows). Ouch, man.

@dkobak
Copy link
Collaborator Author

dkobak commented Sep 25, 2018

No problem, that's how these things are... I am relieved now.

Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants