Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the defaults in the Python wrapper #93

Merged
merged 12 commits into from Mar 30, 2020

Conversation

dkobak
Copy link
Collaborator

@dkobak dkobak commented Mar 10, 2020

  1. Learning rate is set to max(200, N/early_exag_coeff) by default.
  2. Iteration number is set to 750 by default (250+500).
  3. Initialization is set to PCA (via ARPACK) by default.
  4. N-body algorithm is set to FFT for N>=8000 and to BH for N<8000 by default. UPDATE: I TOOK THIS OUT!
  5. Late exaggeration start is set to the early exagg end by default (if late exagg coeff is provided).

I updated the example notebook too.

This fixes issues #88 #89 #90.

UPDATE: also implements multithreaded Barnes-Hut!

@linqiaozhi
Copy link
Member

For the learning rate documentation:

Default 'auto'; it sets learning rate to N/12 where N is the sample size,
or to 200 if N/12 < 200.

But the code

    if learning_rate == 'auto':
        learning_rate = np.max((200, X.shape[0]/early_exag_coeff))

By default, early_exag_coeff is indeed 12. But if it's chosen to be a different number, do we want the automatic learning rate to change as well? I suppose if you have higher exaggeration, then you are probably making bigger steps, so it probably does actually make sense to decrease the learning rate.

Just wanted to confirm with you. If so, we should probably change the documentation to reflect this. I don't feel strongly either way.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 21, 2020

By default, early_exag_coeff is indeed 12. But if it's chosen to be a different number, do we want the automatic learning rate to change as well?

Yes, I think we do. This heuristic of learning_rate = N / early_exag_coeff is from Belkina et al., and they based it on your work with Stefan, where you have a bound learning_rate * early_exag_coeff < N for algorithm not to diverge.

Now that I spent some time experimenting with large exaggeration, I can also confirm that using large early exaggeration and not decreasing the learning rate from N/12 will easily lead to divergence.

@linqiaozhi
Copy link
Member

Okay great, then we should update that in the documentation. I can make that change.

@linqiaozhi
Copy link
Member

I did come across an issue with this choice of learning rate. As a test dataset, I made two well-separated 1,000-dimensional gaussian balls with 10,000 points in each.

require(MASS);
N <- 2E4;
d <- 1000;
input_data <- rbind(mvrnorm(n = N/2, rep(0, d), diag(d)), mvrnorm(n = N/2, rep(100, d), diag(d)))

I noticed that with the new learning rate setting, the exaggeration phase was terribly slow, so I stopped the optimization at 200 iterations and looked at the embedding:
image

The learning rate was too high, so it was overshooting like crazy and making the embedding large, which accordingly slows down the interpolation scheme.

When I set the learning rate to 200 (previous default), after the same number of iterations, it looks as expected:

image

When I set the exaggeration factor to be 1 (i.e. no early exaggeration), it also looks as it should.

Here's the interesting part though. When I do the same experiment in only 10 dimensions (i.e. d=10 in the snippet above), this problem also disappears. That is, the overshooting that I am observing with learning rate of N/exaggeration_factor during the exaggeration phase does not occur when the original space is only 10 dimensions.

So, it does look we are running the risk of overshooting with this new heuristic. I don't fully understand this connection with the input dimensionality, but I wonder if it has to do with ANNOY maybe being less accurate in high dimensions. I don't typically run t-SNE in these high dimensions (as I typically do PCA first), so I don't have much experience with it.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

Hmm. I can confirm this observation. I don't have an explanation right now. I've never seen this before.

I also noticed that the final embedding with d=10 and d=1000 (with learning_rate=200) look very different from each other: d=1000 turns out very well separated, but in d=10 case the Gaussians almost touch. I don't know why this would be because I think there should be zero kNN between-cluster connections in either case.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

I also noticed that the final embedding with d=10 and d=1000 (with learning_rate=200) look very different from each other: d=1000 turns out very well separated, but in d=10 case the Gaussians almost touch. I don't know why this would be because I think there should be zero kNN between-cluster connections in either case.

I am really stupefied by this. I used VP trees to remove any possible influence of Annoy, used random init to remove any possible influence of PCA, used K=15 and sigma=1e+6 to remove any possibly influence of perplexity calibration, and used learning_rate=200 to make sure everything converges well. Still I get this:

Screenshot from 2020-03-22 12-45-36

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

I must be missing something very obvious...

tmp

@linqiaozhi
Copy link
Member

I can't put my finger on what it is exactly, but I expect it to be related to the distribution of the p_ijs. In high dimensions, all points are "almost equidistant" from eachother, so the distribution of the affinities will be different in high vs. low dimensions. But I don't have an explanation yet for how that results in this effect (that looks almost like an exaggeration of the affinities).

@linqiaozhi
Copy link
Member

linqiaozhi commented Mar 22, 2020

Then again, your experiment with sigma being so large means that the actual value of the non-zero p_ijs should be essentially constant. So it has to be a difference in "which neighbors" each point is getting.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

Then again, your experiment with sigma being so large means that the actual value of the non-zero p_ijs should be essentially constant. So it has to be a difference in "which neighbors" each point is getting.

Exactly... but HOW?

@dkobak dkobak mentioned this pull request Mar 22, 2020
3 tasks
@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

So it has to be a difference in "which neighbors" each point is getting.

I think something funny happens during the symmetrization step. The affinity matrix for d=2 has up to 26 non-zero elements in a row. (I use k=15, so it's more than 15 due to symmetrization). But for d=100, the max number of non-zero elements in one row is 407. I have no idea if/how this can make a big difference. The means across rows are 17 and 27 respectively, so not hugely different.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

I quickly checked what happens with openTSNE and I could not reproduce the divergence problem.

n = 20000
d = 1000
X = np.random.randn(n,d)
X[int(n/2):,:] += 100
Z = TSNE(verbose=1, n_jobs=-1).fit(X)

This ran just fine and produced the expected embedding.

Here is the output:

--------------------------------------------------------------------------------
TSNE(callbacks=None, callbacks_every_iters=50, early_exaggeration=12,
     early_exaggeration_iter=250, exaggeration=None, final_momentum=0.8,
     initial_momentum=0.5, initialization='pca', ints_in_interval=1,
     learning_rate='auto', max_grad_norm=None, metric='euclidean',
     metric_params=None, min_num_intervals=50, n_components=2,
     n_interpolation_points=3, n_iter=500, n_jobs=-1,
     negative_gradient_method='fft', neighbors=None, perplexity=30,
     random_state=None, theta=0.5, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using Annoy approximate search using euclidean distance...
   --> Time elapsed: 20.60 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.76 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.76 seconds
===> Running optimization with exaggeration=12, lr=1666.6666666666667 for 250 iterations...
Iteration   50, KL divergence 5.0988, 50 iterations in 1.4479 sec
Iteration  100, KL divergence 4.7109, 50 iterations in 4.3490 sec
Iteration  150, KL divergence 4.6123, 50 iterations in 3.7709 sec
Iteration  200, KL divergence 4.5207, 50 iterations in 1.6480 sec
Iteration  250, KL divergence 4.5206, 50 iterations in 1.8959 sec
   --> Time elapsed: 13.11 seconds
===> Running optimization with exaggeration=1, lr=1666.6666666666667 for 500 iterations...
Iteration   50, KL divergence 4.2664, 50 iterations in 1.9638 sec
Iteration  100, KL divergence 4.2130, 50 iterations in 1.4563 sec
Iteration  150, KL divergence 4.1957, 50 iterations in 1.4923 sec
Iteration  200, KL divergence 4.1883, 50 iterations in 1.8221 sec
Iteration  250, KL divergence 4.1840, 50 iterations in 1.8193 sec
Iteration  300, KL divergence 4.1813, 50 iterations in 1.8282 sec
Iteration  350, KL divergence 4.1795, 50 iterations in 1.8237 sec
Iteration  400, KL divergence 4.1782, 50 iterations in 1.8112 sec
Iteration  450, KL divergence 4.1772, 50 iterations in 1.8842 sec
Iteration  500, KL divergence 4.1765, 50 iterations in 1.7964 sec
   --> Time elapsed: 17.70 seconds

On exactly the same data Z = fast_tsne(X) slows down after ~150 iterations because of the runaway points. I am using updated openTSNE and updated FIt-SNE (my local brunch) so learning rate is n/12 and all other parameters should be the same.

I have no idea why this difference would arise!

Update: I do see large gradients in openTSNE too, e.g. looking at the embedding after 250 iterations. (With d=10 it does not happen.) It's just that it does not become just as large as in FIt-SNE.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 22, 2020

Some additional insight: the points that are shooting off are exactly the points that have very large number of non-zero elements in the affinity matrix (i.e. feel a large number of attractive forces). Here is the result from openTSNE after only 10 iterations (d=1000) with default params. One would think that the number of attractive forces per point is around 90, but actually the max is 2700 (!!) and there are 700 points with >500 attractive forces. Here those 700 are colored in blue; all >19000 other ("normal") points are colored in red:

tmp2

@linqiaozhi
Copy link
Member

the points that are shooting off are exactly the points that have very large number of non-zero elements in the affinity matrix (i.e. feel a large number of attractive forces).

Very interesting. That is the key observation. Their attractive force is so large that exaggerating it causes them to overshoot.

Let's consider the graph before symmetrization, this a directed graph. Each point connects to its k nearest neighbors, i.e. there are k outgoing edges from each point. In high dimensions, we are observing that the distribution of the number of incoming connections is far from uniform. That is, there are some points that have a huge number of incoming edges.

There are obvious examples of point configurations where this is possible. Like 5 points can be constructive to have a "cross" structure, and you form a k=1 nearest neighbor graph, the point in the center will have 4 incoming edges and the others will have at most 1.

It's just not immediately obvious to me why this happens in high dimensions.

I'll think more about it tomorrow. I think we are running up against one of those interesting and non-intuitive facts about high dimensional space. It also might have to do with the fact that the data is Gaussian, and particularly, how we are scaling the variance as we increase the dimensions.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 23, 2020

Yes. I agree that this is very interesting!

It seems that this is not so much about the n/12 heuristic though. And one does not need to have two Gaussians actually. Here are simply 1000 points from a 1000-dimensional standard Gaussian. In this case the learning rate is 200. Still, look what happens during the early exaggeration phase:

tmp

It never goes much out of [-100, 100] range so FFT works reasonably fast. Still, clearly there is exactly the same phenomenon going on.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 23, 2020

I will also come back to this tomorrow. Just wanted to add that we should separate two questions: (1) why does this all happen and what do we learn from it; and (2) how should it affect this PR and the next release. Whereas I am very interested to keep investigating Q1, it'd be good to have some resolution to Q2 so that the PR/release could go ahead.

Given that the same thing (perhaps less drastic) can happen for smaller sample sizes with learning rate = 200, I am wondering if there is some hack that could mitigate this problem. E.g. clip the gradient norms (openTSNE allows it via max_grad_norm parameter that is by default switched off). Or do something with the momentum or delta-bar-delta gains. Ideally this should not affect "normal" uses cases like MNIST or RNAseq data after PCA, but prevent these wild oscillations of kNN graph "hubs" in high dimensions.

Update: I tried it out quickly, and max_grad_norm=0.01 removes large oscillations in my 1000 points in 1000 dimensions example, but does not seem to affect the MNIST embedding with default settings (after PCA down to 50d, like I always do).

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 23, 2020

Ran some further analysis this morning. Here is max grad norm (max over all N points) during MNIST embedding with default params. Never grows above 0.0001 and most of the time is more like 1e-5 or 1e-6
tmp4

Regarding Q1 above, I checked that the same happens with exact t-SNE (so it's not due to any approximations; the dense NxN perplexity-calibrated affinity matrix displays the same behaviour). Second, I quickly ran UMAP on the same N=20000 and d=2, d=10, d=100 toy example as above, and observed qualitatively the same differences between the final embeddings.

tmp3

My conclusion from all this is that we discovered a very real dimensionality effect that affects all kNN embeddings. And as far as this PR is concerned, we should try to stabilize the gradient descent so that points do not overshoot like crazy.

@pavlin-policar might have some experience with max_grad_norm.

@pavlin-policar
Copy link
Contributor

pavlin-policar commented Mar 23, 2020

As George pointed out, the different levels of separation are due to the high dimensionality of the data. As the dimensionality of the data increases, the distances of points sampled from a fairly some distribution on a unit ball all become about the same. In the picture below, I plot the distribution of pairwise distances between points sampled from two Gaussians in different distributions. This is apparent by the ever spikier peaks in the distances. As the dimensionality increases, the separation between the two peaks also increases, so the distances are also higher. This is an inherent problem of using euclidean metrics, which is why I generally go for something like cosine distance instead (but I generally do PCA first, which preserves euclidean distances, but not cosine distances, and I don't know the effects of that).

Code
fig, ax = plt.subplots(ncols=4, figsize=(16, 3))
for idx, dims in enumerate([2, 10, 100, 1000]):
    x1 = np.random.normal(+2, 1, size=(1000, dims))
    x2 = np.random.normal(-2, 1, size=(1000, dims))
    xs = np.vstack((x1, x2))
    dists = squareform(pdist(xs, metric="cosine")).ravel()
    ax[idx].hist(dists, bins=100, color="skyblue")
    ax[idx].set_title(f"{xs.shape[1]} dims")
    ax[idx].get_yaxis().set_ticks([])

image

I calculated the connected components on the exact KNNG and it's 1 for 2 dims, and 2 for the other 3. No surprises there. It's not clear how the dimensionality would affect the affinity matrix. The distribution of p_{ij}s seems unaffected (top row below), while the hubs get bigger. Plotting 10, 100, and 1000 on the log scale reveals that they actually follow a power-law distribution.

Code
fig, ax = plt.subplots(ncols=4, nrows=2, figsize=(16, 5))
for idx, dims in enumerate([2, 10, 100, 1000]):
    x1 = np.random.normal(+2, 1, size=(1000, dims))
    x2 = np.random.normal(-2, 1, size=(1000, dims))
    xs = np.vstack((x1, x2))
    aff = affinity.PerplexityBasedNN(xs, method="exact")
    ax[0, idx].hist(aff.P.data, bins=50, color="skyblue")
    ax[1, idx].hist(np.ravel(np.sum(aff.P > 0, axis=0)), bins=50, color="skyblue")
    ax[0, idx].set_title(f"{xs.shape[1]} dims: pij")
    ax[1, idx].set_title(f"{xs.shape[1]} dims: #neighbors")
    ax[0, idx].get_yaxis().set_ticks([])
    ax[1, idx].get_yaxis().set_ticks([])
plt.tight_layout()

image

I don't know how all this translates into gradients, but it should explain the clean separation that @dkobak was seeing.

Regarding some points offshooting: it would be a good idea to figure out which (attractive or repulsive) forces are causing this. When I was doing the transform functionality of t-SNE, this was a huge problem driven by the attractive forces, which were huge, causing points to offshoot. There, I solved it with gradient clipping or reducing learning rate -- both fixed the problem.

This was also a problem when not-scaling the initialization properly, but I don't remember which of the forces were problematic there. When it's not properly initialized, things don't converge. So my intuition is that having the scaled initialization somehow perfectly balances the attractive and repulsive forces, and I would guess the momentum compensates for that as the embedding is scaled up. But I haven't tested this.

@pavlin-policar
Copy link
Contributor

I would also note that I do gradient clipping on a per-point basis. This prevents individual points from offshooting, which is very problematic in the interpolation scheme.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 23, 2020

@pavlin-policar I posted a comment almost simultaneously with yours (some seconds before); take a look, in case you didn't see it.

As George pointed out, the different levels of separation are due to the high dimensionality of the data. As the dimensionality of the data increases, the distances of points sampled from a fairly some distribution on a unit ball all become about the same.

This is true but to be honest I think it's unrelated (or at least not directly related) to what we are seeing here. In each dimensionality from d=2 to d=1000 the separation between Gaussians is so wide that there are no kNN edges between clusters. So as far as tSNE is concerned the clusters are infinitely far away, independent of d.

The distribution of p_{ij}s seems unaffected (top row below), while the hubs get bigger. Plotting 10, 100, and 1000 on the log scale reveals that they actually follow a power-law distribution.

Yeah, this seems to be the key insight.

Regarding some points offshooting: it would be a good idea to figure out which (attractive or repulsive) forces are causing this.

Everything indicates it's attractive forces. (1) Only happens during exaggeration. (2) Only happens for points that have many nonzero elements in the affinity matrix ("hubs").

When I was doing the transform functionality of t-SNE, this was a huge problem driven by the attractive forces, which were huge, causing points to offshoot. There, I solved it with gradient clipping or reducing learning rate -- both fixed the problem.

Exactly, which is why I am thinking about gradient clipping by default. The question is, what would be a good default value?

I would also note that I do gradient clipping on a per-point basis. This prevents individual points from offshooting, which is very problematic in the interpolation scheme.

You mean when doing transform? Because openTSNE does not use clipping by default for tSNE itself, I think, right?

@pavlin-policar
Copy link
Contributor

This is true but to be honest I think it's unrelated (or at least not directly related) to what we are seeing here. In each dimensionality from d=2 to d=1000 the separation between Gaussians is so wide that there are no kNN edges between clusters. So as far as tSNE is concerned the clusters are infinitely far away, independent of d.

Well, yes, you're right. But it would explain why the clusters seem more compact. If all the points are about the same distance from each other, then which neighbors are chosen is kind of arbitrary. I would guess the KNNG is much denser in 1k dims than in 10 dims. And the p_ijs are probably more uniformly distributed for each neighbor. So, more compact. Then the repulsive forces are more concentrated and therefore stronger. Also, the level of separation is a bit visual. Notice that in your picture below, the span of the embedding is largest in 2d, from -50 to 50, while it only spans from -10 to 10 in the 100d embedding.

You mean when doing transform? Because openTSNE does not use clipping by default for tSNE itself, I think, right?

Right. Up until now, I have never seen points offshooting when running regular t-SNE, so it seemed unnecessary. I did encounter this once when setting the learning rate really high, but I don't think most people touch the learning rate, other than us, apparently.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 23, 2020

If all the points are about the same distance from each other, then which neighbors are chosen is kind of arbitrary. I would guess the KNNG is much denser in 1k dims than in 10 dims.

Right! Yes. I like this way of putting it.

Up until now, I have never seen points offshooting when running regular t-SNE, so it seemed unnecessary. I did encounter this once when setting the learning rate really high, but I don't think most people touch the learning rate, other than us, apparently.

Right. Well, as I showed in the animation above some points do offshoot with the default learning rate in d=1000. The embedding still works fine in that case, but it does indicate the problem. And with learning rate = n/12 it would make sense to use some gradient clipping.

I guess the main question is, what would be a reasonable value?

@linqiaozhi
Copy link
Member

Thanks for doing the experiments, that's really helpful.

Given that this problem is so dependent on the learning rate, it makes sense to me that we should be clipping based on the magnitude after the learning rate is applied. What we need is pretty clear: we need to limit how far a point can move at each iteration, because some points are shooting off and slowing down the interpolation scheme. If we limit it to 1, for example, wouldn't that more or less solve the problem? Based on your experiment, it seems like it would be easier to find a reasonable threshold that is not dependent on N by clipping based on after the learning rate is applied (since the learning rate is dependent on N).

Another option (either in addition to, or lieu of the clipping) is to check at every 50 iterations whether or not this overshooting is happening (e.g. if points are moving more than 1) and to just give a warning to the user that the learning rate is probably too high, and that they might consider using the more conservative learning rate of 200.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 25, 2020

If we limit it to 1, for example, wouldn't that more or less solve the problem?

Okay, so I looked at the steps. They actually don't decrease so much during the course of optimization, presumably because the gains compensate for decreasing gradients. On MNIST, I get max step of around 1.1 around iter=30 (when the gradient is maximal), then it drops to around 0.5 and stays there. On 5k subset of MNIST, max step is around 0.7. On the n=1.3mln dataset, max step size is around 3.5 in the early iterations, then drops to around 1.5.

With N=1000 in d=1000, I get step sizes in the 20--50 range.
With N=20000 in d=1000, I get step sizes in the same range.

OK George, I like this idea. I implemented the clipping of the step size with threshold 5, and it resolves all problems with d=1000. Should I add it to this PR? If so, should we add this threshold value to (max_step_norm) to the wrappers? I'd say we should, in the spirit of having all parameters controllable via the wrappers.

@linqiaozhi
Copy link
Member

Should I add it to this PR? If so, should we add this threshold value to (max_step_norm) to the wrappers? I'd say we should, in the spirit of having all parameters controllable via the wrappers.

Sounds great to me. I will then continue making these changes in the R and MATLAB implementations.

Once this is settled, we should talk more about this phenomenon. I wonder if it is an additional argument for reducing dimensionality using PCA prior to doing t-SNE or UMAP embeddings.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 26, 2020

OK, I pushed it into the PR. Note that the data.dat format changed because of the additional parameter. Everything seems to work, but it's late here, so we should test a bit more.

@linqiaozhi
Copy link
Member

linqiaozhi commented Mar 27, 2020

One more thing: irlba is initialized with a random vector, so as of right now the resulting embedding when initialization='pca' is not actually reproducible. I was going to modify it so that if the user provides a seed, then that same seed is also used immediately prior to the call to PCA.

EDIT: by irlba I mean arpack, in python

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 27, 2020

Good point. Let's add the seed to the PCA() constructor:

pca = PCA(n_components=map_dims, svd_solver=solver, random_state=seed if seed!=-1 else None)

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 27, 2020

sorry, had a typo in the code above, edited now.

Implemented the following changes which have been made to the python
wrapper:
-Learning rate set to max(200, N/early_exag_coeff)
-Initialization set to PCA by default
-Late exaggeration start is set to the early exagg end by default (if late exagg coeff is provided).

Also updated the python wrapper to use a seed prior to computing the
approximate SVD if provided
The following changes have been made in the R and Python wrappers. Now
updated in MATLAB:
-Learning rate set to max(200, N/early_exag_coeff)
-Initialization set to PCA by default
-Late exaggeration start is set to the early exagg end by default (if late exagg coeff is provided).
@linqiaozhi
Copy link
Member

Okay, MATLB and R wrappers are be updated. If no other changes, I'll change the version numbers in each of the wrappers to 1.2.0 and make a new release!

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 28, 2020

The PCA argument in Python should be random_state, not random_seed (I wrote it wrong initially).

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 28, 2020

Other than that, it looks good to me. I can look a bit closer later today.

@pavlin-policar What do you think about the max_step_norm=5 approach that we took?

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 28, 2020

@linqiaozhi Stupid question: how does one add commits to somebody else's PR? You added two commits to this PR or mine; how does it work?

@linqiaozhi
Copy link
Member

@linqiaozhi Stupid question: how does one add commits to somebody else's PR? You added two commits to this PR or mine; how does it work?

As far as I understand, the PR is from dkobak/FIt-SNE:master to KlugerLab/FIt-SNE:master. So, any changes to dkobak/FIt-SNE:master will be included in the PR.

So I cloned dkobak/FIt-SNE, made my changes, and then pushed to it. Because it is a branch of a repository I have push access to, by default I can push to it as well.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 29, 2020

So I cloned dkobak/FIt-SNE, made my changes, and then pushed to it. Because it is a branch of a repository I have push access to, by default I can push to it as well.

Oh! I didn't realize that. Actually, if I look at the settings of dkobak/FIt-SNE repository, the Github tells me that "0 collaborators have access to this repository. Only you can contribute to this repository." But I can see your commits in there.

Do you have push access to all forks of klugerlab/FIt-SNE?

@linqiaozhi
Copy link
Member

No, only because you did a PR from it. There is a checkbox to enable/disable this when you make the PR. See here.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 29, 2020

Oh that's smart from Github! :-)

I fixed the random_state, and also put the whole .py wrapper through the Black formatter. This will result in a large diff, but I thought it's worth it because it's considered to be a good Python code style.

@pavlin-policar
Copy link
Contributor

pavlin-policar commented Mar 29, 2020

What do you think about the max_step_norm=5 approach that we took?

TBH, I don't like it very much. 5 seems very arbitrary and this might not work for cases we haven't tested. A better way to set this would be to look at the distribution of step sizes and figure out something from that, something like 2*median_step or something. I'll add this option to openTSNE just to be consistent with FIt-SNE, but I'll disable it by default. So in case, anybody needs it, it's there, but I don't think this is a common occurrence, and shouldn't be a problem.

Something that worries me is that this feels very much like a band-aid to some underlying problem. It might be worth looking at how momentum acts here, maybe the momentum is too high? Or maybe the learning rate shouldn't just depend on N, but also on the number of dimensions. The list of parameters is already quite long, and slapping another one on feels like a dirty fix.

@dkobak
Copy link
Collaborator Author

dkobak commented Mar 29, 2020

@pavlin-policar

Something that worries me is that this feels very much like a band-aid to some underlying problem. It might be worth looking at how momentum acts here, maybe the momentum is too high? Or maybe the learning rate shouldn't just depend on N, but also on the number of dimensions. The list of parameters is already quite long, and slapping another one on feels like a dirty fix.

I agree with this. On the other hand, limiting max allowed step size seems to be a pretty natural and quite intuitive restriction... One never wants points during t-SNE optimization to move by a lot on each iteration. Given that the final embedding size is usually on the scale of ~100 and that one usually uses 750-1000 steps, it feels fine to restrict the max step by something around ~1.

5 seems very arbitrary and this might not work for cases we haven't tested.

I agree. Based on my experiments above, I think 1 or 10 could work just as well. The good thing is that limiting the step size should not really break anything; the worst that can happen is that the convergence will slow down.

A better way to set this would be to look at the distribution of step sizes and figure out something from that, something like 2*median_step or something.

You mean doing this on each iteration? That's an interesting idea too.

Not exactly, we're still adding extra code that needs to execute in every iteration. The first three lines of the code snippet I posted above will still always have to be evaluated on all the points, so that will definitely affect runtime. The question is, is this time difference noticeable?

To me it felt negligible. We only compute row sums for a Nx2 array. It's very fast.

Do you mean something like what pytorch does with variables? It's an interesting idea, but I don't think it fits into the framework. For stuff like this, I think modifying the code in some hacky way is fine. I can't think of a single reason (besides this) when the gradient would be accessible to users.

Not sure what you meant here (not familiar with pytorch). I just meant that I wanted to write a custom callback that would give me gradient norm at each iteration and realized that I cannot do that because the gradient is only defined within one of the functions.

@linqiaozhi
Copy link
Member

@pavlin-policar I agree that this fix feels arbitrary. But it seems quite safe (as Dmitry pointed out), and as far as we can tell, probably will only be applied in rare circumstances. More than happy to improve upon it as we learn more about this overshooting effect.

I think the increase in learning rate (that is causing this problem) is quite important to users. I believe people are getting poor results on their t-SNEs of huge datasets because the default 1000 iters with learning rate of 200 is simply inadequate.

So, I'd like to not hold back on this release, but rather update if we have a better solution later.

@linqiaozhi linqiaozhi merged commit 2659a9d into KlugerLab:master Mar 30, 2020
@dkobak
Copy link
Collaborator Author

dkobak commented Apr 7, 2020

@pavlin-policar How is it looking with the next openTSNE release, by the way? George and me have an upcoming comment where we mention that openTSNE 0.3.0 implements spectral initialization :-) Should make sure that it's actually out before the comment is published. It's not accepted yet, so there is time, but I wanted to check with you.

@pavlin-policar
Copy link
Contributor

Hey, I've had to shift gears this the past week or so because I've got deadlines for some coursework to catch. I should be back on this in a week or so. I have a couple things to update in the readme, and I've rerun benchmarks, so minor things like that need to be updated. Switching over to annoy actually brought openTSNE on par with FIt-SNE performance-wise, so that's nice :)

@dkobak
Copy link
Collaborator Author

dkobak commented Apr 7, 2020

Sounds good. Thanks for the heads up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants