-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement/fix for optimal_radius_dbscan (WIP) #254
Conversation
… now, and improved the estimation of attractors, but still needs further tests and improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks a lot, this is great! You should mention in the docstring this deviation from the bSTAB method (mean of siluetes vs minimum).
You also need to actually uncomment the tests in line 77 of /test/basins/attractor_mapping_tests.jl
. Its okay if there are deviations, we can increase the tresholds of strictness. For me the most important point is that it finds the correct number of attractors at least.
The period estimator feature gives some dispersion on the features, which leads to several clusters being found, instead of one
In my code I just replaced NaN
with 0 for the period of the fixed point.
though seems to be better if we normalize the features
I am not sure about this. The un-normalized features can be made sensible w.r.t. the actual attractors. And the calculated features can be compared with other features that are computed form other initial conditions. The problem with normalizing is that it is a transformation that depends on the number of points you have. Why does it make a big difference if we normalize features...?
I made some tests for the different test systems, comparing
The systems were
The figures are here. I can of course also share the code. The problems seem to arise only with Lorenz84. They seem ultimately to be due to our estimation of
For other systems (2-5), all options work well. The only observation for the other systems is that increasing to 5000 ics changes the fraction So it seems that to solve the Lorenz84 problems we can use an average with normalization. Instead of normalizing features, it might be better to use another metric that solves the Euclidian distance issue. What do you think? If this is ok, I can update the docstring and the tests tomorrow. |
This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases. |
Yeah, I should have been more precise: it is not always the case that increasing the number of points will increase the optimal radius. In fact, for the other systems I tested, in which the clustering is easier, this did not occur. This seems to occur in this more complicated clustering scenario of Lorenz. I think the reason is ultimately because of the differences between the scales of the features, and the use of Euclidian distance for both dbscan itself and for the silhouette values. It becomes then difficult/non-intuitive to understand what is going on for non-normalized features. I added some figures to understand this here, including the average silhouette for each tested radii. I can explain them in more detail if you want, but I'll give the short version . We can see the problem in this figure: 1000 points lead to a decent clustering, with 3 clusters; 5000 points groups the two clusters together. This occurs at a slightly bigger value of optimal_radius only, and the silhouette values are actually slightly bigger (indicating "better" clustering) in this case. Two changes occur when we go to 5000: the top cluster is grouped together; and more points, spread in the x-axis, are added to the bottom right cluster. Since the x-axis spans a much larger interval, and the silhouette compares Euclidian distances, this addition of points spread in the x-axis overweights the addition of all the top cluster points, which are separated in the y-axis, whose distance is not that big. Adding a few points, spread over the large-spanning axis, can overweight adding lots of points spread over the small-spanning axis. This is not a problem if the features span a similar interval, eg if we normalize them Notice that increasing to 5000 here even decreases |
Alright, I understand. Okay, let's go ahead and normalize the features to [0, 1]. By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used: function feat1(A, t)
x = A[:, 1]
[minimum(x), std(x)]
end
function feat2(A, t)
x, y, z = columns(A)
[std(x), std(y)*std(z)]
end
function feat3(A, t)
g = exp(genentropy(A, 0.1; q = 0))
x = minimum(A[:,1])
return [g,x]
end
function feat4(A, t)
g = exp(genentropy(A, 0.1; q = 0))
p = estimate_period(A[:,1], :zc)
p = isnan(p) ? 0 : p
return [g,p]
end Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call? |
I used Do you get one point even if the number of initial conditions is increased?
Nice, sure! I'm available today anytime now. |
I am available today at 7pm-8pm CET or so. Will send you a zoom link. |
Hey @Datseris I need to meet my advisor at 19:30pm now. Can we shift the conversation to tomorrow afternoon? |
no problem. unfortunately i am fully booked tomorrow and friday, but no stress, we can see for sunday otherwise next week! BTW, do you use the Julia Slack? We can chat like that there to not overburden the discussion here with off topic stuff. Otherwise we can switch to emails! |
Oh, let's meet on sunday or next week then. Sure, I'll start to use Slack :) |
Hi, very nice job! If I remember well, the cluster on the left correspond to a fixed point attractor. This is why the estimation of the period fails in |
So, normalizing the features seems to make the algorithm work fine, though not perfectly: it seems to always find 3 attractors in Lorenz84, but also sometimes mis-identifies some points as outliers (especially those due to the fixed point). Sometimes it perfectly identifies the clusters, but it's hard to tell when this occurs. I tested two other possibilities for improving the algorithm:
But none are as good as the current version. The results are here under "Several seeds for normalized and with average". I describe the methods below also. Method 1 (elbow method)Described clearly here: https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd, and in "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" and "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN". The idea is to define a k = minpts, then find for each point the average distance to its k-nearest neighbors. Plot the sorted distances, and find the point of highest derivative. Optimal epsilon will be the distance at that point. This is the red dot in the figure below. The method seems to work ok, but not as well as the iterative silhouette method. It is much faster, though. For some reason, the problem is that that the optimal radius (the distance) is too small, and so there usually are quite a few outliers found. Method 2 (HDBSCAN)There is this HDBSCAN algorithm that builds on DBSCAN. I didn’t take the time yet to really understand it, but it is supposedly an implementation of DBSCAN that considers several radii. Ultimately, it doesnt need the radius as a parameter. Because of this, it is very good at identifying clusters of varying density (which may not necessarily be a good feature for our purposes). The important parameter is then basically the minimum amount of neighbors. For any value of minimum amount of neighbors that I tested the results were terrible, the worst clustering so far, finding so many groups. This I think is because this algorithm finds clusters in data of varying density. Since we dont have an enormous amount of points, some of the features become isolated, either by themselves or in small groups. And HDBSCAN considers them as clusters. Fortunately, there is a recent paper (https://arxiv.org/abs/1911.02282) that approaches this problem by doing a hybrid approach with DBSCAN: apply HDBSCAN and then group together any clusters that are within a certain distance. I therefore applied that, with the distance being the optimal radius found using the elbow method. This improves the method a lot! Downside of course is we get the epsilon parameter back, though it is not as relevant as before, in a sense. Applying the same iterative optimization we used before for the minimum amount of neighbors makes the method quite good. But I'm still not sure if better than the original DBSCAN with epsilon optimization. Maybe for another system. |
(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) ) |
Oh, hope everything is ok! Yes, we can resume work on this when you can, no hurry. Wish you a good recovery! |
i'm back, i've sent a message on slack. |
…ed clustering method
…d clustering/utils file with the method, the silhouette method, and some other utils for dbscan. Directory should also include the source code from Clustering.jl, but dbscan errors when I do that. I'll try to fix later, and use Clustering.jl directly for now
…us to clustering/utils
…d were related to the labeling of the attractors: (i) algorithm might return labels [1,2] when correct is [-1, 1] (this is because it identifies the Henon attractor at infinity as an attractor); (ii) algorithm might identify some outlier points (eg some of the points in the FP attractor in Lorenz84), in which case it puts them in -1. Then the labels are [-1, 1,2,3] and not [1,2,3]. In both cases, the values were already within the tolerated error, so I just ignored the labels and tested the values. Also had to increase to , as it improved the clustering for Lorenz84.
wait the proximity and recurrences tests were passing before :D what happened now? |
Hmm they weren't for me. I pulled from Are there modifications outside |
I confirm that the tests also fail on master in my machine. I will look into it. |
It seems that the expected basin fractions values are slightly different on my machine: (0.509, 0.491) instead of (0.511, 0.489). It can be due to the RNG. I don't know why this happens but the temporary fix is this in the Duffing test: expected_fs_raw = Dict(2 => 0.509, 1 => 0.491) |
Now all the tests pass on my machine. I don't know why the supervised method is failing in the tests here. There is something different about the knn algorithm.. maybe the version is different? |
|
for k in keys(fs) | ||
@test 0 < fs[k] < 1 | ||
end | ||
@test sum(values(fs)) == 1 | ||
|
||
# Precise test with known initial conditions | ||
fs, labels, approx_atts = basins_fractions(mapper, ics; show_progress = false) | ||
@test sort!(unique!(labels)) == known_ids | ||
# @test sort!(unique!(labels)) == known_ids #why compare keys? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, okay, if we don't compare keys, we should still compare the length of the keys, i.e., that all methods find the same number of attractors (after dropping key -1
)
…replace featurizer's estimationi of period by the minimum of A[:,1]
But the error is occurring in
|
…rc/basins/clustering
Tests pass, so this is good to go, right? |
found_fs = sort(collect(values(fs))) | ||
if length(found_fs) > length(expected_fs) found_fs = found_fs[2:end] end #drop -1 key if it corresponds to just unidentified points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! This corresponds to just unidentified points only if the method is the Featurizing. For Recurrences this is valid and counts the divergence to infinity! Which is something we need to test anyways for the henon map!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true, I only focused on Featurizing : s haha. The problem is that -1
has different meaning in the two methods: featurizing returns -1
if it can't cluster the points while Recurrences returns -1
if the points diverge.
A quick workaround is to do found_fs = found_fs[2:end]
only if mapper
is Featurizing. But maybe it would be better to change how Featurizing labels the points? Maybe label unclustered points as -2 or something?
Should be, but I haven't touched the supervised/knn part, so there seems to be some variability in that error. It's strange because it never ocurred on my machine. |
Alright, but you still need to:
|
Note that the changelog is for the users, so you don't have to mention changes in the tests there :D (its okay now) |
Add changes improving optimal_radius_dbscan; there seem to be no more errors now, and estimation of attractors (clusters) is better.
Worked well for Lorenz84 (though seems to be better if we normalize the features), giving the basins_fractions near the correct ones (the exact value depends on transient, how many ics, which features, whether they are normalized or not...).
But it did not work well if we restrict the initial conditions to be around the FP only. It should just give all features in the FP, but that only works if all features are identical. The period estimator feature gives some dispersion on the features, which leads to several clusters being found, instead of one. I think this is the fault of two parts: the current guess of the optimal_radius is not good for that, and the silhouette method is a bit biased towards finding more than 1 cluster.
Worked for Henon also.
The tests on
attractor_mapping_tests.jl
are not always passing, as the algorithm is not finding the correct values perfectly, and sometimes finds some -1 outliers.I made 3 changes. To explain it, first let me remind the idea behind the algorithm. The way 'optimal_radius_dbscan' works is by first defining a range of possible radii for DBSCAN, then iterating on that range to calculate the quality of the clustering for each value, and then later choosing the radius with best clustering. The clustering quality is assessed by the silhouette quantifier.
From Wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters." It is undefined for only one cluster, but the default value they use is 0 (the midpoint).
The original way the authors approached this was to identify the radius that maximized the minimum silhouette (increased the lower bound the most). This seems very reasonable, but I found that on the Lorenz84 is led to very high 'optimal_radius', which made it group two clusters together. I changed it to instead maximize the average silhouette value, which made the 'optimal_radius' values go down, and improved the clustering (now it finds 3 clusters, instead of two, much more robustly).
Also, in the previous version the algorithm was ignoring the 0-value silhouettes when calculating the minimum. I admit I can't remember why I wrote it that way. The authors of bSTAB did not implement it in their code either. So I removed it. This fixes the errors that were happening (because sometimes all silhouette values were 0, and the code broke).
I also changed the value assigned to the silhouette when only one cluster is found. Previously, it was -2 (ignored the one-cluster solution). Now it is 0 (as the default in Wikipedia), so it seems to me it is more fair. But we might need to discuss this.
So there is still more to do: the algorithm can still be improved, and I'd still like to run more tests, and of course getting
attractor_mapping_tests.jl
to pass. But this is the current version so far, which is already better than before! I can return to this tomorrow.