Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement/fix for optimal_radius_dbscan (WIP) #254

Merged
merged 14 commits into from
May 10, 2022

Conversation

KalelR
Copy link
Contributor

@KalelR KalelR commented Apr 18, 2022

Add changes improving optimal_radius_dbscan; there seem to be no more errors now, and estimation of attractors (clusters) is better.

Worked well for Lorenz84 (though seems to be better if we normalize the features), giving the basins_fractions near the correct ones (the exact value depends on transient, how many ics, which features, whether they are normalized or not...).
But it did not work well if we restrict the initial conditions to be around the FP only. It should just give all features in the FP, but that only works if all features are identical. The period estimator feature gives some dispersion on the features, which leads to several clusters being found, instead of one. I think this is the fault of two parts: the current guess of the optimal_radius is not good for that, and the silhouette method is a bit biased towards finding more than 1 cluster.

Worked for Henon also.

The tests on attractor_mapping_tests.jl are not always passing, as the algorithm is not finding the correct values perfectly, and sometimes finds some -1 outliers.

I made 3 changes. To explain it, first let me remind the idea behind the algorithm. The way 'optimal_radius_dbscan' works is by first defining a range of possible radii for DBSCAN, then iterating on that range to calculate the quality of the clustering for each value, and then later choosing the radius with best clustering. The clustering quality is assessed by the silhouette quantifier.

From Wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters." It is undefined for only one cluster, but the default value they use is 0 (the midpoint).

The original way the authors approached this was to identify the radius that maximized the minimum silhouette (increased the lower bound the most). This seems very reasonable, but I found that on the Lorenz84 is led to very high 'optimal_radius', which made it group two clusters together. I changed it to instead maximize the average silhouette value, which made the 'optimal_radius' values go down, and improved the clustering (now it finds 3 clusters, instead of two, much more robustly).

Also, in the previous version the algorithm was ignoring the 0-value silhouettes when calculating the minimum. I admit I can't remember why I wrote it that way. The authors of bSTAB did not implement it in their code either. So I removed it. This fixes the errors that were happening (because sometimes all silhouette values were 0, and the code broke).

I also changed the value assigned to the silhouette when only one cluster is found. Previously, it was -2 (ignored the one-cluster solution). Now it is 0 (as the default in Wikipedia), so it seems to me it is more fair. But we might need to discuss this.

So there is still more to do: the algorithm can still be improved, and I'd still like to run more tests, and of course getting attractor_mapping_tests.jl to pass. But this is the current version so far, which is already better than before! I can return to this tomorrow.

… now, and improved the estimation of attractors, but still needs further tests and improvements
Copy link
Member

@Datseris Datseris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks a lot, this is great! You should mention in the docstring this deviation from the bSTAB method (mean of siluetes vs minimum).

You also need to actually uncomment the tests in line 77 of /test/basins/attractor_mapping_tests.jl. Its okay if there are deviations, we can increase the tresholds of strictness. For me the most important point is that it finds the correct number of attractors at least.

The period estimator feature gives some dispersion on the features, which leads to several clusters being found, instead of one

In my code I just replaced NaN with 0 for the period of the fixed point.

though seems to be better if we normalize the features

I am not sure about this. The un-normalized features can be made sensible w.r.t. the actual attractors. And the calculated features can be compared with other features that are computed form other initial conditions. The problem with normalizing is that it is a transformation that depends on the number of points you have. Why does it make a big difference if we normalize features...?

src/basins/attractor_mapping_featurizing.jl Outdated Show resolved Hide resolved
src/basins/attractor_mapping_featurizing.jl Outdated Show resolved Hide resolved
src/basins/attractor_mapping_featurizing.jl Outdated Show resolved Hide resolved
src/basins/attractor_mapping_featurizing.jl Outdated Show resolved Hide resolved
@KalelR
Copy link
Contributor Author

KalelR commented Apr 19, 2022

I made some tests for the different test systems, comparing

  1. average vs minimum as criterion for finding the optimal radius
  2. normalizing or not the features (by normalization I mean making each feature range from 0 to 1)
  3. comparing 1000 or 5000 initial conditions (therefore, 1000 or 5000 features)

The systems were

  1. Lorenz84
  2. Henon
  3. Duffing
  4. Magnetic pendulum
  5. Thomas cyclical

The figures are here. I can of course also share the code.

The problems seem to arise only with Lorenz84. They seem ultimately to be due to our estimation of optimal_radius. There,

  1. Algorithm always identifies 3 attractors when 1000 ics are given. But increasing to 5000 leads to it estimating 2 attractors, because the optimal radius increases and it starts to group two clusters (chaotic and periodic) together.
  2. Normalizing the features seems to solve this issue, and the 3 clusters continue to be identified for 5000 ics. The reason I see for this is the following:
    The clustering algorithm calculates the Euclidian distance between points in the clusters and compares that to a threshold, which is what we are calling the optimal radius. For the two features to be important in the comparison, both need to span a similar interval. Otherwise, if the features range very different intervals (eg the period ranges is in ~ [5,10] but the attractor entropy is in ~ [0, 100]), the Euclidian distance will be impacted strongly by the larger interval, making the optimal_radius be large in comparison to the interval spanned by the smaller interval. As a consequence, the differences in parameter space between this smaller-interval feature get ignored. If this is the feature that separates clusters, then the two clusters will be grouped together. In the Lorenz system, the optimal_radius jumps from 2 to 4 when 5000 ics are used (the attractor entropy reaches higher values). This jump is big enough for the two clusters to become grouped, apparently.
    When we normalize, the two features contribute more equally, and this problem does not seem to occur.
  3. The normalization only works if we use maximize the average between silhouettes to obtain the optimal_radius, instead of using the minimum value. Apparently, with the minimum value, optimal_radius is estimated as too high, making clusters group together. If average, the smaller values keep the clusters separated. There is quite a big interval of optimal_radius which leads to 3 clusters. The estimation via minimum leads to quite a large value.

For other systems (2-5), all options work well. The only observation for the other systems is that increasing to 5000 ics changes the fraction fs values, which is expected. But options 1-2 lead to the same values.

So it seems that to solve the Lorenz84 problems we can use an average with normalization. Instead of normalizing features, it might be better to use another metric that solves the Euclidian distance issue. What do you think?

If this is ok, I can update the docstring and the tests tomorrow.

@Datseris
Copy link
Member

For non-normalized features, optimal radius increases when increasing number of features.

This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases.

@KalelR
Copy link
Contributor Author

KalelR commented Apr 20, 2022

This is very confusing for me. How can it be...? The optimal radius should depend on the distribution of distances in the feature space, so why would this change when increasing the amount of points? It should only be more precisely identified, but I must admit it is confusing to me that this optimal radius increases.

Yeah, I should have been more precise: it is not always the case that increasing the number of points will increase the optimal radius. In fact, for the other systems I tested, in which the clustering is easier, this did not occur. This seems to occur in this more complicated clustering scenario of Lorenz.

I think the reason is ultimately because of the differences between the scales of the features, and the use of Euclidian distance for both dbscan itself and for the silhouette values. It becomes then difficult/non-intuitive to understand what is going on for non-normalized features. I added some figures to understand this here, including the average silhouette for each tested radii. I can explain them in more detail if you want, but I'll give the short version . We can see the problem in this figure:
image

1000 points lead to a decent clustering, with 3 clusters; 5000 points groups the two clusters together. This occurs at a slightly bigger value of optimal_radius only, and the silhouette values are actually slightly bigger (indicating "better" clustering) in this case. Two changes occur when we go to 5000: the top cluster is grouped together; and more points, spread in the x-axis, are added to the bottom right cluster. Since the x-axis spans a much larger interval, and the silhouette compares Euclidian distances, this addition of points spread in the x-axis overweights the addition of all the top cluster points, which are separated in the y-axis, whose distance is not that big. Adding a few points, spread over the large-spanning axis, can overweight adding lots of points spread over the small-spanning axis.

This is not a problem if the features span a similar interval, eg if we normalize them
image

Notice that increasing to 5000 here even decreases optimal_radius (it's not very important why, but I think it's because for 1000 there is a semi-isolated point in the bottom right cluster, which needs a bigger radius to be reached; for 5000 there are points points nearby, so the radius can be smaller).

@Datseris
Copy link
Member

Alright, I understand. Okay, let's go ahead and normalize the features to [0, 1].

By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used:

function feat1(A, t)
    x = A[:, 1]
    [minimum(x), std(x)]
end
function feat2(A, t)
    x, y, z = columns(A)
    [std(x), std(y)*std(z)]
end
function feat3(A, t)
    g = exp(genentropy(A, 0.1; q = 0))
    x = minimum(A[:,1])
    return [g,x]
end
function feat4(A, t)
    g = exp(genentropy(A, 0.1; q = 0))
    p = estimate_period(A[:,1], :zc)
    p = isnan(p) ? 0 : p
    return [g,p]
end

Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call?

@KalelR
Copy link
Contributor Author

KalelR commented Apr 20, 2022

By the way, are you sure the reported problems are not coming from the incredibly large spread of the cluster in the bottom left? How do you produce this cluster? All four possiblities I used for featurizing did not have any spread for the cluster corresponding to the fixed point. That's the four functions I used:

I used feat4 actually. My code is here.
The images are:
image

and
image

Do you get one point even if the number of initial conditions is increased?

Also, this discussion gave me an idea for an alternative algorithm for getting the optimal radius. Perhaps if you are available @KalelR , we should discuss this via a short video call?

Nice, sure! I'm available today anytime now.

@Datseris
Copy link
Member

I am available today at 7pm-8pm CET or so. Will send you a zoom link.

@KalelR
Copy link
Contributor Author

KalelR commented Apr 20, 2022

Hey @Datseris I need to meet my advisor at 19:30pm now. Can we shift the conversation to tomorrow afternoon?

@Datseris
Copy link
Member

no problem. unfortunately i am fully booked tomorrow and friday, but no stress, we can see for sunday otherwise next week! BTW, do you use the Julia Slack? We can chat like that there to not overburden the discussion here with off topic stuff. Otherwise we can switch to emails!

@KalelR
Copy link
Contributor Author

KalelR commented Apr 20, 2022

Oh, let's meet on sunday or next week then. Sure, I'll start to use Slack :)

@awage
Copy link
Contributor

awage commented Apr 20, 2022

I used feat4 actually. My code is here.
The images are:

Hi, very nice job! If I remember well, the cluster on the left correspond to a fixed point attractor. This is why the estimation of the period fails in feat4 and you have so much spread in the values. It is quite tough to find a feature function for everything.

@KalelR
Copy link
Contributor Author

KalelR commented Apr 25, 2022

So, normalizing the features seems to make the algorithm work fine, though not perfectly: it seems to always find 3 attractors in Lorenz84, but also sometimes mis-identifies some points as outliers (especially those due to the fixed point). Sometimes it perfectly identifies the clusters, but it's hard to tell when this occurs.

I tested two other possibilities for improving the algorithm:

  1. Finding the optimal_radius by some a priori method, without having to optimize it by the iterative method we are using.
  2. Using another algorithm that does not require an optimal_radius.

But none are as good as the current version. The results are here under "Several seeds for normalized and with average". I describe the methods below also.

Method 1 (elbow method)

Described clearly here: https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd, and in "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise" and "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN".

The idea is to define a k = minpts, then find for each point the average distance to its k-nearest neighbors. Plot the sorted distances, and find the point of highest derivative. Optimal epsilon will be the distance at that point. This is the red dot in the figure below.
image

The method seems to work ok, but not as well as the iterative silhouette method. It is much faster, though. For some reason, the problem is that that the optimal radius (the distance) is too small, and so there usually are quite a few outliers found.

Method 2 (HDBSCAN)

There is this HDBSCAN algorithm that builds on DBSCAN. I didn’t take the time yet to really understand it, but it is supposedly an implementation of DBSCAN that considers several radii. Ultimately, it doesnt need the radius as a parameter. Because of this, it is very good at identifying clusters of varying density (which may not necessarily be a good feature for our purposes).

The important parameter is then basically the minimum amount of neighbors. For any value of minimum amount of neighbors that I tested the results were terrible, the worst clustering so far, finding so many groups. This I think is because this algorithm finds clusters in data of varying density. Since we dont have an enormous amount of points, some of the features become isolated, either by themselves or in small groups. And HDBSCAN considers them as clusters.

Fortunately, there is a recent paper (https://arxiv.org/abs/1911.02282) that approaches this problem by doing a hybrid approach with DBSCAN: apply HDBSCAN and then group together any clusters that are within a certain distance. I therefore applied that, with the distance being the optimal radius found using the elbow method. This improves the method a lot! Downside of course is we get the epsilon parameter back, though it is not as relevant as before, in a sense.

Applying the same iterative optimization we used before for the minimum amount of neighbors makes the method quite good. But I'm still not sure if better than the original DBSCAN with epsilon optimization. Maybe for another system.

@Datseris
Copy link
Member

(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) )

@KalelR
Copy link
Contributor Author

KalelR commented Apr 28, 2022

(quickly letting you know that I did eye surgery and will take some time until I can see code again... But I have saved this in my todo :) )

Oh, hope everything is ok! Yes, we can resume work on this when you can, no hurry. Wish you a good recovery!

@Datseris
Copy link
Member

Datseris commented May 4, 2022

i'm back, i've sent a message on slack.

KalelR added 5 commits May 7, 2022 12:13
…d clustering/utils file with the method, the silhouette method, and some other utils for dbscan. Directory should also include the source code from Clustering.jl, but dbscan errors when I do that. I'll try to fix later, and use Clustering.jl directly for now
…d were related to the labeling of the attractors: (i) algorithm might return labels [1,2] when correct is [-1, 1] (this is because it identifies the Henon attractor at infinity as an attractor); (ii) algorithm might identify some outlier points (eg some of the points in the FP attractor in Lorenz84), in which case it puts them in -1. Then the labels are [-1, 1,2,3] and not [1,2,3]. In both cases, the values were already within the tolerated error, so I just ignored the labels and tested the values. Also had to increase to , as it improved the clustering for Lorenz84.
@KalelR
Copy link
Contributor Author

KalelR commented May 9, 2022

Now the code should be working. Changes were basically

  1. Minimum -> average (slight improvement). Now we choose radius that maximizes the average silhouette.
  2. Rescale features (big improvement)
  3. add elbow method as optional way to find radius. Looking at the original bSTAB implementation, the authors also tried this method, but apparently found that it also did not work very well.
  4. fix tests

Test summary is now on my machine:
image

There are some further improvements that could be made:

  1. Add the Clustering.jl code into the clustering folder. I tried doing it, but dbscan was failing and at the time I didn't want to spend too long trying to fix it.
  2. More importantly, improve the guesses for the optimal_radius. Currently, I think the algorithm would fail if the system has one fixed point (may be trivial, but the code should work) because of the way the radii are guessed:
    feat_ranges = maximum(features, dims=2)[:,1] .- minimum(features, dims=2)[:,1];
    ϵ_grid = range(minimum(feat_ranges)/200, minimum(feat_ranges), length=200)

Also, we are using 200 guesses. Maybe a smarter way could use fewer and speed up the code (which can take some time if there are lots (>1000) of initial conditions). I am guessing that an idea based on the elbow method, or on George's idea, could work.

@Datseris
Copy link
Member

Datseris commented May 9, 2022

wait the proximity and recurrences tests were passing before :D what happened now?

@KalelR
Copy link
Contributor Author

KalelR commented May 9, 2022

wait the proximity and recurrences tests were passing before :D what happened now?

Hmm they weren't for me. I pulled from master and ran the test before modifying anything and these errors on the duffing oscillators were there already. Just ran the tests again on master to confirm. The modifications I made shouldn't affect the other methods anyway.

Are there modifications outside master that I missed?

@awage
Copy link
Contributor

awage commented May 9, 2022

I confirm that the tests also fail on master in my machine. I will look into it.

@awage
Copy link
Contributor

awage commented May 9, 2022

It seems that the expected basin fractions values are slightly different on my machine: (0.509, 0.491) instead of (0.511, 0.489). It can be due to the RNG. I don't know why this happens but the temporary fix is this in the Duffing test:

    expected_fs_raw = Dict(2 => 0.509, 1 => 0.491)

@KalelR
Copy link
Contributor Author

KalelR commented May 10, 2022

Now all the tests pass on my machine. I don't know why the supervised method is failing in the tests here. There is something different about the knn algorithm.. maybe the version is different?

@Datseris
Copy link
Member

BoundsError: attempt to access 3-element Vector{Int64} at index [-1] seems to be the error. Perhaps we should modify the code so that it skips keys -1 in case teh clustering fails...?

src/ChaosTools.jl Outdated Show resolved Hide resolved
for k in keys(fs)
@test 0 < fs[k] < 1
end
@test sum(values(fs)) == 1

# Precise test with known initial conditions
fs, labels, approx_atts = basins_fractions(mapper, ics; show_progress = false)
@test sort!(unique!(labels)) == known_ids
# @test sort!(unique!(labels)) == known_ids #why compare keys?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, okay, if we don't compare keys, we should still compare the length of the keys, i.e., that all methods find the same number of attractors (after dropping key -1)

…replace featurizer's estimationi of period by the minimum of A[:,1]
@KalelR
Copy link
Contributor Author

KalelR commented May 10, 2022

BoundsError: attempt to access 3-element Vector{Int64} at index [-1] seems to be the error. Perhaps we should modify the code so that it skips keys -1 in case teh clustering fails...?

But the error is occurring in knn, not the tests:

  BoundsError: attempt to access 3-element Vector{Int64} at index [-1]
  Stacktrace:
    [1] getindex
      @ ./array.jl:861 [inlined]
    [2] knn_point!(tree::KDTree{SVector{2, Float64}, Euclidean, Float64}, point::SVector{2, Float64}, sortres::Bool, dist::Vector{Float64}, idx::Vector{Int64}, skip::typeof(NearestNeighbors.always_false))
      @ NearestNeighbors ~/.julia/packages/NearestNeighbors/YCcEC/src/knn.jl:36

@Datseris
Copy link
Member

Tests pass, so this is good to go, right?

found_fs = sort(collect(values(fs)))
if length(found_fs) > length(expected_fs) found_fs = found_fs[2:end] end #drop -1 key if it corresponds to just unidentified points
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! This corresponds to just unidentified points only if the method is the Featurizing. For Recurrences this is valid and counts the divergence to infinity! Which is something we need to test anyways for the henon map!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true, I only focused on Featurizing : s haha. The problem is that -1 has different meaning in the two methods: featurizing returns -1 if it can't cluster the points while Recurrences returns -1 if the points diverge.
A quick workaround is to do found_fs = found_fs[2:end] only if mapper is Featurizing. But maybe it would be better to change how Featurizing labels the points? Maybe label unclustered points as -2 or something?

@KalelR
Copy link
Contributor Author

KalelR commented May 10, 2022

Tests pass, so this is good to go, right?

Should be, but I haven't touched the supervised/knn part, so there seems to be some variability in that error. It's strange because it never ocurred on my machine.

@Datseris
Copy link
Member

Alright, but you still need to:

  1. Pull recent master here.
  2. Increment minor version in Project.toml
  3. Add a CHANGELOG.md entry of the changes

@Datseris Datseris merged commit 9658c50 into JuliaDynamics:master May 10, 2022
@Datseris
Copy link
Member

Note that the changelog is for the users, so you don't have to mention changes in the tests there :D (its okay now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants