Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "binedges non-unique" error #33

Open
xukai92 opened this issue Feb 20, 2022 · 9 comments
Open

The "binedges non-unique" error #33

xukai92 opened this issue Feb 20, 2022 · 9 comments

Comments

@xukai92
Copy link

xukai92 commented Feb 20, 2022

I met the "binedges non-unique" error from https://github.com/sisl/Discretizers.jl/blob/master/src/disc_uniformcount.jl#L49.
I saw there is a to-do note on making the algorithm properly handle that---can someone explain what needs to be done for that?
I'm happy to give a try if someone gives me some guidence---thanks!

@tawheeler
Copy link
Contributor

Thanks for volunteering!

There are multiple ways this could be handled. The method in question discretizes such that each bin has roughly the same number of items within it. However, things get tricky when boundaries happen to fall on repeated samples.

Consider trying to discretize the following into two bins with uniform counts:
[0, 1, 1, 1, 1, 1, 2, 2]
In this case, we probably want bins [0,1] and (1,2] to get 6 and 2 items per bin.
However, for the following:
[0, 0.5, 1, 1, 1, 1, 1, 2]
we probably want [0,0.5] and [1,2] to get 2 and 6 items per bin.

There probably is a nice way to handle this where some initial binning is done and then a process iterates to a better binning by narrowing bins with many elements and widening bins with very few elements.

@xukai92
Copy link
Author

xukai92 commented Feb 21, 2022

Thanks for the explanation!
I understand the problem better now!

I took a look at ScikitLearn's implementation and it looks like that they simply removes those repeated edges; see https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L226.

Also, their main implementation is much easier: https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L204 (they refer to uniform count as "quantile" as it just doing uniform width but on the quantile)---have we considered take that implementation? If yes why did we choose the current one (what's the benefit)?

@xukai92
Copy link
Author

xukai92 commented Feb 21, 2022

Another way that I just read to handle this while still achieving uniform count is by random binning (see https://bkamins.github.io/julialang/2020/12/11/binning.html) but it looks like not compatible with out current pipeline as it doesn't get binedges first.

@tawheeler
Copy link
Contributor

You are referring to this?

# Remove bins whose width are too small (i.e., <= 1e-8)
            if self.strategy in ("quantile", "kmeans"):

It looks like fit for scikit-learn decided the number of bins for you. That code there is removing bins that are too small. The code for us receives the number of bins you want as an input - that is different.

If you want a method that decides on the number of bins for you, then maybe consider BayesianBlocks.

There are a lot of possible discretization methods - feel free to submit a PR for anything new if you want the functionality.

@xukai92
Copy link
Author

xukai92 commented Feb 21, 2022

It looks like fit for scikit-learn decided the number of bins for you.

Yes and no.
The fit function only changes the required number of bins if this "binedges non-unique" situation happens, otherwise it performs with the specified number of bins.
For reference their implementation of uniformcount equivalence in Julia to compute binedges with nbins is:

qs = range(0, 1; length=nbins)
binedges = quantile.(Ref(x), qs)

This implementation looks neat for me and I wonder what the advantage of the one we have in our code base is as it seems doing something more complex.

PS: In my view, when the use the "post-processing" of "remove bins whose width are too small (i.e., <= 1e-8)", users' input is like the maximum number of bins, which is kind of a sensible thing to do as default.

@tawheeler
Copy link
Contributor

Using the builtin quantile sounds great. While what goes on there is more complex than what we do, the code in the codebase approximates that. If we can avoid binning problems, then I'd say it is a fine adjustment.

quantiles = range(0, 1, length=nbins+1)
binedges = Statistics.quantile!(data, quantiles)

If you want to include an option for trimming bins, that's fine. I wouldn't do it by default in order to preserve backwards compatibility, but I am not too opinionated there.

@xukai92
Copy link
Author

xukai92 commented Feb 22, 2022

I can do a PR with the new implementaiton with an option for trimming.
Whether or not backwards compatibility is possible on DiscretizeUniformCount is a question though---the new implementaiton actually generates quite different binedges, e.g.
binedges(DiscretizeUniformCount(2), [1, 2]) was [1, 2, 2] and would become [1.0, 1.5, 2.0].
Looks like it's better to add a new struct called DiscretizeQuantileCount---how do you think?

@tawheeler
Copy link
Contributor

Right - ideally a uniform count algorithm for 3 bins on the data [1,2,3,4,5,6,6,6,6,6,6,6,6,6] would give something like [1,3], [4,5], [6,6+eps()] and not [1,5] [6+eps] the way the quantile method would. A DiscretizeQuantile struct sounds great. Thank you!

@xukai92
Copy link
Author

xukai92 commented Feb 22, 2022

PR #34 created---plz review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants