Open
Description
It seems quite widely adopted:
- https://numpy.org/doc/stable/reference/generated/numpy.bincount.html
- https://pytorch.org/docs/stable/generated/torch.bincount.html
- https://docs.cupy.dev/en/stable/reference/generated/cupy.bincount.html
- https://jax.readthedocs.io/en/latest/_autosummary/jax.numpy.bincount.html
- https://docs.dask.org/en/stable/generated/dask.array.bincount.html
Although I have not checked how compatible they are.
JAX in particular requires an extra length
argument to statically define the length of the output array to make JIT possible and dask uses (nan,)
shape to represent a data-dependent output shapes.
Activity
rgommers commentedon Jun 13, 2024
Thanks for the suggestion @ogrisel!
bincount
is indeed a heavily used function. I did a quick tally: scikit-learn currently has 117 usages ofnp.bincount
, and SciPy 38. It's also used as a primitive forhistogram
& co, and it's not easy to implement efficiently with other existing functions in the standard. So from that perspective, I think it meets the bar for inclusion fairly easily.Signatures do all seem compatible. There is at least one difference in behavior I spotted: the accepted input is non-negative integer values, but the case of negative values is not treated the same - it may raise (NumPy), or clip (JAX).
The NumPy issue tracker contains a lot of open issues about
bincount
, so the constraints on values, weights etc. seem to need careful specification.The PyTorch docs also contain a warning about non-deterministic gradients - probably related to the data-dependent behavior:
Yes, that's non-ideal. CuPy also adds a warning about
bincount
possibly causing synchronization.The number of APIs that use data-dependent shapes is still quite low. As of now, I think this is the full list:
unique_*
functionsrepeats
nonzero
NeilGirdhar commentedon Jun 15, 2024
Just curious, but are you sure you want to keep using the name
bincount
? It seems redundant: "binning" means counting. What aboutbin_integers
,bin
,bucket
, orinteger_histogram
?ogrisel commentedon Jun 16, 2024
To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.
Then you can aggregate those values as a histogram of per bin counts if you wish, but it's not the only thing you can do. In machine learning, binned values are used in many various ways and only some involve histograms, and often only filtered subsets of the original binned array.
NeilGirdhar commentedon Jun 16, 2024
Right, I agree that that's what happens when you "bin" floating point values. This function (
bincount
) is just the integer version of binning floating point values. In this integer case, bins can be discrete values (or they could be ranges even thoughbincount
doesn't support ranges).I understand what you're saying. But, it's also a fact that the result of
bincount
is by definition already a histogram. This is clear because you can convert any call tobincount
into a similar call tohistogram
(possibly massaging the output to get the types right and shapes right).I still think that "bincount" is a bad name. It is just "binning", and its name should reflect the technical term that people use. No one calls it the "bincount" or "bincounting".
If I had to guess, I think its name reflects a very human tendency to combine two names that would have each been fine:
numpy.bin
would have been fine, ornumpy.count
might have been okay.I think someone just glued these two names. This is like when people say "step foot" when "set foot" or "step" would each be fine, but the combination just isn't.
NeilGirdhar commentedon Jun 16, 2024
An alternative to choosing a different name would be to replace
bincount
with a trueinteger_histogram
:It's just a generalization that supports normalization and general-width bins.
Scanning how
bincount
is actually used, it seems that normalization is very common. So even if we don't addrange
and a generalbins
, I think it might be good to adddensity
to cover this common case.feat: add `bincount` to the specification
bincount
to the specification #960kgryte commentedon Jun 12, 2025
A PR adding
bincount
to the specification is up for review: #960