Skip to content

RFC: add bincount #812

Open
Open
@ogrisel

Description

@ogrisel

It seems quite widely adopted:

Although I have not checked how compatible they are.

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

Activity

added
RFCRequest for comments. Feature requests and proposed changes.
API extensionAdds new functions or objects to the API.
on Jun 5, 2024
rgommers

rgommers commented on Jun 13, 2024

@rgommers
Member

Thanks for the suggestion @ogrisel!

bincount is indeed a heavily used function. I did a quick tally: scikit-learn currently has 117 usages of np.bincount, and SciPy 38. It's also used as a primitive for histogram & co, and it's not easy to implement efficiently with other existing functions in the standard. So from that perspective, I think it meets the bar for inclusion fairly easily.

Signatures do all seem compatible. There is at least one difference in behavior I spotted: the accepted input is non-negative integer values, but the case of negative values is not treated the same - it may raise (NumPy), or clip (JAX).

The NumPy issue tracker contains a lot of open issues about bincount, so the constraints on values, weights etc. seem to need careful specification.

The PyTorch docs also contain a warning about non-deterministic gradients - probably related to the data-dependent behavior:

JAX in particular requires an extra length argument to statically define the length of the output array to make JIT possible and dask uses (nan,) shape to represent a data-dependent output shapes.

Yes, that's non-ideal. CuPy also adds a warning about bincount possibly causing synchronization.

The number of APIs that use data-dependent shapes is still quite low. As of now, I think this is the full list:

  • boolean indexing
  • the four unique_* functions
  • repeats
  • nonzero
NeilGirdhar

NeilGirdhar commented on Jun 15, 2024

@NeilGirdhar

Just curious, but are you sure you want to keep using the name bincount? It seems redundant: "binning" means counting. What about bin_integers, bin, bucket, or integer_histogram?

ogrisel

ogrisel commented on Jun 16, 2024

@ogrisel
Author

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Then you can aggregate those values as a histogram of per bin counts if you wish, but it's not the only thing you can do. In machine learning, binned values are used in many various ways and only some involve histograms, and often only filtered subsets of the original binned array.

NeilGirdhar

NeilGirdhar commented on Jun 16, 2024

@NeilGirdhar

To me, 'binning' means assigning to a bin. E.g. an array of continuous values (floating point values) is transformed into an array of discrete values (bin indices) typically represented as integers. This is a synonym of discretization.

Right, I agree that that's what happens when you "bin" floating point values. This function (bincount) is just the integer version of binning floating point values. In this integer case, bins can be discrete values (or they could be ranges even though bincount doesn't support ranges).

Then you can aggregate those values as a histogram of per bin counts if you wish,

I understand what you're saying. But, it's also a fact that the result of bincount is by definition already a histogram. This is clear because you can convert any call to bincount into a similar call to histogram (possibly massaging the output to get the types right and shapes right).

I still think that "bincount" is a bad name. It is just "binning", and its name should reflect the technical term that people use. No one calls it the "bincount" or "bincounting".

If I had to guess, I think its name reflects a very human tendency to combine two names that would have each been fine:

  • numpy.bin would have been fine, or
  • numpy.count might have been okay.

I think someone just glued these two names. This is like when people say "step foot" when "set foot" or "step" would each be fine, but the combination just isn't.

NeilGirdhar

NeilGirdhar commented on Jun 16, 2024

@NeilGirdhar

An alternative to choosing a different name would be to replace bincount with a true integer_histogram:

def integer_histogram(x,
                      /,
                      bins: int | Array,
                      range: None | tuple[int, int] = None,
                      density: bool = False,
                      weights: Array
                      ) -> Array:
    """Compute the histogram of an integer dataset.

    Args:
        x: The input data.  The histogram is computed over the flattened array.
        bins: If bins is an int, it defines the number of equal-width bins in the given range.
            If bins is an integer array, it defines a monotonically increasing array of bin edges,
            including the rightmost edges, which allows for non-uniform bin widths.
        range: The lower and upper range of the bins.  range[1] - range[0] + 1 must be divisible by
            the integer bins.
        weights: An array of weights, of the same shape as a. Each value in a only contributes its
            associated weight towards the bin count (instead of 1). If density is True, the weights
            are normalized, so that the integral of the density over the range remains 1.
        density: If False, the result will contain the number of samples in each bin. If True, the
            result is the value of the probability density function at the bin, normalized such that
            the integral over the range is 1. Note that the sum of the histogram values will not be
            equal to 1 unless bins of unity width are chosen; it is not a probability mass function.
    """

It's just a generalization that supports normalization and general-width bins.

Scanning how bincount is actually used, it seems that normalization is very common. So even if we don't add range and a general bins, I think it might be good to add density to cover this common case.

added a commit that references this issue on Jun 12, 2025
linked a pull request that will close this issue on Jun 12, 2025
kgryte

kgryte commented on Jun 12, 2025

@kgryte
Contributor

A PR adding bincount to the specification is up for review: #960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API extensionAdds new functions or objects to the API.Needs DiscussionNeeds further discussion.RFCRequest for comments. Feature requests and proposed changes.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @ogrisel@rgommers@NeilGirdhar@kgryte

      Issue actions

        RFC: add bincount · Issue #812 · data-apis/array-api