Skip to content

describe on columns with e.g. Integers #2269

@KristofferC

Description

@KristofferC

Right now, describe is documented with:

If a column's base type derives from Real, :nunique will return nothings.

The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:

  • Compute the number of unique values below some threshold
  • Above some threshold, use a technique like HyperLogLog to get an approximate number.

The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)

Ref #1435

Metadata

Metadata

Assignees

No one assigned

    Labels

    breakingThe proposed change is breaking.decision

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions