-
Notifications
You must be signed in to change notification settings - Fork 373
Closed
Labels
Milestone
Description
Right now, describe is documented with:
If a column's base type derives from Real, :nunique will return nothings.
The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:
- Compute the number of unique values below some threshold
- Above some threshold, use a technique like HyperLogLog to get an approximate number.
The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)
Ref #1435
nalimilan