Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

countmap should sort values #223

Closed
femtotrader opened this issue Nov 8, 2016 · 15 comments
Closed

countmap should sort values #223

femtotrader opened this issue Nov 8, 2016 · 15 comments

Comments

@femtotrader
Copy link

femtotrader commented Nov 8, 2016

According doc http://statsbasejl.readthedocs.io/en/latest/counts.html#countmap
countmap returns a dictionary that maps distinct values in x to their counts (or total weights).

It may be useful to sort these values (using ascending order)

So maybe an OrderedDict may be a better data structure for this.

Use case:
When you define a timeseries, it may help to find the most prevalent period between 2 timestamps (sampling period).
You will have to sort this OrderedDict using descending order of values
Most prevalent period will be the first value of this OrderedDict.

When timestamps are regulary spaced you get a dict with only one key (for which value is sampling period)

Python Pandas have a value_counts property for Series which is very useful.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

@nalimilan
Copy link
Member

OrderedDict sorts values according to insertion order, which isn't likely to be the most useful choice. SortedDict would be more appropriate. But sorting entries based on their counts is going to hurt performance if it's done during the computation, i.e. when the counts change. So it would have to be done at the end only.

Then there's the question of type stability: returning a SortedDict when sort=false wouldn't make sense, but we can't change the return type depending on the value of an argument either. Not sure what to do. Should we always sort the result? I guess some people could also want sorting based on the keys instead, so it could make sense.

(FWIW, freqtable in package FreqTables already sorts the result, but based on keys.)

@femtotrader
Copy link
Author

Thanks @nalimilan

julia> using Dates: Minute
julia> a = [Minute(4), Minute(4), Minute(4), Minute(1), Minute(1), Minute(2)]
julia> using StatsBase
julia> d = countmap(a)

SortedDict sorts the result, but based on keys

julia> using DataStructures: SortedDict, Reverse
julia> SortedDict(d, Reverse)
DataStructures.SortedDict{Base.Dates.Minute,Int64,Base.Order.ReverseOrdering{Base.Order.ForwardOrdering}} with 3 entries:
  4 minutes => 3
  2 minutes => 1
  1 minute  => 2

freqtable sorts the result, but based on values

julia> using FreqTables
julia> sort(freqtable(a), rev=true)
3-element Named Array{Int64,1}
Dim1      │
──────────┼──
4 minutes │ 3
1 minute  │ 2
2 minutes │ 1

so freqtable may be a solution but I was expecting something in StatsBase.

@nalimilan
Copy link
Member

You could return an OrderedDict which you could create manually by inserting pairs in a sorted order, but you'd need to find an algorithm to do that efficiently. And that doesn't resolve the API issues.

@femtotrader
Copy link
Author

femtotrader commented Nov 9, 2016

Maybe we could have an other function orderedcountmap which will always return an OrderedDict and have parameter to define how it should be sorted.

@nalimilan
Copy link
Member

A more Julian approach would be to have countmap(Dict, ...) and countmap(OrderedDict, ...).

@femtotrader
Copy link
Author

femtotrader commented Nov 9, 2016

According http://stackoverflow.com/questions/29848734/is-it-possible-to-sort-a-dictionary-in-julia

We can get an OrderedDict sort by value (descending) using

using DataStructures: OrderdDict
OrderedDict(sort(collect(d), by=x->x[2], rev=true))

This PR JuliaCollections/DataStructures.jl#43 should help

@nalimilan
Copy link
Member

Yes, but that's kind of wasteful since it creates an intermediate array. Maybe acceptable as a temporary solution until that PR is merged, though.

@ararslan
Copy link
Member

ararslan commented Nov 9, 2016

A more Julian approach would be to have countmap(Dict, ...) and countmap(OrderedDict, ...).

It's never too late to add those methods. 😉

@femtotrader
Copy link
Author

femtotrader commented Nov 14, 2016

When merged, this PR JuliaCollections/DataStructures.jl#225 should help

Is it possible to add DataStructures as a dependency of StatsBase?

@ararslan
Copy link
Member

I'd quite prefer we not add that dependency. If we have a method like

countmap{T<:Associative}(::Type{T}, ...)

that should allow countmap(OrderedDict, ...) without requiring an extra dependency.

@andreasnoack
Copy link
Member

@femtotrader Why not just load DataStructures in your package and sort the Dict?

@femtotrader
Copy link
Author

@andreasnoack I know this, with JuliaCollections/DataStructures.jl#225 we should be able to do

using DataStructures
d = sort(countmap(a), byvalue=true)

which can be quite good for dict with few keys/values

but I like @ararslan idea of defining

countmap{T<:Associative}(::Type{T}, ...)

because if a SortedDict is passed, countmap will be sort using keys.

We just need to find a SortedValuesDict which sort Dict using values

@andreasnoack
Copy link
Member

Are you sure that it wouldn't be as fast or faster to sort after the creation of the Dict?

@ararslan
Copy link
Member

I'm rather skeptical that any kind of special support for this needs to be in StatsBase beyond what's currently there. For your use case, it sounds like the easiest approach would just be to use DataStructures.jl on your end and sort the dict after creation, especially since (AFAIK) no SortedValuesDict currently exists.

Regarding my suggestion for countmap, I realized that we currently have a method for the exported function addcounts! which is actually what countmap is calling:

addcounts!{T}(cm::Dict{T}, x::AbstractArray{T})

If we widen that to

addcounts!{T}(cm::Associative{T,Int}, x::AbstractArray{T})

then countmap with a SortedDict is just addcounts!(SortedDict(), x).

@nilshg
Copy link

nilshg commented Mar 28, 2023

I agree with Alex and the fact that no one has answered in the last 6.5 years means this can probably be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants