Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added weighted median function and tests #90

Merged
merged 1 commit into from
Oct 11, 2014
Merged

Added weighted median function and tests #90

merged 1 commit into from
Oct 11, 2014

Conversation

tinybike
Copy link
Contributor

No description provided.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 76056f4 on tensorjack:master into ae435d1 on JuliaStats:master.

@nalimilan
Copy link
Member

Thanks for this!

I thought it might be a good idea to provide more tests for corner cases, which are where things usually break. For example, when only one value is passed; when all values are the same; when all weights are zero; when some weights are negative.

###### Weighted median #####

function Base.median{W<:Real}(v::RealVector, w::WeightVec{W})
sorted = sortrows([v w.values])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT sortrows is going to sort on both columns, right? Couldn't you sort only on the first column? I think you may save some memory by first saving [v w.values] in an object, and then calling sort! on it.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling 43f7bbb on tensorjack:master into ae435d1 on JuliaStats:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling fa6e783 on tensorjack:master into ae435d1 on JuliaStats:master.

if any(mask)
v = v[mask]
wt = w.values[mask]
sorted = sortrows([v wt])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure that's the best solution. [v wt] is going to require copying the whole vectors, which may be really big, and then create another sorted copy. It may be better to call s = sortperm(v), and instead of sorting [v wt] iterate over s in the loop below, accessing wt[s[below_midpoint_index]]. This may be slower, not sure, but it may be much better for very large vectors (which is where performance matters for such a function).

Base.median() uses select() to avoid sorting the whole vector, but I don't know whether there's an equivalent function for sortperm.

But the question of efficiently computing the median has probably been discussed many times before. Have you done some research to find this algorithm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base.median() uses select() to avoid sorting the whole vector, but I don't know whether there's an equivalent function for sortperm.

A selectperm function would look like this (base off of select and sortperm)

import Base.Sort: Ordering, Perm, Forward, ord
selectperm(v::AbstractVector, k::Union(Int,OrdinalRange);
    lt::Function=isless, by::Function=identity, rev::Bool=false, order::Ordering=Forward) =
    select!([1:length(v)], k, Perm(ord(lt,by,rev,order),v))

Note that this does create a vector of Ints the sames length as v, but that's still probably better in most cases than copying the whole vector and sorting.

Example usage:

julia> x = rand(5)
5-element Array{Float64,1}:
 0.677481
 0.107243
 0.525638
 0.297544
 0.748513

julia> sortperm(x)
5-element Array{Int64,1}:
 2
 4
 3
 1
 5

julia> selectperm(x, 2)
4

julia> selectperm(x, 1:3)
3-element Array{Int64,1}:
 2
 4
 3

This could be submitted to Base if people though it were useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure that's the best solution. [v wt] is going to require copying the whole vectors, which may be really big, and then create another sorted copy. It may be better to call s = sortperm(v), and instead of sorting [v wt] iterate over s in the loop below, accessing wt[s[below_midpoint_index]]. This may be slower, not sure, but it may be much better for very large vectors (which is where performance matters for such a function).

Good call on using sortperm. I updated the gist to include a sortperm implementation:

https://gist.github.com/tensorjack/432bdbaa2aff38ee64ee

It's 25x faster than the old version, and also decreases memory use by 89%. Going to send over a pull request with the new, improved version :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can update your existing fork and this pull request will reflect the changes.

@nalimilan
Copy link
Member

Actually there seems to be a complete implementation in C++ under the MIT license here. It's a bit scary, but there's a simple quickSelect function there, which could be a nice addition to Base, similar to the currently existing select, but with weights.

https://www.eecis.udel.edu/~rauh/fqs/

@nalimilan
Copy link
Member

@tensorjack Do you think you could try porting wquickSelect() from the above C++ code to see how it performs? It would probably be even faster than your new implementation. The relevant function is at wmedianf_impl.cpp:1017.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling 3f3623a on tensorjack:master into ae435d1 on JuliaStats:master.

@simonster
Copy link
Member

This should probably throw an error or return nan(eltype(v)) if there are no non-zero weights. Right now it will return nothing.

wt = w.values[mask]
midpoint = 0.5 * sum(wt)
if any(wt .> midpoint)
first(v[wt .== maximum(wt)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be equivalent but more efficient if you compute maxval, maxind = findmax(wt), check if maxval > midpoint, and if so, return v[maxind].

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) when pulling 3c24a2c on tensorjack:master into ae435d1 on JuliaStats:master.

cumulative_weight += wt[p]
end
if cumulative_weight == midpoint
0.5 * (v[permute[i-2]] + v[permute[i-1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a second look-through and I have more. This should be middle((v[permute[i-2]], v[permute[i-1]]) and the below case should be middle(v[permute[i-1]]) for type stability.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling e4efed8 on tensorjack:master into ae435d1 on JuliaStats:master.

@StefanKarpinski
Copy link
Contributor

I had test cases with zero weights. But, I added the warn() to alert the user that negative/zero weights are being ignored, which made the output feel spammy. What I can do is add a keyword argument that shuts off the warnings, and use that for tests.

Can't find the comment now but I have an email. I suspect that it's better to simply allow zero weights with well-defined behavior than to have a warn/nowarn option that's going to have go on an endless number of functions.

@tinybike
Copy link
Contributor Author

Can't find the comment now but I have an email. I suspect that it's better to simply allow zero weights with well-defined behavior than to have a warn/nowarn option that's going to have go on an endless number of functions.

That makes sense. I can remove the warn altogether. The behavior would then be that zero and negative weights are silently ignored. (An error is only thrown if all weights are zero or negative.) Is there a way to add "docstring" text? I think if this behavior were mentioned in help, it would be a non-issue.

@StefanKarpinski
Copy link
Contributor

I think that negative weights are quite different from zero weights, no? Negative seems to me to be a genuine error while zero is the limit of a positive weight.

@tinybike
Copy link
Contributor Author

I think that negative weights are quite different from zero weights, no? Negative seems to me to be a genuine error while zero is the limit of a positive weight.

That's a good point. A zero weight just means "this observation is meaningless" (or, "this observation was repeated zero times"). Negative weights are always errors, as I understand it. In light of this, it probably makes the most sense to throw an error when there are any negative weights, silently remove any zero weights, and get rid of the warning altogether.

@grayclhn
Copy link

There are plenty of cases in stats where negative weights are not errors: that lets the other weights have mass greater than one. The examples I know off the top of my head are in time-series, but I'm sure they show up elsewhere too.

@nalimilan
Copy link
Member

There are plenty of cases in stats where negative weights are not errors: that lets the other weights have mass greater than one. The examples I know off the top of my head are in time-series, but I'm sure they show up elsewhere too.

But then what meaning would negative weights convey in the case of median?

@andreasnoack
Copy link
Member

@grayclhn Just curious. What are your examples?

@grayclhn
Copy link

grayclhn commented Oct 1, 2014

Covariance matix estimators that account for serial correlation: the "Quadratic Spectral" kernel satisfies some optimality properties for the estimator of the variance-covariance matrix, and it takes on small negative values in places. Mathworks, of all places, has a nice plot (don't worry, no code at the link...)
http://www.mathworks.com/help/econ/hac.html#btt5ta4-1

@StefanKarpinski
Copy link
Contributor

So how are negative weights treated? As negative multipliers where the sum still needs to be 1? If so, then maybe we just leave it up to the caller to make sure that weights are non-negative or positive as appropriate given the circumstances.

@grayclhn
Copy link

grayclhn commented Oct 1, 2014

In the cases I'm familiar with, the negative values don't need to be treated any differently than the positive cases. Here it looks like the algorithm should work fine even if some of the weights are negative.

@StefanKarpinski
Copy link
Contributor

That seems pretty reasonable to me.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling f5b81ea on tensorjack:master into ae435d1 on JuliaStats:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling b1b08ac on tensorjack:master into ae435d1 on JuliaStats:master.

@tinybike
Copy link
Contributor Author

tinybike commented Oct 1, 2014

Ok, I changed mask so that it allows negative weights, which are treated the same way as positive weights. I also removed the verbose keyword and its accompanying warn. Let me know if there are other changes you would like to see, or feel free to merge this request if you guys think it's ready.

@johnmyleswhite
Copy link
Member

I'd like to see some docs. In particular, a mathematical definition of what's happening here is good.

@tinybike
Copy link
Contributor Author

tinybike commented Oct 2, 2014

Sure. Does Julia have a docstring-style format for this, like Python?

@johnmyleswhite
Copy link
Member

No, but there are plenty of docs for this project already.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling d24f8fa on tensorjack:master into ae435d1 on JuliaStats:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 4fac766 on tensorjack:master into ae435d1 on JuliaStats:master.

@johnmyleswhite
Copy link
Member

Looking pretty good. Thanks for writing docs.

Questions:

  • Is the definition you provide for the weighted median always unique? And is it always defined, even if the weights are sometimes negative?
  • Do other functions in StatsBase provide checknan? It's somewhat at odds with our idea that NaN shouldn't be used to encode missing values.

@tinybike
Copy link
Contributor Author

tinybike commented Oct 6, 2014

Good questions!

  • I looked this up, and found a textbook that maps this onto a convex optimization problem. So, it is unique if the weights are non-negative. I don't have a good reference for negative weights, so I am not sure it is always defined in that case. That textbook suggests there are "problems" with negative weights, which are further explored in the Exercises, which are unfortunately not part of Google's book preview. @grayclhn, do you know the answer to this? I was not aware that negative weights could even be used in this context, prior to your comment!
  • checknan is from the median function in julia/base/statistics.jl. I agree that the usage is awkward, but felt I should be consistent with the base median implementation. Would you guys prefer that I remove this argument? If it's at odds with the StatsBase philosophy, I wouldn't mind having an excuse to get rid of it, to be honest.

@grayclhn
Copy link

grayclhn commented Oct 6, 2014

No idea, I was originally responding to the claim, "negative weights are clearly errors," but this isn't an area I know a lot about.

I wouldn't worry too much about uniqueness, though: the unweighted sample median is not necessarily unique, we just typically assign it to be the midpoint of its allowable values by convention. As long as the algorithm for the weighted median always returns a value such that at least half of the mass lies on or below it and at least half of the mass lies on or above it, then it satisfies the definition of a weighted median.

@johnmyleswhite
Copy link
Member

My worry is whether the algorithm is stable: given two inputs that are equal as multisets, but in different order in a vector, will the same output be produced?

@grayclhn
Copy link

grayclhn commented Oct 6, 2014

Good point. It looks like the elements are sorted by (value, weight) pairs; first value and then weight, which should make it unique.

@johnmyleswhite
Copy link
Member

I think it should be too. Just want to confirm, because the median is such a finicky thing already and adding weights makes me worry.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 07b9e9d on tensorjack:master into ae435d1 on JuliaStats:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling ba8a745 on tensorjack:master into ae435d1 on JuliaStats:master.

@tinybike
Copy link
Contributor Author

tinybike commented Oct 7, 2014

I agree, this is a good thing to confirm. I wrote an extra block of tests to show that the result doesn't change when the data/weights are reordered. (Although, this is not true if all weights are negative, so that case now throws an error.)

@johnmyleswhite
Copy link
Member

Ok. This should be good to go. I'd personally suggest removing the ability to use negative weights until we can document that against the literature, but don't hold this up over that.

@tinybike
Copy link
Contributor Author

tinybike commented Oct 8, 2014

The checknan keyword argument was just removed from the base median implementation, so I have removed it from this version, as well.

@johnmyleswhite
Copy link
Member

If nobody objects, let's merge this.

@tinybike
Copy link
Contributor Author

tinybike commented Oct 8, 2014

@johnmyleswhite, are you able to merge this, or does someone else need to? (Your comments are marked "Owner".)

@johnmyleswhite
Copy link
Member

I can. Just wanted to wait a bit. It's been enough now.

One thing: can you squash the 14 commits into one atomic change?

@garborg
Copy link
Member

garborg commented Oct 8, 2014

He was just giving other collaborators a chance to respond.

@tinybike
Copy link
Contributor Author

Sure -- squashed the commits!

johnmyleswhite added a commit that referenced this pull request Oct 11, 2014
Added weighted median function and tests
@johnmyleswhite johnmyleswhite merged commit 6b44ead into JuliaStats:master Oct 11, 2014
@johnmyleswhite
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants