-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added weighted median function and tests #90
Conversation
Thanks for this! I thought it might be a good idea to provide more tests for corner cases, which are where things usually break. For example, when only one value is passed; when all values are the same; when all weights are zero; when some weights are negative. |
###### Weighted median ##### | ||
|
||
function Base.median{W<:Real}(v::RealVector, w::WeightVec{W}) | ||
sorted = sortrows([v w.values]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT sortrows
is going to sort on both columns, right? Couldn't you sort only on the first column? I think you may save some memory by first saving [v w.values]
in an object, and then calling sort!
on it.
if any(mask) | ||
v = v[mask] | ||
wt = w.values[mask] | ||
sorted = sortrows([v wt]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure that's the best solution. [v wt]
is going to require copying the whole vectors, which may be really big, and then create another sorted copy. It may be better to call s = sortperm(v)
, and instead of sorting [v wt]
iterate over s
in the loop below, accessing wt[s[below_midpoint_index]]
. This may be slower, not sure, but it may be much better for very large vectors (which is where performance matters for such a function).
Base.median()
uses select()
to avoid sorting the whole vector, but I don't know whether there's an equivalent function for sortperm
.
But the question of efficiently computing the median has probably been discussed many times before. Have you done some research to find this algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base.median()
usesselect()
to avoid sorting the whole vector, but I don't know whether there's an equivalent function forsortperm
.
A selectperm
function would look like this (base off of select
and sortperm
)
import Base.Sort: Ordering, Perm, Forward, ord
selectperm(v::AbstractVector, k::Union(Int,OrdinalRange);
lt::Function=isless, by::Function=identity, rev::Bool=false, order::Ordering=Forward) =
select!([1:length(v)], k, Perm(ord(lt,by,rev,order),v))
Note that this does create a vector of Int
s the sames length as v
, but that's still probably better in most cases than copying the whole vector and sorting.
Example usage:
julia> x = rand(5)
5-element Array{Float64,1}:
0.677481
0.107243
0.525638
0.297544
0.748513
julia> sortperm(x)
5-element Array{Int64,1}:
2
4
3
1
5
julia> selectperm(x, 2)
4
julia> selectperm(x, 1:3)
3-element Array{Int64,1}:
2
4
3
This could be submitted to Base if people though it were useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure that's the best solution. [v wt] is going to require copying the whole vectors, which may be really big, and then create another sorted copy. It may be better to call s = sortperm(v), and instead of sorting [v wt] iterate over s in the loop below, accessing wt[s[below_midpoint_index]]. This may be slower, not sure, but it may be much better for very large vectors (which is where performance matters for such a function).
Good call on using sortperm
. I updated the gist to include a sortperm
implementation:
https://gist.github.com/tensorjack/432bdbaa2aff38ee64ee
It's 25x faster than the old version, and also decreases memory use by 89%. Going to send over a pull request with the new, improved version :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can update your existing fork and this pull request will reflect the changes.
Actually there seems to be a complete implementation in C++ under the MIT license here. It's a bit scary, but there's a simple quickSelect function there, which could be a nice addition to Base, similar to the currently existing select, but with weights. |
@tensorjack Do you think you could try porting |
This should probably throw an error or return |
wt = w.values[mask] | ||
midpoint = 0.5 * sum(wt) | ||
if any(wt .> midpoint) | ||
first(v[wt .== maximum(wt)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be equivalent but more efficient if you compute maxval, maxind = findmax(wt)
, check if maxval > midpoint
, and if so, return v[maxind]
.
cumulative_weight += wt[p] | ||
end | ||
if cumulative_weight == midpoint | ||
0.5 * (v[permute[i-2]] + v[permute[i-1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, a second look-through and I have more. This should be middle((v[permute[i-2]], v[permute[i-1]])
and the below case should be middle(v[permute[i-1]])
for type stability.
Can't find the comment now but I have an email. I suspect that it's better to simply allow zero weights with well-defined behavior than to have a warn/nowarn option that's going to have go on an endless number of functions. |
That makes sense. I can remove the |
I think that negative weights are quite different from zero weights, no? Negative seems to me to be a genuine error while zero is the limit of a positive weight. |
That's a good point. A zero weight just means "this observation is meaningless" (or, "this observation was repeated zero times"). Negative weights are always errors, as I understand it. In light of this, it probably makes the most sense to throw an error when there are any negative weights, silently remove any zero weights, and get rid of the warning altogether. |
There are plenty of cases in stats where negative weights are not errors: that lets the other weights have mass greater than one. The examples I know off the top of my head are in time-series, but I'm sure they show up elsewhere too. |
But then what meaning would negative weights convey in the case of median? |
@grayclhn Just curious. What are your examples? |
Covariance matix estimators that account for serial correlation: the "Quadratic Spectral" kernel satisfies some optimality properties for the estimator of the variance-covariance matrix, and it takes on small negative values in places. Mathworks, of all places, has a nice plot (don't worry, no code at the link...) |
So how are negative weights treated? As negative multipliers where the sum still needs to be 1? If so, then maybe we just leave it up to the caller to make sure that weights are non-negative or positive as appropriate given the circumstances. |
In the cases I'm familiar with, the negative values don't need to be treated any differently than the positive cases. Here it looks like the algorithm should work fine even if some of the weights are negative. |
That seems pretty reasonable to me. |
Ok, I changed |
I'd like to see some docs. In particular, a mathematical definition of what's happening here is good. |
Sure. Does Julia have a docstring-style format for this, like Python? |
No, but there are plenty of docs for this project already. |
Looking pretty good. Thanks for writing docs. Questions:
|
Good questions!
|
No idea, I was originally responding to the claim, "negative weights are clearly errors," but this isn't an area I know a lot about. I wouldn't worry too much about uniqueness, though: the unweighted sample median is not necessarily unique, we just typically assign it to be the midpoint of its allowable values by convention. As long as the algorithm for the weighted median always returns a value such that at least half of the mass lies on or below it and at least half of the mass lies on or above it, then it satisfies the definition of a weighted median. |
My worry is whether the algorithm is stable: given two inputs that are equal as multisets, but in different order in a vector, will the same output be produced? |
Good point. It looks like the elements are sorted by (value, weight) pairs; first value and then weight, which should make it unique. |
I think it should be too. Just want to confirm, because the median is such a finicky thing already and adding weights makes me worry. |
I agree, this is a good thing to confirm. I wrote an extra block of tests to show that the result doesn't change when the data/weights are reordered. (Although, this is not true if all weights are negative, so that case now throws an error.) |
Ok. This should be good to go. I'd personally suggest removing the ability to use negative weights until we can document that against the literature, but don't hold this up over that. |
The |
If nobody objects, let's merge this. |
@johnmyleswhite, are you able to merge this, or does someone else need to? (Your comments are marked "Owner".) |
I can. Just wanted to wait a bit. It's been enough now. One thing: can you squash the 14 commits into one atomic change? |
He was just giving other collaborators a chance to respond. |
Sure -- squashed the commits! |
Added weighted median function and tests
Thanks! |
No description provided.