Added weighted median function and tests #90

tinybike · 2014-09-26T09:08:10Z

No description provided.

coveralls · 2014-09-26T09:10:51Z

Coverage increased (+0.01%) when pulling 76056f4 on tensorjack:master into ae435d1 on JuliaStats:master.

nalimilan · 2014-09-26T10:09:51Z

Thanks for this!

I thought it might be a good idea to provide more tests for corner cases, which are where things usually break. For example, when only one value is passed; when all values are the same; when all weights are zero; when some weights are negative.

nalimilan · 2014-09-26T10:12:03Z

src/weights.jl

+###### Weighted median #####
+
+function Base.median{W<:Real}(v::RealVector, w::WeightVec{W})
+    sorted = sortrows([v w.values])


AFAICT sortrows is going to sort on both columns, right? Couldn't you sort only on the first column? I think you may save some memory by first saving [v w.values] in an object, and then calling sort! on it.

coveralls · 2014-09-27T02:45:15Z

Coverage increased (+0.03%) when pulling 43f7bbb on tensorjack:master into ae435d1 on JuliaStats:master.

coveralls · 2014-09-27T03:05:41Z

Coverage increased (+0.03%) when pulling fa6e783 on tensorjack:master into ae435d1 on JuliaStats:master.

nalimilan · 2014-09-27T10:29:36Z

src/weights.jl

+    if any(mask)
+        v = v[mask]
+        wt = w.values[mask]
+        sorted = sortrows([v wt])


I'm still not sure that's the best solution. [v wt] is going to require copying the whole vectors, which may be really big, and then create another sorted copy. It may be better to call s = sortperm(v), and instead of sorting [v wt] iterate over s in the loop below, accessing wt[s[below_midpoint_index]]. This may be slower, not sure, but it may be much better for very large vectors (which is where performance matters for such a function).

Base.median() uses select() to avoid sorting the whole vector, but I don't know whether there's an equivalent function for sortperm.

But the question of efficiently computing the median has probably been discussed many times before. Have you done some research to find this algorithm?

Base.median() uses select() to avoid sorting the whole vector, but I don't know whether there's an equivalent function for sortperm.

A selectperm function would look like this (base off of select and sortperm)

import Base.Sort: Ordering, Perm, Forward, ord selectperm(v::AbstractVector, k::Union(Int,OrdinalRange); lt::Function=isless, by::Function=identity, rev::Bool=false, order::Ordering=Forward) = select!([1:length(v)], k, Perm(ord(lt,by,rev,order),v))

Note that this does create a vector of Ints the sames length as v, but that's still probably better in most cases than copying the whole vector and sorting.

Example usage:

julia> x = rand(5) 5-element Array{Float64,1}: 0.677481 0.107243 0.525638 0.297544 0.748513 julia> sortperm(x) 5-element Array{Int64,1}: 2 4 3 1 5 julia> selectperm(x, 2) 4 julia> selectperm(x, 1:3) 3-element Array{Int64,1}: 2 4 3

This could be submitted to Base if people though it were useful.

I'm still not sure that's the best solution. [v wt] is going to require copying the whole vectors, which may be really big, and then create another sorted copy. It may be better to call s = sortperm(v), and instead of sorting [v wt] iterate over s in the loop below, accessing wt[s[below_midpoint_index]]. This may be slower, not sure, but it may be much better for very large vectors (which is where performance matters for such a function).

Good call on using sortperm. I updated the gist to include a sortperm implementation:

https://gist.github.com/tensorjack/432bdbaa2aff38ee64ee

It's 25x faster than the old version, and also decreases memory use by 89%. Going to send over a pull request with the new, improved version :)

You can update your existing fork and this pull request will reflect the changes.

nalimilan · 2014-09-27T18:44:12Z

Actually there seems to be a complete implementation in C++ under the MIT license here. It's a bit scary, but there's a simple quickSelect function there, which could be a nice addition to Base, similar to the currently existing select, but with weights.

https://www.eecis.udel.edu/~rauh/fqs/

nalimilan · 2014-09-29T08:06:27Z

@tensorjack Do you think you could try porting wquickSelect() from the above C++ code to see how it performs? It would probably be even faster than your new implementation. The relevant function is at wmedianf_impl.cpp:1017.

coveralls · 2014-09-29T08:21:01Z

Coverage increased (+0.03%) when pulling 3f3623a on tensorjack:master into ae435d1 on JuliaStats:master.

simonster · 2014-09-29T17:24:41Z

This should probably throw an error or return nan(eltype(v)) if there are no non-zero weights. Right now it will return nothing.

simonster · 2014-09-29T17:28:09Z

src/weights.jl

+        wt = w.values[mask]
+        midpoint = 0.5 * sum(wt)
+        if any(wt .> midpoint)
+            first(v[wt .== maximum(wt)])


I think it would be equivalent but more efficient if you compute maxval, maxind = findmax(wt), check if maxval > midpoint, and if so, return v[maxind].

coveralls · 2014-09-29T20:53:40Z

Coverage decreased (-0.03%) when pulling 3c24a2c on tensorjack:master into ae435d1 on JuliaStats:master.

simonster · 2014-09-29T21:49:47Z

src/weights.jl

+                cumulative_weight += wt[p]
+            end
+            if cumulative_weight == midpoint
+                0.5 * (v[permute[i-2]] + v[permute[i-1]])


Sorry, a second look-through and I have more. This should be middle((v[permute[i-2]], v[permute[i-1]]) and the below case should be middle(v[permute[i-1]]) for type stability.

coveralls · 2014-09-30T00:15:00Z

Coverage increased (+0.05%) when pulling e4efed8 on tensorjack:master into ae435d1 on JuliaStats:master.

StefanKarpinski · 2014-09-30T01:09:52Z

I had test cases with zero weights. But, I added the warn() to alert the user that negative/zero weights are being ignored, which made the output feel spammy. What I can do is add a keyword argument that shuts off the warnings, and use that for tests.

Can't find the comment now but I have an email. I suspect that it's better to simply allow zero weights with well-defined behavior than to have a warn/nowarn option that's going to have go on an endless number of functions.

tinybike · 2014-09-30T18:16:52Z

Can't find the comment now but I have an email. I suspect that it's better to simply allow zero weights with well-defined behavior than to have a warn/nowarn option that's going to have go on an endless number of functions.

That makes sense. I can remove the warn altogether. The behavior would then be that zero and negative weights are silently ignored. (An error is only thrown if all weights are zero or negative.) Is there a way to add "docstring" text? I think if this behavior were mentioned in help, it would be a non-issue.

StefanKarpinski · 2014-09-30T19:46:48Z

I think that negative weights are quite different from zero weights, no? Negative seems to me to be a genuine error while zero is the limit of a positive weight.

tinybike · 2014-09-30T20:22:09Z

I think that negative weights are quite different from zero weights, no? Negative seems to me to be a genuine error while zero is the limit of a positive weight.

That's a good point. A zero weight just means "this observation is meaningless" (or, "this observation was repeated zero times"). Negative weights are always errors, as I understand it. In light of this, it probably makes the most sense to throw an error when there are any negative weights, silently remove any zero weights, and get rid of the warning altogether.

grayclhn · 2014-09-30T21:16:00Z

There are plenty of cases in stats where negative weights are not errors: that lets the other weights have mass greater than one. The examples I know off the top of my head are in time-series, but I'm sure they show up elsewhere too.

nalimilan · 2014-09-30T21:17:10Z

There are plenty of cases in stats where negative weights are not errors: that lets the other weights have mass greater than one. The examples I know off the top of my head are in time-series, but I'm sure they show up elsewhere too.

But then what meaning would negative weights convey in the case of median?

andreasnoack · 2014-09-30T21:24:48Z

@grayclhn Just curious. What are your examples?

grayclhn · 2014-10-01T00:18:33Z

Covariance matix estimators that account for serial correlation: the "Quadratic Spectral" kernel satisfies some optimality properties for the estimator of the variance-covariance matrix, and it takes on small negative values in places. Mathworks, of all places, has a nice plot (don't worry, no code at the link...)
http://www.mathworks.com/help/econ/hac.html#btt5ta4-1

StefanKarpinski · 2014-10-01T00:28:56Z

So how are negative weights treated? As negative multipliers where the sum still needs to be 1? If so, then maybe we just leave it up to the caller to make sure that weights are non-negative or positive as appropriate given the circumstances.

grayclhn · 2014-10-01T00:54:40Z

In the cases I'm familiar with, the negative values don't need to be treated any differently than the positive cases. Here it looks like the algorithm should work fine even if some of the weights are negative.

StefanKarpinski · 2014-10-01T01:01:56Z

That seems pretty reasonable to me.

coveralls · 2014-10-01T05:34:55Z

Coverage increased (+0.05%) when pulling f5b81ea on tensorjack:master into ae435d1 on JuliaStats:master.

coveralls · 2014-10-01T05:54:00Z

Coverage increased (+0.04%) when pulling b1b08ac on tensorjack:master into ae435d1 on JuliaStats:master.

tinybike · 2014-10-01T21:45:00Z

Ok, I changed mask so that it allows negative weights, which are treated the same way as positive weights. I also removed the verbose keyword and its accompanying warn. Let me know if there are other changes you would like to see, or feel free to merge this request if you guys think it's ready.

johnmyleswhite · 2014-10-01T21:47:19Z

I'd like to see some docs. In particular, a mathematical definition of what's happening here is good.

tinybike · 2014-10-02T00:38:44Z

Sure. Does Julia have a docstring-style format for this, like Python?

johnmyleswhite · 2014-10-02T00:46:02Z

No, but there are plenty of docs for this project already.

coveralls · 2014-10-04T07:04:22Z

Coverage increased (+0.05%) when pulling d24f8fa on tensorjack:master into ae435d1 on JuliaStats:master.

coveralls · 2014-10-04T07:14:01Z

Coverage increased (+0.05%) when pulling 4fac766 on tensorjack:master into ae435d1 on JuliaStats:master.

johnmyleswhite · 2014-10-05T00:12:32Z

Looking pretty good. Thanks for writing docs.

Questions:

Is the definition you provide for the weighted median always unique? And is it always defined, even if the weights are sometimes negative?
Do other functions in StatsBase provide checknan? It's somewhat at odds with our idea that NaN shouldn't be used to encode missing values.

tinybike · 2014-10-06T17:48:48Z

Good questions!

I looked this up, and found a textbook that maps this onto a convex optimization problem. So, it is unique if the weights are non-negative. I don't have a good reference for negative weights, so I am not sure it is always defined in that case. That textbook suggests there are "problems" with negative weights, which are further explored in the Exercises, which are unfortunately not part of Google's book preview. @grayclhn, do you know the answer to this? I was not aware that negative weights could even be used in this context, prior to your comment!
checknan is from the median function in julia/base/statistics.jl. I agree that the usage is awkward, but felt I should be consistent with the base median implementation. Would you guys prefer that I remove this argument? If it's at odds with the StatsBase philosophy, I wouldn't mind having an excuse to get rid of it, to be honest.

grayclhn · 2014-10-06T20:05:01Z

No idea, I was originally responding to the claim, "negative weights are clearly errors," but this isn't an area I know a lot about.

I wouldn't worry too much about uniqueness, though: the unweighted sample median is not necessarily unique, we just typically assign it to be the midpoint of its allowable values by convention. As long as the algorithm for the weighted median always returns a value such that at least half of the mass lies on or below it and at least half of the mass lies on or above it, then it satisfies the definition of a weighted median.

johnmyleswhite · 2014-10-06T21:07:30Z

My worry is whether the algorithm is stable: given two inputs that are equal as multisets, but in different order in a vector, will the same output be produced?

grayclhn · 2014-10-06T21:24:11Z

Good point. It looks like the elements are sorted by (value, weight) pairs; first value and then weight, which should make it unique.

johnmyleswhite · 2014-10-06T21:27:04Z

I think it should be too. Just want to confirm, because the median is such a finicky thing already and adding weights makes me worry.

coveralls · 2014-10-07T07:57:38Z

Coverage increased (+0.05%) when pulling 07b9e9d on tensorjack:master into ae435d1 on JuliaStats:master.

coveralls · 2014-10-07T08:04:52Z

Coverage increased (+0.05%) when pulling ba8a745 on tensorjack:master into ae435d1 on JuliaStats:master.

tinybike · 2014-10-07T08:11:21Z

I agree, this is a good thing to confirm. I wrote an extra block of tests to show that the result doesn't change when the data/weights are reordered. (Although, this is not true if all weights are negative, so that case now throws an error.)

johnmyleswhite · 2014-10-07T14:43:09Z

Ok. This should be good to go. I'd personally suggest removing the ability to use negative weights until we can document that against the literature, but don't hold this up over that.

tinybike · 2014-10-08T17:35:20Z

The checknan keyword argument was just removed from the base median implementation, so I have removed it from this version, as well.

johnmyleswhite · 2014-10-08T17:36:46Z

If nobody objects, let's merge this.

tinybike · 2014-10-08T19:58:21Z

@johnmyleswhite, are you able to merge this, or does someone else need to? (Your comments are marked "Owner".)

johnmyleswhite · 2014-10-08T20:00:55Z

I can. Just wanted to wait a bit. It's been enough now.

One thing: can you squash the 14 commits into one atomic change?

garborg · 2014-10-08T20:00:57Z

He was just giving other collaborators a chance to respond.

tinybike · 2014-10-11T08:07:39Z

Sure -- squashed the commits!

Added weighted median function and tests

johnmyleswhite · 2014-10-11T22:17:50Z

Thanks!

nalimilan reviewed Sep 26, 2014
View reviewed changes

nalimilan reviewed Sep 27, 2014
View reviewed changes

simonster reviewed Sep 29, 2014
View reviewed changes

Added weighted median function and tests

37b6d86

johnmyleswhite added a commit that referenced this pull request Oct 11, 2014

Merge pull request #90 from tensorjack/master

6b44ead

Added weighted median function and tests

johnmyleswhite merged commit 6b44ead into JuliaStats:master Oct 11, 2014

IainNZ mentioned this pull request Oct 16, 2014

Pull request/ac6dccfe JuliaLang/METADATA.jl#1506

Closed

Added weighted median function and tests #90

Added weighted median function and tests #90

Conversation

tinybike commented Sep 26, 2014

coveralls commented Sep 26, 2014

nalimilan commented Sep 26, 2014

nalimilan Sep 26, 2014

Choose a reason for hiding this comment

coveralls commented Sep 27, 2014

coveralls commented Sep 27, 2014

nalimilan Sep 27, 2014

Choose a reason for hiding this comment

kmsquire Sep 27, 2014

Choose a reason for hiding this comment

tinybike Sep 29, 2014

Choose a reason for hiding this comment

johnmyleswhite Sep 29, 2014

Choose a reason for hiding this comment

nalimilan commented Sep 27, 2014

nalimilan commented Sep 29, 2014

coveralls commented Sep 29, 2014

simonster commented Sep 29, 2014

simonster Sep 29, 2014

Choose a reason for hiding this comment

coveralls commented Sep 29, 2014

simonster Sep 29, 2014

Choose a reason for hiding this comment

coveralls commented Sep 30, 2014

StefanKarpinski commented Sep 30, 2014

tinybike commented Sep 30, 2014

StefanKarpinski commented Sep 30, 2014

tinybike commented Sep 30, 2014

grayclhn commented Sep 30, 2014

nalimilan commented Sep 30, 2014

andreasnoack commented Sep 30, 2014

grayclhn commented Oct 1, 2014

StefanKarpinski commented Oct 1, 2014

grayclhn commented Oct 1, 2014

StefanKarpinski commented Oct 1, 2014

coveralls commented Oct 1, 2014

coveralls commented Oct 1, 2014

tinybike commented Oct 1, 2014

johnmyleswhite commented Oct 1, 2014

tinybike commented Oct 2, 2014

johnmyleswhite commented Oct 2, 2014

coveralls commented Oct 4, 2014

coveralls commented Oct 4, 2014

johnmyleswhite commented Oct 5, 2014

tinybike commented Oct 6, 2014

grayclhn commented Oct 6, 2014

johnmyleswhite commented Oct 6, 2014

grayclhn commented Oct 6, 2014

johnmyleswhite commented Oct 6, 2014

coveralls commented Oct 7, 2014

coveralls commented Oct 7, 2014

tinybike commented Oct 7, 2014

johnmyleswhite commented Oct 7, 2014

tinybike commented Oct 8, 2014

johnmyleswhite commented Oct 8, 2014

tinybike commented Oct 8, 2014

johnmyleswhite commented Oct 8, 2014

garborg commented Oct 8, 2014

tinybike commented Oct 11, 2014

johnmyleswhite commented Oct 11, 2014