trim and winsor now maintain original order #546

AsafManela · 2019-12-19T16:37:58Z

trim and winsor now maintain original order. closes #437

codecov · 2019-12-19T17:04:30Z

Codecov Report

Merging #546 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
- Coverage   90.39%   90.37%   -0.03%     
==========================================
  Files          21       21              
  Lines        2104     2099       -5     
==========================================
- Hits         1902     1897       -5     
  Misses        202      202

Impacted Files	Coverage Δ
src/robust.jl	`93.1% <100%> (-1.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da42557...02ff391. Read the comment docs.

nalimilan

Thanks! Could you do some benchmarking to compare the performance with the previous implementation? If that's significantly slower, it would make sense to add an argument to allow sorting data when performance is needed (e.g. if you just want to compute the mean).

nalimilan · 2019-12-20T10:05:05Z

src/robust.jl

-    deleteat!(x, 1:count)
+    ix = vcat(partialsortperm(x, 1:count), partialsortperm(x, (n-count+1):n))
+    sort!(ix)
+    ix = unique(ix) # can be replaced by unique! starting julia 1.1


Since you know indices are sorted, a simple loop comparing values to previous ones and skipping them will probably be faster.

Have you tried this?

src/robust.jl

nalimilan · 2019-12-20T10:11:49Z

src/robust.jl

-    x[1:count] .= x[count+1]
-    x[n-count+1:end] .= x[n-count]
+    ix = partialsortperm(x, 1:(count+1))
+    x[ix[1:count]] .= x[ix[count+1]]


Suggested change

x[ix[1:count]] .= x[ix[count+1]]

x[view(ix, 1:count)] .= x[ix[count+1]]

perhaps because now this would be a view of a view, julia doesn't like this and it kill the process

Well that's certainly not supposed to happen. Can you try starting Julia with --check-bounds=yes? Otherwise it would be worth reporting.

nalimilan · 2019-12-20T10:12:01Z

src/robust.jl

+    x[ix[1:count]] .= x[ix[count+1]]
+
+    ix = partialsortperm(x, (n-count):n)
+    x[ix[2:end]] .= x[ix[1]]


Suggested change

x[ix[2:end]] .= x[ix[1]]

x[view(ix, 2:end)] .= x[ix[1]]

same as above

src/robust.jl

AsafManela

Thanks for the suggestions!
Here is what the benchmarking looks like now compared with master branch:

setup

julia> n = 10000; v = rand(n)
10000-element Array{Float64,1}:
 0.07612935713307811
 0.7580974363520654 
 0.9980846175305966 
 ⋮                  
 0.491731488822605  
 0.9336393339977287 
 0.9843318462521014

master

julia> @btime trim(v, prop=0.1);
  374.195 μs (3 allocations: 78.22 KiB)

julia> @btime winsor(v, prop=0.1);
  374.352 μs (3 allocations: 78.22 KiB)

am/winsor

julia> @btime trim(v, prop=0.1);
  504.562 μs (13 allocations: 172.39 KiB)

julia> @btime winsor(v, prop=0.1);
  401.524 μs (13 allocations: 172.48 KiB)

src/robust.jl

nalimilan · 2019-12-20T21:40:34Z

OK, I guess a 34% slowdown is still acceptable. Maybe my suggestion to replace unique! can reduce it a bit more.

nalimilan · 2019-12-20T22:13:29Z

Sometimes I wonder whether it wouldn't make more sense to have these functions return iterators over the original data. If we only stored the minimum and maximum thresholds, we could just return generators that skip or replace values outside of the range on the fly. That would probably be quite faster when computing a single summary statistic (typically the mean), and you could easily call collect if you want a vector.

What do you think?

mschauer · 2019-12-21T10:01:02Z

There isn't really a nice solution to it, because these would be iterators depending on properties of the full original data, so not playing nicely with iterators as inputs. But I would think an improvement in performance would justify this.

nalimilan · 2019-12-21T10:24:05Z

There isn't really a nice solution to it, because these would be iterators depending on properties of the full original data, so not playing nicely with iterators as inputs.

What do you mean?

mschauer · 2019-12-21T10:44:44Z

We do not so often define iterators which require the input data to be collected or traversed,
but in forming the trimmed iterator quantiles of the full input are needed so the input iterator has to be collected or traversed before iteration can start.

nalimilan · 2019-12-21T11:02:13Z

Yes, so as currently the input would be required to be an AbstractVector. I don't think that's a serious issue, as the same problem would exist if the functions returned vectors (they would have to collect the input first).

mschauer · 2019-12-21T13:36:21Z

In this case no new structures would be needed, custom generators would suffice, e.g.

(clamp(x, lo, up) for x in xs) # winsor

or

(x for x in xs if lo <= x <= up) #trim

for adequate values of up, lo could do.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

AsafManela · 2019-12-31T22:13:08Z

I fixed that typo, but also tried to implement the generator solution like this for winsor:

function winsor(x::AbstractVector; prop::Real=0.0, count::Integer=0)
    n = length(x)
    n > 0 || throw(ArgumentError("x can not be empty."))

    if count == 0
        0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
        count = floor(Int, n * prop)
    else
        prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
        0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
    end

    # allocate vector of all x's indices
    ixall = collect(1:n)
    # indices for lowest count values
    ixlo = partialsortperm!(ixall, x, count+1; initialized=true)
    # indices for largest count values
    ixup = partialsortperm!(ixall, x, n-count; initialized=true)

    @inbounds lo = x[ixlo]
    @inbounds up = x[ixup]

    (clamp(xi, lo, up) for xi in x)
end

Just creating the generator seems to be more expensive than the previous approach.

Version currently comitted to this PR:

julia> @btime winsor(v, prop=0.1);
  436.854 μs (13 allocations: 172.48 KiB)

Version using the above code with a generator:

julia> @btime winsor(v, prop=0.1);
  1.317 ms (8 allocations: 78.31 KiB)
julia> @btime collect(winsor(v, prop=0.1));
  1.342 ms (10 allocations: 156.52 KiB)

So I would suggest we just merge the PR as is.

nalimilan · 2020-01-01T18:32:11Z

Thanks for trying. That's really surprising, as the creation of a generator is basically free... I turns out that calling sortperm on a range of indices is faster than passing a single value (I've filed JuliaLang/julia#34233).

This variant is about as fast as the implementation in this PR:

function winsor2(x::AbstractVector; prop::Real=0.0, count::Integer=0)
    n = length(x)
    n > 0 || throw(ArgumentError("x can not be empty."))

    if count == 0
        0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
        count = floor(Int, n * prop)
    else
        prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
        0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
    end

    # allocate vector of all x's indices
    ixall = collect(1:n)
    # indices for lowest count values
    ixlo = partialsortperm!(ixall, x, 1:(count+1); initialized=true)[end]
    # indices for largest count values
    ixup = partialsortperm!(ixall, x, (n-count):n; initialized=true)[1]

    @inbounds lo = x[ixlo]
    @inbounds up = x[ixup]

    (clamp(xi, lo, up) for xi in x)
end

And this one is even faster:

function winsornew2(x::AbstractVector; prop::Real=0.0, count::Integer=0)
    n = length(x)
    n > 0 || throw(ArgumentError("x can not be empty."))

    if count == 0
        0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
        count = floor(Int, n * prop)
    else
        prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
        0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
    end

    # indices for lowest count values
    x2 = copy(x)
    lo = partialsort!(x2, 1:count+1)[end]
    # indices for largest count values
    up = partialsort!(x2, n-count:n)[1]

    (clamp(xi, lo, up) for xi in x)
end

Benchmarks computing mean:

julia> @btime mean(winsor(v, prop=0.1)); # From this PR
  575.010 μs (14 allocations: 172.50 KiB)

julia> @btime mean(winsornew(v, prop=0.1));
  572.053 μs (11 allocations: 78.45 KiB)

julia> @btime mean(winsornew2(v, prop=0.1));
  386.626 μs (5 allocations: 78.27 KiB)

AsafManela · 2020-01-01T20:54:16Z

That explains it. Thanks for the better solution.
I implemented it for both trim and winsor and refactored to share some of the code.
New benchmarking suggests we get to have the cake and eat it too:

julia> @btime trim(v, prop=0.1);
  246.258 μs (5 allocations: 78.27 KiB)

julia> @btime winsor(v, prop=0.1);
  245.745 μs (4 allocations: 78.25 KiB)

julia> @btime collect(trim(v, prop=0.1));
  341.090 μs (19 allocations: 206.92 KiB)

julia> @btime collect(winsor(v, prop=0.1));
  252.993 μs (6 allocations: 156.45 KiB)

nalimilan · 2020-01-02T17:27:24Z

Cool. We could even add a specialized collect method for trim in the future if we wanted to avoid the overhead due to push! (since we know the number of elements the vector can be allocated upfront).

Since moving to generators is breaking, we should probably wait a bit until we can bundle a series of breaking changes into a new minor release. If you want to fix to be available quickly, maybe drop the last commit from this PR so that we can merge it now, and open a new PR for the generator changes? As you prefer.

AsafManela · 2020-01-02T17:33:42Z

No rush on my end. Please merge when you think appropriate.

…

On Thu, Jan 2, 2020 at 11:27 AM Milan Bouchet-Valat < ***@***.***> wrote: Cool. We could even add a specialized collect method for trim in the future if we wanted to avoid the overhead due to push! (since we know the number of elements the vector can be allocated upfront). Since moving to generators is breaking, we should probably wait a bit until we can bundle a series of breaking changes into a new minor release. If you want to fix to be available quickly, maybe drop the last commit from this PR so that we can merge it now, and open a new PR for the generator changes? As you prefer. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#546?email_source=notifications&email_token=ADJM2OUEFTEQS7WZI3I47A3Q3YPXZA5CNFSM4J5KTP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH64FFI#issuecomment-570278549>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADJM2OR7ZRBZR6KFDDAG5TDQ3YPXZANCNFSM4J5KTP4A> .

nalimilan · 2020-01-03T20:40:50Z

Fine with me.

Could you update the docstrings to reflect the fact that the functions no longer return a modified copy of the input? Examples can simply use collect. It would also be nice to specify that when several elements are equal to the threshold, they are all trimmed/clamped, meaning that the proportion or count might be adjusted (the passed value is only a lower bound). That wasn't the case before (which didn't make a lot of sense actually).

AsafManela · 2020-01-06T16:40:50Z

I updated the examples, but I don't fully understand your second comment.
Clamp keeps lo <= x <= hi as is and only replaces strictly higher or lower elements. So elements equal to the thresholds are not trimmed or clamped. Right?

nalimilan · 2020-01-06T18:07:36Z

Ah, right. Though my remark still holds in a different form: if there are several values equal to the threshold, none of them will be skipped/clamped, but in the previous implementations some of them could have been, and some of them not, if the specified proportion fell exactly in the middle of that group. Do you see what I mean?

nalimilan · 2020-01-06T18:10:02Z

src/robust.jl

-elements of `x` replaced with the next-lowest, and an equal number of the
-highest elements replaced with the previous-highest.  To compute the Winsorized
-mean of `x` use `mean(winsor(x))`.
+Return a generator of all elements of `x` that replaces either `count` or


Maybe let's just say "an iterator", so that in the future we can return a more specific type than an iterator. For example we could return a special AbstractVector type. Same for trim (even if it cannot be a vector since computing indices is O(N)).

AsafManela · 2020-01-06T18:13:04Z

I see. How about this:
"The number of trimmed / replaced elements could be smaller then specified if several elements equal the lower or upper bound."

nalimilan · 2020-01-06T18:16:42Z

Sounds good! (But "then" -> "than")

nalimilan

Thanks!

I'll merge in a few days if nobody objects. Then if we want to tag a non-breaking release we can create a separate branch with only minor changes to backport.

trim and winsor now maintain original order

58d1b1d

nalimilan reviewed Dec 20, 2019

View reviewed changes

perf improvements for trim and winsor

ae1760d

AsafManela commented Dec 20, 2019

View reviewed changes

indices spelling

0b70143

nalimilan reviewed Dec 20, 2019

View reviewed changes

src/robust.jl Outdated Show resolved Hide resolved

fix typo

b35c0a2

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

trim and winsor now return generators

df66bf4

updated docs

e47ed68

nalimilan reviewed Jan 6, 2020

View reviewed changes

AsafManela and others added 2 commits January 6, 2020 13:59

more doc updates

f428982

Adjust spacing and improve signature in docstring

02ff391

nalimilan approved these changes Jan 7, 2020

View reviewed changes

kleinschmidt added the breaking A breaking change that needs a semver major release label Feb 16, 2020

nalimilan merged commit 3762c78 into JuliaStats:master Feb 17, 2020

kleinschmidt mentioned this pull request Mar 4, 2020

minor version #566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trim and winsor now maintain original order #546

trim and winsor now maintain original order #546

AsafManela commented Dec 19, 2019

codecov bot commented Dec 19, 2019 •

edited

nalimilan left a comment

nalimilan Dec 20, 2019

nalimilan Dec 20, 2019

nalimilan Dec 20, 2019

AsafManela Dec 20, 2019

nalimilan Dec 20, 2019

nalimilan Dec 20, 2019

AsafManela Dec 20, 2019

AsafManela left a comment

nalimilan commented Dec 20, 2019

nalimilan commented Dec 20, 2019

mschauer commented Dec 21, 2019

nalimilan commented Dec 21, 2019

mschauer commented Dec 21, 2019

nalimilan commented Dec 21, 2019

mschauer commented Dec 21, 2019 •

edited

AsafManela commented Dec 31, 2019

nalimilan commented Jan 1, 2020

AsafManela commented Jan 1, 2020

nalimilan commented Jan 2, 2020

AsafManela commented Jan 2, 2020 via email

nalimilan commented Jan 3, 2020

AsafManela commented Jan 6, 2020

nalimilan commented Jan 6, 2020

nalimilan Jan 6, 2020

AsafManela commented Jan 6, 2020

nalimilan commented Jan 6, 2020

nalimilan left a comment

	x[ix[1:count]] .= x[ix[count+1]]
	x[view(ix, 1:count)] .= x[ix[count+1]]

trim and winsor now maintain original order #546

trim and winsor now maintain original order #546

Conversation

AsafManela commented Dec 19, 2019

codecov bot commented Dec 19, 2019 • edited

Codecov Report

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AsafManela left a comment

Choose a reason for hiding this comment

setup

master

am/winsor

nalimilan commented Dec 20, 2019

nalimilan commented Dec 20, 2019

mschauer commented Dec 21, 2019

nalimilan commented Dec 21, 2019

mschauer commented Dec 21, 2019

nalimilan commented Dec 21, 2019

mschauer commented Dec 21, 2019 • edited

AsafManela commented Dec 31, 2019

nalimilan commented Jan 1, 2020

AsafManela commented Jan 1, 2020

nalimilan commented Jan 2, 2020

AsafManela commented Jan 2, 2020 via email

nalimilan commented Jan 3, 2020

AsafManela commented Jan 6, 2020

nalimilan commented Jan 6, 2020

Choose a reason for hiding this comment

AsafManela commented Jan 6, 2020

nalimilan commented Jan 6, 2020

nalimilan left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 19, 2019 •

edited

mschauer commented Dec 21, 2019 •

edited