Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trim and winsor now maintain original order #546

Merged
merged 8 commits into from Feb 17, 2020
Merged

trim and winsor now maintain original order #546

merged 8 commits into from Feb 17, 2020

Conversation

AsafManela
Copy link
Contributor

trim and winsor now maintain original order. closes #437

@codecov
Copy link

codecov bot commented Dec 19, 2019

Codecov Report

Merging #546 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
- Coverage   90.39%   90.37%   -0.03%     
==========================================
  Files          21       21              
  Lines        2104     2099       -5     
==========================================
- Hits         1902     1897       -5     
  Misses        202      202
Impacted Files Coverage Δ
src/robust.jl 93.1% <100%> (-1.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da42557...02ff391. Read the comment docs.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you do some benchmarking to compare the performance with the previous implementation? If that's significantly slower, it would make sense to add an argument to allow sorting data when performance is needed (e.g. if you just want to compute the mean).

src/robust.jl Outdated
deleteat!(x, 1:count)
ix = vcat(partialsortperm(x, 1:count), partialsortperm(x, (n-count+1):n))
sort!(ix)
ix = unique(ix) # can be replaced by unique! starting julia 1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you know indices are sorted, a simple loop comparing values to previous ones and skipping them will probably be faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried this?

src/robust.jl Outdated Show resolved Hide resolved
src/robust.jl Outdated
x[1:count] .= x[count+1]
x[n-count+1:end] .= x[n-count]
ix = partialsortperm(x, 1:(count+1))
x[ix[1:count]] .= x[ix[count+1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x[ix[1:count]] .= x[ix[count+1]]
x[view(ix, 1:count)] .= x[ix[count+1]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps because now this would be a view of a view, julia doesn't like this and it kill the process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's certainly not supposed to happen. Can you try starting Julia with --check-bounds=yes? Otherwise it would be worth reporting.

src/robust.jl Outdated
x[ix[1:count]] .= x[ix[count+1]]

ix = partialsortperm(x, (n-count):n)
x[ix[2:end]] .= x[ix[1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x[ix[2:end]] .= x[ix[1]]
x[view(ix, 2:end)] .= x[ix[1]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

src/robust.jl Outdated Show resolved Hide resolved
Copy link
Contributor Author

@AsafManela AsafManela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!
Here is what the benchmarking looks like now compared with master branch:

setup

julia> n = 10000; v = rand(n)
10000-element Array{Float64,1}:
 0.07612935713307811
 0.7580974363520654 
 0.9980846175305966 
                   
 0.491731488822605  
 0.9336393339977287 
 0.9843318462521014 

master

julia> @btime trim(v, prop=0.1);
  374.195 μs (3 allocations: 78.22 KiB)

julia> @btime winsor(v, prop=0.1);
  374.352 μs (3 allocations: 78.22 KiB)

am/winsor

julia> @btime trim(v, prop=0.1);
  504.562 μs (13 allocations: 172.39 KiB)

julia> @btime winsor(v, prop=0.1);
  401.524 μs (13 allocations: 172.48 KiB)

src/robust.jl Outdated Show resolved Hide resolved
@nalimilan
Copy link
Member

OK, I guess a 34% slowdown is still acceptable. Maybe my suggestion to replace unique! can reduce it a bit more.

@nalimilan
Copy link
Member

Sometimes I wonder whether it wouldn't make more sense to have these functions return iterators over the original data. If we only stored the minimum and maximum thresholds, we could just return generators that skip or replace values outside of the range on the fly. That would probably be quite faster when computing a single summary statistic (typically the mean), and you could easily call collect if you want a vector.

What do you think?

@mschauer
Copy link
Member

There isn't really a nice solution to it, because these would be iterators depending on properties of the full original data, so not playing nicely with iterators as inputs. But I would think an improvement in performance would justify this.

@nalimilan
Copy link
Member

There isn't really a nice solution to it, because these would be iterators depending on properties of the full original data, so not playing nicely with iterators as inputs.

What do you mean?

@mschauer
Copy link
Member

We do not so often define iterators which require the input data to be collected or traversed,
but in forming the trimmed iterator quantiles of the full input are needed so the input iterator has to be collected or traversed before iteration can start.

@nalimilan
Copy link
Member

Yes, so as currently the input would be required to be an AbstractVector. I don't think that's a serious issue, as the same problem would exist if the functions returned vectors (they would have to collect the input first).

@mschauer
Copy link
Member

mschauer commented Dec 21, 2019

In this case no new structures would be needed, custom generators would suffice, e.g.

(clamp(x, lo, up) for x in xs) # winsor

or

(x for x in xs if lo <= x <= up) #trim

for adequate values of up, lo could do.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@AsafManela
Copy link
Contributor Author

I fixed that typo, but also tried to implement the generator solution like this for winsor:

function winsor(x::AbstractVector; prop::Real=0.0, count::Integer=0)
    n = length(x)
    n > 0 || throw(ArgumentError("x can not be empty."))

    if count == 0
        0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
        count = floor(Int, n * prop)
    else
        prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
        0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
    end

    # allocate vector of all x's indices
    ixall = collect(1:n)
    # indices for lowest count values
    ixlo = partialsortperm!(ixall, x, count+1; initialized=true)
    # indices for largest count values
    ixup = partialsortperm!(ixall, x, n-count; initialized=true)

    @inbounds lo = x[ixlo]
    @inbounds up = x[ixup]

    (clamp(xi, lo, up) for xi in x)
end

Just creating the generator seems to be more expensive than the previous approach.

Version currently comitted to this PR:

julia> @btime winsor(v, prop=0.1);
  436.854 μs (13 allocations: 172.48 KiB)

Version using the above code with a generator:

julia> @btime winsor(v, prop=0.1);
  1.317 ms (8 allocations: 78.31 KiB)
julia> @btime collect(winsor(v, prop=0.1));
  1.342 ms (10 allocations: 156.52 KiB)

So I would suggest we just merge the PR as is.

@nalimilan
Copy link
Member

Thanks for trying. That's really surprising, as the creation of a generator is basically free... I turns out that calling sortperm on a range of indices is faster than passing a single value (I've filed JuliaLang/julia#34233).

This variant is about as fast as the implementation in this PR:

function winsor2(x::AbstractVector; prop::Real=0.0, count::Integer=0)
    n = length(x)
    n > 0 || throw(ArgumentError("x can not be empty."))

    if count == 0
        0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
        count = floor(Int, n * prop)
    else
        prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
        0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
    end

    # allocate vector of all x's indices
    ixall = collect(1:n)
    # indices for lowest count values
    ixlo = partialsortperm!(ixall, x, 1:(count+1); initialized=true)[end]
    # indices for largest count values
    ixup = partialsortperm!(ixall, x, (n-count):n; initialized=true)[1]

    @inbounds lo = x[ixlo]
    @inbounds up = x[ixup]

    (clamp(xi, lo, up) for xi in x)
end

And this one is even faster:

function winsornew2(x::AbstractVector; prop::Real=0.0, count::Integer=0)
    n = length(x)
    n > 0 || throw(ArgumentError("x can not be empty."))

    if count == 0
        0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
        count = floor(Int, n * prop)
    else
        prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
        0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
    end

    # indices for lowest count values
    x2 = copy(x)
    lo = partialsort!(x2, 1:count+1)[end]
    # indices for largest count values
    up = partialsort!(x2, n-count:n)[1]

    (clamp(xi, lo, up) for xi in x)
end

Benchmarks computing mean:

julia> @btime mean(winsor(v, prop=0.1)); # From this PR
  575.010 μs (14 allocations: 172.50 KiB)

julia> @btime mean(winsornew(v, prop=0.1));
  572.053 μs (11 allocations: 78.45 KiB)

julia> @btime mean(winsornew2(v, prop=0.1));
  386.626 μs (5 allocations: 78.27 KiB)

@AsafManela
Copy link
Contributor Author

That explains it. Thanks for the better solution.
I implemented it for both trim and winsor and refactored to share some of the code.
New benchmarking suggests we get to have the cake and eat it too:

julia> @btime trim(v, prop=0.1);
  246.258 μs (5 allocations: 78.27 KiB)

julia> @btime winsor(v, prop=0.1);
  245.745 μs (4 allocations: 78.25 KiB)

julia> @btime collect(trim(v, prop=0.1));
  341.090 μs (19 allocations: 206.92 KiB)

julia> @btime collect(winsor(v, prop=0.1));
  252.993 μs (6 allocations: 156.45 KiB)

@nalimilan
Copy link
Member

Cool. We could even add a specialized collect method for trim in the future if we wanted to avoid the overhead due to push! (since we know the number of elements the vector can be allocated upfront).

Since moving to generators is breaking, we should probably wait a bit until we can bundle a series of breaking changes into a new minor release. If you want to fix to be available quickly, maybe drop the last commit from this PR so that we can merge it now, and open a new PR for the generator changes? As you prefer.

@AsafManela
Copy link
Contributor Author

AsafManela commented Jan 2, 2020 via email

@nalimilan
Copy link
Member

Fine with me.

Could you update the docstrings to reflect the fact that the functions no longer return a modified copy of the input? Examples can simply use collect. It would also be nice to specify that when several elements are equal to the threshold, they are all trimmed/clamped, meaning that the proportion or count might be adjusted (the passed value is only a lower bound). That wasn't the case before (which didn't make a lot of sense actually).

@AsafManela
Copy link
Contributor Author

I updated the examples, but I don't fully understand your second comment.
Clamp keeps lo <= x <= hi as is and only replaces strictly higher or lower elements. So elements equal to the thresholds are not trimmed or clamped. Right?

@nalimilan
Copy link
Member

Ah, right. Though my remark still holds in a different form: if there are several values equal to the threshold, none of them will be skipped/clamped, but in the previous implementations some of them could have been, and some of them not, if the specified proportion fell exactly in the middle of that group. Do you see what I mean?

src/robust.jl Outdated
elements of `x` replaced with the next-lowest, and an equal number of the
highest elements replaced with the previous-highest. To compute the Winsorized
mean of `x` use `mean(winsor(x))`.
Return a generator of all elements of `x` that replaces either `count` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's just say "an iterator", so that in the future we can return a more specific type than an iterator. For example we could return a special AbstractVector type. Same for trim (even if it cannot be a vector since computing indices is O(N)).

@AsafManela
Copy link
Contributor Author

I see. How about this:
"The number of trimmed / replaced elements could be smaller then specified if several elements equal the lower or upper bound."

@nalimilan
Copy link
Member

Sounds good! (But "then" -> "than")

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I'll merge in a few days if nobody objects. Then if we want to tag a non-breaking release we can create a separate branch with only minor changes to backport.

@kleinschmidt kleinschmidt added the breaking A breaking change that needs a semver major release label Feb 16, 2020
@nalimilan nalimilan merged commit 3762c78 into JuliaStats:master Feb 17, 2020
@kleinschmidt kleinschmidt mentioned this pull request Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking A breaking change that needs a semver major release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

winsor implicitly sorts data
4 participants