New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trim and winsor now maintain original order #546
Conversation
Codecov Report
@@ Coverage Diff @@
## master #546 +/- ##
==========================================
- Coverage 90.39% 90.37% -0.03%
==========================================
Files 21 21
Lines 2104 2099 -5
==========================================
- Hits 1902 1897 -5
Misses 202 202
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you do some benchmarking to compare the performance with the previous implementation? If that's significantly slower, it would make sense to add an argument to allow sorting data when performance is needed (e.g. if you just want to compute the mean).
src/robust.jl
Outdated
deleteat!(x, 1:count) | ||
ix = vcat(partialsortperm(x, 1:count), partialsortperm(x, (n-count+1):n)) | ||
sort!(ix) | ||
ix = unique(ix) # can be replaced by unique! starting julia 1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you know indices are sorted, a simple loop comparing values to previous ones and skipping them will probably be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried this?
src/robust.jl
Outdated
x[1:count] .= x[count+1] | ||
x[n-count+1:end] .= x[n-count] | ||
ix = partialsortperm(x, 1:(count+1)) | ||
x[ix[1:count]] .= x[ix[count+1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x[ix[1:count]] .= x[ix[count+1]] | |
x[view(ix, 1:count)] .= x[ix[count+1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps because now this would be a view of a view, julia doesn't like this and it kill the process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's certainly not supposed to happen. Can you try starting Julia with --check-bounds=yes
? Otherwise it would be worth reporting.
src/robust.jl
Outdated
x[ix[1:count]] .= x[ix[count+1]] | ||
|
||
ix = partialsortperm(x, (n-count):n) | ||
x[ix[2:end]] .= x[ix[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x[ix[2:end]] .= x[ix[1]] | |
x[view(ix, 2:end)] .= x[ix[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions!
Here is what the benchmarking looks like now compared with master branch:
setup
julia> n = 10000; v = rand(n)
10000-element Array{Float64,1}:
0.07612935713307811
0.7580974363520654
0.9980846175305966
⋮
0.491731488822605
0.9336393339977287
0.9843318462521014
master
julia> @btime trim(v, prop=0.1);
374.195 μs (3 allocations: 78.22 KiB)
julia> @btime winsor(v, prop=0.1);
374.352 μs (3 allocations: 78.22 KiB)
am/winsor
julia> @btime trim(v, prop=0.1);
504.562 μs (13 allocations: 172.39 KiB)
julia> @btime winsor(v, prop=0.1);
401.524 μs (13 allocations: 172.48 KiB)
OK, I guess a 34% slowdown is still acceptable. Maybe my suggestion to replace |
Sometimes I wonder whether it wouldn't make more sense to have these functions return iterators over the original data. If we only stored the minimum and maximum thresholds, we could just return generators that skip or replace values outside of the range on the fly. That would probably be quite faster when computing a single summary statistic (typically the mean), and you could easily call What do you think? |
There isn't really a nice solution to it, because these would be iterators depending on properties of the full original data, so not playing nicely with iterators as inputs. But I would think an improvement in performance would justify this. |
What do you mean? |
We do not so often define iterators which require the input data to be collected or traversed, |
Yes, so as currently the input would be required to be an |
In this case no new structures would be needed, custom generators would suffice, e.g.
or
for adequate values of |
I fixed that typo, but also tried to implement the generator solution like this for winsor: function winsor(x::AbstractVector; prop::Real=0.0, count::Integer=0)
n = length(x)
n > 0 || throw(ArgumentError("x can not be empty."))
if count == 0
0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
count = floor(Int, n * prop)
else
prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
end
# allocate vector of all x's indices
ixall = collect(1:n)
# indices for lowest count values
ixlo = partialsortperm!(ixall, x, count+1; initialized=true)
# indices for largest count values
ixup = partialsortperm!(ixall, x, n-count; initialized=true)
@inbounds lo = x[ixlo]
@inbounds up = x[ixup]
(clamp(xi, lo, up) for xi in x)
end Just creating the generator seems to be more expensive than the previous approach. Version currently comitted to this PR: julia> @btime winsor(v, prop=0.1);
436.854 μs (13 allocations: 172.48 KiB) Version using the above code with a generator: julia> @btime winsor(v, prop=0.1);
1.317 ms (8 allocations: 78.31 KiB)
julia> @btime collect(winsor(v, prop=0.1));
1.342 ms (10 allocations: 156.52 KiB) So I would suggest we just merge the PR as is. |
Thanks for trying. That's really surprising, as the creation of a generator is basically free... I turns out that calling This variant is about as fast as the implementation in this PR: function winsor2(x::AbstractVector; prop::Real=0.0, count::Integer=0)
n = length(x)
n > 0 || throw(ArgumentError("x can not be empty."))
if count == 0
0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
count = floor(Int, n * prop)
else
prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
end
# allocate vector of all x's indices
ixall = collect(1:n)
# indices for lowest count values
ixlo = partialsortperm!(ixall, x, 1:(count+1); initialized=true)[end]
# indices for largest count values
ixup = partialsortperm!(ixall, x, (n-count):n; initialized=true)[1]
@inbounds lo = x[ixlo]
@inbounds up = x[ixup]
(clamp(xi, lo, up) for xi in x)
end And this one is even faster: function winsornew2(x::AbstractVector; prop::Real=0.0, count::Integer=0)
n = length(x)
n > 0 || throw(ArgumentError("x can not be empty."))
if count == 0
0 <= prop < 0.5 || throw(ArgumentError("prop must satisfy 0 ≤ prop < 0.5."))
count = floor(Int, n * prop)
else
prop == 0 || throw(ArgumentError("prop and count can not both be > 0."))
0 <= count < n/2 || throw(ArgumentError("count must satisfy 0 ≤ count < length(x)/2."))
end
# indices for lowest count values
x2 = copy(x)
lo = partialsort!(x2, 1:count+1)[end]
# indices for largest count values
up = partialsort!(x2, n-count:n)[1]
(clamp(xi, lo, up) for xi in x)
end Benchmarks computing mean: julia> @btime mean(winsor(v, prop=0.1)); # From this PR
575.010 μs (14 allocations: 172.50 KiB)
julia> @btime mean(winsornew(v, prop=0.1));
572.053 μs (11 allocations: 78.45 KiB)
julia> @btime mean(winsornew2(v, prop=0.1));
386.626 μs (5 allocations: 78.27 KiB) |
That explains it. Thanks for the better solution. julia> @btime trim(v, prop=0.1);
246.258 μs (5 allocations: 78.27 KiB)
julia> @btime winsor(v, prop=0.1);
245.745 μs (4 allocations: 78.25 KiB)
julia> @btime collect(trim(v, prop=0.1));
341.090 μs (19 allocations: 206.92 KiB)
julia> @btime collect(winsor(v, prop=0.1));
252.993 μs (6 allocations: 156.45 KiB) |
Cool. We could even add a specialized Since moving to generators is breaking, we should probably wait a bit until we can bundle a series of breaking changes into a new minor release. If you want to fix to be available quickly, maybe drop the last commit from this PR so that we can merge it now, and open a new PR for the generator changes? As you prefer. |
No rush on my end. Please merge when you think appropriate.
…On Thu, Jan 2, 2020 at 11:27 AM Milan Bouchet-Valat < ***@***.***> wrote:
Cool. We could even add a specialized collect method for trim in the
future if we wanted to avoid the overhead due to push! (since we know the
number of elements the vector can be allocated upfront).
Since moving to generators is breaking, we should probably wait a bit
until we can bundle a series of breaking changes into a new minor release.
If you want to fix to be available quickly, maybe drop the last commit from
this PR so that we can merge it now, and open a new PR for the generator
changes? As you prefer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#546?email_source=notifications&email_token=ADJM2OUEFTEQS7WZI3I47A3Q3YPXZA5CNFSM4J5KTP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH64FFI#issuecomment-570278549>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADJM2OR7ZRBZR6KFDDAG5TDQ3YPXZANCNFSM4J5KTP4A>
.
|
Fine with me. Could you update the docstrings to reflect the fact that the functions no longer return a modified copy of the input? Examples can simply use |
I updated the examples, but I don't fully understand your second comment. |
Ah, right. Though my remark still holds in a different form: if there are several values equal to the threshold, none of them will be skipped/clamped, but in the previous implementations some of them could have been, and some of them not, if the specified proportion fell exactly in the middle of that group. Do you see what I mean? |
src/robust.jl
Outdated
elements of `x` replaced with the next-lowest, and an equal number of the | ||
highest elements replaced with the previous-highest. To compute the Winsorized | ||
mean of `x` use `mean(winsor(x))`. | ||
Return a generator of all elements of `x` that replaces either `count` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's just say "an iterator", so that in the future we can return a more specific type than an iterator. For example we could return a special AbstractVector
type. Same for trim
(even if it cannot be a vector since computing indices is O(N)).
I see. How about this: |
Sounds good! (But "then" -> "than") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I'll merge in a few days if nobody objects. Then if we want to tag a non-breaking release we can create a separate branch with only minor changes to backport.
trim and winsor now maintain original order. closes #437