Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call user function only once in mean #151

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Sep 9, 2023

Override the standard mapreduce machinery to promote accumulator type.
This avoid calling the function twice, which can be confusing.

Fixes #49 but with a more general solution than #80.

Cc: @stevengj @kagalenko-m-b

Unfortunately, performance drops for a Float64 vector:

julia> x = rand(100_000_000);

julia> y = ifelse.(rand(Bool, length(x)) .> .25, missing, x);

julia> z = Vector{Union{Missing, Float64, Int}}(y);

julia> z .= ifelse.(rand(Bool, length(y)) .> .25, round.(Int, x), z);

# master
julia> @btime mean(x);
  68.582 ms (1 allocation: 16 bytes)

julia> @btime mean(y);
  472.870 ms (1 allocation: 16 bytes)

julia> @btime mean(z);
  533.646 ms (3 allocations: 48 bytes)

# PR
julia> @btime mean(x);
  128.785 ms (4 allocations: 64 bytes)

julia> @btime mean(y);
  506.305 ms (1 allocation: 16 bytes)

julia> @btime mean(z);
  598.425 ms (1 allocation: 16 bytes)

This is simply due to sum being slower when init is provided as it calls mapfoldl under the hood in that case:

julia> @btime sum(x);
  68.199 ms (1 allocation: 16 bytes)

julia> @btime sum(x, init=0.0);
  124.127 ms (1 allocation: 16 bytes)

Override the standard `mapreduce` machinery to promote accumulator type.
This avoid calling the function twice, which can be confusing.
@nalimilan
Copy link
Member Author

The new commit implements a simpler approach which is fast for concrete eltypes but super slow for complex type unions as the compiler doesn't infer an element type more precise than Any

julia> @btime mean(x);
  68.529 ms (4 allocations: 64 bytes)

julia> @btime mean(y);
  550.489 ms (0 allocations: 0 bytes)

julia> @btime mean(z);
  2.752 s (25300984 allocations: 386.06 MiB)

@kagalenko-m-b
Copy link
Contributor

It looks like you're reimplementing undocumented Base internals, doesn't that have potential to break in future Julia versions?

@nalimilan
Copy link
Member Author

Yes, but given that Statistics is an stdlib any breakage would be noticed in Julia even before merging it. I'm more concerned about performance issues introduced by the PR...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Statistic.mean(f, A) calls f length(A)+1 times
2 participants