Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Make Each(Drop|Replace|Fail)Null iterators faster #289

Merged
merged 2 commits into from
Nov 12, 2017
Merged

Make Each(Drop|Replace|Fail)Null iterators faster #289

merged 2 commits into from
Nov 12, 2017

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Oct 19, 2017

The compiler previously did not know that the result would never be Null.
This dramatically improves their efficiency.

Also require that the type of replacement matches eltype of iterator
for EachReplaceNull, which is much faster. Nulls.replace() already
takes care of converting it on construction.

Optimizations need to rely on the particular fields of DataArray.
This requires restricting them to DataArray, and moving code
to dataarray.jl since the type had not been defined in abstractdataarray.jl.

Rather than adding specialized iterators for PooledDataArray, which
is going to be deprecated, rely on the generic iterators provided
by Nulls, which are already faster than the current ones.

Cf. JuliaData/Missings.jl#50


Before (on Julia 0.6.0):

Main> using Nulls

Main> using DataArrays

Main> using BenchmarkTools

Main> x = DataArray(Vector{Union{Int, Null}}(rand(Int, 100_000)));

Main> x[rand(1:length(x), 10_000)] = null;

Main> skip_iter(x) = sum(Nulls.skip(x))
skip_iter (generic function with 1 method)

Main> @btime skip_iter(x);
  6.988 ms (552024 allocations: 8.42 MiB)

Main> replace_iter(x, y) = sum(Nulls.replace(x, y))
replace_iter (generic function with 1 method)

Main> @btime replace_iter(x, 0);
  31.703 ms (1051004 allocations: 17.56 MiB)

Main> y = DataArray(Vector{Union{Int, Null}}(rand(Int, 100_000)));

Main> fail_iter(x) = sum(Nulls.fail(x))
fail_iter (generic function with 1 method)

Main> @btime fail_iter(y);
  31.586 ms (1098980 allocations: 18.29 MiB)

After (on Julia 0.6.0):

Main> using Nulls

Main> using DataArrays

Main> using BenchmarkTools

Main> x = DataArray(Vector{Union{Int, Null}}(rand(Int, 100_000)));

Main> x[rand(1:length(x), 10_000)] = null;

Main> skip_iter(x) = sum(Nulls.skip(x))
skip_iter (generic function with 1 method)

Main> @btime skip_iter(x);
  516.099 μs (4 allocations: 64 bytes)

Main> replace_iter(x, y) = sum(Nulls.replace(x, y))
replace_iter (generic function with 1 method)

Main> @btime replace_iter(x, 0);
  676.586 μs (4 allocations: 80 bytes)

Main> y = DataArray(Vector{Union{Int, Null}}(rand(Int, 100_000)));

Main> fail_iter(x) = sum(Nulls.fail(x))
fail_iter (generic function with 1 method)

Main> @btime fail_iter(y);
  526.172 μs (4 allocations: 64 bytes)

The compiler previously did not know that the result would never be Null.
This dramatically improves their efficiency.

Also require that the type of replacement matches eltype of iterator
for EachReplaceNull, which is much faster. Nulls.replace() already
takes care of converting it on construction.
item = isnull(itr.da, ind) ? itr.replacement : itr.da[ind]
(item, ind + 1)
end
Base.isnull(a::AbstractArray{T}, i::Real) where {T} = Null <: T ? isa(a[i], Null) : false # -> Bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piratey _vector_version svg

Optimizations need to rely on the particular fields of DataArray.
This requires moving code to dataarray.jl since the type had not
been defined in abstractdataarray.jl.

Rather than adding specialized iterators for PooledDataArray, which
is going to be deprecated, rely on the generic iterators provided
by Nulls, which are already faster than the current ones.
@nalimilan
Copy link
Member Author

OK to merge?

@nalimilan nalimilan merged commit 63a6158 into master Nov 12, 2017
@nalimilan nalimilan deleted the nl/itr branch November 12, 2017 19:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants