Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move bounds checks on copyto!(dst, n, src) #43517

Merged
merged 8 commits into from
May 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 11 additions & 3 deletions base/abstractarray.jl
Original file line number Diff line number Diff line change
Expand Up @@ -914,9 +914,17 @@ end

function copyto!(dest::AbstractArray, dstart::Integer, src)
i = Int(dstart)
for x in src
dest[i] = x
i += 1
if haslength(src) && length(dest) > 0
@boundscheck checkbounds(dest, i:(i + length(src) - 1))
for x in src
Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replace this with

I = eachindex(dest)[i]
@inbounds for x in src
    dest[I] = x
    I = nextind(dest, I)
end

Might be faster for IndexCartesian cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems reasonable to improve copyto!(dest::AbstractArray, src) (L893) in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eachindex(dest)[i] doesn't seem to help, on things I tried.

But I agree that the whole-array method just above has room for comparable improvement.

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see some difference in the following example:

using BenchmarkTools
f(dest, i, src) = begin
    I = eachindex(dest)[i]
    @inbounds for x in src
        dest[I] = x
        I = nextind(dest, I)
    end
end

g(dest, i, src) = begin
    @inbounds for x in src
        dest[i] = x
        i += 1
    end
end

a = view(randn(100,100),1:100,1:100)
@btime f($a, 555, $(i + 1 for i in 1:1000))    # 2.144 μs (0 allocations: 0 bytes)
@btime g($a, 555, $(i + 1 for i in 1:1000))   # 2.956 μs (0 allocations: 0 bytes)

For longer src or higher dimension, the gain might be bigger?

Copy link
Contributor Author

@mcabbott mcabbott Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just when I thought I'd convinced myself... these give me:

julia> @btime f($a, 555, $(i + 1 for i in 1:1000))
  min 2.718 μs, mean 2.759 μs (0 allocations)

julia> @btime g($a, 555, $(i + 1 for i in 1:1000))
  min 804.804 ns, mean 810.625 ns (0 allocations)

This does not affect the 2-arg method?

julia> @btime Base.copyto!($a, 555, $(i + 1 for i in 1:1000));
  min 973.938 ns, mean 985.105 ns (0 allocations)

julia> @btime _copyto!($a, 555, $(i + 1 for i in 1:1000));  # first commit of PR
  min 806.769 ns, mean 811.212 ns (0 allocations)
 
julia> @btime _copyto!($a, 555, $(i + 1 for i in 1:1000));  # PR with eaaefb1
  min 2.741 μs, mean 2.786 μs (0 allocations)

julia> @btime Base.copyto!($a, $(i + 1 for i in 1:length(a)));  # 2-arg method
  min 12.875 μs, mean 13.022 μs (0 allocations)

julia> @btime _copyto!($a, $(i + 1 for i in 1:length(a)));  # PR with 68e3d5e
  min 6.933 μs, mean 7.023 μs (0 allocations)

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for 2-arg version on M1 native, master.
I think the problem is firstindex always return 1 if ndims != 1. Use first(eachindex(dest)) should make things consistent.

On the other hand, just notice that nextind(A, ind::Base.SCartesianIndex2) is not defined, so dest can't be reinterpret(reshape, args...)...
Not sure is it ok to add related definition in this PR? Or just use eachindex(IndexStyle(dest) isa IndexLinear ? IndexLinear() : IndexCartesian(), dest)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good point re firstindex. So that's just handling offsets right now.

On the xeon, "g" is still an improvement on Base.copyto!, even if slower than "f" there. Is that true on your computer too?

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's true, at least we eliminated the boundscheck within the loop.
I have no M1 machine, so I can't test myself.
Would you mind to bench the following?

a = view(randn(100,100),1:100,1:100)
b = view(a,1:99,1:100)
f_each(x) = begin
    r = 0.0
    @inbounds for i in eachindex(x)
        r += x[i]
    end
    r
end
f_linear(x) = begin
    r = 0.0
    @inbounds for i in firstindex(x):lastindex(x)
        r += x[i]
    end
    r
end

@btime f_each($a)
@btime f_linear($a)
@btime f_each($b)
@btime f_linear($b)

Copy link
Contributor Author

@mcabbott mcabbott Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure:

julia> @btime f_each($a)
  min 9.250 μs, mean 9.401 μs (0 allocations)
32.23877902877461

julia> @btime f_linear($a)
  min 9.250 μs, mean 9.401 μs (0 allocations)
32.23877902877461

julia> @btime f_each($b)
  min 9.166 μs, mean 9.315 μs (0 allocations)
22.794925499363792

julia> @btime f_linear($b)
  min 9.166 μs, mean 9.286 μs (0 allocations)
22.794925499363792

vs xeon:

julia> @btime f_each($a)
  17.736 μs (0 allocations: 0 bytes)
153.39744409371883

julia> @btime f_linear($a)
  82.459 μs (0 allocations: 0 bytes)
153.39744409371883

julia> @btime f_each($b)
  17.586 μs (0 allocations: 0 bytes)
153.27406012213328

julia> @btime f_linear($b)
  81.627 μs (0 allocations: 0 bytes)
153.27406012213328

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, the result is

@btime f_each($a) # 9.100 μs (0 allocations: 0 bytes)
@btime f_linear($a) # 22.700 μs (0 allocations: 0 bytes)
@btime f_each($b)  #  9.000 μs (0 allocations: 0 bytes)
@btime f_linear($b) # 22.500 μs (0 allocations: 0 bytes)

Is this some "dark magic" of M1? (Maybe we don't need IndexCartesian on M1?)

Something might related: M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles
If this is true, I guess reshape would be faster if we omit the current optimization via MultiplicativeInverse

@inbounds dest[i] = x
i += 1
end
else
for x in src
dest[i] = x
i += 1
end
end
return dest
end
Expand Down
10 changes: 10 additions & 0 deletions test/arrayops.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2105,6 +2105,16 @@ end
@test_throws ArgumentError LinearAlgebra.copy_transpose!(a,2:3,1:3,b,1:5,2:7)
end

@testset "empty copyto!" begin
@test isempty(copyto!(Int[], ()))
@test isempty(copyto!(Int[], Int[]))
@test copyto!([1,2], ()) == [1,2]

@test isempty(copyto!(Int[], 1, ()))
@test isempty(copyto!(Int[], 1, Int[]))
@test copyto!([1,2], 1, ()) == [1,2]
end

module RetTypeDecl
using Test
import Base: +, *, broadcast, convert
Expand Down