Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move bounds checks on copyto!(dst, n, src) #43517

Merged
merged 8 commits into from
May 10, 2022

Conversation

mcabbott
Copy link
Contributor

This speeds up @btime copyto!($(rand(3,10)), 7, (1.0, 2.0, 3.0)); from 2.708 ns to 1.375 ns.

By adding @inline and @boundscheck, we can get @btime @inbounds copyto!($(rand(3,10)), 7, (1.0, 2.0, 3.0)); down to 0.875 ns. But this didn't seem to improve anything when used within a loop in #43334, so maybe that's not necessary.

Copy link
Sponsor Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

if haslength(src)
checkbounds(dest, i)
checkbounds(dest, i + length(src) - 1)
for x in src
Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replace this with

I = eachindex(dest)[i]
@inbounds for x in src
    dest[I] = x
    I = nextind(dest, I)
end

Might be faster for IndexCartesian cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it seems reasonable to improve copyto!(dest::AbstractArray, src) (L893) in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eachindex(dest)[i] doesn't seem to help, on things I tried.

But I agree that the whole-array method just above has room for comparable improvement.

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see some difference in the following example:

using BenchmarkTools
f(dest, i, src) = begin
    I = eachindex(dest)[i]
    @inbounds for x in src
        dest[I] = x
        I = nextind(dest, I)
    end
end

g(dest, i, src) = begin
    @inbounds for x in src
        dest[i] = x
        i += 1
    end
end

a = view(randn(100,100),1:100,1:100)
@btime f($a, 555, $(i + 1 for i in 1:1000))    # 2.144 μs (0 allocations: 0 bytes)
@btime g($a, 555, $(i + 1 for i in 1:1000))   # 2.956 μs (0 allocations: 0 bytes)

For longer src or higher dimension, the gain might be bigger?

Copy link
Contributor Author

@mcabbott mcabbott Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just when I thought I'd convinced myself... these give me:

julia> @btime f($a, 555, $(i + 1 for i in 1:1000))
  min 2.718 μs, mean 2.759 μs (0 allocations)

julia> @btime g($a, 555, $(i + 1 for i in 1:1000))
  min 804.804 ns, mean 810.625 ns (0 allocations)

This does not affect the 2-arg method?

julia> @btime Base.copyto!($a, 555, $(i + 1 for i in 1:1000));
  min 973.938 ns, mean 985.105 ns (0 allocations)

julia> @btime _copyto!($a, 555, $(i + 1 for i in 1:1000));  # first commit of PR
  min 806.769 ns, mean 811.212 ns (0 allocations)
 
julia> @btime _copyto!($a, 555, $(i + 1 for i in 1:1000));  # PR with eaaefb1
  min 2.741 μs, mean 2.786 μs (0 allocations)

julia> @btime Base.copyto!($a, $(i + 1 for i in 1:length(a)));  # 2-arg method
  min 12.875 μs, mean 13.022 μs (0 allocations)

julia> @btime _copyto!($a, $(i + 1 for i in 1:length(a)));  # PR with 68e3d5e
  min 6.933 μs, mean 7.023 μs (0 allocations)

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for 2-arg version on M1 native, master.
I think the problem is firstindex always return 1 if ndims != 1. Use first(eachindex(dest)) should make things consistent.

On the other hand, just notice that nextind(A, ind::Base.SCartesianIndex2) is not defined, so dest can't be reinterpret(reshape, args...)...
Not sure is it ok to add related definition in this PR? Or just use eachindex(IndexStyle(dest) isa IndexLinear ? IndexLinear() : IndexCartesian(), dest)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good point re firstindex. So that's just handling offsets right now.

On the xeon, "g" is still an improvement on Base.copyto!, even if slower than "f" there. Is that true on your computer too?

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's true, at least we eliminated the boundscheck within the loop.
I have no M1 machine, so I can't test myself.
Would you mind to bench the following?

a = view(randn(100,100),1:100,1:100)
b = view(a,1:99,1:100)
f_each(x) = begin
    r = 0.0
    @inbounds for i in eachindex(x)
        r += x[i]
    end
    r
end
f_linear(x) = begin
    r = 0.0
    @inbounds for i in firstindex(x):lastindex(x)
        r += x[i]
    end
    r
end

@btime f_each($a)
@btime f_linear($a)
@btime f_each($b)
@btime f_linear($b)

Copy link
Contributor Author

@mcabbott mcabbott Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure:

julia> @btime f_each($a)
  min 9.250 μs, mean 9.401 μs (0 allocations)
32.23877902877461

julia> @btime f_linear($a)
  min 9.250 μs, mean 9.401 μs (0 allocations)
32.23877902877461

julia> @btime f_each($b)
  min 9.166 μs, mean 9.315 μs (0 allocations)
22.794925499363792

julia> @btime f_linear($b)
  min 9.166 μs, mean 9.286 μs (0 allocations)
22.794925499363792

vs xeon:

julia> @btime f_each($a)
  17.736 μs (0 allocations: 0 bytes)
153.39744409371883

julia> @btime f_linear($a)
  82.459 μs (0 allocations: 0 bytes)
153.39744409371883

julia> @btime f_each($b)
  17.586 μs (0 allocations: 0 bytes)
153.27406012213328

julia> @btime f_linear($b)
  81.627 μs (0 allocations: 0 bytes)
153.27406012213328

Copy link
Member

@N5N3 N5N3 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, the result is

@btime f_each($a) # 9.100 μs (0 allocations: 0 bytes)
@btime f_linear($a) # 22.700 μs (0 allocations: 0 bytes)
@btime f_each($b)  #  9.000 μs (0 allocations: 0 bytes)
@btime f_linear($b) # 22.500 μs (0 allocations: 0 bytes)

Is this some "dark magic" of M1? (Maybe we don't need IndexCartesian on M1?)

Something might related: M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles
If this is true, I guess reshape would be faster if we omit the current optimization via MultiplicativeInverse

@mcabbott mcabbott changed the title Move bounds checks on copyto!(dst, n, src) Move bounds checks on copyto!(dst, n, src) and copyto!(dest, src) Dec 22, 2021
throw(ArgumentError("destination has fewer elements than required"))
dest[y[1]] = x
y = iterate(destiter, y[2])
i = Int(firstindex(dest))
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right to me to switch from eachindex to firstindex just because src has a length.

Also, in the past we have avoided annotating inbounds in generic methods like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will fiddle a bit more. At the moment, removing @inbounds removes the whole speed advantage of this method. But perhaps there's a smarter way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, I'm think the PR does not handle length zero correctly right now. It breaks these, which work on master, but don't seem to have tests:

julia> firstindex(Int[])
1

julia> copyto!(Int[], ())
Int64[]

julia> copyto!(Int[], 1, ())
Int64[]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this 2-arg method, as on the cases I was testing, I can't see a way to speed it up without using linear indexing & @inbounds.

I've also added tests for these empty cases, and fixed the 3-arg method to pass them.

@mcabbott mcabbott changed the title Move bounds checks on copyto!(dst, n, src) and copyto!(dest, src) Move bounds checks on copyto!(dst, n, src) Jan 6, 2022
@mcabbott
Copy link
Contributor Author

mcabbott commented Feb 9, 2022

Pre-1.8 bump?

@oscardssmith
Copy link
Member

What's the status on this? Is it ready to merge?

@N5N3
Copy link
Member

N5N3 commented Feb 9, 2022

@JeffBezanson posted some concern on inbounds annotation.

Also, in the past we have avoided annotating inbounds in generic methods like this.

Although that part change has been reverted, I'm not sure is it OK to him for the @inbounds in copyto!(dest::AbstractArray, dstart::Integer, src).
Also it's a little strange for me copyto!(a, src) is slower than copyto!(a, 1, src)

@mcabbott
Copy link
Contributor Author

mcabbott commented Feb 9, 2022

Yes that's accurate. I reverted to the smallest initial change, thus this only speeds up copyto!(a, 1, src), which is the case #43334 wanted.

@KristofferC KristofferC added the status:merge me PR is reviewed. Merge when all tests are passing label May 10, 2022
@KristofferC KristofferC merged commit eb938da into JuliaLang:master May 10, 2022
@mcabbott mcabbott deleted the copyto branch May 10, 2022 15:53
@giordano giordano removed the status:merge me PR is reviewed. Merge when all tests are passing label May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants