Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient hvcat of scalars and arrays of numbers #39729

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

BioTurboNick
Copy link
Contributor

@BioTurboNick BioTurboNick commented Feb 18, 2021

First attempt to address #39713

Original:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
@btime [a c ; b d]   # 31 allocations and 1.25 kb

New:

@btime [a c ; b d]   # 15 allocations and 656 bytes

Others unchanged, as expected. Though if different types of numbers are mixed, it still takes the longer path. I tried expanding the definition but there's some weird stuff going on that increases allocations in the other situations I posted in that issue. Works for any Number element type.

Fixes #39713

@BioTurboNick
Copy link
Contributor Author

Oops, yep. completely broken. Should have guessed.

@BioTurboNick BioTurboNick changed the title More efficient hvcat of same-type elements More efficient hvcat of scalars and arrays of numbers Feb 18, 2021
base/abstractarray.jl Outdated Show resolved Hide resolved
@BioTurboNick
Copy link
Contributor Author

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]
@btime e19() # 12 alloc: 528 bytes     ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e20() # 13 alloc: 544 bytes     ; 1.6.0-rc1: 39 alloc: 1.38 KiB
@btime e21() # <broken>                 ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e22() # 10 alloc: 512 bytes     ; 1.6.0-rc1: 10 alloc: 512 bytes
const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1

e6() = [x y ; x y]
@btime e6() # 4 alloc: 672 bytes        ; 1.6.0-rc1: 104 alloc: 4.98 KiB

This is what I achieved by consolidating with hvncat (once it's ready). Note it integrates #39725, which is currently broken for assigning strings into a range setindex.

@timholy
Copy link
Sponsor Member

timholy commented Feb 19, 2021

I haven't had time to look at this, but in case you're interested and if it helps, I recently sped up cat: #39292, #39294, #39314

@BioTurboNick
Copy link
Contributor Author

Cool! @timholy I think I noticed some of that effect without knowing the cause. Don't worry about reviewing this yet, it's dependent on my other PR #33697

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Jun 6, 2021

Updated this, now that the hvncat mechanism is merged:

julia> @btime e19()
  904.651 ns (8 allocations: 576 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 2.611 μs (31 allocations: 1.25 KiB)
#             master: 2.156 μs (30 allocations: 1.44 KiB)

julia> @btime e20()
  910.811 ns (9 allocations: 592 bytes)
3×3 Matrix{Float64}:
 0.0  0.0  2.0
 0.0  0.0  4.0
 3.0  4.0  5.0
# compare with 1.6.1: 2.811 μs (39 allocations: 1.38 KiB)
#             master: 2.267 μs (38 allocations: 1.56 KiB)

julia> @btime e21()
  847.368 ns (5 allocations: 432 bytes)
3×3 Matrix{Any}:
 0  0  2
 0  0  4
 3  4   "string"
# compare with 1.6.1: 2.678 μs (31 allocations: 1.25 KiB)
#             master: 2.167 μs (26 allocations: 1.25 KiB)

julia> @btime e22()
  873.333 ns (6 allocations: 528 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 1.390 μs (10 allocations: 512 bytes)
#             master: 1.240 μs (10 allocations: 512 bytes)

julia> @btime e6() # 4 alloc: 672 bytes
  515.104 ns (5 allocations: 640 bytes)
4×4×2 Array{Int64, 3}:
[:, :, 1] =
 1  2  2  3
 3  4  4  5
 1  2  2  3
 3  4  4  5

[:, :, 2] =
 5  6  6  7
 7  8  8  9
 5  6  6  7
 7  8  8  9
# compare with 1.6.1: 4.414 μs (56 allocations: 2.41 KiB)
#             master: 3.337 μs (50 allocations: 2.39 KiB)

@btime [1 2; 4 5]
  29.879 ns (1 allocation: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 4  5
# compare with 1.6.1: 91.208 ns (2 allocations: 160 bytes)
#             master:  93.194 ns (2 allocations: 160 bytes)

julia> @btime [1 2; 4 fill(1, 1, 1, 1)]
  877.551 ns (9 allocations: 624 bytes)
2×2 Matrix{Int64}:
 1  2
 4  1
# compare with 1.6.1: 4.188 μs (43 allocations: 1.62 KiB)
#             master: 3.487 μs (42 allocations: 1.81 KiB)

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Jun 6, 2021

With respect to the original motivating examples:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 2.922 μs (31 allocations: 1.25 KiB); now 897.778 ns (8 allocations: 576 bytes)
@btime [a c ; [b d]] # 2.322 μs (21 allocations: 880 bytes); 1.690 μs (26 allocations: 1.30 KiB)
@btime [[a c] ; [b d]] # 975.000 ns 16 allocations and 816 bytes -- explicit hcat nested within vcat (should be worst case)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 1.410 μs ( (10 allocations: 512 bytes); now 865.957 ns (6 allocations: 528 bytes)
@btime [a c ; [b [d]]] # 1.230 μs (9 allocations: 560 bytes); now 816.250 ns (14 allocations: 1008 bytes)
@btime [[a c] ; [b [d]]] # 142.069 ns (4 allocations: 496 bytes) -- explicit hcat nested within vcat (should be worst case)

So better. Still some more improvements to make. Not entirely sure why the middle case is relatively poor.

@BioTurboNick

This comment was marked as outdated.

@BioTurboNick
Copy link
Contributor Author

Current status of this PR vs. Master:

(Some timing differences from previous test likely due to my new computer)

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 259.371 ns (22 allocations: 1.17 KiB) +++
@btime [a c ; [b d]] # 1.612 μs (26 allocations: 1.12 KiB) ---
@btime [[a c] ; [b d]] # 497.052 ns (16 allocations: 720 bytes) ---

# scalars wrapped in arrays
@btime [a c ; b [d]] # 189.981 ns (4 allocations: 352 bytes)
@btime [a c ; [b [d]]] # 1.161 μs (14 allocations: 816 bytes) ---
@btime [[a c] ; [b [d]]] # 84.902 ns (4 allocations: 384 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 628.402 ns (17 allocations: 944 bytes)
@btime [a c ; [b d]] # 1.110 μs (21 allocations: 800 bytes)
@btime [[a c] ; [b d]] # 480.513 ns (16 allocations: 720 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 801.111 ns (10 allocations: 448 bytes)
@btime [a c ; [b [d]]] # 665.161 ns (9 allocations: 464 bytes)
@btime [[a c] ; [b [d]]] # 79.321 ns (4 allocations: 384 bytes)

+++ = Improved since last test
--- = Regression since last test

So currently, Master is actually faster or equivalent for 4 of the 6 cases now, which is great. So will need to think if this PR is truly necessary anymore, and see what can improve that last case.

@BioTurboNick
Copy link
Contributor Author

Updating based on current master:

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 2.473 μs (43 allocations: 2.03 KiB)
@btime [a c ; [b d]] # 1.561 μs (33 allocations: 1.17 KiB)
@btime [[a c] ; [b d]] # 467.454 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 176.155 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 1.088 μs (22 allocations: 864 bytes)
@btime [[a c] ; [b [d]]] # 57.834 ns (8 allocations: 432 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 564.249 ns (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.041 μs (23 allocations: 832 bytes)
@btime [[a c] ; [b d]] # 470.286 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 708.369 ns (12 allocations: 464 bytes)
@btime [a c ; [b [d]]] # 585.099 ns (12 allocations: 496 bytes)
@btime [[a c] ; [b [d]]] # 59.694 ns (8 allocations: 432 bytes)

All cases on master are all faster than they were, even in the presence of additional allocations. I'll see about whether this PR can be wrapped up one way or the other this weekend I think.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 11, 2024

New simple approach - just dispatching hvcat to the hvncat_shape methods. Back to being similar or faster than master. But need to verify I didn't slow any other conditions down.

EDIT: updated timings after better addressing dynamic dispatch of cat_similar

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 150.041 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 565.120 ns (31 allocations: 1.11 KiB)
@btime [[a c] ; [b d]] # 472.211 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 85.399 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 148.468 ns (20 allocations: 800 bytes)
@btime [[a c] ; [b [d]]] # 58.048 ns (8 allocations: 432 bytes)

base/abstractarray.jl Outdated Show resolved Hide resolved
base/abstractarray.jl Outdated Show resolved Hide resolved
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 12, 2024

The remaining slowdown I'm seeing relates to the hcat methods.

In 1.10, hcat for an array and a scalar dispatches to a SparseArrays method (even though I haven't loaded SparseArrays?) which then redirects to typed_hcat, and then to _cat_t(Val(2), T, X...), which is very fast.

On master, the fallback abstractarrays.jl hcat is called instead, which feeds into the slower cat(A...; dims=Val(2)) instead.

Possibly related to #49322 ? Looks like a regression in cat, haven't fully investigated:

#1.10:
julia> @btime cat(b, d, dims = 2)
#  24.975 ns (1 allocation: 80 bytes)

#master:
julia> @btime cat(b, d, dims = 2)
#  554.371 ns (19 allocations: 576 bytes)

But we can probably ignore that for the purposes of this PR, which is about hvcat.

@BioTurboNick BioTurboNick marked this pull request as ready for review February 12, 2024 15:11
@RomeoV
Copy link

RomeoV commented Mar 10, 2024

I wonder if this resolves the following issue:

using LinearAlgebra
function measure()
  data = rand(3,3)
  mat = [
    data ones(size(data, 1))
    ones(size(data, 2))' 0
  ]
  det(mat)
end
@code_warntype measure()  # -> can't infer type of `mat`

There has been some discussion at https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Matrix.20comprehension.20failed.20type.20inference.

@BioTurboNick
Copy link
Contributor Author

Seems like it does!

MethodInstance for measure()
  from measure() @ Main REPL[2]:1
Arguments
  #self#::Core.Const(measure)
Locals
  mat::Matrix{Float64}
  data::Matrix{Float64}
Body::Float64
1 ─ %1  = Main.rand::Core.Const(rand)
│         (data = (%1)(3, 3))
│   %3  = Core.tuple(2, 2)::Core.Const((2, 2))
│   %4  = data::Matrix{Float64}
│   %5  = Main.ones::Core.Const(ones)
│   %6  = Main.size::Core.Const(size)
│   %7  = data::Matrix{Float64}
│   %8  = (%6)(%7, 1)::Int64
│   %9  = (%5)(%8)::Vector{Float64}
│   %10 = Main.:var"'"::Core.Const(adjoint)
│   %11 = Main.ones::Core.Const(ones)
│   %12 = Main.size::Core.Const(size)
│   %13 = data::Matrix{Float64}
│   %14 = (%12)(%13, 2)::Int64
│   %15 = (%11)(%14)::Vector{Float64}
│   %16 = (%10)(%15)::Adjoint{Float64, Vector{Float64}}
│         (mat = Base.hvcat(%3, %4, %9, %16, 0))
│   %18 = Main.det::Core.Const(LinearAlgebra.det)
│   %19 = mat::Matrix{Float64}
│   %20 = (%18)(%19)::Float64
└──       return %20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extra allocations with hvcat of mixed arrays and scalars
5 participants