More efficient hvcat of scalars and arrays of numbers #39729

BioTurboNick · 2021-02-18T08:08:12Z

First attempt to address #39713

Original:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
@btime [a c ; b d]   # 31 allocations and 1.25 kb

New:

@btime [a c ; b d]   # 15 allocations and 656 bytes

Others unchanged, as expected. Though if different types of numbers are mixed, it still takes the longer path. I tried expanding the definition but there's some weird stuff going on that increases allocations in the other situations I posted in that issue. Works for any Number element type.

Fixes #39713

BioTurboNick · 2021-02-18T13:38:18Z

Oops, yep. completely broken. Should have guessed.

base/abstractarray.jl

BioTurboNick · 2021-02-19T10:38:01Z

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]
@btime e19() # 12 alloc: 528 bytes     ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e20() # 13 alloc: 544 bytes     ; 1.6.0-rc1: 39 alloc: 1.38 KiB
@btime e21() # <broken>                 ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e22() # 10 alloc: 512 bytes     ; 1.6.0-rc1: 10 alloc: 512 bytes

const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1

e6() = [x y ; x y]
@btime e6() # 4 alloc: 672 bytes        ; 1.6.0-rc1: 104 alloc: 4.98 KiB

This is what I achieved by consolidating with hvncat (once it's ready). Note it integrates #39725, which is currently broken for assigning strings into a range setindex.

timholy · 2021-02-19T21:06:01Z

I haven't had time to look at this, but in case you're interested and if it helps, I recently sped up cat: #39292, #39294, #39314

BioTurboNick · 2021-02-19T21:09:05Z

Cool! @timholy I think I noticed some of that effect without knowing the cause. Don't worry about reviewing this yet, it's dependent on my other PR #33697

BioTurboNick · 2021-06-06T00:20:57Z

Updated this, now that the hvncat mechanism is merged:

julia> @btime e19()
  904.651 ns (8 allocations: 576 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 2.611 μs (31 allocations: 1.25 KiB)
#             master: 2.156 μs (30 allocations: 1.44 KiB)

julia> @btime e20()
  910.811 ns (9 allocations: 592 bytes)
3×3 Matrix{Float64}:
 0.0  0.0  2.0
 0.0  0.0  4.0
 3.0  4.0  5.0
# compare with 1.6.1: 2.811 μs (39 allocations: 1.38 KiB)
#             master: 2.267 μs (38 allocations: 1.56 KiB)

julia> @btime e21()
  847.368 ns (5 allocations: 432 bytes)
3×3 Matrix{Any}:
 0  0  2
 0  0  4
 3  4   "string"
# compare with 1.6.1: 2.678 μs (31 allocations: 1.25 KiB)
#             master: 2.167 μs (26 allocations: 1.25 KiB)

julia> @btime e22()
  873.333 ns (6 allocations: 528 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 1.390 μs (10 allocations: 512 bytes)
#             master: 1.240 μs (10 allocations: 512 bytes)

julia> @btime e6() # 4 alloc: 672 bytes
  515.104 ns (5 allocations: 640 bytes)
4×4×2 Array{Int64, 3}:
[:, :, 1] =
 1  2  2  3
 3  4  4  5
 1  2  2  3
 3  4  4  5

[:, :, 2] =
 5  6  6  7
 7  8  8  9
 5  6  6  7
 7  8  8  9
# compare with 1.6.1: 4.414 μs (56 allocations: 2.41 KiB)
#             master: 3.337 μs (50 allocations: 2.39 KiB)

@btime [1 2; 4 5]
  29.879 ns (1 allocation: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 4  5
# compare with 1.6.1: 91.208 ns (2 allocations: 160 bytes)
#             master:  93.194 ns (2 allocations: 160 bytes)

julia> @btime [1 2; 4 fill(1, 1, 1, 1)]
  877.551 ns (9 allocations: 624 bytes)
2×2 Matrix{Int64}:
 1  2
 4  1
# compare with 1.6.1: 4.188 μs (43 allocations: 1.62 KiB)
#             master: 3.487 μs (42 allocations: 1.81 KiB)

BioTurboNick · 2021-06-06T00:40:45Z

With respect to the original motivating examples:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 2.922 μs (31 allocations: 1.25 KiB); now 897.778 ns (8 allocations: 576 bytes)
@btime [a c ; [b d]] # 2.322 μs (21 allocations: 880 bytes); 1.690 μs (26 allocations: 1.30 KiB)
@btime [[a c] ; [b d]] # 975.000 ns 16 allocations and 816 bytes -- explicit hcat nested within vcat (should be worst case)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 1.410 μs ( (10 allocations: 512 bytes); now 865.957 ns (6 allocations: 528 bytes)
@btime [a c ; [b [d]]] # 1.230 μs (9 allocations: 560 bytes); now 816.250 ns (14 allocations: 1008 bytes)
@btime [[a c] ; [b [d]]] # 142.069 ns (4 allocations: 496 bytes) -- explicit hcat nested within vcat (should be worst case)

So better. Still some more improvements to make. Not entirely sure why the middle case is relatively poor.

BioTurboNick · 2023-08-13T21:24:16Z

Current status of this PR vs. Master:

(Some timing differences from previous test likely due to my new computer)

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 259.371 ns (22 allocations: 1.17 KiB) +++
@btime [a c ; [b d]] # 1.612 μs (26 allocations: 1.12 KiB) ---
@btime [[a c] ; [b d]] # 497.052 ns (16 allocations: 720 bytes) ---

# scalars wrapped in arrays
@btime [a c ; b [d]] # 189.981 ns (4 allocations: 352 bytes)
@btime [a c ; [b [d]]] # 1.161 μs (14 allocations: 816 bytes) ---
@btime [[a c] ; [b [d]]] # 84.902 ns (4 allocations: 384 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 628.402 ns (17 allocations: 944 bytes)
@btime [a c ; [b d]] # 1.110 μs (21 allocations: 800 bytes)
@btime [[a c] ; [b d]] # 480.513 ns (16 allocations: 720 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 801.111 ns (10 allocations: 448 bytes)
@btime [a c ; [b [d]]] # 665.161 ns (9 allocations: 464 bytes)
@btime [[a c] ; [b [d]]] # 79.321 ns (4 allocations: 384 bytes)

+++ = Improved since last test
--- = Regression since last test

So currently, Master is actually faster or equivalent for 4 of the 6 cases now, which is great. So will need to think if this PR is truly necessary anymore, and see what can improve that last case.

BioTurboNick · 2024-02-10T16:48:56Z

Updating based on current master:

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 2.473 μs (43 allocations: 2.03 KiB)
@btime [a c ; [b d]] # 1.561 μs (33 allocations: 1.17 KiB)
@btime [[a c] ; [b d]] # 467.454 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 176.155 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 1.088 μs (22 allocations: 864 bytes)
@btime [[a c] ; [b [d]]] # 57.834 ns (8 allocations: 432 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 564.249 ns (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.041 μs (23 allocations: 832 bytes)
@btime [[a c] ; [b d]] # 470.286 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 708.369 ns (12 allocations: 464 bytes)
@btime [a c ; [b [d]]] # 585.099 ns (12 allocations: 496 bytes)
@btime [[a c] ; [b [d]]] # 59.694 ns (8 allocations: 432 bytes)

All cases on master are all faster than they were, even in the presence of additional allocations. I'll see about whether this PR can be wrapped up one way or the other this weekend I think.

BioTurboNick · 2024-02-11T20:17:20Z

New simple approach - just dispatching hvcat to the hvncat_shape methods. Back to being similar or faster than master. But need to verify I didn't slow any other conditions down.

EDIT: updated timings after better addressing dynamic dispatch of cat_similar

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 150.041 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 565.120 ns (31 allocations: 1.11 KiB)
@btime [[a c] ; [b d]] # 472.211 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 85.399 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 148.468 ns (20 allocations: 800 bytes)
@btime [[a c] ; [b [d]]] # 58.048 ns (8 allocations: 432 bytes)

base/abstractarray.jl

BioTurboNick · 2024-02-12T09:13:25Z

The remaining slowdown I'm seeing relates to the hcat methods.

In 1.10, hcat for an array and a scalar dispatches to a SparseArrays method (even though I haven't loaded SparseArrays?) which then redirects to typed_hcat, and then to _cat_t(Val(2), T, X...), which is very fast.

On master, the fallback abstractarrays.jl hcat is called instead, which feeds into the slower cat(A...; dims=Val(2)) instead.

Possibly related to #49322 ? Looks like a regression in cat, haven't fully investigated:

#1.10:
julia> @btime cat(b, d, dims = 2)
#  24.975 ns (1 allocation: 80 bytes)

#master:
julia> @btime cat(b, d, dims = 2)
#  554.371 ns (19 allocations: 576 bytes)

But we can probably ignore that for the purposes of this PR, which is about hvcat.

RomeoV · 2024-03-10T16:38:00Z

I wonder if this resolves the following issue:

using LinearAlgebra
function measure()
  data = rand(3,3)
  mat = [
    data ones(size(data, 1))
    ones(size(data, 2))' 0
  ]
  det(mat)
end
@code_warntype measure()  # -> can't infer type of `mat`

There has been some discussion at https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Matrix.20comprehension.20failed.20type.20inference.

BioTurboNick · 2024-03-10T18:16:31Z

Seems like it does!

MethodInstance for measure()
  from measure() @ Main REPL[2]:1
Arguments
  #self#::Core.Const(measure)
Locals
  mat::Matrix{Float64}
  data::Matrix{Float64}
Body::Float64
1 ─ %1  = Main.rand::Core.Const(rand)
│         (data = (%1)(3, 3))
│   %3  = Core.tuple(2, 2)::Core.Const((2, 2))
│   %4  = data::Matrix{Float64}
│   %5  = Main.ones::Core.Const(ones)
│   %6  = Main.size::Core.Const(size)
│   %7  = data::Matrix{Float64}
│   %8  = (%6)(%7, 1)::Int64
│   %9  = (%5)(%8)::Vector{Float64}
│   %10 = Main.:var"'"::Core.Const(adjoint)
│   %11 = Main.ones::Core.Const(ones)
│   %12 = Main.size::Core.Const(size)
│   %13 = data::Matrix{Float64}
│   %14 = (%12)(%13, 2)::Int64
│   %15 = (%11)(%14)::Vector{Float64}
│   %16 = (%10)(%15)::Adjoint{Float64, Vector{Float64}}
│         (mat = Base.hvcat(%3, %4, %9, %16, 0))
│   %18 = Main.det::Core.Const(LinearAlgebra.det)
│   %19 = mat::Matrix{Float64}
│   %20 = (%18)(%19)::Float64
└──       return %20

BioTurboNick changed the title ~~More efficient hvcat of same-type elements~~ More efficient hvcat of scalars and arrays of numbers Feb 18, 2021

simeonschaub reviewed Feb 18, 2021

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

BioTurboNick force-pushed the array_scalar_hvcat branch from db3547c to de41ca2 Compare May 21, 2021 02:26

BioTurboNick force-pushed the array_scalar_hvcat branch from b09fd6f to e5b9ba7 Compare June 6, 2021 00:19

This comment was marked as outdated.

Sign in to view

BioTurboNick force-pushed the array_scalar_hvcat branch from e41aa39 to f8cb0ff Compare May 19, 2022 06:46

BioTurboNick added 5 commits August 13, 2023 16:47

Rebased changes

c7a73a2

Reorganized

bf028bb

Type stability improvement

78146d6

Fix type param

b084534

Merge issue

34872cd

BioTurboNick force-pushed the array_scalar_hvcat branch from 4d1c1d2 to 34872cd Compare August 13, 2023 21:03

Merge branch 'master' into array_scalar_hvcat

1a379ab

Simplifying for now

67dc65e

vtjnash reviewed Feb 12, 2024

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

vtjnash reviewed Feb 12, 2024

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

BioTurboNick added 2 commits February 12, 2024 02:47

Remove new cat_similar, rely on type assertions instead

16e3e80

Switch hvcat fallback to hvncat, allow dims argument form

b7f6f2d

BioTurboNick marked this pull request as ready for review February 12, 2024 15:11

Merge branch 'master' into array_scalar_hvcat

f762724

Merge branch 'master' into array_scalar_hvcat

2836f9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient hvcat of scalars and arrays of numbers #39729

More efficient hvcat of scalars and arrays of numbers #39729

BioTurboNick commented Feb 18, 2021 •

edited

BioTurboNick commented Feb 18, 2021

BioTurboNick commented Feb 19, 2021

timholy commented Feb 19, 2021

BioTurboNick commented Feb 19, 2021

BioTurboNick commented Jun 6, 2021 •

edited

BioTurboNick commented Jun 6, 2021 •

edited

This comment was marked as outdated.

BioTurboNick commented Aug 13, 2023

BioTurboNick commented Feb 10, 2024

BioTurboNick commented Feb 11, 2024 •

edited

BioTurboNick commented Feb 12, 2024 •

edited

RomeoV commented Mar 10, 2024

BioTurboNick commented Mar 10, 2024

More efficient hvcat of scalars and arrays of numbers #39729

Are you sure you want to change the base?

More efficient hvcat of scalars and arrays of numbers #39729

Conversation

BioTurboNick commented Feb 18, 2021 • edited

BioTurboNick commented Feb 18, 2021

BioTurboNick commented Feb 19, 2021

timholy commented Feb 19, 2021

BioTurboNick commented Feb 19, 2021

BioTurboNick commented Jun 6, 2021 • edited

BioTurboNick commented Jun 6, 2021 • edited

This comment was marked as outdated.

BioTurboNick commented Aug 13, 2023

BioTurboNick commented Feb 10, 2024

BioTurboNick commented Feb 11, 2024 • edited

BioTurboNick commented Feb 12, 2024 • edited

RomeoV commented Mar 10, 2024

BioTurboNick commented Mar 10, 2024

BioTurboNick commented Feb 18, 2021 •

edited

BioTurboNick commented Jun 6, 2021 •

edited

BioTurboNick commented Jun 6, 2021 •

edited

BioTurboNick commented Feb 11, 2024 •

edited

BioTurboNick commented Feb 12, 2024 •

edited