Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermediate matrix form in MOLFiniteDifference #2

Closed
valentinsulzer opened this issue Jun 10, 2021 · 27 comments
Closed

Intermediate matrix form in MOLFiniteDifference #2

valentinsulzer opened this issue Jun 10, 2021 · 27 comments

Comments

@valentinsulzer
Copy link

The code in MOLFiniteDifference is getting pretty messy as we add more functionality for nonlinear laplacian, upwinding, etc.

One way to simplify the code would be to have an intermediate matrix form - which then gets scalarized by the compiler - instead of directly going to the scalar form by looping over all indices. Ideally this would use DerivativeOperator to stop this package diverging into two almost entirely distinct packages (the two examples in the readme). Can DerivativeOperator be used like this? If so,

  • can it already do nonlinear laplacian?
  • how can this then be scalarized?
  • Should boundary conditions remain as they are or use the *bc convolution?

I realize this is a little bit of a step backwards because that is closer to how it was done before the refactoring of MOLFiniteDifference, so worth discussing the pros and cons of each approach here a little bit before changing anything

@ChrisRackauckas
Copy link
Member

With array symbolics now merged (https://symbolics.juliasymbolics.org/dev/manual/arrays/) we have a better idea of where this is all going. Pulling in @shashi, @chriselrod, and @YingboMa into this as well.

From https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html we know that fusing the operators is a pretty major effect on performance. So we do not want the final code to have sparse matrix multiplications, and instead want to make sure that this all lowers to something that uses LoopVectorization.jl over contiguous loop indices. In order to support huge equations, we probably do not want this to be scalarized. Also, we want to make sure that this use a cache-optimized or at least cache-oblivious loop iteration, which should be the case by using LoopVectorization.jl. It would be good for the code to be GPU friendly as well.

So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark.

Additionally, we need to think about how to lower symbolic maps and such down to something that could be GPU friendly, and what do about the boundary conditions in this case.

It would also be good to note the effect of the matrix-vector for m on the index reduction algorithm. IIUC one would need to write the boundary terms separate from the matvecs in order to allow for the tearing to work, since otherwise that would break the full incidence assumption. If the boundary conditions already have to be written out in index form, I don't know if that saves any simplicity.

@YingboMa can you share the suggested MTK code for the current form?

@chriselrod
Copy link

chriselrod commented Jun 12, 2021

So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark.

Have an example?

using NNlib, Polyester, LoopVectorization, Static
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
  K₁ =  StaticInt(1):StaticInt(K[1])
  K₂ =  StaticInt(1):StaticInt(K[2])
  Cᵢₙ =  StaticInt(1):StaticInt(C_in)
  Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
  (K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end
function convlayer!(
  out::AbstractArray{<:Any,4}, img, kern,
  dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
  )
  (K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
  @batch for d  axes(out,4)
    @turbo for j₁  axes(out,1), j₂  axes(out,2), o  Cₒᵤₜ
      s = zero(eltype(out))
      for k₁  K₁, k₂  K₂, i  Cᵢₙ
        s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
      end
      out[j₁, j₂, o, d] = s
    end
  end
  out
end
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
@time NNlib.conv!(out1, img, kern, dcd);
@time convlayer!(out2, img, kern, dcd);
out1  out2
@benchmark NNlib.conv!($out1, $img, $kern, $dcd)
@benchmark convlayer!($out1, $img, $kern, $dcd)

LoopVectorization's algorithms scale very badly with the number of loops. 7 works, but takes a minute to compile. So I'm using 6 above, @batching across the seventh. For a CPU-optimized NNlib, consistent threading (e.g., always across batches) would also have the advantage of improving locality, i.e. data will hopefully be able to stay in the same L2 cache from one layer to the next (assuming the data is small enough to fit, as may be the case if training on the CPU).

This is what I get on an 18 core system with AVX512:

julia> @time NNlib.conv!(out1, img, kern, dcd);
  0.802161 seconds (2.06 M allocations: 790.165 MiB, 20.41% gc time)

julia> @time convlayer!(out2, img, kern, dcd);
  9.553409 seconds (21.51 M allocations: 1.149 GiB, 1.06% gc time)

julia> out1  out2
true

julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
samples: 27; evals/sample: 1; memory estimate: 675.02 MiB; allocs estimate: 195
ns

 (1.8e8 - 1.9e8]  ██████████████████████████████ 24
 (1.9e8 - 2.1e8]  █▎1
 (2.1e8 - 2.3e8]   0
 (2.3e8 - 2.5e8]   0
 (2.5e8 - 2.7e8]  █▎1
 (2.7e8 - 2.9e8]  █▎1

                  Counts

min: 175.652 ms (0.00% GC); mean: 191.870 ms (4.80% GC); median: 184.447 ms (0.44% GC); max: 288.455 ms (36.61% GC).

julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
samples: 944; evals/sample: 1; memory estimate: 32 bytes; allocs estimate: 4
ns

 (5.233e6 - 5.256e6]  ██▍31
 (5.256e6 - 5.278e6]  ██████████████▏193
 (5.278e6 - 5.301e6]  ██████████████████████████████412
 (5.301e6 - 5.324e6]  ████████████████████▊284
 (5.324e6 - 5.347e6]  █▌20
 (5.347e6 - 5.37e6 ]   0
 (5.37e6  - 5.392e6]   0
 (5.392e6 - 5.415e6]  ▏1
 (5.415e6 - 5.438e6]   0
 (5.438e6 - 5.461e6]  ▏1
 (5.461e6 - 5.484e6]  ▎2

                  Counts

min: 5.233 ms (0.00% GC); mean: 5.292 ms (0.00% GC); median: 5.294 ms (0.00% GC); max: 5.484 ms (0.00% GC).

julia> versioninfo()
Julia Version 1.7.0-DEV.1208
Commit 0a133ff22b* (2021-05-30 11:28 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

LoopVectorization was over 30x faster.

This was also a generic convolution. The Laplacian has more structure you could exploit.
If it's only a 1-d laplacian, it'd probably be harder to do well. The more compute, the less it'll be bottle-necked by memory.

EDIT:
Just realized convlayer! is allocating...
Apparently it isn't making much use of threading -- only about 4x faster, so I guess it's already pretty constrained by memory:

julia> function convlayer_st!(
         out::AbstractArray{<:Any,4}, img, kern,
         dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
         )
         (K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
       for d  axes(out,4)
           @turbo for j₁  axes(out,1), j₂  axes(out,2), o  Cₒᵤₜ
             s = zero(eltype(out))
             for k₁  K₁, k₂  K₂, i  Cᵢₙ
               s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
             end
             out[j₁, j₂, o, d] = s
           end
           out
         end
       end
convlayer_st! (generic function with 1 method)

julia> @benchmark convlayer_st!($out1, $img, $kern, $dcd)
samples: 180; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2.754e7 - 2.758e7]  ▋2
 (2.758e7 - 2.763e7]   0
 (2.763e7 - 2.768e7]   0
 (2.768e7 - 2.772e7]  █████▍19
 (2.772e7 - 2.777e7]  ██████████▍37
 (2.777e7 - 2.781e7]  ▋2
 (2.781e7 - 2.786e7]  ██▌9
 (2.786e7 - 2.79e7 ]  ██████████████████████████████ 108
 (2.79e7  - 2.795e7]  ▉3

                  Counts

min: 27.539 ms (0.00% GC); mean: 27.824 ms (0.00% GC); median: 27.865 ms (0.00% GC); max: 27.949 ms (0.00% GC).

Apparently Intel's next generation of server chips (Saphire Rapids) will come with up to 64 GiB of HBME2 memory. Maybe it'll only be a select few extremely expensive models, but if there're affordable options, that'd be exciting.
AMD also announced 3d stacking for 192 MiB of L3 cache, which isn't quite enough for problems of this size:

julia> (sizeof(out1) + sizeof(img) + sizeof(kern)) / (1<<20) # bytes of data / (bytes/MiB)
227.36377716064453

The three arrays here are 227 MiB. Making the problem just a little smaller would then let them all fit.
For comparison, the 10980XE I ran these benchmarks on has only 24.75 MiB of L3.
These were 100 260x260 input images (and 100 256x256 outputs), so the real use cases here are hopefully smaller.

@ChrisRackauckas
Copy link
Member

Look at the PDE example in https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html for an example of it.

  mul!(Ayu,Ay,u)
  mul!(uAx,u,Ax)
  mul!(Ayv,Ay,v)
  mul!(vAx,v,Ax)

is just a cheap way to do 2D convolutions with dense matmuls, but it would be good to transcribe that into conv!

@chriselrod
Copy link

I played with it a bit, and haven't been able to get any major improvements over the explicit loop version:

using LinearAlgebra, OrdinaryDiffEq
# Generate the constants
p = (1.0,1.0,1.0,10.0,0.001,100.0) # a,α,ubar,β,D1,D2
N = 100
const Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]));
const Ay = copy(Ax);
Ax[2,1] = 2.0;
Ax[end-1,end] = 2.0;
Ay[1,2] = 2.0;
Ay[end,end-1] = 2.0;

function basic_version!(dr,r,p,t)
  a,α,ubar,β,D1,D2 = p
  u = r[:,:,1]
  v = r[:,:,2]
  Du = D1*(Ay*u + u*Ax)
  Dv = D2*(Ay*v + v*Ax)
  dr[:,:,1] = Du .+ a.*u.*u./v .+ ubar .- α*u
  dr[:,:,2] = Dv .+ a.*u.*u .- β*v
end

a,α,ubar,β,D1,D2 = p;
uss = (ubar+β)/α;
vss = (a/β)*uss^2;
r0 = zeros(100,100,2);
r0[:,:,1] .= uss.+0.1.*rand.();
r0[:,:,2] .= vss;

prob = ODEProblem(basic_version!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 123; evals/sample: 1; memory estimate: 186.80 MiB; allocs estimate: 4555
# ns

#  (3.0e7 - 5.0e7]  ██████████████████████████████ 121
#  (5.0e7 - 7.0e7]   0
#  (7.0e7 - 8.0e7]   0
#  (8.0e7 - 1.0e8]   0
#  (1.0e8 - 1.2e8]   0
#  (1.2e8 - 1.3e8]  ▎1
#  (1.3e8 - 1.5e8]   0
#  (1.5e8 - 1.6e8]  ▎1

#                   Counts

# min: 34.770 ms (7.03% GC); mean: 40.722 ms (9.64% GC); median: 39.011 ms (5.79% GC); max: 164.571 ms (77.65% GC).
  
function gm2!(dr,r,p,t)
  a,α,ubar,β,D1,D2 = p
  u = @view r[:,:,1]
  v = @view r[:,:,2]
  du = @view dr[:,:,1]
  dv = @view dr[:,:,2]
  Du = D1*(Ay*u + u*Ax)
  Dv = D2*(Ay*v + v*Ax)
  @. du = Du + a.*u.*u./v + ubar - α*u
  @. dv = Dv + a.*u.*u - β*v
end
prob = ODEProblem(gm2!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 171; evals/sample: 1; memory estimate: 119.44 MiB; allocs estimate: 2790
# ns

#  (2.68e7 - 2.75e7]  ▋1
#  (2.75e7 - 2.82e7]  ███▏6
#  (2.82e7 - 2.9e7 ]  ███████████████████████▍45
#  (2.9e7  - 2.97e7]  ██████████████████████████████▏58
#  (2.97e7 - 3.04e7]  ████████████████████████████▌55
#  (3.04e7 - 3.11e7]  ██▋5
#  (3.11e7 - 3.18e7]   0
#  (3.18e7 - 3.25e7]   0
#  (3.25e7 - 3.32e7]  ▋1

#                   Counts

# min: 26.824 ms (7.47% GC); mean: 29.393 ms (6.39% GC); median: 29.594 ms (7.36% GC); max: 33.241 ms (27.63% GC).


const Ayu = zeros(N,N);
const uAx = zeros(N,N);
const Du = zeros(N,N);
const Ayv = zeros(N,N);
const vAx = zeros(N,N);
const Dv = zeros(N,N);
function gm3!(dr,r,p,t)
  a,α,ubar,β,D1,D2 = p
  u = @view r[:,:,1]
  v = @view r[:,:,2]
  du = @view dr[:,:,1]
  dv = @view dr[:,:,2]
  mul!(Ayu,Ay,u)
  mul!(uAx,u,Ax)
  mul!(Ayv,Ay,v)
  mul!(vAx,v,Ax)
  @. Du = D1*(Ayu + uAx)
  @. Dv = D2*(Ayv + vAx)
  @. du = Du + a*u*u./v + ubar - α*u
  @. dv = Dv + a*u*u - β*v
end
prob_gm3 = ODEProblem(gm3!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm3,Tsit5())
# samples: 181; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns

#  (2.1e7 - 2.2e7]  ▌1
#  (2.2e7 - 2.3e7]  ▌1
#  (2.3e7 - 2.5e7]   0
#  (2.5e7 - 2.6e7]  █████▋14
#  (2.6e7 - 2.7e7]  ██████████████▎36
#  (2.7e7 - 2.8e7]  ██████████████████████████████ 76
#  (2.8e7 - 3.0e7]  ███████████████▌39
#  (3.0e7 - 3.1e7]  ████10
#  (3.1e7 - 3.2e7]  █▋4

#                   Counts

# min: 20.760 ms (0.00% GC); mean: 27.682 ms (2.07% GC); median: 27.566 ms (0.00% GC); max: 32.108 ms (0.00% GC).

using LoopVectorization
function matmul!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
  @tturbo for n  indices((C,B),2), m  indices((C,A),1)
    Cₘₙ = zero(eltype(C))
    for k  indices((A,B),(2,1))
      Cₘₙ += A[m,k]*B[k,n]
    end
    C[m,n] = Cₘₙ
  end
end
function matmuladd!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
  @tturbo for n  indices((C,B),2), m  indices((C,A),1)
    Cₘₙ = zero(eltype(C))
    for k  indices((A,B),(2,1))
      Cₘₙ += A[m,k]*B[k,n]
    end
    C[m,n] += Cₘₙ
  end
end

function gm5!(dr,r,p,t)
  a,α,ubar,β,D1,D2 = p
  u = @view r[:,:,1];
  v = @view r[:,:,2];
  du = @view dr[:,:,1];
  dv = @view dr[:,:,2];
  matmul!(Ayu,Ay,u)
  matmuladd!(Ayu,u,Ax)
  # @turbo @. du = D1*Ayu + a*u*u./v + ubar - α*u
  @turbo for i  eachindex(u)
    du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - α*u[i]
  end
  matmul!(Ayu,Ay,v)
  matmuladd!(Ayu,v,Ax)
  # @turbo @. dv = D2*Ayu + a*u*u - β*v
  @turbo for i  eachindex(u)
    dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - β*v[i]
  end
end
prob_gm5 = ODEProblem(gm5!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm5,Tsit5())
# samples: 306; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns

#  (1.1e7 - 1.3e7]  ██████████████████████████████ 211
#  (1.3e7 - 1.5e7]  ██▎15
#  (1.5e7 - 1.7e7]  ▍2
#  (1.7e7 - 1.9e7]  ▍2
#  (1.9e7 - 2.1e7]   0
#  (2.1e7 - 2.3e7]   0
#  (2.3e7 - 2.5e7]  ▉6
#  (2.5e7 - 2.8e7]  ███████▊54
#  (2.8e7 - 3.0e7]  █7
#  (3.0e7 - 3.2e7]  █▍9

#                   Counts

# min: 10.999 ms (0.00% GC); mean: 16.355 ms (3.53% GC); median: 12.869 ms (0.00% GC); max: 31.714 ms (6.63% GC).

function matmul!(C::AbstractArray{<:Any,3}, A::AbstractMatrix, B::AbstractArray{<:Any,3})
  @tturbo for n  indices((C,B),2), m  indices((C,A),1), d  1:2
    Cₘₙ = zero(eltype(C))
    for k  indices((A,B),(2,1))
      Cₘₙ += A[m,k]*B[k,n,d]
    end
    C[m,n,d] = Cₘₙ
  end
end
function matmuladd!(C::AbstractArray{<:Any,3}, A::AbstractArray{<:Any,3}, B::AbstractMatrix)
  @tturbo for n  indices((C,B),2), m  indices((C,A),1), d  1:2
    Cₘₙ = zero(eltype(C))
    for k  indices((A,B),(2,1))
      Cₘₙ += A[m,k,d]*B[k,n]
    end
    C[m,n,d] += Cₘₙ
  end
end
const Ayuv = similar(Ayu, (size(Ayu,1),size(Ayu,2),2));
function gm6!(dr,r,p,t)
  a,α,ubar,β,D1,D2 = p
  matmul!(Ayuv, Ay, r)
  matmuladd!(Ayuv, r, Ax)
  N² = size(r,1)*size(r,2)
  u = reshape(r, (N²,2))
  du = reshape(dr, (N²,2))
  Ayuvr = reshape(Ayuv, (N²,2))
  @turbo for i  1:N²
    du[i,1] = D1 * Ayuvr[i,1] + a * u[i,1]*u[i,1] / u[i,2] + ubar - α*u[i,1]
    du[i,2] = D2 * Ayuvr[i,2] + a * u[i,1]*u[i,1] - β*u[i,2]
  end
end
prob_gm6 = ODEProblem(gm6!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm6, Tsit5())
# samples: 310; evals/sample: 1; memory estimate: 29.67 MiB; allocs estimate: 1315
# ns

#  (1.1e7 - 1.3e7]  █████▌37
#  (1.3e7 - 1.4e7]  ██████████████████████████████205
#  (1.4e7 - 1.6e7]   0
#  (1.6e7 - 1.8e7]   0
#  (1.8e7 - 2.0e7]   0
#  (2.0e7 - 2.1e7]   0
#  (2.1e7 - 2.3e7]   0
#  (2.3e7 - 2.5e7]   0
#  (2.5e7 - 2.6e7]  ▎1
#  (2.6e7 - 2.8e7]  █████████▉67

#                   Counts

# min: 11.091 ms (0.00% GC); mean: 16.126 ms (3.49% GC); median: 13.182 ms (0.00% GC); max: 27.934 ms (9.08% GC).
  
pdevec = (1.0,1.0,1.0,10.0,0.001,100.0,N);
function fast_gm!(du,u,p,t)
  a,α,ubar,β,D1,D2,N = p

  @inbounds for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end

  @inbounds for j in 2:N-1, i in 2:N-1
    du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @inbounds for j in 2:N-1
    i = 1
    du[1,j,1] = D1*(2u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
            a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for j in 2:N-1
    i = 1
    du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - β*u[i,j,2]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
           a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
           a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @inbounds for i in 2:N-1
    j = 1
    du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = 1
    du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - β*u[i,j,2]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @inbounds begin
    i = 1; j = 1
    du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - β*u[i,j,2]

    i = 1; j = N
    du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]

    i = N; j = 1
    du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]

    i = N; j = N
    du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]
   end
end
prob_devec = ODEProblem(fast_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns

#  (7.2e6   - 7.74e6 ]  ██████████████████████████████▏337
#  (7.74e6  - 8.27e6 ]  ██▋29
#  (8.27e6  - 8.81e6 ]  ██████▏68
#  (8.81e6  - 9.34e6 ]  ██████▋74
#  (9.34e6  - 9.88e6 ]  ███33
#  (9.88e6  - 1.041e7]  ▍4
#  (1.041e7 - 1.095e7]   0
#  (1.095e7 - 1.149e7]  ██22
#  (1.149e7 - 1.202e7]  ▋7
#  (1.202e7 - 1.256e7]  ██▎24
#  (1.256e7 - 1.309e7]  ▎2

#                   Counts

# min: 7.200 ms (0.00% GC); mean: 8.335 ms (7.27% GC); median: 7.482 ms (0.00% GC); max: 13.092 ms (43.21% GC).

  
function turbo_gm!(du,u,p,t)
  a,α,ubar,β,D1,D2,N = p

  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end

  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @inbounds for j in 2:N-1
    i=1
    du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
            a*u[1,j,1]^2/u[1,j,2] + ubar - α*u[1,j,1]
  end
  @inbounds for j in 2:N-1
    i = 1
    du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - β*u[i,j,2]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
           a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for j in 2:N-1
    i = N
    du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
           a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @inbounds for i in 2:N-1
    j = 1
    du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = 1
    du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - β*u[i,j,2]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
  end
  @inbounds for i in 2:N-1
    j = N
    du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @inbounds begin
    i = 1; j = 1
    du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - β*u[i,j,2]

    i = 1; j = N
    du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]

    i = N; j = 1
    du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]

    i = N; j = N
    du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]
   end
end
prob_turbo_devec = ODEProblem(turbo_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns

#  (7.11e6  - 7.78e6 ]  ██████████████████████████████ 324
#  (7.78e6  - 8.45e6 ]  ▍4
#  (8.45e6  - 9.12e6 ]  ████████▎88
#  (9.12e6  - 9.79e6 ]  ████████▉95
#  (9.79e6  - 1.046e7]  ██████▋71
#  (1.046e7 - 1.113e7]  ▌5
#  (1.113e7 - 1.18e7 ]  ▍3
#  (1.18e7  - 1.247e7]  ▍3
#  (1.247e7 - 1.314e7]  ▎2
#  (1.314e7 - 1.381e7]  ▎2
#  (1.381e7 - 1.448e7]  ▍3

#                   Counts

# min: 7.107 ms (0.00% GC); mean: 8.331 ms (7.72% GC); median: 7.374 ms (0.00% GC); max: 14.477 ms (0.00% GC).
function turbo_gm_fuse!(du,u,p,t)
  a,α,ubar,β,D1,D2,N = p

  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
            a*u[i,j,1]^2 - β*u[i,j,2]
  end

  @turbo for j in 2:N-1
    du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
            a*u[1,j,1]^2/u[1,j,2] + ubar - α*u[1,j,1]
  # end
  # @turbo for j in 2:N-1
    du[1,j,2] = D2*(2u[2,j,2] + u[1,j+1,2] + u[1,j-1,2] - 4u[1,j,2]) +
            a*u[1,j,1]^2 - β*u[1,j,2]
  # end
  # @turbo for j in 2:N-1
    du[N,j,1] = D1*(2u[N-1,j,1] + u[N,j+1,1] + u[N,j-1,1] - 4u[N,j,1]) +
           a*u[N,j,1]^2/u[N,j,2] + ubar - α*u[N,j,1]
  # end
  # @turbo for j in 2:N-1
    du[N,j,2] = D2*(2u[N-1,j,2] + u[N,j+1,2] + u[N,j-1,2] - 4u[N,j,2]) +
      a*u[N,j,1]^2 - β*u[N,j,2]
  end
  @turbo for j in 2:N-1
    du[j,1,1] = D1*(u[j-1,1,1] + u[j+1,1,1] + 2u[j,2,1] - 4u[j,1,1]) +
              a*u[j,1,1]^2/u[j,1,2] + ubar - α*u[j,1,1]
  # end
  # @turbo for j in 2:N-1
    du[j,1,2] = D2*(u[j-1,1,2] + u[j+1,1,2] + 2u[j,2,2] - 4u[j,1,2]) +
              a*u[j,1,1]^2 - β*u[j,1,2]
  # end
  # @turbo for j in 2:N-1
    du[j,N,1] = D1*(u[j-1,N,1] + u[j+1,N,1] + 2u[j,N-1,1] - 4u[j,N,1]) +
             a*u[j,N,1]^2/u[j,N,2] + ubar - α*u[j,N,1]
  # end
  # @turbo for j in 2:N-1
    du[j,N,2] = D2*(u[j-1,N,2] + u[j+1,N,2] + 2u[j,N-1,2] - 4u[j,N,2]) +
             a*u[j,N,1]^2 - β*u[j,N,2]
  end

  @inbounds begin
    i = 1; j = 1
    du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
              a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
              a*u[i,j,1]^2 - β*u[i,j,2]

    i = 1; j = N
    du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]

    i = N; j = 1
    du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]

    i = N; j = N
    du[N,N,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
             a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
    du[N,N,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
             a*u[i,j,1]^2 - β*u[i,j,2]
   end
end
prob_turbo_fuse = ODEProblem(turbo_gm_fuse!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_fuse,Tsit5())
# samples: 629; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns

#  (6.8e6  - 7.9e6 ]  ██████████████████████████████427
#  (7.9e6  - 8.9e6 ]  ██▋36
#  (8.9e6  - 1.0e7 ]  ████▊66
#  (1.0e7  - 1.11e7]  ██████▉97
#  (1.11e7 - 1.22e7]   0
#  (1.22e7 - 1.33e7]   0
#  (1.33e7 - 1.44e7]   0
#  (1.44e7 - 1.54e7]  ▏1
#  (1.54e7 - 1.65e7]  ▏1
#  (1.65e7 - 1.76e7]   0
#  (1.76e7 - 2.12e7]  ▏1

#                   Counts

# min: 6.786 ms (0.00% GC); mean: 7.938 ms (7.63% GC); median: 7.014 ms (0.00% GC); max: 21.180 ms (66.47% GC).

@chriselrod
Copy link

Those loops are easy enough for LLVM to SIMD, and there's not too much that can be done to make them faster.
I.e., there isn't much reuse you can get through blocking. You're just streaming through memory.

It is unrolling the j loop to try and save on j loads a bit (i.e., j+1 on one j iteration is j on the next, is j-1 on the next), but that doesn't seem to help much.

@ChrisRackauckas
Copy link
Member

Yeah, I think you need to not have the nonlinear portion of the PDE (+ a*u[i,j,1]^2 - β*u[i,j,2]) for a stencil compiler to work well. This is what I was mentioning in https://discourse.julialang.org/t/how-to-tackle-2d-reaction-diffusion-equation/54110/15?u=chrisrackauckas . The authors of that seemed to get mad at me, but it seems to be the simple fact that once you have a term that is outside the original range, u[i,j,2], or a term which is sufficiently difficult to compute so that it's not too memory bound, u[i,j,1]^2, then the act of fusion is just better than having a good stencil compiler. That LoopVectorization demo above is precisely what I needed to see to fully prove that indeed the fused version outperforms even a good stencil compiler in that case, and that @turbo has an effect but it's not massive.

It would still be good to investigate multithreading between those when the space is really large though, but that's a separate thing and only matters for the huge PDE case. But essentially, those benchmarks are my argument for why we shouldn't be generating A*u operators but instead should be generating (and fusing) map operations.

@chriselrod
Copy link

Making them dense:

Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]))

adds a huge amount of computation, making it go from O(N^2) to O(N^3).
Seems like an obviously bad idea.
We could implement specialized Tridiagonal multiplication routines, e.g. like
JuliaSIMD/LoopVectorization.jl#266 (comment)
but at that point we may as well use the fusing map operations.

For the mapped version, 100x100 isn't large enough to benefit from multithreading yet, but 1000x1000 would be.

@ChrisRackauckas
Copy link
Member

The first part is the same as an image filter.

@chriselrod
Copy link

or a term which is sufficiently difficult to compute so that it's not too memory bound, u[i,j,1]^2

Not sure if I follow here, but u[i,j,1] is easy to compute.
Fusion is great for memory bound applications though. Passing over the data once is better than passing over it twice.

We shouldn't be doing dense Ax::Matrix operations.
I assume ParallelStencil wouldn't be?

If we let Ax be tridiagonal, it'd do much better.
Still probably not as well as the loops, but better.

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Jun 13, 2021

Not sure if I follow here, but u[i,j,1] is easy to compute.

I mean, when you're looping over u[i,j,2] and need to grab u[i,j,1], you very quickly expand the cache lines like that. And this is still a very small example with very few nonlinearities!

We shouldn't be doing dense Ax::Matrix operations. I assume ParallelStencil wouldn't be?

It wouldn't, I thought you were using the same setup as the image filter you had? A 2D Laplacian is just a 2D filter.

@chriselrod
Copy link

chriselrod commented Jun 13, 2021

Ha, let me do that. Above it was just matmul!.

Note that the image filtering example doesn't do boundaries...

I mean, when you're looping over u[i,j,2] and need to grab u[i,j,1], you very quickly expand the cache lines like that.

Although reuse tends to be fairly immediate (e.g. u[i,j,1], u[i+1,j,1]), but then there aren't any more opportunities for re-use, so you don't mind these loads getting evicted.
Although there is concern over u[i,j,1], u[i,j+1,1]. It'd probably take some experimenting to see if some larger cache tiling helps more than the j unrolling does.

@chriselrod
Copy link

@ChrisRackauckas here is the example I think you were looking for:

using StaticArrays, LoopVectorization, Static
const LAPLACE_KERNEL = @MMatrix([0.0 1.0 0.0; 1.0 -4.0 1.0; 0.0 1.0 0.0]);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern::MMatrix{3,3})
  N = size(out,1)
  @inbounds @fastmath let j_2 = j_1 = 1
    tmp  = A[1+j_1, 1+j_2] * kern[-1+2, -1+2]
    tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
    tmp += A[1+j_1, 1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
    tmp += A[1+j_1, 1+j_2] * kern[1+2, -1+2]
    for i_1  0:1, i_2  0:1
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  @inbounds @fastmath let j_1 = 1; j_2 = N
    tmp  = A[1+j_1, -1+j_2] * kern[-1+2, -1+2]
    tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
    tmp += A[1+j_1, -1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
    tmp += A[1+j_1, -1+j_2] * kern[1+2, 1+2]
    for i_1  0:1, i_2  -1:0
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  @inbounds @fastmath let j_2 = 1; j_1 = N
    tmp  = A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
    tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
    tmp += A[-1+j_1, 1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
    tmp += A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
    for i_1  -1:0, i_2  0:1
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end  
  @turbo for j  2:N-1
  # for j ∈ 2:N-1
    tmp_1 = zero(eltype(out))
    tmp_2 = zero(eltype(out))
    for i_1  -1:1
      tmp_1 += A[i_1+j, 2] * kern[i_1+2, 3]
      tmp_2 += A[2, i_1+j] * kern[3, i_1+2]
      for i_2  0:1
        tmp_1 += A[i_1+j, 1+i_2] * kern[i_1+2, i_2+2]
        tmp_2 += A[1+i_2, i_1+j] * kern[i_2+2, i_1+2]
      end
    end
    out[j,1] = tmp_1
    out[1,j] = tmp_2
  end
  @turbo for j_2  2:N-1, j_1  2:N-1
  # for j_2 ∈ 2:N-1, j_1 ∈ 2:N-1
    tmp = zero(eltype(out))
    for i_1  -1:1, i_2  -1:1
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  @turbo for j  2:N-1
  # for j ∈ 2:N-1
    tmp_1 = zero(eltype(out))
    tmp_2 = zero(eltype(out))
    for i_1  -1:1
      tmp_1 += A[i_1+j, N-1] * kern[i_1+2, 3]
      tmp_2 += A[N-1, i_1+j] * kern[3, i_1+2]
      for i_2  -1:0
        tmp_1 += A[i_1+j, N+i_2] * kern[i_1+2, i_2+2]
        tmp_2 += A[N+i_2, i_1+j] * kern[i_2+2, i_1+2]
      end
    end
    out[j,N] = tmp_1
    out[N,j] = tmp_2
  end
  @inbounds @fastmath let j_2 = j_1 = N
    tmp  = A[-1+j_1, -1+j_2] * kern[1+2, -1+2]
    tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
    tmp += A[-1+j_1, -1+j_2] * kern[1+2, 1+2]
    tmp += A[-1+j_1, -1+j_2] * kern[-1+2, 1+2]
    tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
    for i_1  -1:0, i_2  -1:0
      tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
    end
    out[j_1,j_2] = tmp
  end
  nothing
end
function gm7!(dr,r,p,t)
  a,α,ubar,β,D1,D2 = p
  u = @view r[:,:,1];
  v = @view r[:,:,2];
  du = @view dr[:,:,1];
  dv = @view dr[:,:,2];
  filter2davx!(Ayu, u, LAPLACE_KERNEL)
  # @turbo @. du = D1*Ayu + a*u*u./v + ubar - α*u
  @turbo for i  eachindex(u)
    du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - α*u[i]
  end
  filter2davx!(Ayu, v, LAPLACE_KERNEL)
  # @turbo @. dv = D2*Ayu + a*u*u - β*v
  @turbo for i  eachindex(u)
    dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - β*v[i]
  end
end
prob_gm7 = ODEProblem(gm7!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm7,Tsit5())
samples: 546; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
ns

 (7.96e6  - 8.63e6 ]  ██████████████████████████████311
 (8.63e6  - 9.29e6 ]  ██▋26
 (9.29e6  - 9.95e6 ]  █████████▊101
 (9.95e6  - 1.062e7]  █████51
 (1.062e7 - 1.128e7]  ▍3
 (1.128e7 - 1.194e7]  ██▌25
 (1.194e7 - 1.26e7 ]  ▏1
 (1.26e7  - 1.327e7]  ▏1
 (1.327e7 - 1.393e7]   0
 (1.393e7 - 1.459e7]  █▍13
 (1.459e7 - 1.525e7]  █▍14

                  Counts

min: 7.965 ms (0.00% GC); mean: 9.162 ms (6.71% GC); median: 8.165 ms (0.00% GC); max: 15.255 ms (10.53% GC).

If we wanted to optimize this a bit more, we could

  1. take advantage of the sparsity in the laplacian kernel
  2. make the values from the kernel constant
  3. Fuse operations to pass over data less often.

Oops, we now have fast_gm!/turbo_gm_fuse!.

@ChrisRackauckas
Copy link
Member

You can just hardcode that it's the stencil

0 1 0
1 -4 1
0 1 0

@chriselrod
Copy link

Sure, but at that point why not go the rest of the way to turbo_gm_fuse!?

@ChrisRackauckas
Copy link
Member

Yeah, go all of the way.

@chriselrod
Copy link

chriselrod commented Jun 14, 2021

The one thing fusing inhibits is easily threading across the third axis.
I tried that, and it didn't help. 8.2ms solve time.

I should add a version of @spawn to Polyester so we can parallelize different functions.
Maybe if the third axis was huge (and you're still doing something different), but that seems contrived.

Ideally, we'd have a loop compiler that does the right thing with whatever loops we give it. There isn't really anything that special about stencils.

EDIT: To give ParallelStencil.jl/stencil compilers credit (aside from the fact that I didn't benchmark it at all so there is no "discredit" either), working on a smaller problem domain (just stencils) makes it easier to scale out, e.g. to all the GPUs.

@ChrisRackauckas
Copy link
Member

Yeah, they do have a lot of uses. http://supertech.csail.mit.edu/papers/pochoir_spaa11.pdf is a good read if you haven't seen it.

@chriselrod
Copy link

chriselrod commented Jun 14, 2021

Thanks, that's interesting.
So the key to a stencil compiler is how it optimizes cache locality across time.

That is, it cuts up the iteration space/dependencies in a way so that multiple time steps -- and hence lots of reuse -- can be computed on sub (trapezoidal) regions. These regions can also be computed in parallel.

If you want to compute only a single time step update, you need to iterate over the entire arrays. Then you may as well do what you can to minimize the amount of iteration, e.g. fuse it with other operations.

However, if you want to compute, say, 1000 time steps, you don't need to pass over the arrays 1000 times. It takes many time steps for changes to index u[i,j] to fan out and influence iterations at index u[i+100, j-100].
So starting at t=0, you can take a chunk of the array u that fits nicely in cache, and then perform a large number of time updates. This chunk shrinks overtime, because you lose some along the boundaries, hence the trapezoidal shape.

Eventually, you'd stop and instead work on another region. For example, one now available because the region(s) it depends on have been finished.

Stencil compilers figure our clever ways to carve the space up in this way, running those separate chunks on different cores/gpus/etc, and also scheduling the leftovers to fill in the missing pieces, etc.

Polyhedral optimizers can also do this sort of analysis.
Hmm.

Anyway, regarding the computations here where we're doing a single time step at a time, we can't benefit from these optimizations.
If we were able to do many iterations, then ideally we'd be able to fuse operations like a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1] with it. That shouldn't cause any scheduling problems (e.g. to a polyhedral optimizer).

@YingboMa
Copy link
Member

The current MTK code example of scalarized code is

using ModelingToolkit
l = 1
n = 5
h = l / n
@variables t u[1:n](t)
D = Differential(t)
# periodic bounary condition version 1
eqs = [
       h^2 * D(u[1]) ~ u[end] - 2u[1] + u[2]
       map(2:n-1) do i
           h^2 * D(u[i]) ~ u[i-1] - 2u[i] + u[i+1] # ifelse(u[i]>0, u[i+1] - u[i], u[i] - u[i-1])
       end
       h^2 * D(u[end]) ~ u[end-1] - 2u[end] + u[1]
      ]

My tentative syntax for the interior is

map(u[1:n-1], u[2:n], u[3:n+1]) do um1, ui, up1
    um1 - 2ui + up1
end

I am not sure how to scale this to high dims though.

@ChrisRackauckas
Copy link
Member

Let's talk about that high dim case today. That's going to be important to this. And get @chriselrod and @shashi on too. 1pm?

@valentinsulzer
Copy link
Author

Happy to help with this but the discussions so far are above my paygrade 😅

@ChrisRackauckas
Copy link
Member

using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkTools
using LoopVectorization
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N)
const Y = reshape([j for i in 1:N for j in 1:N],N,N)
const α₁ = 1.0.*(X.>=4*N/5)
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N)
const AMx = zeros(N,N)
const DA = zeros(N,N)
function f(du,u,p,t)
   A = @view  u[:,:,1]
   B = @view  u[:,:,2]
   C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  mul!(MyA,My,A)
  mul!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
function f!(du,u,p,t)
    A = @view  u[:,:,1]
    B = @view  u[:,:,2]
    C = @view  u[:,:,3]
   dA = @view du[:,:,1]
   dB = @view du[:,:,2]
   dC = @view du[:,:,3]
  @turbo for j in 2:N-1, i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
              α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
              α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
              α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
              α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
                α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds begin
     i = 1; j = 1
     dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
               α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
     i = 1; j = N
     dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
               α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
     i = N; j = 1
     dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
               α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
     i = N; j = N
     dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
               α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
     dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
   end
end
#=
# Double check:
u = rand(N,N,3)
du  = similar(u)
du2 = similar(u)
f( du ,u,nothing,0.0)
f!(du2,u,nothing,0.0)
=#
u0 = zeros(N,N,3)
prob = ODEProblem(f!,u0,(0.0,10.0))
@btime solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 13.300 s (3406 allocations: 4.34 GiB)
@btime solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 38.481 s (270 allocations: 708.02 MiB)
@btime solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 40.445 s (267 allocations: 690.02 MiB)

There's an N to tone down to test with. I'm pretty confused by this. Is it just the cache lines would have to be too long?

@chriselrod found:

Locally:
julia> u0 = randn(N,N,3);
julia> du0 = similar(u0);
julia> @benchmark f!($du0,$u0,nothing,0.0) # @turbo
samples: 7479; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (641000.0 - 644000.0]  ██▏57
 (644000.0 - 647000.0]  ██████████▌291
 (647000.0 - 650000.0]  ███████████████████▉552
 (650000.0 - 653100.0]  ████████████████████████▋683
 (653100.0 - 656100.0]  ████████████████████████████778
 (656100.0 - 659100.0]  ███████████████████████████749
 (659100.0 - 662100.0]  ███████████████████████████750
 (662100.0 - 665100.0]  ███████████████████████▌653
 (665100.0 - 668200.0]  ████████████████████████▎675
 (668200.0 - 671200.0]  ████████████████████████▉693
 (671200.0 - 674200.0]  ██████████████████████████████ 836
 (674200.0 - 677200.0]  ███████████████████████▉663
 (677200.0 - 680200.0]  ███▍91
 (680200.0 - 1.0048e6]  ▍8
                  Counts
min: 640.984 μs (0.00% GC); mean: 661.914 μs (0.00% GC); median: 661.578 μs (0.00% GC); max: 1.005 ms (0.00% GC).
julia> @benchmark g!($du0,$u0,nothing,0.0) # @inbounds @fastmath
samples: 4784; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (1.0269e6 - 1.0287e6]  ▎15
 (1.0287e6 - 1.0306e6]  ▏10
 (1.0306e6 - 1.0324e6]  ▏7
 (1.0324e6 - 1.0342e6]  ▎20
 (1.0342e6 - 1.036e6 ]  ▏7
 (1.036e6  - 1.0379e6]  ████████▋953
 (1.0379e6 - 1.0397e6]  ██████████████████████████████3337
 (1.0397e6 - 1.0415e6]  ███▌377
 (1.0415e6 - 1.0433e6]  ▍30
 (1.0433e6 - 1.0452e6]  ▏13
 (1.0452e6 - 1.047e6 ]  ▏6
 (1.047e6  - 1.0488e6]  ▏3
 (1.0488e6 - 1.0506e6]  ▏1
 (1.0506e6 - 1.2524e6]  ▏5
                  Counts
min: 1.027 ms (0.00% GC); mean: 1.039 ms (0.00% GC); median: 1.038 ms (0.00% GC); max: 1.252 ms (0.00% GC).
julia> @show N
N = 512
LV doesn't do cache tiling currently, so performance suffers at larger sizes.
I have a good idea of how I want it to do that now, but want to wait for the rewrite as otherwise the rewrite will never happen (there's no end to the stuff I could still do). (edited) 
7:43
g! was using @inbounds @fastmath and took 1ms, while f! was using @turbo and took 0.6ms.
I could try more or on a different computer. (edited) 
7:46
With @tturbo:
julia> @benchmark fthreads!($du0,$u0,nothing,0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (89800.0  - 91100.0 ]  ▏7
 (91100.0  - 92400.0 ]  █80
 (92400.0  - 93700.0 ]  █████▏453
 (93700.0  - 95000.0 ]  ███████████████▊1405
 (95000.0  - 96300.0 ]  ██████████████████████████████2681
 (96300.0  - 97600.0 ]  ██████████████████████████████ 2692
 (97600.0  - 98900.0 ]  ██████████████████▌1658
 (98900.0  - 100200.0]  ███████▏629
 (100200.0 - 101500.0]  ██▋234
 (101500.0 - 102800.0]  ▉78
 (102800.0 - 104000.0]  ▌37
 (104000.0 - 105300.0]  ▎15
 (105300.0 - 106600.0]  ▏11
 (106600.0 - 107900.0]  ▏10
 (107900.0 - 349100.0]  ▏10
                  Counts
min: 89.756 μs (0.00% GC); mean: 96.586 μs (0.00% GC); median: 96.428 μs (0.00% GC); max: 349.074 μs (0.00% GC).

and thread:


Chris Elrod  1 day ago
Seems to be a problem in LoopVectorization's cost modeling.
It is unrolling the j loop to save on reloads of A[i,j-1],A[i,j] ,A[i,j+1].
But it seems faster not to unroll at all.

Chris Elrod  1 day ago
You can try to confirm on your machine with @turbo unroll=1 for ...

Chris Elrod  1 day ago
Unrolling does seem faster for very small N, e.g. I tried 30.

Chris Elrod  1 day ago
This was also true on my old (AVX2, Haswell) laptop.

Chris Elrod  18 hours ago
I tweaked some heuristics in LoopVectorization 0.12.38.
It should do better now, but I still saw some anonymously worse performance on my Cascadelake-X system from objectively worse assembly. I.e., reloading the same data multiple times instead of reusing a register -> worse performance. So it's generating the objectively worse but slower (on that machine) code.
I could perhaps add heuristics to identify that case to add prefetch instructions, which should have the same (but more) benefit as whatever benefit the extra loads is having.
On my Haswell (AVX2) laptop, the code without extra loads seemed slightly faster.
Code with extra loads == @inbounds @fastmath for ...; @simd ivdep for ...
I can reproduce the problem with LLVM by using @simd ivdep like above, but moving all the stores after all the loads in the loop. That eliminates the redundant loads, and worsens performance. (edited) 

Chris Elrod  18 hours ago
Anyway, the loop is memory bound and the key seems to be making it as easy on the hardware prefetchers as possible. Any sort of fanciness -- including using TiledIteration.jl to block -- just makes performance worse.
So it may be worth getting LoopVectorization to emit software prefetches there.
EDIT: Played around with software prefetching there. Made it slower. (edited) 

Chris Elrod  16 hours ago
Also note that 512 is a magic bad number. Benchmarks with 512x512 matrices:
Binary 
stencilbench.jl
4 kB Binary4 kB — Click to download



Chris Elrod  16 hours ago
julia> @benchmark fbase!($du,$u,$α₁)
samples: 4373; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (1.06e6  - 1.102e6]  █▏122
 (1.102e6 - 1.144e6]  ██████████████████████████████ 3397
 (1.144e6 - 1.186e6]  █████▎591
 (1.186e6 - 1.229e6]  █▊195
 (1.229e6 - 1.271e6]  ▍32
 (1.271e6 - 1.313e6]  ▏10
 (1.313e6 - 1.355e6]  ▏5
 (1.355e6 - 1.397e6]  ▏9
 (1.397e6 - 1.439e6]  ▏1
 (1.439e6 - 1.481e6]   0
 (1.481e6 - 1.523e6]  ▏3
 (1.523e6 - 1.565e6]  ▏1
 (1.565e6 - 1.608e6]  ▏2
 (1.608e6 - 1.807e6]  ▏5
                  Counts
min: 1.060 ms (0.00% GC); mean: 1.134 ms (0.00% GC); median: 1.126 ms (0.00% GC); max: 1.807 ms (0.00% GC).
julia> @benchmark fturbo!($du,$u,$α₁)
samples: 6310; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (740000.0 - 787000.0]  ██████████████████████████████ 4631
 (787000.0 - 834000.0]  ███████▌1139
 (834000.0 - 881000.0]  ██▏321
 (881000.0 - 928000.0]  █139
 (928000.0 - 975000.0]  ▍40
 (975000.0 - 1.022e6 ]  ▏19
 (1.022e6  - 1.069e6 ]  ▏6
 (1.069e6  - 1.116e6 ]  ▏3
 (1.116e6  - 1.163e6 ]  ▏4
 (1.163e6  - 1.211e6 ]   0
 (1.211e6  - 1.258e6 ]   0
 (1.258e6  - 1.305e6 ]  ▏1
 (1.305e6  - 1.352e6 ]   0
 (1.352e6  - 1.577e6 ]  ▏7
                  Counts
min: 739.585 μs (0.00% GC); mean: 781.544 μs (0.00% GC); median: 767.157 μs (0.00% GC); max: 1.577 ms (0.00% GC).
julia> @benchmark funroll1!($du,$u,$α₁)
samples: 6365; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (739000.0 - 756000.0]  ███████████████▋1365
 (756000.0 - 772000.0]  ██████████████████████████████ 2623
 (772000.0 - 788000.0]  ██████████▋921
 (788000.0 - 804000.0]  ██████▊580
 (804000.0 - 821000.0]  ████▊407
 (821000.0 - 837000.0]  ██▍201
 (837000.0 - 853000.0]  █▎109
 (853000.0 - 870000.0]  ▉73
 (870000.0 - 886000.0]  ▌34
 (886000.0 - 902000.0]  ▍25
 (902000.0 - 918000.0]  ▏9
 (918000.0 - 935000.0]  ▏6
 (935000.0 - 951000.0]  ▏5
 (951000.0 - 1.137e6 ]  ▏7
                  Counts
min: 739.437 μs (0.00% GC); mean: 775.172 μs (0.00% GC); median: 764.959 μs (0.00% GC); max: 1.137 ms (0.00% GC).
julia> @benchmark funroll1_v2!($du,$u,$α₁)
samples: 6431; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (730000.0 - 746000.0]  ████████▍806
 (746000.0 - 761000.0]  ██████████████████████████████▏2902
 (761000.0 - 777000.0]  ████████████▊1229
 (777000.0 - 792000.0]  ██████▍605
 (792000.0 - 808000.0]  ███▉369
 (808000.0 - 823000.0]  ██▏194
 (823000.0 - 839000.0]  █▍125
 (839000.0 - 855000.0]  ▉75
 (855000.0 - 870000.0]  ▌43
 (870000.0 - 886000.0]  ▍33
 (886000.0 - 901000.0]  ▎18
 (901000.0 - 917000.0]  ▏9
 (917000.0 - 932000.0]  ▎16
 (932000.0 - 1.133e6 ]  ▏7
                  Counts
min: 730.097 μs (0.00% GC); mean: 767.196 μs (0.00% GC); median: 757.611 μs (0.00% GC); max: 1.133 ms (0.00% GC).
julia> @benchmark fbase_ivdep!($du,$u,$α₁)
samples: 6693; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (688000.0 - 709000.0]  ███████████████████████████▎1779
 (709000.0 - 730000.0]  ██████████████████████████████ 1959
 (730000.0 - 751000.0]  █████████████████▉1163
 (751000.0 - 773000.0]  ███████████▎731
 (773000.0 - 794000.0]  ██████▉441
 (794000.0 - 815000.0]  ████▌291
 (815000.0 - 836000.0]  ██▍154
 (836000.0 - 857000.0]  █▎81
 (857000.0 - 878000.0]  ▌30
 (878000.0 - 900000.0]  ▌27
 (900000.0 - 921000.0]  ▍19
 (921000.0 - 942000.0]  ▏7
 (942000.0 - 963000.0]  ▏4
 (963000.0 - 1.17e6  ]  ▏7
                  Counts
min: 687.778 μs (0.00% GC); mean: 736.942 μs (0.00% GC); median: 724.632 μs (0.00% GC); max: 1.170 ms (0.00% GC).
julia> @benchmark fbase_ivdep_v2!($du,$u,$α₁)
samples: 6249; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (749000.0 - 766000.0]  ██████████████▏1134
 (766000.0 - 783000.0]  ██████████████████████████████ 2424
 (783000.0 - 801000.0]  █████████████▌1089
 (801000.0 - 818000.0]  ████████▎661
 (818000.0 - 835000.0]  ████▉392
 (835000.0 - 852000.0]  ███241
 (852000.0 - 869000.0]  █▍109
 (869000.0 - 886000.0]  ▉61
 (886000.0 - 903000.0]  ▊51
 (903000.0 - 920000.0]  ▌40
 (920000.0 - 937000.0]  ▍24
 (937000.0 - 954000.0]  ▏8
 (954000.0 - 971000.0]  ▏8
 (971000.0 - 1.177e6 ]  ▏7
                  Counts
min: 749.351 μs (0.00% GC); mean: 789.853 μs (0.00% GC); median: 779.491 μs (0.00% GC); max: 1.177 ms (0.00% GC).
Binary 
stencilbench.jl
4 kB Binary4 kB — Click to download



Chris Elrod  16 hours ago
Increasing N from 512 to 528:
Binary 
stencilbench.jl
4 kB Binary4 kB — Click to download



Chris Elrod  16 hours ago
julia> @benchmark fbase!($du,$u,$α₁)
samples: 4993; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (911000.0 - 924000.0]  ██████▉477
 (924000.0 - 936000.0]  █████████████▏904
 (936000.0 - 949000.0]  ██▊183
 (949000.0 - 961000.0]  █▋112
 (961000.0 - 974000.0]  ▋37
 (974000.0 - 987000.0]  ▎9
 (987000.0 - 999000.0]  ▏2
 (999000.0 - 1.012e6 ]  ▏1
 (1.012e6  - 1.025e6 ]  ██████████████████████████████ 2086
 (1.025e6  - 1.037e6 ]  ██████████▍718
 (1.037e6  - 1.05e6  ]  ████▋319
 (1.05e6   - 1.063e6 ]  █▉128
 (1.063e6  - 1.075e6 ]  ▎12
 (1.075e6  - 1.192e6 ]  ▏5
                  Counts
min: 910.853 μs (0.00% GC); mean: 993.422 μs (0.00% GC); median: 1.020 ms (0.00% GC); max: 1.192 ms (0.00% GC).
julia> @benchmark fturbo!($du,$u,$α₁)
samples: 6683; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (709000.0 - 719000.0]  ███████▎552
 (719000.0 - 730000.0]  ██████████████████████████████ 2323
 (730000.0 - 740000.0]  ██████████████████████▎1717
 (740000.0 - 751000.0]  ████████▎637
 (751000.0 - 762000.0]  ████▋353
 (762000.0 - 772000.0]  ████▋356
 (772000.0 - 783000.0]  ████305
 (783000.0 - 793000.0]  ███232
 (793000.0 - 804000.0]  █▋118
 (804000.0 - 814000.0]  ▌35
 (814000.0 - 825000.0]  ▍27
 (825000.0 - 836000.0]  ▎16
 (836000.0 - 846000.0]  ▏5
 (846000.0 - 1.136e6 ]  ▏7
                  Counts
min: 708.509 μs (0.00% GC); mean: 739.460 μs (0.00% GC); median: 731.846 μs (0.00% GC); max: 1.136 ms (0.00% GC).
julia> @benchmark funroll1!($du,$u,$α₁)
samples: 6669; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (707000.0 - 719000.0]  █████▊429
 (719000.0 - 730000.0]  ██████████████████████████████ 2241
 (730000.0 - 742000.0]  █████████████████████████▋1905
 (742000.0 - 754000.0]  █████████▏680
 (754000.0 - 765000.0]  █████▌409
 (765000.0 - 777000.0]  ████▉359
 (777000.0 - 788000.0]  ████▍325
 (788000.0 - 800000.0]  ██▎163
 (800000.0 - 811000.0]  █▏79
 (811000.0 - 823000.0]  ▍28
 (823000.0 - 834000.0]  ▎18
 (834000.0 - 846000.0]  ▍19
 (846000.0 - 857000.0]  ▏7
 (857000.0 - 1.143e6 ]  ▏7
                  Counts
min: 707.473 μs (0.00% GC); mean: 740.984 μs (0.00% GC); median: 733.445 μs (0.00% GC); max: 1.143 ms (0.00% GC).
julia> @benchmark funroll1_v2!($du,$u,$α₁)
samples: 6594; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (717000.0 - 727000.0]  ████▍285
 (727000.0 - 736000.0]  █████████████████████████▊1707
 (736000.0 - 746000.0]  ██████████████████████████████ 1998
 (746000.0 - 756000.0]  ██████████████▍950
 (756000.0 - 766000.0]  █████▊377
 (766000.0 - 775000.0]  ████▋305
 (775000.0 - 785000.0]  ████▌293
 (785000.0 - 795000.0]  ████▋307
 (795000.0 - 805000.0]  ███196
 (805000.0 - 814000.0]  █▋101
 (814000.0 - 824000.0]  ▊45
 (824000.0 - 834000.0]  ▍17
 (834000.0 - 844000.0]  ▏6
 (844000.0 - 1.147e6 ]  ▏7
                  Counts
min: 716.736 μs (0.00% GC); mean: 749.543 μs (0.00% GC); median: 742.499 μs (0.00% GC); max: 1.147 ms (0.00% GC).
julia> @benchmark fbase_ivdep!($du,$u,$α₁)
samples: 7403; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (635000.0 - 649000.0]  ██████████████████████████████▏2645
 (649000.0 - 664000.0]  ██████████████████████████2292
 (664000.0 - 678000.0]  ██████▊591
 (678000.0 - 693000.0]  ████▎374
 (693000.0 - 707000.0]  █████▋490
 (707000.0 - 722000.0]  █████▉508
 (722000.0 - 736000.0]  ███▎276
 (736000.0 - 751000.0]  █▍119
 (751000.0 - 765000.0]  ▋49
 (765000.0 - 780000.0]  ▍25
 (780000.0 - 794000.0]  ▎18
 (794000.0 - 809000.0]  ▏3
 (809000.0 - 823000.0]  ▏5
 (823000.0 - 1.106e6 ]  ▏8
                  Counts
min: 634.654 μs (0.00% GC); mean: 666.727 μs (0.00% GC); median: 652.383 μs (0.00% GC); max: 1.106 ms (0.00% GC).
julia> @benchmark fbase_ivdep_v2!($du,$u,$α₁)
samples: 6610; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
 (717000.0 - 730000.0]  ████████▊893
 (730000.0 - 742000.0]  ██████████████████████████████▏3102
 (742000.0 - 754000.0]  ██████████1031
 (754000.0 - 766000.0]  ███▋363
 (766000.0 - 779000.0]  ███▌358
 (779000.0 - 791000.0]  ███▎331
 (791000.0 - 803000.0]  ██▊278
 (803000.0 - 815000.0]  █▎125
 (815000.0 - 827000.0]  ▋63
 (827000.0 - 840000.0]  ▍27
 (840000.0 - 852000.0]  ▏12
 (852000.0 - 864000.0]  ▏11
 (864000.0 - 876000.0]  ▏9
 (876000.0 - 1.126e6 ]  ▏7
                  Counts
min: 717.473 μs (0.00% GC); mean: 747.755 μs (0.00% GC); median: 738.879 μs (0.00% GC); max: 1.126 ms (0.00% GC).
Binary 
stencilbench.jl
4 kB Binary4 kB — Click to download



Chris Elrod  16 hours ago
I attached the script.
512 is bad because many CPUs (including this one I benchmarked on, a 7900X Skylake-X CPU) have 32 KiB L1 caches with 8 way associativity.
That means data
julia> 32*(2^10) ÷ 8
4096
bytes apart would contend for the same slot of the of the L1 cache.
I.e., data
julia> (32*(2^10) ÷ 8) ÷ sizeof(Float64)
512
Float64 appart contend for the same slot.
With 512x512 matrices, this means data from A[i,j-1] , A[i,j], and A[i,j+1] all want to occupy the same place in the L1 cache.
This also makes prefetching the next cacheline (i.e. A[i+8,j-1], A[i+8,j], and A[i+8,j+1] at the same time harder).
Hence why increasing the sizes of these arrays can actually make operating on them take overall less time.
Binary 
stencilbench.jl
4 kB Binary4 kB — Click to download

@chriselrod
Copy link

chriselrod commented Jun 16, 2021

Solve times, prob is @turbo, prob_threads is @tturbo, and prob_simdivdep is @inbounds @fastmath @simd ivdep:

julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 46.533658 seconds (14.03 k allocations: 35.467 GiB, 0.49% gc time)

julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 42.405112 seconds (14.86 k allocations: 35.467 GiB, 0.93% gc time)

julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 46.475041 seconds (14.12 k allocations: 35.725 GiB, 0.14% gc time)

julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.199515 seconds (266 allocations: 708.018 MiB)

julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 38.143093 seconds (283 allocations: 708.018 MiB, 0.07% gc time)

julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.644268 seconds (266 allocations: 708.018 MiB, 0.24% gc time)

julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.931490 seconds (265 allocations: 690.017 MiB)

julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 37.560453 seconds (282 allocations: 690.018 MiB, 0.01% gc time)

julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 43.625376 seconds (265 allocations: 690.017 MiB, 0.01% gc time)

Testing just the functions:

julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 9231; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (530100.0 - 531000.0]  ▊31
 (531000.0 - 532000.0]  ████▌207
 (532000.0 - 532900.0]  ████████████▍575
 (532900.0 - 533800.0]  █████████████████████▏991
 (533800.0 - 534700.0]  ████████████████████████████▌1335
 (534700.0 - 535700.0]  ██████████████████████████████1408
 (535700.0 - 536600.0]  ██████████████████████████████1404
 (536600.0 - 537500.0]  ████████████████████████████▌1333
 (537500.0 - 538400.0]  ████████████████████████1125
 (538400.0 - 539400.0]  ████████████▋589
 (539400.0 - 540300.0]  ████184
 (540300.0 - 541200.0]  ▊31
 (541200.0 - 542100.0]  ▎6
 (542100.0 - 543100.0]  ▏2
 (543100.0 - 895600.0]  ▎10

                  Counts

min: 530.115 μs (0.00% GC); mean: 535.779 μs (0.00% GC); median: 535.711 μs (0.00% GC); max: 895.599 μs (0.00% GC).

julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (81000.0 - 81900.0 ]  ▏1
 (81900.0 - 82800.0 ]  ▏5
 (82800.0 - 83700.0 ]  ▊50
 (83700.0 - 84500.0 ]  ███▍260
 (84500.0 - 85400.0 ]  ████████████948
 (85400.0 - 86300.0 ]  █████████████████████████▎2006
 (86300.0 - 87200.0 ]  ██████████████████████████████2384
 (87200.0 - 88100.0 ]  ████████████████████████▊1964
 (88100.0 - 89000.0 ]  ████████████████▊1322
 (89000.0 - 89800.0 ]  ████████▎650
 (89800.0 - 90700.0 ]  ███238
 (90700.0 - 91600.0 ]  █▍107
 (91600.0 - 92500.0 ]  ▋41
 (92500.0 - 93400.0 ]  ▎14
 (93400.0 - 327700.0]  ▎10

                  Counts

min: 81.006 μs (0.00% GC); mean: 87.097 μs (0.00% GC); median: 86.934 μs (0.00% GC); max: 327.730 μs (0.00% GC).

julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 9821; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (498800.0 - 499600.0]  ▎10
 (499600.0 - 500400.0]  █▉109
 (500400.0 - 501200.0]  ███████████▊688
 (501200.0 - 502000.0]  █████████████████████████▋1512
 (502000.0 - 502800.0]  ██████████████████████████▋1571
 (502800.0 - 503500.0]  █████████████████████████▋1508
 (503500.0 - 504300.0]  ██████████████████████████████▏1772
 (504300.0 - 505100.0]  █████████████████████████▊1517
 (505100.0 - 505900.0]  █████████████▎781
 (505900.0 - 506700.0]  ████▋266
 (506700.0 - 507500.0]  █54
 (507500.0 - 508200.0]  ▎13
 (508200.0 - 509000.0]  ▏5
 (509000.0 - 509800.0]  ▏5
 (509800.0 - 897000.0]  ▎10

                  Counts

min: 498.833 μs (0.00% GC); mean: 503.346 μs (0.00% GC); median: 503.297 μs (0.00% GC); max: 897.010 μs (0.00% GC).

So most of the performance gain we see from threading doesn't translate to the actual solve.

For reference, here is the assembly from the main loop using @inbounds @fastmath @simd ivdep, along with llvm-mca's report:

Iterations:        100
Instructions:      2400
Total Cycles:      878
Total uOps:        4100

Dispatch Width:    6
uOps Per Cycle:    4.67
IPC:               2.73
Block RThroughput: 7.0


Cycles with backend pressure increase [ 45.22% ]
Throughput Bottlenecks:
  Resource Pressure       [ 29.16% ]
  - SKXPort0  [ 21.30% ]
  - SKXPort1  [ 0.80% ]
  - SKXPort2  [ 14.58% ]
  - SKXPort3  [ 14.58% ]
  - SKXPort4  [ 4.78% ]
  - SKXPort5  [ 21.30% ]
  Data Dependencies:      [ 39.41% ]
  - Register Dependencies [ 39.41% ]
  - Memory Dependencies   [ 0.00% ]

Critical sequence based on the simulation:

              Instruction                                 Dependency Information
 +----< 22.   add       r8, 64
 |
 |    < loop carried >
 |
 +----> 0.    vmovupd   zmm6, zmmword ptr [r14 + r8 - 8]  ## REGISTER dependency:  r8
 |      1.    vmovupd   zmm7, zmmword ptr [r14 + r8]
 |      2.    vaddpd    zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
 |      3.    vaddpd    zmm7, zmm7, zmmword ptr [rbx + r8]
 |      4.    vaddpd    zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
 |      5.    vfmadd231pd       zmm7, zmm6, zmm3
 |      6.    vfmadd213pd       zmm7, zmm4, zmmword ptr [rdx + r8]
 +----> 7.    vfnmsub132pd      zmm6, zmm6, zmmword ptr [r10 + r8] ## REGISTER dependency:  zmm6
 |      8.    vaddpd    zmm6, zmm7, zmm6
 |      9.    vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 |      10.   vmovupd   zmmword ptr [r13 + r8], zmm6
 +----> 11.   vmovupd   zmm6, zmmword ptr [r10 + r8]      ## RESOURCE interference:  SKXPort3 [ probability: 2% ]
 +----> 12.   vfnmsub231pd      zmm6, zmm6, zmmword ptr [r14 + r8 - 8] ## REGISTER dependency:  zmm6
 |      13.   vaddpd    zmm6, zmm6, zmm5
 |      14.   vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 |      15.   vmovupd   zmmword ptr [r12 + r8], zmm6
 +----> 16.   vmovupd   zmm6, zmmword ptr [r9 + r8]       ## RESOURCE interference:  SKXPort3 [ probability: 11% ]
 |      17.   vmovupd   zmm7, zmmword ptr [r10 + r8]
 |      18.   vfmadd132pd       zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
 +----> 19.   vaddpd    zmm6, zmm6, zmm6                  ## REGISTER dependency:  zmm6
 +----> 20.   vsubpd    zmm6, zmm7, zmm6                  ## REGISTER dependency:  zmm6
 +----> 21.   vmovupd   zmmword ptr [rdi + r8], zmm6      ## REGISTER dependency:  zmm6
        22.   add       r8, 64
        23.   cmp       r8, 4032


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [r14 + r8 - 8]
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [r14 + r8]
 2      11    0.50    *                   vaddpd        zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
 2      11    0.50    *                   vaddpd        zmm7, zmm7, zmmword ptr [rbx + r8]
 2      11    0.50    *                   vaddpd        zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
 1      4     0.50                        vfmadd231pd   zmm7, zmm6, zmm3
 2      11    0.50    *                   vfmadd213pd   zmm7, zmm4, zmmword ptr [rdx + r8]
 2      11    0.50    *                   vfnmsub132pd  zmm6, zmm6, zmmword ptr [r10 + r8]
 1      4     0.50                        vaddpd        zmm6, zmm7, zmm6
 2      11    0.50    *                   vaddpd        zmm6, zmm6, zmmword ptr [r9 + r8]
 2      1     1.00           *            vmovupd       zmmword ptr [r13 + r8], zmm6
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [r10 + r8]
 2      11    0.50    *                   vfnmsub231pd  zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
 1      4     0.50                        vaddpd        zmm6, zmm6, zmm5
 2      11    0.50    *                   vaddpd        zmm6, zmm6, zmmword ptr [r9 + r8]
 2      1     1.00           *            vmovupd       zmmword ptr [r12 + r8], zmm6
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [r9 + r8]
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [r10 + r8]
 2      11    0.50    *                   vfmadd132pd   zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
 1      4     0.50                        vaddpd        zmm6, zmm6, zmm6
 1      4     0.50                        vsubpd        zmm6, zmm7, zmm6
 2      1     1.00           *            vmovupd       zmmword ptr [rdi + r8], zmm6
 1      1     0.25                        add   r8, 64
 1      1     0.25                        cmp   r8, 4032


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     7.44   4.25   7.06   7.06   3.00   7.31   2.00   2.88

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.01   0.98   0.41   0.59    -     0.01    -      -     vmovupd   zmm6, zmmword ptr [r14 + r8 - 8]
 -      -     0.13   0.84   0.55   0.45    -     0.03    -      -     vmovupd   zmm7, zmmword ptr [r14 + r8]
 -      -     0.61    -     0.59   0.41    -     0.39    -      -     vaddpd    zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
 -      -     0.47    -     0.61   0.39    -     0.53    -      -     vaddpd    zmm7, zmm7, zmmword ptr [rbx + r8]
 -      -     0.51    -     0.55   0.45    -     0.49    -      -     vaddpd    zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
 -      -     0.46    -      -      -      -     0.54    -      -     vfmadd231pd       zmm7, zmm6, zmm3
 -      -     0.54    -     0.46   0.54    -     0.46    -      -     vfmadd213pd       zmm7, zmm4, zmmword ptr [rdx + r8]
 -      -     0.44    -     0.66   0.34    -     0.56    -      -     vfnmsub132pd      zmm6, zmm6, zmmword ptr [r10 + r8]
 -      -     0.48    -      -      -      -     0.52    -      -     vaddpd    zmm6, zmm7, zmm6
 -      -     0.52    -     0.69   0.31    -     0.48    -      -     vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 -      -      -      -     0.01    -     1.00    -      -     0.99   vmovupd   zmmword ptr [r13 + r8], zmm6
 -      -     0.11   0.88   0.49   0.51    -     0.01    -      -     vmovupd   zmm6, zmmword ptr [r10 + r8]
 -      -     0.66    -     0.32   0.68    -     0.34    -      -     vfnmsub231pd      zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
 -      -     0.50    -      -      -      -     0.50    -      -     vaddpd    zmm6, zmm6, zmm5
 -      -     0.50    -     0.47   0.53    -     0.50    -      -     vaddpd    zmm6, zmm6, zmmword ptr [r9 + r8]
 -      -      -      -      -     0.02   1.00    -      -     0.98   vmovupd   zmmword ptr [r12 + r8], zmm6
 -      -     0.01   0.98   0.24   0.76    -     0.01    -      -     vmovupd   zmm6, zmmword ptr [r9 + r8]
 -      -     0.25   0.57   0.57   0.43    -     0.18    -      -     vmovupd   zmm7, zmmword ptr [r10 + r8]
 -      -     0.45    -     0.36   0.64    -     0.55    -      -     vfmadd132pd       zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
 -      -     0.39    -      -      -      -     0.61    -      -     vaddpd    zmm6, zmm6, zmm6
 -      -     0.40    -      -      -      -     0.60    -      -     vsubpd    zmm6, zmm7, zmm6
 -      -      -      -     0.08   0.01   1.00    -      -     0.91   vmovupd   zmmword ptr [rdi + r8], zmm6
 -      -      -      -      -      -      -      -     1.00    -     add       r8, 64
 -      -      -      -      -      -      -      -     1.00    -     cmp       r8, 4032

Versus @turbo:

Iterations:        100
Instructions:      3100
Total Cycles:      783
Total uOps:        4200

Dispatch Width:    6
uOps Per Cycle:    5.36
IPC:               3.96
Block RThroughput: 7.0


No resource or data dependency bottlenecks discovered.


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 2      8     0.50    *                   vmovupd       zmm4, zmmword ptr [rbx - 8]
 2      8     0.50    *                   vmovupd       zmm5, zmmword ptr [rbx]
 2      11    0.50    *                   vaddpd        zmm4, zmm4, zmmword ptr [rbx + 8]
 2      11    0.50    *                   vaddpd        zmm4, zmm4, zmmword ptr [r12 + rcx]
 2      11    0.50    *                   vaddpd        zmm4, zmm4, zmmword ptr [r12]
 1      4     0.50                        vfmadd231pd   zmm4, zmm5, zmm1
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [rbp]
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [rdx]
 2      11    0.50    *                   vaddpd        zmm8, zmm7, zmmword ptr [rdi]
 1      4     0.50                        vfmsub231pd   zmm8, zmm5, zmm6
 1      4     0.50                        vfnmadd231pd  zmm8, zmm5, zmm0
 1      4     0.50                        vfmsub231pd   zmm8, zmm4, zmm2
 1      4     0.50                        vaddpd        zmm4, zmm7, zmm3
 1      4     0.50                        vfmsub231pd   zmm4, zmm5, zmm6
 1      4     0.50                        vfmsub231pd   zmm4, zmm6, zmm0
 1      1     0.33                        vmovapd       zmm9, zmm0
 1      4     0.50                        vfmadd213pd   zmm9, zmm7, zmm3
 1      4     0.50                        vfmadd231pd   zmm9, zmm5, zmm6
 2      1     1.00           *            vmovupd       zmmword ptr [r9], zmm8
 2      1     1.00           *            vmovupd       zmmword ptr [rax], zmm4
 1      4     0.50                        vfmadd231pd   zmm9, zmm0, zmm7
 2      1     1.00           *            vmovupd       zmmword ptr [r14], zmm9
 1      1     0.25                        add   rbx, 64
 1      1     0.25                        add   rbp, 64
 1      1     0.25                        add   rdx, 64
 1      1     0.25                        add   rdi, 64
 1      1     0.25                        add   r9, 64
 1      1     0.25                        add   rax, 64
 1      1     0.25                        add   r14, 64
 1      1     0.25                        add   r12, 64
 1      1     0.25                        cmp   r10, rbx


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     7.52   7.33   4.06   4.06   3.00   7.53   5.62   2.88

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.07   0.84   0.61   0.39    -     0.09    -      -     vmovupd   zmm4, zmmword ptr [rbx - 8]
 -      -      -     0.97   0.45   0.55    -     0.03    -      -     vmovupd   zmm5, zmmword ptr [rbx]
 -      -     0.65    -     0.39   0.61    -     0.35    -      -     vaddpd    zmm4, zmm4, zmmword ptr [rbx + 8]
 -      -     0.48    -     0.37   0.63    -     0.52    -      -     vaddpd    zmm4, zmm4, zmmword ptr [r12 + rcx]
 -      -     0.35    -     0.80   0.20    -     0.65    -      -     vaddpd    zmm4, zmm4, zmmword ptr [r12]
 -      -     0.27    -      -      -      -     0.73    -      -     vfmadd231pd       zmm4, zmm5, zmm1
 -      -     0.02   0.83   0.62   0.38    -     0.15    -      -     vmovupd   zmm6, zmmword ptr [rbp]
 -      -     0.08   0.91   0.29   0.71    -     0.01    -      -     vmovupd   zmm7, zmmword ptr [rdx]
 -      -     0.56    -     0.49   0.51    -     0.44    -      -     vaddpd    zmm8, zmm7, zmmword ptr [rdi]
 -      -     0.39    -      -      -      -     0.61    -      -     vfmsub231pd       zmm8, zmm5, zmm6
 -      -     0.40    -      -      -      -     0.60    -      -     vfnmadd231pd      zmm8, zmm5, zmm0
 -      -     0.29    -      -      -      -     0.71    -      -     vfmsub231pd       zmm8, zmm4, zmm2
 -      -     0.55    -      -      -      -     0.45    -      -     vaddpd    zmm4, zmm7, zmm3
 -      -     0.31    -      -      -      -     0.69    -      -     vfmsub231pd       zmm4, zmm5, zmm6
 -      -     0.32    -      -      -      -     0.68    -      -     vfmsub231pd       zmm4, zmm6, zmm0
 -      -     0.20   0.78    -      -      -     0.02    -      -     vmovapd   zmm9, zmm0
 -      -     0.75    -      -      -      -     0.25    -      -     vfmadd213pd       zmm9, zmm7, zmm3
 -      -     0.77    -      -      -      -     0.23    -      -     vfmadd231pd       zmm9, zmm5, zmm6
 -      -      -      -     0.01    -     1.00    -      -     0.99   vmovupd   zmmword ptr [r9], zmm8
 -      -      -      -      -     0.03   1.00    -      -     0.97   vmovupd   zmmword ptr [rax], zmm4
 -      -     0.74    -      -      -      -     0.26    -      -     vfmadd231pd       zmm9, zmm0, zmm7
 -      -      -      -     0.03   0.05   1.00    -      -     0.92   vmovupd   zmmword ptr [r14], zmm9
 -      -      -      -      -      -      -      -     1.00    -     add       rbx, 64
 -      -     0.12   0.03    -      -      -     0.01   0.84    -     add       rbp, 64
 -      -     0.05   0.34    -      -      -     0.02   0.59    -     add       rdx, 64
 -      -     0.06   0.24    -      -      -      -     0.70    -     add       rdi, 64
 -      -     0.02   0.69    -      -      -     0.01   0.28    -     add       r9, 64
 -      -      -     0.29    -      -      -      -     0.71    -     add       rax, 64
 -      -     0.02   0.30    -      -      -     0.02   0.66    -     add       r14, 64
 -      -     0.01   0.69    -      -      -      -     0.30    -     add       r12, 64
 -      -     0.04   0.42    -      -      -      -     0.54    -     cmp       r10, rbx

All the extra adds here for LoopVectorization are because while LLVM can figure out the views are the same array, LoopVectorization doesn't know that.

If you want just a single number from each report, LLVM-MCA estimates 100 iterations of the @turbo loop will take 783 clock cycles, while 100 iterations of the @simd ivdep loop will take 878 clock cycles.
If we replace the views with indexing, @turbo is down to 731 cycles:

Iterations:        100
Instructions:      2700
Total Cycles:      731
Total uOps:        3800

Dispatch Width:    6
uOps Per Cycle:    5.20
IPC:               3.69
Block RThroughput: 7.0


Cycles with backend pressure increase [ 51.44% ]
Throughput Bottlenecks:
  Resource Pressure       [ 50.21% ]
  - SKXPort0  [ 49.93% ]
  - SKXPort1  [ 49.11% ]
  - SKXPort2  [ 13.13% ]
  - SKXPort3  [ 13.13% ]
  - SKXPort4  [ 5.06% ]
  - SKXPort5  [ 49.93% ]
  - SKXPort6  [ 9.58% ]
  - SKXPort7  [ 3.15% ]
  Data Dependencies:      [ 39.53% ]
  - Register Dependencies [ 39.53% ]
  - Memory Dependencies   [ 0.00% ]

Critical sequence based on the simulation:

              Instruction                                 Dependency Information
 +----< 13.   vaddpd    zmm5, zmm7, zmm3
 |
 |    < loop carried >
 |
 |      0.    mov       r11, rbx
 +----> 1.    vmovupd   zmm4, zmmword ptr [rbx]           ## RESOURCE interference:  SKXPort5 [ probability: 25% ]
 +----> 2.    vmovupd   zmm5, zmmword ptr [rbx - 8]       ## RESOURCE interference:  SKXPort1 [ probability: 91% ]
 +----> 3.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference:  SKXPort2 [ probability: 13% ]
 +----> 4.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + r9] ## REGISTER dependency:  zmm5
 +----> 5.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + rax] ## REGISTER dependency:  zmm5
 +----> 6.    vmovupd   zmm6, zmmword ptr [rbx + rdi]     ## RESOURCE interference:  SKXPort5 [ probability: 1% ]
 |      7.    vfmadd231pd       zmm5, zmm4, zmm0
 +----> 8.    vmovupd   zmm7, zmmword ptr [rbx + 2*rdi]   ## RESOURCE interference:  SKXPort1 [ probability: 93% ]
 |      9.    vaddpd    zmm8, zmm7, zmmword ptr [rbp]
 |      10.   vfmsub231pd       zmm8, zmm4, zmm6
 |      11.   vfnmadd231pd      zmm8, zmm4, zmm1
 |      12.   vfmsub231pd       zmm8, zmm5, zmm2
 +----> 13.   vaddpd    zmm5, zmm7, zmm3                  ## REGISTER dependency:  zmm7
 |      14.   vfmsub231pd       zmm5, zmm4, zmm6
 |      15.   vfmsub231pd       zmm5, zmm6, zmm1
 |      16.   vmovapd   zmm9, zmm1
 +----> 17.   vfmadd213pd       zmm9, zmm7, zmm3          ## RESOURCE interference:  SKXPort5 [ probability: 51% ]
 |      18.   vfmadd231pd       zmm9, zmm4, zmm6
 |      19.   vfmadd231pd       zmm9, zmm1, zmm7
 |      20.   vmovupd   zmmword ptr [r12], zmm8
 |      21.   vmovupd   zmmword ptr [r12 + rdx], zmm5
 |      22.   vmovupd   zmmword ptr [r12 + 2*rdx], zmm9
 |      23.   add       rbx, 64
 |      24.   add       rbp, 64
 |      25.   add       r12, 64
 |      26.   cmp       rsi, rbx
 |
 |    < loop carried >
 |
 +----> 3.    vaddpd    zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference:  SKXPort5 [ probability: 38% ]


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        mov   r11, rbx
 2      8     0.50    *                   vmovupd       zmm4, zmmword ptr [rbx]
 2      8     0.50    *                   vmovupd       zmm5, zmmword ptr [rbx - 8]
 2      11    0.50    *                   vaddpd        zmm5, zmm5, zmmword ptr [rbx + 8]
 2      11    0.50    *                   vaddpd        zmm5, zmm5, zmmword ptr [rbx + r9]
 2      11    0.50    *                   vaddpd        zmm5, zmm5, zmmword ptr [rbx + rax]
 2      8     0.50    *                   vmovupd       zmm6, zmmword ptr [rbx + rdi]
 1      4     0.50                        vfmadd231pd   zmm5, zmm4, zmm0
 2      8     0.50    *                   vmovupd       zmm7, zmmword ptr [rbx + 2*rdi]
 2      11    0.50    *                   vaddpd        zmm8, zmm7, zmmword ptr [rbp]
 1      4     0.50                        vfmsub231pd   zmm8, zmm4, zmm6
 1      4     0.50                        vfnmadd231pd  zmm8, zmm4, zmm1
 1      4     0.50                        vfmsub231pd   zmm8, zmm5, zmm2
 1      4     0.50                        vaddpd        zmm5, zmm7, zmm3
 1      4     0.50                        vfmsub231pd   zmm5, zmm4, zmm6
 1      4     0.50                        vfmsub231pd   zmm5, zmm6, zmm1
 1      1     0.33                        vmovapd       zmm9, zmm1
 1      4     0.50                        vfmadd213pd   zmm9, zmm7, zmm3
 1      4     0.50                        vfmadd231pd   zmm9, zmm4, zmm6
 1      4     0.50                        vfmadd231pd   zmm9, zmm1, zmm7
 2      1     1.00           *            vmovupd       zmmword ptr [r12], zmm8
 2      1     1.00           *            vmovupd       zmmword ptr [r12 + rdx], zmm5
 2      1     1.00           *            vmovupd       zmmword ptr [r12 + 2*rdx], zmm9
 1      1     0.25                        add   rbx, 64
 1      1     0.25                        add   rbp, 64
 1      1     0.25                        add   r12, 64
 1      1     0.25                        cmp   rsi, rbx


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     7.06   6.79   4.04   4.05   3.00   7.07   3.08   2.91

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.85    -      -      -     0.01   0.14    -     mov       r11, rbx
 -      -      -     0.99   0.46   0.54    -     0.01    -      -     vmovupd   zmm4, zmmword ptr [rbx]
 -      -     0.01   0.98   0.75   0.25    -     0.01    -      -     vmovupd   zmm5, zmmword ptr [rbx - 8]
 -      -     0.19    -     0.67   0.33    -     0.81    -      -     vaddpd    zmm5, zmm5, zmmword ptr [rbx + 8]
 -      -     0.19    -     0.53   0.47    -     0.81    -      -     vaddpd    zmm5, zmm5, zmmword ptr [rbx + r9]
 -      -     0.53    -     0.66   0.34    -     0.47    -      -     vaddpd    zmm5, zmm5, zmmword ptr [rbx + rax]
 -      -     0.01   0.97   0.31   0.69    -     0.02    -      -     vmovupd   zmm6, zmmword ptr [rbx + rdi]
 -      -     0.94    -      -      -      -     0.06    -      -     vfmadd231pd       zmm5, zmm4, zmm0
 -      -      -     0.99   0.29   0.71    -     0.01    -      -     vmovupd   zmm7, zmmword ptr [rbx + 2*rdi]
 -      -     0.10    -     0.33   0.67    -     0.90    -      -     vaddpd    zmm8, zmm7, zmmword ptr [rbp]
 -      -     0.73    -      -      -      -     0.27    -      -     vfmsub231pd       zmm8, zmm4, zmm6
 -      -     0.93    -      -      -      -     0.07    -      -     vfnmadd231pd      zmm8, zmm4, zmm1
 -      -     0.96    -      -      -      -     0.04    -      -     vfmsub231pd       zmm8, zmm5, zmm2
 -      -     0.09    -      -      -      -     0.91    -      -     vaddpd    zmm5, zmm7, zmm3
 -      -     0.55    -      -      -      -     0.45    -      -     vfmsub231pd       zmm5, zmm4, zmm6
 -      -     0.34    -      -      -      -     0.66    -      -     vfmsub231pd       zmm5, zmm6, zmm1
 -      -      -     0.98    -      -      -     0.02    -      -     vmovapd   zmm9, zmm1
 -      -     0.30    -      -      -      -     0.70    -      -     vfmadd213pd       zmm9, zmm7, zmm3
 -      -     0.55    -      -      -      -     0.45    -      -     vfmadd231pd       zmm9, zmm4, zmm6
 -      -     0.62    -      -      -      -     0.38    -      -     vfmadd231pd       zmm9, zmm1, zmm7
 -      -      -      -      -      -     1.00    -      -     1.00   vmovupd   zmmword ptr [r12], zmm8
 -      -      -      -      -     0.03   1.00    -      -     0.97   vmovupd   zmmword ptr [r12 + rdx], zmm5
 -      -      -      -     0.04   0.02   1.00    -      -     0.94   vmovupd   zmmword ptr [r12 + 2*rdx], zmm9
 -      -      -      -      -      -      -      -     1.00    -     add       rbx, 64
 -      -      -     0.04    -      -      -     0.01   0.95    -     add       rbp, 64
 -      -     0.01   0.87    -      -      -      -     0.12    -     add       r12, 64
 -      -     0.01   0.12    -      -      -      -     0.87    -     cmp       rsi, rbx

This code looks better to me, and LLVM says it is faster. But it's slower in benchmarks (on my computers) anyway, and I'm a bit at a loss.
Maybe I could add architecture specific hacks to do bad things sometimes, but in general worse probably isn't better.

Would be interesting to see more benchmarks.
Code:

using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkHistograms, LoopVectorization
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N);
const Y = reshape([j for i in 1:N for j in 1:N],N,N);
const α₁ = 1.0.*(X.>=4*N/5);
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]);
const My = copy(Mx);
Mx[2,1] = 2.0;
Mx[end-1,end] = 2.0;
My[1,2] = 2.0;
My[end,end-1] = 2.0;
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N);
function f!(du,u,p,t)
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @turbo for j in 2:N-1, i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
end
function fthreads!(du,u,p,t)
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @tturbo for j in 2:N-1, i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
end
function fsimdivdep!(du,u,p,t)
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @inbounds @fastmath for j in 2:N-1; @simd ivdep for i in 2:N-1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end; end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
end
function fnoviews!(du,u,p,t) # no views
  A = @view  u[:,:,1]
  B = @view  u[:,:,2]
  C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  @turbo for j in 2:N-1, i in 2:N-1
    du[i,j,1] = D*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
      α₁[i,j] - β₁*u[i,j,1] - r₁*u[i,j,1]*u[i,j,2] + r₂*u[i,j,3]
    du[i,j,2] = α₂ - β₂*u[i,j,2] - r₁*u[i,j,1]*u[i,j,2] + r₂*u[i,j,3]
    du[i,j,3] = α₃ - β₃*u[i,j,3] + r₁*u[i,j,1]*u[i,j,2] - r₂*u[i,j,3]
  end
  @inbounds for j in 2:N-1
    i = 1
    dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for j in 2:N-1
    i = N
    dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = 1
    dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds for i in 2:N-1
    j = N
    dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
  @inbounds begin
    i = 1; j = 1
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = 1; j = N
    dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = 1
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
    i = N; j = N
    dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
      α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
    dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
  end
end
u0 = zeros(N,N,3); du = similar(u0);
prob = ODEProblem(f!,u0,(0.0,10.0));
prob_threads = ODEProblem(fthreads!,u0,(0.0,10.0));
prob_simdivdep = ODEProblem(fsimdivdep!,u0,(0.0,10.0));

@time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);

@time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);

@time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);

@chriselrod
Copy link

Oh, and benchmarks on the M1 (running on a native Julia 1.8 build):

julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.625047 seconds (14.09 k allocations: 35.655 GiB, 0.71% gc time)

julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 91.950376 seconds (14.24 k allocations: 35.655 GiB, 0.24% gc time)

julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 23.505749 seconds (14.01 k allocations: 35.421 GiB, 0.23% gc time)

julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 19.860662 seconds (267 allocations: 708.018 MiB)

julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 22.793870 seconds (270 allocations: 708.018 MiB, 0.06% gc time)

julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.074039 seconds (270 allocations: 708.018 MiB, 0.53% gc time)

julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 20.464703 seconds (264 allocations: 690.017 MiB, 0.34% gc time)

julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.983172 seconds (267 allocations: 690.017 MiB, 0.07% gc time)

julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
 21.848427 seconds (267 allocations: 690.017 MiB, 0.39% gc time)

Not sure what's going on with ROCK4 and multithreading, but otherwise -- ouch Intel.
The benchmarks from my previous post were on a 10980XE, a much higher end chip than the M1.
Would be interesting to see how a Zen3 chip compares.

Benchmarks of the individual functions:

julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (409000.0 - 411200.0]  ▏3
 (411200.0 - 413500.0]  ▍23
 (413500.0 - 415700.0]  ██▎191
 (415700.0 - 418000.0]  ██████████▍894
 (418000.0 - 420300.0]  ██████████████████████▊1974
 (420300.0 - 422500.0]  ██████████████████████████████▏2614
 (422500.0 - 424800.0]  █████████████████████████▋2228
 (424800.0 - 427100.0]  ██████████████▏1220
 (427100.0 - 429300.0]  █████▉506
 (429300.0 - 431600.0]  ██▏180
 (431600.0 - 433800.0]  ▉76
 (433800.0 - 436100.0]  ▋46
 (436100.0 - 438400.0]  ▍25
 (438400.0 - 440600.0]  ▏10
 (440600.0 - 534200.0]  ▏10

                  Counts

min: 408.958 μs (0.00% GC); mean: 422.196 μs (0.00% GC); median: 421.917 μs (0.00% GC); max: 534.208 μs (0.00% GC).

julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (218400.0 - 219900.0]  ▎16
 (219900.0 - 221500.0]  █73
 (221500.0 - 223000.0]  █████▉473
 (223000.0 - 224500.0]  ██████████████████▍1485
 (224500.0 - 226100.0]  █████████████████████████████▉2415
 (226100.0 - 227600.0]  ██████████████████████████████ 2433
 (227600.0 - 229100.0]  ███████████████████████1865
 (229100.0 - 230700.0]  ██████████▊871
 (230700.0 - 232200.0]  ███▏244
 (232200.0 - 233800.0]  ▊55
 (233800.0 - 235300.0]  ▎16
 (235300.0 - 236800.0]  ▍22
 (236800.0 - 238400.0]  ▎16
 (238400.0 - 239900.0]  ▏6
 (239900.0 - 411600.0]  ▏10

                  Counts

min: 218.375 μs (0.00% GC); mean: 226.536 μs (0.00% GC); median: 226.375 μs (0.00% GC); max: 411.584 μs (0.00% GC).

julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 8948; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (548500.0 - 550300.0]  ▍27
 (550300.0 - 552100.0]  ███▋290
 (552100.0 - 553900.0]  ███████████████▌1252
 (553900.0 - 555700.0]  ██████████████████████████████ 2424
 (555700.0 - 557500.0]  ████████████████████████████▊2315
 (557500.0 - 559300.0]  ████████████████▉1360
 (559300.0 - 561100.0]  ███████▎582
 (561100.0 - 562900.0]  ██▉227
 (562900.0 - 564700.0]  █▉144
 (564700.0 - 566400.0]  █▍111
 (566400.0 - 568200.0]  █▏88
 (568200.0 - 570000.0]  █76
 (570000.0 - 571800.0]  ▍27
 (571800.0 - 573600.0]  ▎16
 (573600.0 - 604700.0]  ▏9

                  Counts

min: 548.500 μs (0.00% GC); mean: 556.582 μs (0.00% GC); median: 556.000 μs (0.00% GC); max: 604.667 μs (0.00% GC).

Julia crashes a lot on the M1, though.

@valentinsulzer
Copy link
Author

Some more benchmarks (also on M1) for linear diffusion that might be of interest

using DifferentialEquations

using LinearAlgebra, SparseArrays

# Naive implementation

n = 100
A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))

function pde!(dy, y, p, t)
    dy .= A * y 
    nothing
end

u0 = sin.(range(0,stop=pi,length=n))
dy = similar(u0)
tspan = (0.0,1.0)

using BenchmarkTools
@btime pde!(dy,u0,0,0) # 728.531 ns (3 allocations: 928 bytes)

prob = ODEProblem(pde!, u0, tspan)

# Benchmarks
using Sundials
@btime solve(prob, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.848 ms (2358 allocations: 806.56 KiB)
@btime solve(prob, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 367.167 μs (899 allocations: 190.02 KiB)
@btime solve(prob, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6);# 118.416 μs (412 allocations: 86.91 KiB)


# Non-allocating implementation

pde_noalloc! = let A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))
    
    function pde!(dy, y, p, t)
        mul!(dy, A, y)
        nothing
    end
    
end

@btime pde_noalloc!(dy,u0,0,0) # 371.079 ns (0 allocations: 0 bytes)
prob_noalloc = ODEProblem(pde_noalloc!, u0, tspan)
@btime solve(prob_noalloc, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.283 ms (388 allocations: 211.94 KiB)
@btime solve(prob_noalloc, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 305.750 μs (522 allocations: 77.14 KiB)
@btime solve(prob_noalloc, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6); # 105.125 μs (328 allocations: 61.98 KiB)

# Analytical Jacobian
de = modelingtoolkitize(prob)
jac = eval(ModelingToolkit.generate_jacobian(de)[2])
f = ODEFunction(pde!, jac=jac)
prob_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.403 ms (499 allocations: 256.42 KiB)
@btime solve(prob_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 298.291 μs (403 allocations: 83.95 KiB)

f = ODEFunction(pde_noalloc!, jac=jac)
prob_noalloc_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.351 ms (347 allocations: 210.91 KiB)
@btime solve(prob_noalloc_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 278.792 μs (328 allocations: 61.75 KiB)

prob_mtk = ODEProblem(de, Pair[], tspan)
@btime solve(prob_mtk, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.028 ms (388 allocations: 214.03 KiB)
@btime solve(prob_mtk, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 266.958 μs (522 allocations: 78.38 KiB)

# Sparse jacobian
using SparsityDetection
input = rand(length(u0))
output = similar(input)
sparsity_pattern = jacobian_sparsity(pde!,output,input,0.0,0.0)
jac_sparsity = Float64.(sparse(sparsity_pattern))

using Plots
spy(jac_sparsity)

f = ODEFunction(pde!;jac_prototype=jac_sparsity)
prob_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 785.375 μs (2368 allocations: 1.51 MiB)
@btime solve(prob_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 375.833 μs (934 allocations: 199.62 KiB)

f = ODEFunction(pde_noalloc!;jac_prototype=jac_sparsity)
prob_noalloc_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 751.666 μs (2200 allocations: 1.45 MiB)
@btime solve(prob_noalloc_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 314.375 μs (561 allocations: 87.88 KiB)

@ChrisRackauckas ChrisRackauckas transferred this issue from SciML/DiffEqOperators.jl Dec 21, 2021
ChrisRackauckas pushed a commit that referenced this issue Mar 30, 2022
@xtalax
Copy link
Member

xtalax commented Dec 1, 2022

@ChrisRackauckas is this still relevant?

@ChrisRackauckas
Copy link
Member

I don't think so. It's the justification with going with the stencil composition form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants