-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermediate matrix form in MOLFiniteDifference
#2
Comments
With array symbolics now merged (https://symbolics.juliasymbolics.org/dev/manual/arrays/) we have a better idea of where this is all going. Pulling in @shashi, @chriselrod, and @YingboMa into this as well. From https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html we know that fusing the operators is a pretty major effect on performance. So we do not want the final code to have sparse matrix multiplications, and instead want to make sure that this all lowers to something that uses LoopVectorization.jl over contiguous loop indices. In order to support huge equations, we probably do not want this to be scalarized. Also, we want to make sure that this use a cache-optimized or at least cache-oblivious loop iteration, which should be the case by using LoopVectorization.jl. It would be good for the code to be GPU friendly as well. So on that note, it would be good to see a test of LoopVectorization against something like NNLib's conv for the nonlinear Laplacian. I have faith it could match or surpass it, but that's a thing we should benchmark. Additionally, we need to think about how to lower symbolic maps and such down to something that could be GPU friendly, and what do about the boundary conditions in this case. It would also be good to note the effect of the matrix-vector for m on the index reduction algorithm. IIUC one would need to write the boundary terms separate from the matvecs in order to allow for the tearing to work, since otherwise that would break the full incidence assumption. If the boundary conditions already have to be written out in index form, I don't know if that saves any simplicity. @YingboMa can you share the suggested MTK code for the current form? |
Have an example? using NNlib, Polyester, LoopVectorization, Static
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
K₁ = StaticInt(1):StaticInt(K[1])
K₂ = StaticInt(1):StaticInt(K[2])
Cᵢₙ = StaticInt(1):StaticInt(C_in)
Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end
function convlayer!(
out::AbstractArray{<:Any,4}, img, kern,
dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
@batch for d ∈ axes(out,4)
@turbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), o ∈ Cₒᵤₜ
s = zero(eltype(out))
for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
end
out[j₁, j₂, o, d] = s
end
end
out
end
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
@time NNlib.conv!(out1, img, kern, dcd);
@time convlayer!(out2, img, kern, dcd);
out1 ≈ out2
@benchmark NNlib.conv!($out1, $img, $kern, $dcd)
@benchmark convlayer!($out1, $img, $kern, $dcd) LoopVectorization's algorithms scale very badly with the number of loops. 7 works, but takes a minute to compile. So I'm using 6 above, This is what I get on an 18 core system with AVX512: julia> @time NNlib.conv!(out1, img, kern, dcd);
0.802161 seconds (2.06 M allocations: 790.165 MiB, 20.41% gc time)
julia> @time convlayer!(out2, img, kern, dcd);
9.553409 seconds (21.51 M allocations: 1.149 GiB, 1.06% gc time)
julia> out1 ≈ out2
true
julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
samples: 27; evals/sample: 1; memory estimate: 675.02 MiB; allocs estimate: 195
ns
(1.8e8 - 1.9e8] ██████████████████████████████ 24
(1.9e8 - 2.1e8] █▎1
(2.1e8 - 2.3e8] 0
(2.3e8 - 2.5e8] 0
(2.5e8 - 2.7e8] █▎1
(2.7e8 - 2.9e8] █▎1
Counts
min: 175.652 ms (0.00% GC); mean: 191.870 ms (4.80% GC); median: 184.447 ms (0.44% GC); max: 288.455 ms (36.61% GC).
julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
samples: 944; evals/sample: 1; memory estimate: 32 bytes; allocs estimate: 4
ns
(5.233e6 - 5.256e6] ██▍31
(5.256e6 - 5.278e6] ██████████████▏193
(5.278e6 - 5.301e6] ██████████████████████████████412
(5.301e6 - 5.324e6] ████████████████████▊284
(5.324e6 - 5.347e6] █▌20
(5.347e6 - 5.37e6 ] 0
(5.37e6 - 5.392e6] 0
(5.392e6 - 5.415e6] ▏1
(5.415e6 - 5.438e6] 0
(5.438e6 - 5.461e6] ▏1
(5.461e6 - 5.484e6] ▎2
Counts
min: 5.233 ms (0.00% GC); mean: 5.292 ms (0.00% GC); median: 5.294 ms (0.00% GC); max: 5.484 ms (0.00% GC).
julia> versioninfo()
Julia Version 1.7.0-DEV.1208
Commit 0a133ff22b* (2021-05-30 11:28 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36 LoopVectorization was over 30x faster. This was also a generic convolution. The Laplacian has more structure you could exploit. EDIT: julia> function convlayer_st!(
out::AbstractArray{<:Any,4}, img, kern,
dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
for d ∈ axes(out,4)
@turbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), o ∈ Cₒᵤₜ
s = zero(eltype(out))
for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
end
out[j₁, j₂, o, d] = s
end
out
end
end
convlayer_st! (generic function with 1 method)
julia> @benchmark convlayer_st!($out1, $img, $kern, $dcd)
samples: 180; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(2.754e7 - 2.758e7] ▋2
(2.758e7 - 2.763e7] 0
(2.763e7 - 2.768e7] 0
(2.768e7 - 2.772e7] █████▍19
(2.772e7 - 2.777e7] ██████████▍37
(2.777e7 - 2.781e7] ▋2
(2.781e7 - 2.786e7] ██▌9
(2.786e7 - 2.79e7 ] ██████████████████████████████ 108
(2.79e7 - 2.795e7] ▉3
Counts
min: 27.539 ms (0.00% GC); mean: 27.824 ms (0.00% GC); median: 27.865 ms (0.00% GC); max: 27.949 ms (0.00% GC). Apparently Intel's next generation of server chips (Saphire Rapids) will come with up to 64 GiB of HBME2 memory. Maybe it'll only be a select few extremely expensive models, but if there're affordable options, that'd be exciting. julia> (sizeof(out1) + sizeof(img) + sizeof(kern)) / (1<<20) # bytes of data / (bytes/MiB)
227.36377716064453 The three arrays here are 227 MiB. Making the problem just a little smaller would then let them all fit. |
Look at the PDE example in https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html for an example of it. mul!(Ayu,Ay,u)
mul!(uAx,u,Ax)
mul!(Ayv,Ay,v)
mul!(vAx,v,Ax) is just a cheap way to do 2D convolutions with dense matmuls, but it would be good to transcribe that into conv! |
I played with it a bit, and haven't been able to get any major improvements over the explicit loop version: using LinearAlgebra, OrdinaryDiffEq
# Generate the constants
p = (1.0,1.0,1.0,10.0,0.001,100.0) # a,α,ubar,β,D1,D2
N = 100
const Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]));
const Ay = copy(Ax);
Ax[2,1] = 2.0;
Ax[end-1,end] = 2.0;
Ay[1,2] = 2.0;
Ay[end,end-1] = 2.0;
function basic_version!(dr,r,p,t)
a,α,ubar,β,D1,D2 = p
u = r[:,:,1]
v = r[:,:,2]
Du = D1*(Ay*u + u*Ax)
Dv = D2*(Ay*v + v*Ax)
dr[:,:,1] = Du .+ a.*u.*u./v .+ ubar .- α*u
dr[:,:,2] = Dv .+ a.*u.*u .- β*v
end
a,α,ubar,β,D1,D2 = p;
uss = (ubar+β)/α;
vss = (a/β)*uss^2;
r0 = zeros(100,100,2);
r0[:,:,1] .= uss.+0.1.*rand.();
r0[:,:,2] .= vss;
prob = ODEProblem(basic_version!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 123; evals/sample: 1; memory estimate: 186.80 MiB; allocs estimate: 4555
# ns
# (3.0e7 - 5.0e7] ██████████████████████████████ 121
# (5.0e7 - 7.0e7] 0
# (7.0e7 - 8.0e7] 0
# (8.0e7 - 1.0e8] 0
# (1.0e8 - 1.2e8] 0
# (1.2e8 - 1.3e8] ▎1
# (1.3e8 - 1.5e8] 0
# (1.5e8 - 1.6e8] ▎1
# Counts
# min: 34.770 ms (7.03% GC); mean: 40.722 ms (9.64% GC); median: 39.011 ms (5.79% GC); max: 164.571 ms (77.65% GC).
function gm2!(dr,r,p,t)
a,α,ubar,β,D1,D2 = p
u = @view r[:,:,1]
v = @view r[:,:,2]
du = @view dr[:,:,1]
dv = @view dr[:,:,2]
Du = D1*(Ay*u + u*Ax)
Dv = D2*(Ay*v + v*Ax)
@. du = Du + a.*u.*u./v + ubar - α*u
@. dv = Dv + a.*u.*u - β*v
end
prob = ODEProblem(gm2!,r0,(0.0,0.1),p);
@benchmark solve($prob,Tsit5())
# samples: 171; evals/sample: 1; memory estimate: 119.44 MiB; allocs estimate: 2790
# ns
# (2.68e7 - 2.75e7] ▋1
# (2.75e7 - 2.82e7] ███▏6
# (2.82e7 - 2.9e7 ] ███████████████████████▍45
# (2.9e7 - 2.97e7] ██████████████████████████████▏58
# (2.97e7 - 3.04e7] ████████████████████████████▌55
# (3.04e7 - 3.11e7] ██▋5
# (3.11e7 - 3.18e7] 0
# (3.18e7 - 3.25e7] 0
# (3.25e7 - 3.32e7] ▋1
# Counts
# min: 26.824 ms (7.47% GC); mean: 29.393 ms (6.39% GC); median: 29.594 ms (7.36% GC); max: 33.241 ms (27.63% GC).
const Ayu = zeros(N,N);
const uAx = zeros(N,N);
const Du = zeros(N,N);
const Ayv = zeros(N,N);
const vAx = zeros(N,N);
const Dv = zeros(N,N);
function gm3!(dr,r,p,t)
a,α,ubar,β,D1,D2 = p
u = @view r[:,:,1]
v = @view r[:,:,2]
du = @view dr[:,:,1]
dv = @view dr[:,:,2]
mul!(Ayu,Ay,u)
mul!(uAx,u,Ax)
mul!(Ayv,Ay,v)
mul!(vAx,v,Ax)
@. Du = D1*(Ayu + uAx)
@. Dv = D2*(Ayv + vAx)
@. du = Du + a*u*u./v + ubar - α*u
@. dv = Dv + a*u*u - β*v
end
prob_gm3 = ODEProblem(gm3!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm3,Tsit5())
# samples: 181; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns
# (2.1e7 - 2.2e7] ▌1
# (2.2e7 - 2.3e7] ▌1
# (2.3e7 - 2.5e7] 0
# (2.5e7 - 2.6e7] █████▋14
# (2.6e7 - 2.7e7] ██████████████▎36
# (2.7e7 - 2.8e7] ██████████████████████████████ 76
# (2.8e7 - 3.0e7] ███████████████▌39
# (3.0e7 - 3.1e7] ████10
# (3.1e7 - 3.2e7] █▋4
# Counts
# min: 20.760 ms (0.00% GC); mean: 27.682 ms (2.07% GC); median: 27.566 ms (0.00% GC); max: 32.108 ms (0.00% GC).
using LoopVectorization
function matmul!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
@tturbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1)
Cₘₙ = zero(eltype(C))
for k ∈ indices((A,B),(2,1))
Cₘₙ += A[m,k]*B[k,n]
end
C[m,n] = Cₘₙ
end
end
function matmuladd!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
@tturbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1)
Cₘₙ = zero(eltype(C))
for k ∈ indices((A,B),(2,1))
Cₘₙ += A[m,k]*B[k,n]
end
C[m,n] += Cₘₙ
end
end
function gm5!(dr,r,p,t)
a,α,ubar,β,D1,D2 = p
u = @view r[:,:,1];
v = @view r[:,:,2];
du = @view dr[:,:,1];
dv = @view dr[:,:,2];
matmul!(Ayu,Ay,u)
matmuladd!(Ayu,u,Ax)
# @turbo @. du = D1*Ayu + a*u*u./v + ubar - α*u
@turbo for i ∈ eachindex(u)
du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - α*u[i]
end
matmul!(Ayu,Ay,v)
matmuladd!(Ayu,v,Ax)
# @turbo @. dv = D2*Ayu + a*u*u - β*v
@turbo for i ∈ eachindex(u)
dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - β*v[i]
end
end
prob_gm5 = ODEProblem(gm5!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm5,Tsit5())
# samples: 306; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns
# (1.1e7 - 1.3e7] ██████████████████████████████ 211
# (1.3e7 - 1.5e7] ██▎15
# (1.5e7 - 1.7e7] ▍2
# (1.7e7 - 1.9e7] ▍2
# (1.9e7 - 2.1e7] 0
# (2.1e7 - 2.3e7] 0
# (2.3e7 - 2.5e7] ▉6
# (2.5e7 - 2.8e7] ███████▊54
# (2.8e7 - 3.0e7] █7
# (3.0e7 - 3.2e7] █▍9
# Counts
# min: 10.999 ms (0.00% GC); mean: 16.355 ms (3.53% GC); median: 12.869 ms (0.00% GC); max: 31.714 ms (6.63% GC).
function matmul!(C::AbstractArray{<:Any,3}, A::AbstractMatrix, B::AbstractArray{<:Any,3})
@tturbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1), d ∈ 1:2
Cₘₙ = zero(eltype(C))
for k ∈ indices((A,B),(2,1))
Cₘₙ += A[m,k]*B[k,n,d]
end
C[m,n,d] = Cₘₙ
end
end
function matmuladd!(C::AbstractArray{<:Any,3}, A::AbstractArray{<:Any,3}, B::AbstractMatrix)
@tturbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1), d ∈ 1:2
Cₘₙ = zero(eltype(C))
for k ∈ indices((A,B),(2,1))
Cₘₙ += A[m,k,d]*B[k,n]
end
C[m,n,d] += Cₘₙ
end
end
const Ayuv = similar(Ayu, (size(Ayu,1),size(Ayu,2),2));
function gm6!(dr,r,p,t)
a,α,ubar,β,D1,D2 = p
matmul!(Ayuv, Ay, r)
matmuladd!(Ayuv, r, Ax)
N² = size(r,1)*size(r,2)
u = reshape(r, (N²,2))
du = reshape(dr, (N²,2))
Ayuvr = reshape(Ayuv, (N²,2))
@turbo for i ∈ 1:N²
du[i,1] = D1 * Ayuvr[i,1] + a * u[i,1]*u[i,1] / u[i,2] + ubar - α*u[i,1]
du[i,2] = D2 * Ayuvr[i,2] + a * u[i,1]*u[i,1] - β*u[i,2]
end
end
prob_gm6 = ODEProblem(gm6!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm6, Tsit5())
# samples: 310; evals/sample: 1; memory estimate: 29.67 MiB; allocs estimate: 1315
# ns
# (1.1e7 - 1.3e7] █████▌37
# (1.3e7 - 1.4e7] ██████████████████████████████205
# (1.4e7 - 1.6e7] 0
# (1.6e7 - 1.8e7] 0
# (1.8e7 - 2.0e7] 0
# (2.0e7 - 2.1e7] 0
# (2.1e7 - 2.3e7] 0
# (2.3e7 - 2.5e7] 0
# (2.5e7 - 2.6e7] ▎1
# (2.6e7 - 2.8e7] █████████▉67
# Counts
# min: 11.091 ms (0.00% GC); mean: 16.126 ms (3.49% GC); median: 13.182 ms (0.00% GC); max: 27.934 ms (9.08% GC).
pdevec = (1.0,1.0,1.0,10.0,0.001,100.0,N);
function fast_gm!(du,u,p,t)
a,α,ubar,β,D1,D2,N = p
@inbounds for j in 2:N-1, i in 2:N-1
du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for j in 2:N-1, i in 2:N-1
du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for j in 2:N-1
i = 1
du[1,j,1] = D1*(2u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for j in 2:N-1
i = 1
du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for j in 2:N-1
i = N
du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for j in 2:N-1
i = N
du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for i in 2:N-1
j = N
du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for i in 2:N-1
j = N
du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds begin
i = 1; j = 1
du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = 1; j = N
du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = N; j = 1
du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = N; j = N
du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
end
prob_devec = ODEProblem(fast_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns
# (7.2e6 - 7.74e6 ] ██████████████████████████████▏337
# (7.74e6 - 8.27e6 ] ██▋29
# (8.27e6 - 8.81e6 ] ██████▏68
# (8.81e6 - 9.34e6 ] ██████▋74
# (9.34e6 - 9.88e6 ] ███33
# (9.88e6 - 1.041e7] ▍4
# (1.041e7 - 1.095e7] 0
# (1.095e7 - 1.149e7] ██22
# (1.149e7 - 1.202e7] ▋7
# (1.202e7 - 1.256e7] ██▎24
# (1.256e7 - 1.309e7] ▎2
# Counts
# min: 7.200 ms (0.00% GC); mean: 8.335 ms (7.27% GC); median: 7.482 ms (0.00% GC); max: 13.092 ms (43.21% GC).
function turbo_gm!(du,u,p,t)
a,α,ubar,β,D1,D2,N = p
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for j in 2:N-1
i=1
du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
a*u[1,j,1]^2/u[1,j,2] + ubar - α*u[1,j,1]
end
@inbounds for j in 2:N-1
i = 1
du[1,j,2] = D2*(2u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for j in 2:N-1
i = N
du[end,j,1] = D1*(2u[i-1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for j in 2:N-1
i = N
du[end,j,2] = D2*(2u[i-1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for i in 2:N-1
j = 1
du[i,1,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds for i in 2:N-1
j = N
du[i,end,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
end
@inbounds for i in 2:N-1
j = N
du[i,end,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@inbounds begin
i = 1; j = 1
du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = 1; j = N
du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = N; j = 1
du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = N; j = N
du[end,end,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[end,end,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
end
prob_turbo_devec = ODEProblem(turbo_gm!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_devec,Tsit5())
# samples: 600; evals/sample: 1; memory estimate: 29.63 MiB; allocs estimate: 438
# ns
# (7.11e6 - 7.78e6 ] ██████████████████████████████ 324
# (7.78e6 - 8.45e6 ] ▍4
# (8.45e6 - 9.12e6 ] ████████▎88
# (9.12e6 - 9.79e6 ] ████████▉95
# (9.79e6 - 1.046e7] ██████▋71
# (1.046e7 - 1.113e7] ▌5
# (1.113e7 - 1.18e7 ] ▍3
# (1.18e7 - 1.247e7] ▍3
# (1.247e7 - 1.314e7] ▎2
# (1.314e7 - 1.381e7] ▎2
# (1.381e7 - 1.448e7] ▍3
# Counts
# min: 7.107 ms (0.00% GC); mean: 8.331 ms (7.72% GC); median: 7.374 ms (0.00% GC); max: 14.477 ms (0.00% GC).
function turbo_gm_fuse!(du,u,p,t)
a,α,ubar,β,D1,D2,N = p
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,1] = D1*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[i,j,2] = D2*(u[i-1,j,2] + u[i+1,j,2] + u[i,j+1,2] + u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
@turbo for j in 2:N-1
du[1,j,1] = D1*(2u[2,j,1] + u[1,j+1,1] + u[1,j-1,1] - 4u[1,j,1]) +
a*u[1,j,1]^2/u[1,j,2] + ubar - α*u[1,j,1]
# end
# @turbo for j in 2:N-1
du[1,j,2] = D2*(2u[2,j,2] + u[1,j+1,2] + u[1,j-1,2] - 4u[1,j,2]) +
a*u[1,j,1]^2 - β*u[1,j,2]
# end
# @turbo for j in 2:N-1
du[N,j,1] = D1*(2u[N-1,j,1] + u[N,j+1,1] + u[N,j-1,1] - 4u[N,j,1]) +
a*u[N,j,1]^2/u[N,j,2] + ubar - α*u[N,j,1]
# end
# @turbo for j in 2:N-1
du[N,j,2] = D2*(2u[N-1,j,2] + u[N,j+1,2] + u[N,j-1,2] - 4u[N,j,2]) +
a*u[N,j,1]^2 - β*u[N,j,2]
end
@turbo for j in 2:N-1
du[j,1,1] = D1*(u[j-1,1,1] + u[j+1,1,1] + 2u[j,2,1] - 4u[j,1,1]) +
a*u[j,1,1]^2/u[j,1,2] + ubar - α*u[j,1,1]
# end
# @turbo for j in 2:N-1
du[j,1,2] = D2*(u[j-1,1,2] + u[j+1,1,2] + 2u[j,2,2] - 4u[j,1,2]) +
a*u[j,1,1]^2 - β*u[j,1,2]
# end
# @turbo for j in 2:N-1
du[j,N,1] = D1*(u[j-1,N,1] + u[j+1,N,1] + 2u[j,N-1,1] - 4u[j,N,1]) +
a*u[j,N,1]^2/u[j,N,2] + ubar - α*u[j,N,1]
# end
# @turbo for j in 2:N-1
du[j,N,2] = D2*(u[j-1,N,2] + u[j+1,N,2] + 2u[j,N-1,2] - 4u[j,N,2]) +
a*u[j,N,1]^2 - β*u[j,N,2]
end
@inbounds begin
i = 1; j = 1
du[1,1,1] = D1*(2u[i+1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[1,1,2] = D2*(2u[i+1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = 1; j = N
du[1,N,1] = D1*(2u[i+1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[1,N,2] = D2*(2u[i+1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = N; j = 1
du[N,1,1] = D1*(2u[i-1,j,1] + 2u[i,j+1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[N,1,2] = D2*(2u[i-1,j,2] + 2u[i,j+1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
i = N; j = N
du[N,N,1] = D1*(2u[i-1,j,1] + 2u[i,j-1,1] - 4u[i,j,1]) +
a*u[i,j,1]^2/u[i,j,2] + ubar - α*u[i,j,1]
du[N,N,2] = D2*(2u[i-1,j,2] + 2u[i,j-1,2] - 4u[i,j,2]) +
a*u[i,j,1]^2 - β*u[i,j,2]
end
end
prob_turbo_fuse = ODEProblem(turbo_gm_fuse!,r0,(0.0,0.1),pdevec);
@benchmark solve($prob_turbo_fuse,Tsit5())
# samples: 629; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
# ns
# (6.8e6 - 7.9e6 ] ██████████████████████████████427
# (7.9e6 - 8.9e6 ] ██▋36
# (8.9e6 - 1.0e7 ] ████▊66
# (1.0e7 - 1.11e7] ██████▉97
# (1.11e7 - 1.22e7] 0
# (1.22e7 - 1.33e7] 0
# (1.33e7 - 1.44e7] 0
# (1.44e7 - 1.54e7] ▏1
# (1.54e7 - 1.65e7] ▏1
# (1.65e7 - 1.76e7] 0
# (1.76e7 - 2.12e7] ▏1
# Counts
# min: 6.786 ms (0.00% GC); mean: 7.938 ms (7.63% GC); median: 7.014 ms (0.00% GC); max: 21.180 ms (66.47% GC). |
Those loops are easy enough for LLVM to SIMD, and there's not too much that can be done to make them faster. It is unrolling the |
Yeah, I think you need to not have the nonlinear portion of the PDE It would still be good to investigate multithreading between those when the space is really large though, but that's a separate thing and only matters for the huge PDE case. But essentially, those benchmarks are my argument for why we shouldn't be generating |
Making them dense: Ax = Array(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])) adds a huge amount of computation, making it go from For the mapped version, 100x100 isn't large enough to benefit from multithreading yet, but 1000x1000 would be. |
The first part is the same as an image filter. |
Not sure if I follow here, but We shouldn't be doing dense If we let |
I mean, when you're looping over
It wouldn't, I thought you were using the same setup as the image filter you had? A 2D Laplacian is just a 2D filter. |
Ha, let me do that. Above it was just Note that the image filtering example doesn't do boundaries...
Although reuse tends to be fairly immediate (e.g. |
@ChrisRackauckas here is the example I think you were looking for: using StaticArrays, LoopVectorization, Static
const LAPLACE_KERNEL = @MMatrix([0.0 1.0 0.0; 1.0 -4.0 1.0; 0.0 1.0 0.0]);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern::MMatrix{3,3})
N = size(out,1)
@inbounds @fastmath let j_2 = j_1 = 1
tmp = A[1+j_1, 1+j_2] * kern[-1+2, -1+2]
tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
tmp += A[1+j_1, 1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
tmp += A[1+j_1, 1+j_2] * kern[1+2, -1+2]
for i_1 ∈ 0:1, i_2 ∈ 0:1
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@inbounds @fastmath let j_1 = 1; j_2 = N
tmp = A[1+j_1, -1+j_2] * kern[-1+2, -1+2]
tmp += A[1+j_1, 0+j_2] * kern[-1+2, 0+2]
tmp += A[1+j_1, -1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
tmp += A[1+j_1, -1+j_2] * kern[1+2, 1+2]
for i_1 ∈ 0:1, i_2 ∈ -1:0
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@inbounds @fastmath let j_2 = 1; j_1 = N
tmp = A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
tmp += A[-1+j_1, 1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, 1+j_2] * kern[0+2, -1+2]
tmp += A[-1+j_1, 1+j_2] * kern[1+2, -1+2]
for i_1 ∈ -1:0, i_2 ∈ 0:1
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@turbo for j ∈ 2:N-1
# for j ∈ 2:N-1
tmp_1 = zero(eltype(out))
tmp_2 = zero(eltype(out))
for i_1 ∈ -1:1
tmp_1 += A[i_1+j, 2] * kern[i_1+2, 3]
tmp_2 += A[2, i_1+j] * kern[3, i_1+2]
for i_2 ∈ 0:1
tmp_1 += A[i_1+j, 1+i_2] * kern[i_1+2, i_2+2]
tmp_2 += A[1+i_2, i_1+j] * kern[i_2+2, i_1+2]
end
end
out[j,1] = tmp_1
out[1,j] = tmp_2
end
@turbo for j_2 ∈ 2:N-1, j_1 ∈ 2:N-1
# for j_2 ∈ 2:N-1, j_1 ∈ 2:N-1
tmp = zero(eltype(out))
for i_1 ∈ -1:1, i_2 ∈ -1:1
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
@turbo for j ∈ 2:N-1
# for j ∈ 2:N-1
tmp_1 = zero(eltype(out))
tmp_2 = zero(eltype(out))
for i_1 ∈ -1:1
tmp_1 += A[i_1+j, N-1] * kern[i_1+2, 3]
tmp_2 += A[N-1, i_1+j] * kern[3, i_1+2]
for i_2 ∈ -1:0
tmp_1 += A[i_1+j, N+i_2] * kern[i_1+2, i_2+2]
tmp_2 += A[N+i_2, i_1+j] * kern[i_2+2, i_1+2]
end
end
out[j,N] = tmp_1
out[N,j] = tmp_2
end
@inbounds @fastmath let j_2 = j_1 = N
tmp = A[-1+j_1, -1+j_2] * kern[1+2, -1+2]
tmp += A[-1+j_1, 0+j_2] * kern[1+2, 0+2]
tmp += A[-1+j_1, -1+j_2] * kern[1+2, 1+2]
tmp += A[-1+j_1, -1+j_2] * kern[-1+2, 1+2]
tmp += A[0+j_1, -1+j_2] * kern[0+2, 1+2]
for i_1 ∈ -1:0, i_2 ∈ -1:0
tmp += A[i_1+j_1, i_2+j_2] * kern[i_1+2, i_2+2]
end
out[j_1,j_2] = tmp
end
nothing
end
function gm7!(dr,r,p,t)
a,α,ubar,β,D1,D2 = p
u = @view r[:,:,1];
v = @view r[:,:,2];
du = @view dr[:,:,1];
dv = @view dr[:,:,2];
filter2davx!(Ayu, u, LAPLACE_KERNEL)
# @turbo @. du = D1*Ayu + a*u*u./v + ubar - α*u
@turbo for i ∈ eachindex(u)
du[i] = D1 * Ayu[i] + a * u[i]*u[i] / v[i] + ubar - α*u[i]
end
filter2davx!(Ayu, v, LAPLACE_KERNEL)
# @turbo @. dv = D2*Ayu + a*u*u - β*v
@turbo for i ∈ eachindex(u)
dv[i] = D2 * Ayu[i] + a * u[i]*u[i] - β*v[i]
end
end
prob_gm7 = ODEProblem(gm7!,r0,(0.0,0.1),p);
@benchmark solve($prob_gm7,Tsit5())
samples: 546; evals/sample: 1; memory estimate: 29.62 MiB; allocs estimate: 433
ns
(7.96e6 - 8.63e6 ] ██████████████████████████████311
(8.63e6 - 9.29e6 ] ██▋26
(9.29e6 - 9.95e6 ] █████████▊101
(9.95e6 - 1.062e7] █████51
(1.062e7 - 1.128e7] ▍3
(1.128e7 - 1.194e7] ██▌25
(1.194e7 - 1.26e7 ] ▏1
(1.26e7 - 1.327e7] ▏1
(1.327e7 - 1.393e7] 0
(1.393e7 - 1.459e7] █▍13
(1.459e7 - 1.525e7] █▍14
Counts
min: 7.965 ms (0.00% GC); mean: 9.162 ms (6.71% GC); median: 8.165 ms (0.00% GC); max: 15.255 ms (10.53% GC). If we wanted to optimize this a bit more, we could
Oops, we now have |
You can just hardcode that it's the stencil 0 1 0 |
Sure, but at that point why not go the rest of the way to |
Yeah, go all of the way. |
The one thing fusing inhibits is easily threading across the third axis. I should add a version of Ideally, we'd have a loop compiler that does the right thing with whatever loops we give it. There isn't really anything that special about stencils. EDIT: To give |
Yeah, they do have a lot of uses. http://supertech.csail.mit.edu/papers/pochoir_spaa11.pdf is a good read if you haven't seen it. |
Thanks, that's interesting. That is, it cuts up the iteration space/dependencies in a way so that multiple time steps -- and hence lots of reuse -- can be computed on sub (trapezoidal) regions. These regions can also be computed in parallel. If you want to compute only a single time step update, you need to iterate over the entire arrays. Then you may as well do what you can to minimize the amount of iteration, e.g. fuse it with other operations. However, if you want to compute, say, 1000 time steps, you don't need to pass over the arrays 1000 times. It takes many time steps for changes to index Eventually, you'd stop and instead work on another region. For example, one now available because the region(s) it depends on have been finished. Stencil compilers figure our clever ways to carve the space up in this way, running those separate chunks on different cores/gpus/etc, and also scheduling the leftovers to fill in the missing pieces, etc. Polyhedral optimizers can also do this sort of analysis. Anyway, regarding the computations here where we're doing a single time step at a time, we can't benefit from these optimizations. |
The current MTK code example of scalarized code is using ModelingToolkit
l = 1
n = 5
h = l / n
@variables t u[1:n](t)
D = Differential(t)
# periodic bounary condition version 1
eqs = [
h^2 * D(u[1]) ~ u[end] - 2u[1] + u[2]
map(2:n-1) do i
h^2 * D(u[i]) ~ u[i-1] - 2u[i] + u[i+1] # ifelse(u[i]>0, u[i+1] - u[i], u[i] - u[i-1])
end
h^2 * D(u[end]) ~ u[end-1] - 2u[end] + u[1]
] My tentative syntax for the interior is map(u[1:n-1], u[2:n], u[3:n+1]) do um1, ui, up1
um1 - 2ui + up1
end I am not sure how to scale this to high dims though. |
Let's talk about that high dim case today. That's going to be important to this. And get @chriselrod and @shashi on too. 1pm? |
Happy to help with this but the discussions so far are above my paygrade 😅 |
using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkTools
using LoopVectorization
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N)
const Y = reshape([j for i in 1:N for j in 1:N],N,N)
const α₁ = 1.0.*(X.>=4*N/5)
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N)
const AMx = zeros(N,N)
const DA = zeros(N,N)
function f(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
mul!(MyA,My,A)
mul!(AMx,A,Mx)
@. DA = D*(MyA + AMx)
@. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
@. dB = α₂ - β₂*B - r₁*A*B + r₂*C
@. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
function f!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@turbo for j in 2:N-1, i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
end
#=
# Double check:
u = rand(N,N,3)
du = similar(u)
du2 = similar(u)
f( du ,u,nothing,0.0)
f!(du2,u,nothing,0.0)
=#
u0 = zeros(N,N,3)
prob = ODEProblem(f!,u0,(0.0,10.0))
@btime solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 13.300 s (3406 allocations: 4.34 GiB)
@btime solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 38.481 s (270 allocations: 708.02 MiB)
@btime solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); # 40.445 s (267 allocations: 690.02 MiB) There's an N to tone down to test with. I'm pretty confused by this. Is it just the cache lines would have to be too long? @chriselrod found: Locally:
julia> u0 = randn(N,N,3);
julia> du0 = similar(u0);
julia> @benchmark f!($du0,$u0,nothing,0.0) # @turbo
samples: 7479; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(641000.0 - 644000.0] ██▏57
(644000.0 - 647000.0] ██████████▌291
(647000.0 - 650000.0] ███████████████████▉552
(650000.0 - 653100.0] ████████████████████████▋683
(653100.0 - 656100.0] ████████████████████████████778
(656100.0 - 659100.0] ███████████████████████████749
(659100.0 - 662100.0] ███████████████████████████750
(662100.0 - 665100.0] ███████████████████████▌653
(665100.0 - 668200.0] ████████████████████████▎675
(668200.0 - 671200.0] ████████████████████████▉693
(671200.0 - 674200.0] ██████████████████████████████ 836
(674200.0 - 677200.0] ███████████████████████▉663
(677200.0 - 680200.0] ███▍91
(680200.0 - 1.0048e6] ▍8
Counts
min: 640.984 μs (0.00% GC); mean: 661.914 μs (0.00% GC); median: 661.578 μs (0.00% GC); max: 1.005 ms (0.00% GC).
julia> @benchmark g!($du0,$u0,nothing,0.0) # @inbounds @fastmath
samples: 4784; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(1.0269e6 - 1.0287e6] ▎15
(1.0287e6 - 1.0306e6] ▏10
(1.0306e6 - 1.0324e6] ▏7
(1.0324e6 - 1.0342e6] ▎20
(1.0342e6 - 1.036e6 ] ▏7
(1.036e6 - 1.0379e6] ████████▋953
(1.0379e6 - 1.0397e6] ██████████████████████████████3337
(1.0397e6 - 1.0415e6] ███▌377
(1.0415e6 - 1.0433e6] ▍30
(1.0433e6 - 1.0452e6] ▏13
(1.0452e6 - 1.047e6 ] ▏6
(1.047e6 - 1.0488e6] ▏3
(1.0488e6 - 1.0506e6] ▏1
(1.0506e6 - 1.2524e6] ▏5
Counts
min: 1.027 ms (0.00% GC); mean: 1.039 ms (0.00% GC); median: 1.038 ms (0.00% GC); max: 1.252 ms (0.00% GC).
julia> @show N
N = 512
LV doesn't do cache tiling currently, so performance suffers at larger sizes.
I have a good idea of how I want it to do that now, but want to wait for the rewrite as otherwise the rewrite will never happen (there's no end to the stuff I could still do). (edited)
7:43
g! was using @inbounds @fastmath and took 1ms, while f! was using @turbo and took 0.6ms.
I could try more or on a different computer. (edited)
7:46
With @tturbo:
julia> @benchmark fthreads!($du0,$u0,nothing,0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(89800.0 - 91100.0 ] ▏7
(91100.0 - 92400.0 ] █80
(92400.0 - 93700.0 ] █████▏453
(93700.0 - 95000.0 ] ███████████████▊1405
(95000.0 - 96300.0 ] ██████████████████████████████2681
(96300.0 - 97600.0 ] ██████████████████████████████ 2692
(97600.0 - 98900.0 ] ██████████████████▌1658
(98900.0 - 100200.0] ███████▏629
(100200.0 - 101500.0] ██▋234
(101500.0 - 102800.0] ▉78
(102800.0 - 104000.0] ▌37
(104000.0 - 105300.0] ▎15
(105300.0 - 106600.0] ▏11
(106600.0 - 107900.0] ▏10
(107900.0 - 349100.0] ▏10
Counts
min: 89.756 μs (0.00% GC); mean: 96.586 μs (0.00% GC); median: 96.428 μs (0.00% GC); max: 349.074 μs (0.00% GC). and thread:
|
Solve times, julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
46.533658 seconds (14.03 k allocations: 35.467 GiB, 0.49% gc time)
julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
42.405112 seconds (14.86 k allocations: 35.467 GiB, 0.93% gc time)
julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
46.475041 seconds (14.12 k allocations: 35.725 GiB, 0.14% gc time)
julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.199515 seconds (266 allocations: 708.018 MiB)
julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
38.143093 seconds (283 allocations: 708.018 MiB, 0.07% gc time)
julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.644268 seconds (266 allocations: 708.018 MiB, 0.24% gc time)
julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.931490 seconds (265 allocations: 690.017 MiB)
julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
37.560453 seconds (282 allocations: 690.018 MiB, 0.01% gc time)
julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
43.625376 seconds (265 allocations: 690.017 MiB, 0.01% gc time) Testing just the functions: julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 9231; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(530100.0 - 531000.0] ▊31
(531000.0 - 532000.0] ████▌207
(532000.0 - 532900.0] ████████████▍575
(532900.0 - 533800.0] █████████████████████▏991
(533800.0 - 534700.0] ████████████████████████████▌1335
(534700.0 - 535700.0] ██████████████████████████████1408
(535700.0 - 536600.0] ██████████████████████████████1404
(536600.0 - 537500.0] ████████████████████████████▌1333
(537500.0 - 538400.0] ████████████████████████1125
(538400.0 - 539400.0] ████████████▋589
(539400.0 - 540300.0] ████184
(540300.0 - 541200.0] ▊31
(541200.0 - 542100.0] ▎6
(542100.0 - 543100.0] ▏2
(543100.0 - 895600.0] ▎10
Counts
min: 530.115 μs (0.00% GC); mean: 535.779 μs (0.00% GC); median: 535.711 μs (0.00% GC); max: 895.599 μs (0.00% GC).
julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(81000.0 - 81900.0 ] ▏1
(81900.0 - 82800.0 ] ▏5
(82800.0 - 83700.0 ] ▊50
(83700.0 - 84500.0 ] ███▍260
(84500.0 - 85400.0 ] ████████████948
(85400.0 - 86300.0 ] █████████████████████████▎2006
(86300.0 - 87200.0 ] ██████████████████████████████2384
(87200.0 - 88100.0 ] ████████████████████████▊1964
(88100.0 - 89000.0 ] ████████████████▊1322
(89000.0 - 89800.0 ] ████████▎650
(89800.0 - 90700.0 ] ███238
(90700.0 - 91600.0 ] █▍107
(91600.0 - 92500.0 ] ▋41
(92500.0 - 93400.0 ] ▎14
(93400.0 - 327700.0] ▎10
Counts
min: 81.006 μs (0.00% GC); mean: 87.097 μs (0.00% GC); median: 86.934 μs (0.00% GC); max: 327.730 μs (0.00% GC).
julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 9821; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(498800.0 - 499600.0] ▎10
(499600.0 - 500400.0] █▉109
(500400.0 - 501200.0] ███████████▊688
(501200.0 - 502000.0] █████████████████████████▋1512
(502000.0 - 502800.0] ██████████████████████████▋1571
(502800.0 - 503500.0] █████████████████████████▋1508
(503500.0 - 504300.0] ██████████████████████████████▏1772
(504300.0 - 505100.0] █████████████████████████▊1517
(505100.0 - 505900.0] █████████████▎781
(505900.0 - 506700.0] ████▋266
(506700.0 - 507500.0] █54
(507500.0 - 508200.0] ▎13
(508200.0 - 509000.0] ▏5
(509000.0 - 509800.0] ▏5
(509800.0 - 897000.0] ▎10
Counts
min: 498.833 μs (0.00% GC); mean: 503.346 μs (0.00% GC); median: 503.297 μs (0.00% GC); max: 897.010 μs (0.00% GC). So most of the performance gain we see from threading doesn't translate to the actual solve. For reference, here is the assembly from the main loop using Iterations: 100
Instructions: 2400
Total Cycles: 878
Total uOps: 4100
Dispatch Width: 6
uOps Per Cycle: 4.67
IPC: 2.73
Block RThroughput: 7.0
Cycles with backend pressure increase [ 45.22% ]
Throughput Bottlenecks:
Resource Pressure [ 29.16% ]
- SKXPort0 [ 21.30% ]
- SKXPort1 [ 0.80% ]
- SKXPort2 [ 14.58% ]
- SKXPort3 [ 14.58% ]
- SKXPort4 [ 4.78% ]
- SKXPort5 [ 21.30% ]
Data Dependencies: [ 39.41% ]
- Register Dependencies [ 39.41% ]
- Memory Dependencies [ 0.00% ]
Critical sequence based on the simulation:
Instruction Dependency Information
+----< 22. add r8, 64
|
| < loop carried >
|
+----> 0. vmovupd zmm6, zmmword ptr [r14 + r8 - 8] ## REGISTER dependency: r8
| 1. vmovupd zmm7, zmmword ptr [r14 + r8]
| 2. vaddpd zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
| 3. vaddpd zmm7, zmm7, zmmword ptr [rbx + r8]
| 4. vaddpd zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
| 5. vfmadd231pd zmm7, zmm6, zmm3
| 6. vfmadd213pd zmm7, zmm4, zmmword ptr [rdx + r8]
+----> 7. vfnmsub132pd zmm6, zmm6, zmmword ptr [r10 + r8] ## REGISTER dependency: zmm6
| 8. vaddpd zmm6, zmm7, zmm6
| 9. vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
| 10. vmovupd zmmword ptr [r13 + r8], zmm6
+----> 11. vmovupd zmm6, zmmword ptr [r10 + r8] ## RESOURCE interference: SKXPort3 [ probability: 2% ]
+----> 12. vfnmsub231pd zmm6, zmm6, zmmword ptr [r14 + r8 - 8] ## REGISTER dependency: zmm6
| 13. vaddpd zmm6, zmm6, zmm5
| 14. vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
| 15. vmovupd zmmword ptr [r12 + r8], zmm6
+----> 16. vmovupd zmm6, zmmword ptr [r9 + r8] ## RESOURCE interference: SKXPort3 [ probability: 11% ]
| 17. vmovupd zmm7, zmmword ptr [r10 + r8]
| 18. vfmadd132pd zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
+----> 19. vaddpd zmm6, zmm6, zmm6 ## REGISTER dependency: zmm6
+----> 20. vsubpd zmm6, zmm7, zmm6 ## REGISTER dependency: zmm6
+----> 21. vmovupd zmmword ptr [rdi + r8], zmm6 ## REGISTER dependency: zmm6
22. add r8, 64
23. cmp r8, 4032
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
2 8 0.50 * vmovupd zmm6, zmmword ptr [r14 + r8 - 8]
2 8 0.50 * vmovupd zmm7, zmmword ptr [r14 + r8]
2 11 0.50 * vaddpd zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
2 11 0.50 * vaddpd zmm7, zmm7, zmmword ptr [rbx + r8]
2 11 0.50 * vaddpd zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
1 4 0.50 vfmadd231pd zmm7, zmm6, zmm3
2 11 0.50 * vfmadd213pd zmm7, zmm4, zmmword ptr [rdx + r8]
2 11 0.50 * vfnmsub132pd zmm6, zmm6, zmmword ptr [r10 + r8]
1 4 0.50 vaddpd zmm6, zmm7, zmm6
2 11 0.50 * vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
2 1 1.00 * vmovupd zmmword ptr [r13 + r8], zmm6
2 8 0.50 * vmovupd zmm6, zmmword ptr [r10 + r8]
2 11 0.50 * vfnmsub231pd zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
1 4 0.50 vaddpd zmm6, zmm6, zmm5
2 11 0.50 * vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
2 1 1.00 * vmovupd zmmword ptr [r12 + r8], zmm6
2 8 0.50 * vmovupd zmm6, zmmword ptr [r9 + r8]
2 8 0.50 * vmovupd zmm7, zmmword ptr [r10 + r8]
2 11 0.50 * vfmadd132pd zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
1 4 0.50 vaddpd zmm6, zmm6, zmm6
1 4 0.50 vsubpd zmm6, zmm7, zmm6
2 1 1.00 * vmovupd zmmword ptr [rdi + r8], zmm6
1 1 0.25 add r8, 64
1 1 0.25 cmp r8, 4032
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 7.44 4.25 7.06 7.06 3.00 7.31 2.00 2.88
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.01 0.98 0.41 0.59 - 0.01 - - vmovupd zmm6, zmmword ptr [r14 + r8 - 8]
- - 0.13 0.84 0.55 0.45 - 0.03 - - vmovupd zmm7, zmmword ptr [r14 + r8]
- - 0.61 - 0.59 0.41 - 0.39 - - vaddpd zmm7, zmm7, zmmword ptr [r14 + r8 - 16]
- - 0.47 - 0.61 0.39 - 0.53 - - vaddpd zmm7, zmm7, zmmword ptr [rbx + r8]
- - 0.51 - 0.55 0.45 - 0.49 - - vaddpd zmm7, zmm7, zmmword ptr [r15 + r8 + 8]
- - 0.46 - - - - 0.54 - - vfmadd231pd zmm7, zmm6, zmm3
- - 0.54 - 0.46 0.54 - 0.46 - - vfmadd213pd zmm7, zmm4, zmmword ptr [rdx + r8]
- - 0.44 - 0.66 0.34 - 0.56 - - vfnmsub132pd zmm6, zmm6, zmmword ptr [r10 + r8]
- - 0.48 - - - - 0.52 - - vaddpd zmm6, zmm7, zmm6
- - 0.52 - 0.69 0.31 - 0.48 - - vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
- - - - 0.01 - 1.00 - - 0.99 vmovupd zmmword ptr [r13 + r8], zmm6
- - 0.11 0.88 0.49 0.51 - 0.01 - - vmovupd zmm6, zmmword ptr [r10 + r8]
- - 0.66 - 0.32 0.68 - 0.34 - - vfnmsub231pd zmm6, zmm6, zmmword ptr [r14 + r8 - 8]
- - 0.50 - - - - 0.50 - - vaddpd zmm6, zmm6, zmm5
- - 0.50 - 0.47 0.53 - 0.50 - - vaddpd zmm6, zmm6, zmmword ptr [r9 + r8]
- - - - - 0.02 1.00 - - 0.98 vmovupd zmmword ptr [r12 + r8], zmm6
- - 0.01 0.98 0.24 0.76 - 0.01 - - vmovupd zmm6, zmmword ptr [r9 + r8]
- - 0.25 0.57 0.57 0.43 - 0.18 - - vmovupd zmm7, zmmword ptr [r10 + r8]
- - 0.45 - 0.36 0.64 - 0.55 - - vfmadd132pd zmm7, zmm5, zmmword ptr [r14 + r8 - 8]
- - 0.39 - - - - 0.61 - - vaddpd zmm6, zmm6, zmm6
- - 0.40 - - - - 0.60 - - vsubpd zmm6, zmm7, zmm6
- - - - 0.08 0.01 1.00 - - 0.91 vmovupd zmmword ptr [rdi + r8], zmm6
- - - - - - - - 1.00 - add r8, 64
- - - - - - - - 1.00 - cmp r8, 4032 Versus Iterations: 100
Instructions: 3100
Total Cycles: 783
Total uOps: 4200
Dispatch Width: 6
uOps Per Cycle: 5.36
IPC: 3.96
Block RThroughput: 7.0
No resource or data dependency bottlenecks discovered.
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
2 8 0.50 * vmovupd zmm4, zmmword ptr [rbx - 8]
2 8 0.50 * vmovupd zmm5, zmmword ptr [rbx]
2 11 0.50 * vaddpd zmm4, zmm4, zmmword ptr [rbx + 8]
2 11 0.50 * vaddpd zmm4, zmm4, zmmword ptr [r12 + rcx]
2 11 0.50 * vaddpd zmm4, zmm4, zmmword ptr [r12]
1 4 0.50 vfmadd231pd zmm4, zmm5, zmm1
2 8 0.50 * vmovupd zmm6, zmmword ptr [rbp]
2 8 0.50 * vmovupd zmm7, zmmword ptr [rdx]
2 11 0.50 * vaddpd zmm8, zmm7, zmmword ptr [rdi]
1 4 0.50 vfmsub231pd zmm8, zmm5, zmm6
1 4 0.50 vfnmadd231pd zmm8, zmm5, zmm0
1 4 0.50 vfmsub231pd zmm8, zmm4, zmm2
1 4 0.50 vaddpd zmm4, zmm7, zmm3
1 4 0.50 vfmsub231pd zmm4, zmm5, zmm6
1 4 0.50 vfmsub231pd zmm4, zmm6, zmm0
1 1 0.33 vmovapd zmm9, zmm0
1 4 0.50 vfmadd213pd zmm9, zmm7, zmm3
1 4 0.50 vfmadd231pd zmm9, zmm5, zmm6
2 1 1.00 * vmovupd zmmword ptr [r9], zmm8
2 1 1.00 * vmovupd zmmword ptr [rax], zmm4
1 4 0.50 vfmadd231pd zmm9, zmm0, zmm7
2 1 1.00 * vmovupd zmmword ptr [r14], zmm9
1 1 0.25 add rbx, 64
1 1 0.25 add rbp, 64
1 1 0.25 add rdx, 64
1 1 0.25 add rdi, 64
1 1 0.25 add r9, 64
1 1 0.25 add rax, 64
1 1 0.25 add r14, 64
1 1 0.25 add r12, 64
1 1 0.25 cmp r10, rbx
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 7.52 7.33 4.06 4.06 3.00 7.53 5.62 2.88
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.07 0.84 0.61 0.39 - 0.09 - - vmovupd zmm4, zmmword ptr [rbx - 8]
- - - 0.97 0.45 0.55 - 0.03 - - vmovupd zmm5, zmmword ptr [rbx]
- - 0.65 - 0.39 0.61 - 0.35 - - vaddpd zmm4, zmm4, zmmword ptr [rbx + 8]
- - 0.48 - 0.37 0.63 - 0.52 - - vaddpd zmm4, zmm4, zmmword ptr [r12 + rcx]
- - 0.35 - 0.80 0.20 - 0.65 - - vaddpd zmm4, zmm4, zmmword ptr [r12]
- - 0.27 - - - - 0.73 - - vfmadd231pd zmm4, zmm5, zmm1
- - 0.02 0.83 0.62 0.38 - 0.15 - - vmovupd zmm6, zmmword ptr [rbp]
- - 0.08 0.91 0.29 0.71 - 0.01 - - vmovupd zmm7, zmmword ptr [rdx]
- - 0.56 - 0.49 0.51 - 0.44 - - vaddpd zmm8, zmm7, zmmword ptr [rdi]
- - 0.39 - - - - 0.61 - - vfmsub231pd zmm8, zmm5, zmm6
- - 0.40 - - - - 0.60 - - vfnmadd231pd zmm8, zmm5, zmm0
- - 0.29 - - - - 0.71 - - vfmsub231pd zmm8, zmm4, zmm2
- - 0.55 - - - - 0.45 - - vaddpd zmm4, zmm7, zmm3
- - 0.31 - - - - 0.69 - - vfmsub231pd zmm4, zmm5, zmm6
- - 0.32 - - - - 0.68 - - vfmsub231pd zmm4, zmm6, zmm0
- - 0.20 0.78 - - - 0.02 - - vmovapd zmm9, zmm0
- - 0.75 - - - - 0.25 - - vfmadd213pd zmm9, zmm7, zmm3
- - 0.77 - - - - 0.23 - - vfmadd231pd zmm9, zmm5, zmm6
- - - - 0.01 - 1.00 - - 0.99 vmovupd zmmword ptr [r9], zmm8
- - - - - 0.03 1.00 - - 0.97 vmovupd zmmword ptr [rax], zmm4
- - 0.74 - - - - 0.26 - - vfmadd231pd zmm9, zmm0, zmm7
- - - - 0.03 0.05 1.00 - - 0.92 vmovupd zmmword ptr [r14], zmm9
- - - - - - - - 1.00 - add rbx, 64
- - 0.12 0.03 - - - 0.01 0.84 - add rbp, 64
- - 0.05 0.34 - - - 0.02 0.59 - add rdx, 64
- - 0.06 0.24 - - - - 0.70 - add rdi, 64
- - 0.02 0.69 - - - 0.01 0.28 - add r9, 64
- - - 0.29 - - - - 0.71 - add rax, 64
- - 0.02 0.30 - - - 0.02 0.66 - add r14, 64
- - 0.01 0.69 - - - - 0.30 - add r12, 64
- - 0.04 0.42 - - - - 0.54 - cmp r10, rbx All the extra If you want just a single number from each report, LLVM-MCA estimates 100 iterations of the Iterations: 100
Instructions: 2700
Total Cycles: 731
Total uOps: 3800
Dispatch Width: 6
uOps Per Cycle: 5.20
IPC: 3.69
Block RThroughput: 7.0
Cycles with backend pressure increase [ 51.44% ]
Throughput Bottlenecks:
Resource Pressure [ 50.21% ]
- SKXPort0 [ 49.93% ]
- SKXPort1 [ 49.11% ]
- SKXPort2 [ 13.13% ]
- SKXPort3 [ 13.13% ]
- SKXPort4 [ 5.06% ]
- SKXPort5 [ 49.93% ]
- SKXPort6 [ 9.58% ]
- SKXPort7 [ 3.15% ]
Data Dependencies: [ 39.53% ]
- Register Dependencies [ 39.53% ]
- Memory Dependencies [ 0.00% ]
Critical sequence based on the simulation:
Instruction Dependency Information
+----< 13. vaddpd zmm5, zmm7, zmm3
|
| < loop carried >
|
| 0. mov r11, rbx
+----> 1. vmovupd zmm4, zmmword ptr [rbx] ## RESOURCE interference: SKXPort5 [ probability: 25% ]
+----> 2. vmovupd zmm5, zmmword ptr [rbx - 8] ## RESOURCE interference: SKXPort1 [ probability: 91% ]
+----> 3. vaddpd zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference: SKXPort2 [ probability: 13% ]
+----> 4. vaddpd zmm5, zmm5, zmmword ptr [rbx + r9] ## REGISTER dependency: zmm5
+----> 5. vaddpd zmm5, zmm5, zmmword ptr [rbx + rax] ## REGISTER dependency: zmm5
+----> 6. vmovupd zmm6, zmmword ptr [rbx + rdi] ## RESOURCE interference: SKXPort5 [ probability: 1% ]
| 7. vfmadd231pd zmm5, zmm4, zmm0
+----> 8. vmovupd zmm7, zmmword ptr [rbx + 2*rdi] ## RESOURCE interference: SKXPort1 [ probability: 93% ]
| 9. vaddpd zmm8, zmm7, zmmword ptr [rbp]
| 10. vfmsub231pd zmm8, zmm4, zmm6
| 11. vfnmadd231pd zmm8, zmm4, zmm1
| 12. vfmsub231pd zmm8, zmm5, zmm2
+----> 13. vaddpd zmm5, zmm7, zmm3 ## REGISTER dependency: zmm7
| 14. vfmsub231pd zmm5, zmm4, zmm6
| 15. vfmsub231pd zmm5, zmm6, zmm1
| 16. vmovapd zmm9, zmm1
+----> 17. vfmadd213pd zmm9, zmm7, zmm3 ## RESOURCE interference: SKXPort5 [ probability: 51% ]
| 18. vfmadd231pd zmm9, zmm4, zmm6
| 19. vfmadd231pd zmm9, zmm1, zmm7
| 20. vmovupd zmmword ptr [r12], zmm8
| 21. vmovupd zmmword ptr [r12 + rdx], zmm5
| 22. vmovupd zmmword ptr [r12 + 2*rdx], zmm9
| 23. add rbx, 64
| 24. add rbp, 64
| 25. add r12, 64
| 26. cmp rsi, rbx
|
| < loop carried >
|
+----> 3. vaddpd zmm5, zmm5, zmmword ptr [rbx + 8] ## RESOURCE interference: SKXPort5 [ probability: 38% ]
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 mov r11, rbx
2 8 0.50 * vmovupd zmm4, zmmword ptr [rbx]
2 8 0.50 * vmovupd zmm5, zmmword ptr [rbx - 8]
2 11 0.50 * vaddpd zmm5, zmm5, zmmword ptr [rbx + 8]
2 11 0.50 * vaddpd zmm5, zmm5, zmmword ptr [rbx + r9]
2 11 0.50 * vaddpd zmm5, zmm5, zmmword ptr [rbx + rax]
2 8 0.50 * vmovupd zmm6, zmmword ptr [rbx + rdi]
1 4 0.50 vfmadd231pd zmm5, zmm4, zmm0
2 8 0.50 * vmovupd zmm7, zmmword ptr [rbx + 2*rdi]
2 11 0.50 * vaddpd zmm8, zmm7, zmmword ptr [rbp]
1 4 0.50 vfmsub231pd zmm8, zmm4, zmm6
1 4 0.50 vfnmadd231pd zmm8, zmm4, zmm1
1 4 0.50 vfmsub231pd zmm8, zmm5, zmm2
1 4 0.50 vaddpd zmm5, zmm7, zmm3
1 4 0.50 vfmsub231pd zmm5, zmm4, zmm6
1 4 0.50 vfmsub231pd zmm5, zmm6, zmm1
1 1 0.33 vmovapd zmm9, zmm1
1 4 0.50 vfmadd213pd zmm9, zmm7, zmm3
1 4 0.50 vfmadd231pd zmm9, zmm4, zmm6
1 4 0.50 vfmadd231pd zmm9, zmm1, zmm7
2 1 1.00 * vmovupd zmmword ptr [r12], zmm8
2 1 1.00 * vmovupd zmmword ptr [r12 + rdx], zmm5
2 1 1.00 * vmovupd zmmword ptr [r12 + 2*rdx], zmm9
1 1 0.25 add rbx, 64
1 1 0.25 add rbp, 64
1 1 0.25 add r12, 64
1 1 0.25 cmp rsi, rbx
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 7.06 6.79 4.04 4.05 3.00 7.07 3.08 2.91
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 0.85 - - - 0.01 0.14 - mov r11, rbx
- - - 0.99 0.46 0.54 - 0.01 - - vmovupd zmm4, zmmword ptr [rbx]
- - 0.01 0.98 0.75 0.25 - 0.01 - - vmovupd zmm5, zmmword ptr [rbx - 8]
- - 0.19 - 0.67 0.33 - 0.81 - - vaddpd zmm5, zmm5, zmmword ptr [rbx + 8]
- - 0.19 - 0.53 0.47 - 0.81 - - vaddpd zmm5, zmm5, zmmword ptr [rbx + r9]
- - 0.53 - 0.66 0.34 - 0.47 - - vaddpd zmm5, zmm5, zmmword ptr [rbx + rax]
- - 0.01 0.97 0.31 0.69 - 0.02 - - vmovupd zmm6, zmmword ptr [rbx + rdi]
- - 0.94 - - - - 0.06 - - vfmadd231pd zmm5, zmm4, zmm0
- - - 0.99 0.29 0.71 - 0.01 - - vmovupd zmm7, zmmword ptr [rbx + 2*rdi]
- - 0.10 - 0.33 0.67 - 0.90 - - vaddpd zmm8, zmm7, zmmword ptr [rbp]
- - 0.73 - - - - 0.27 - - vfmsub231pd zmm8, zmm4, zmm6
- - 0.93 - - - - 0.07 - - vfnmadd231pd zmm8, zmm4, zmm1
- - 0.96 - - - - 0.04 - - vfmsub231pd zmm8, zmm5, zmm2
- - 0.09 - - - - 0.91 - - vaddpd zmm5, zmm7, zmm3
- - 0.55 - - - - 0.45 - - vfmsub231pd zmm5, zmm4, zmm6
- - 0.34 - - - - 0.66 - - vfmsub231pd zmm5, zmm6, zmm1
- - - 0.98 - - - 0.02 - - vmovapd zmm9, zmm1
- - 0.30 - - - - 0.70 - - vfmadd213pd zmm9, zmm7, zmm3
- - 0.55 - - - - 0.45 - - vfmadd231pd zmm9, zmm4, zmm6
- - 0.62 - - - - 0.38 - - vfmadd231pd zmm9, zmm1, zmm7
- - - - - - 1.00 - - 1.00 vmovupd zmmword ptr [r12], zmm8
- - - - - 0.03 1.00 - - 0.97 vmovupd zmmword ptr [r12 + rdx], zmm5
- - - - 0.04 0.02 1.00 - - 0.94 vmovupd zmmword ptr [r12 + 2*rdx], zmm9
- - - - - - - - 1.00 - add rbx, 64
- - - 0.04 - - - 0.01 0.95 - add rbp, 64
- - 0.01 0.87 - - - - 0.12 - add r12, 64
- - 0.01 0.12 - - - - 0.87 - cmp rsi, rbx This code looks better to me, and LLVM says it is faster. But it's slower in benchmarks (on my computers) anyway, and I'm a bit at a loss. Would be interesting to see more benchmarks. using OrdinaryDiffEq, LinearAlgebra, SparseArrays, BenchmarkHistograms, LoopVectorization
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 512
const X = reshape([i for i in 1:N for j in 1:N],N,N);
const Y = reshape([j for i in 1:N for j in 1:N],N,N);
const α₁ = 1.0.*(X.>=4*N/5);
const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]);
const My = copy(Mx);
Mx[2,1] = 2.0;
Mx[end-1,end] = 2.0;
My[1,2] = 2.0;
My[end,end-1] = 2.0;
# Define the discretized PDE as an ODE function
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N);
function f!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@turbo for j in 2:N-1, i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
end
function fthreads!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@tturbo for j in 2:N-1, i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
end
function fsimdivdep!(du,u,p,t)
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@inbounds @fastmath for j in 2:N-1; @simd ivdep for i in 2:N-1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end; end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
end
function fnoviews!(du,u,p,t) # no views
A = @view u[:,:,1]
B = @view u[:,:,2]
C = @view u[:,:,3]
dA = @view du[:,:,1]
dB = @view du[:,:,2]
dC = @view du[:,:,3]
@turbo for j in 2:N-1, i in 2:N-1
du[i,j,1] = D*(u[i-1,j,1] + u[i+1,j,1] + u[i,j+1,1] + u[i,j-1,1] - 4u[i,j,1]) +
α₁[i,j] - β₁*u[i,j,1] - r₁*u[i,j,1]*u[i,j,2] + r₂*u[i,j,3]
du[i,j,2] = α₂ - β₂*u[i,j,2] - r₁*u[i,j,1]*u[i,j,2] + r₂*u[i,j,3]
du[i,j,3] = α₃ - β₃*u[i,j,3] + r₁*u[i,j,1]*u[i,j,2] - r₂*u[i,j,3]
end
@inbounds for j in 2:N-1
i = 1
dA[1,j] = D*(2A[i+1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for j in 2:N-1
i = N
dA[end,j] = D*(2A[i-1,j] + A[i,j+1] + A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = 1
dA[i,j] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds for i in 2:N-1
j = N
dA[i,end] = D*(A[i-1,j] + A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
@inbounds begin
i = 1; j = 1
dA[i,j] = D*(2A[i+1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = 1; j = N
dA[i,j] = D*(2A[i+1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = 1
dA[i,j] = D*(2A[i-1,j] + 2A[i,j+1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
i = N; j = N
dA[i,j] = D*(2A[i-1,j] + 2A[i,j-1] - 4A[i,j]) +
α₁[i,j] - β₁*A[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dB[i,j] = α₂ - β₂*B[i,j] - r₁*A[i,j]*B[i,j] + r₂*C[i,j]
dC[i,j] = α₃ - β₃*C[i,j] + r₁*A[i,j]*B[i,j] - r₂*C[i,j]
end
end
u0 = zeros(N,N,3); du = similar(u0);
prob = ODEProblem(f!,u0,(0.0,10.0));
prob_threads = ODEProblem(fthreads!,u0,(0.0,10.0));
prob_simdivdep = ODEProblem(fsimdivdep!,u0,(0.0,10.0));
@time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
@time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1); |
Oh, and benchmarks on the M1 (running on a native Julia 1.8 build): julia> @time solve(prob, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.625047 seconds (14.09 k allocations: 35.655 GiB, 0.71% gc time)
julia> @time solve(prob_threads, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
91.950376 seconds (14.24 k allocations: 35.655 GiB, 0.24% gc time)
julia> @time solve(prob_simdivdep, ROCK4(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
23.505749 seconds (14.01 k allocations: 35.421 GiB, 0.23% gc time)
julia> @time solve(prob, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
19.860662 seconds (267 allocations: 708.018 MiB)
julia> @time solve(prob_threads, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
22.793870 seconds (270 allocations: 708.018 MiB, 0.06% gc time)
julia> @time solve(prob_simdivdep, Tsit5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.074039 seconds (270 allocations: 708.018 MiB, 0.53% gc time)
julia> @time solve(prob, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
20.464703 seconds (264 allocations: 690.017 MiB, 0.34% gc time)
julia> @time solve(prob_threads, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.983172 seconds (267 allocations: 690.017 MiB, 0.07% gc time)
julia> @time solve(prob_simdivdep, DP5(), reltol = 1e-8, abstol=1e-8, saveat=0.1);
21.848427 seconds (267 allocations: 690.017 MiB, 0.39% gc time) Not sure what's going on with ROCK4 and multithreading, but otherwise -- ouch Intel. Benchmarks of the individual functions: julia> @benchmark f!($du, $u0, nothing, 0.0) # @turbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(409000.0 - 411200.0] ▏3
(411200.0 - 413500.0] ▍23
(413500.0 - 415700.0] ██▎191
(415700.0 - 418000.0] ██████████▍894
(418000.0 - 420300.0] ██████████████████████▊1974
(420300.0 - 422500.0] ██████████████████████████████▏2614
(422500.0 - 424800.0] █████████████████████████▋2228
(424800.0 - 427100.0] ██████████████▏1220
(427100.0 - 429300.0] █████▉506
(429300.0 - 431600.0] ██▏180
(431600.0 - 433800.0] ▉76
(433800.0 - 436100.0] ▋46
(436100.0 - 438400.0] ▍25
(438400.0 - 440600.0] ▏10
(440600.0 - 534200.0] ▏10
Counts
min: 408.958 μs (0.00% GC); mean: 422.196 μs (0.00% GC); median: 421.917 μs (0.00% GC); max: 534.208 μs (0.00% GC).
julia> @benchmark fthreads!($du, $u0, nothing, 0.0) # @tturbo
samples: 10000; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(218400.0 - 219900.0] ▎16
(219900.0 - 221500.0] █73
(221500.0 - 223000.0] █████▉473
(223000.0 - 224500.0] ██████████████████▍1485
(224500.0 - 226100.0] █████████████████████████████▉2415
(226100.0 - 227600.0] ██████████████████████████████ 2433
(227600.0 - 229100.0] ███████████████████████1865
(229100.0 - 230700.0] ██████████▊871
(230700.0 - 232200.0] ███▏244
(232200.0 - 233800.0] ▊55
(233800.0 - 235300.0] ▎16
(235300.0 - 236800.0] ▍22
(236800.0 - 238400.0] ▎16
(238400.0 - 239900.0] ▏6
(239900.0 - 411600.0] ▏10
Counts
min: 218.375 μs (0.00% GC); mean: 226.536 μs (0.00% GC); median: 226.375 μs (0.00% GC); max: 411.584 μs (0.00% GC).
julia> @benchmark fsimdivdep!($du, $u0, nothing, 0.0) # @simd ivdep
samples: 8948; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns
(548500.0 - 550300.0] ▍27
(550300.0 - 552100.0] ███▋290
(552100.0 - 553900.0] ███████████████▌1252
(553900.0 - 555700.0] ██████████████████████████████ 2424
(555700.0 - 557500.0] ████████████████████████████▊2315
(557500.0 - 559300.0] ████████████████▉1360
(559300.0 - 561100.0] ███████▎582
(561100.0 - 562900.0] ██▉227
(562900.0 - 564700.0] █▉144
(564700.0 - 566400.0] █▍111
(566400.0 - 568200.0] █▏88
(568200.0 - 570000.0] █76
(570000.0 - 571800.0] ▍27
(571800.0 - 573600.0] ▎16
(573600.0 - 604700.0] ▏9
Counts
min: 548.500 μs (0.00% GC); mean: 556.582 μs (0.00% GC); median: 556.000 μs (0.00% GC); max: 604.667 μs (0.00% GC). Julia crashes a lot on the M1, though. |
Some more benchmarks (also on M1) for linear diffusion that might be of interest using DifferentialEquations
using LinearAlgebra, SparseArrays
# Naive implementation
n = 100
A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))
function pde!(dy, y, p, t)
dy .= A * y
nothing
end
u0 = sin.(range(0,stop=pi,length=n))
dy = similar(u0)
tspan = (0.0,1.0)
using BenchmarkTools
@btime pde!(dy,u0,0,0) # 728.531 ns (3 allocations: 928 bytes)
prob = ODEProblem(pde!, u0, tspan)
# Benchmarks
using Sundials
@btime solve(prob, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.848 ms (2358 allocations: 806.56 KiB)
@btime solve(prob, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 367.167 μs (899 allocations: 190.02 KiB)
@btime solve(prob, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6);# 118.416 μs (412 allocations: 86.91 KiB)
# Non-allocating implementation
pde_noalloc! = let A = spdiagm(-1 => ones(n-1), 0 => reduce(vcat, [-1,-2ones(n-2),-1]), 1=> ones(n-1))
function pde!(dy, y, p, t)
mul!(dy, A, y)
nothing
end
end
@btime pde_noalloc!(dy,u0,0,0) # 371.079 ns (0 allocations: 0 bytes)
prob_noalloc = ODEProblem(pde_noalloc!, u0, tspan)
@btime solve(prob_noalloc, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.283 ms (388 allocations: 211.94 KiB)
@btime solve(prob_noalloc, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 305.750 μs (522 allocations: 77.14 KiB)
@btime solve(prob_noalloc, CVODE_BDF(linear_solver=:Band,jac_upper=1,jac_lower=1), reltol=1e-6, abstol=1e-6); # 105.125 μs (328 allocations: 61.98 KiB)
# Analytical Jacobian
de = modelingtoolkitize(prob)
jac = eval(ModelingToolkit.generate_jacobian(de)[2])
f = ODEFunction(pde!, jac=jac)
prob_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.403 ms (499 allocations: 256.42 KiB)
@btime solve(prob_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 298.291 μs (403 allocations: 83.95 KiB)
f = ODEFunction(pde_noalloc!, jac=jac)
prob_noalloc_jac = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_jac, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.351 ms (347 allocations: 210.91 KiB)
@btime solve(prob_noalloc_jac, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 278.792 μs (328 allocations: 61.75 KiB)
prob_mtk = ODEProblem(de, Pair[], tspan)
@btime solve(prob_mtk, KenCarp47(autodiff=false), reltol=1e-6, abstol=1e-6); # 1.028 ms (388 allocations: 214.03 KiB)
@btime solve(prob_mtk, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 266.958 μs (522 allocations: 78.38 KiB)
# Sparse jacobian
using SparsityDetection
input = rand(length(u0))
output = similar(input)
sparsity_pattern = jacobian_sparsity(pde!,output,input,0.0,0.0)
jac_sparsity = Float64.(sparse(sparsity_pattern))
using Plots
spy(jac_sparsity)
f = ODEFunction(pde!;jac_prototype=jac_sparsity)
prob_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 785.375 μs (2368 allocations: 1.51 MiB)
@btime solve(prob_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 375.833 μs (934 allocations: 199.62 KiB)
f = ODEFunction(pde_noalloc!;jac_prototype=jac_sparsity)
prob_noalloc_sparse = ODEProblem(f,u0,tspan)
@btime solve(prob_noalloc_sparse, KenCarp47(), reltol=1e-6, abstol=1e-6); # 751.666 μs (2200 allocations: 1.45 MiB)
@btime solve(prob_noalloc_sparse, CVODE_BDF(), reltol=1e-6, abstol=1e-6); # 314.375 μs (561 allocations: 87.88 KiB) |
@ChrisRackauckas is this still relevant? |
I don't think so. It's the justification with going with the stencil composition form. |
The code in
MOLFiniteDifference
is getting pretty messy as we add more functionality for nonlinear laplacian, upwinding, etc.One way to simplify the code would be to have an intermediate matrix form - which then gets scalarized by the compiler - instead of directly going to the scalar form by looping over all indices. Ideally this would use
DerivativeOperator
to stop this package diverging into two almost entirely distinct packages (the two examples in the readme). CanDerivativeOperator
be used like this? If so,*bc
convolution?I realize this is a little bit of a step backwards because that is closer to how it was done before the refactoring of
MOLFiniteDifference
, so worth discussing the pros and cons of each approach here a little bit before changing anythingThe text was updated successfully, but these errors were encountered: