-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect results from turbo loop #274
Comments
Duplicate of: #266 LoopVectorization's current model for loops assumes all iterations are independent and can be reordered, with a special case added for constant offsets between loads and stores that are for i in 1:I, j in J # I loop has a step of 1
a, b = f(x[i], x[i+1])
x[i] = a
x[i+1] = b # we're loading and storing from `x` with offsets `i, i+1`; (i + 1 - i) <= step(1:I) so it will reorder `j` loop instead.
end If it weren't for the But my priority / long term plan here involves writing a new loop model that'll explicitly include a dependency graph that'll indicate constraints on iteration order and available transformations. #266 and this issue are more examples it should be able to correctly handle. function turbo_foo!(dy, f, Ca, BBB, v_vz, dz)
N_tip = length(f)
length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
inv_2dz = 1/2dz
inv_dz² = 1/dz^2
dy .= 0
We = 0.1
@turbo for i = 2:N_tip-2
fᵢ₋₂ = i < 5 ? 0.0 : f[i-2]
fᵢ₋₁ = i < 4 ? 0.0 : f[i-1]
fᵢ = i == 2 ? 0.0 : f[i ]
fᵢ₊₁ = f[i+1]
fᵢ₊₂ = f[i+2]
dfᵢ₋₁ = (fᵢ - fᵢ₋₂)*inv_2dz # df/dz
# dfᵢ = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz # df/dz
dfᵢ₊₁ = (fᵢ₊₂ - fᵢ )*inv_2dz # df/dz
d2fᵢ₋₁ = (fᵢ - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz² # d²f/dz²
# d2fᵢ = (fᵢ₊₁ - 2fᵢ + fᵢ₋₁)*inv_dz² # d²f/dz²
d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ )*inv_dz² # d²f/dz²
pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) / (2*abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁)^(3/2))
# pᵢ = i == 2 ? 0.0 : ((2-d2fᵢ )*fᵢ + dfᵢ ^2) / (2*abs(1/4*dfᵢ ^2 + fᵢ )^(3/2))
pᵢ₊₁ = ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) / (2*abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁)^(3/2))
dy[i] = -BBB[i]
viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
# viscᵢ = i == 2 ? 0.0 : Ca[i ] * (dfᵢ - BBB[i ]) / fᵢ
viscᵢ₊₁ = Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz # dP/dz
dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
end
return dy
end This works: julia> foo!(dy1, f, Ca, BBB, v_vz, dz) ≈ turbo_foo!(dy2, f, Ca, BBB, v_vz, dz)
true
julia> display([dy1 dy2])
10×2 Matrix{Float64}:
0.0 0.0
-0.2 -0.2
-0.3 -0.3
0.0 0.0
0.0 0.0
0.0 0.0
-10099.2 -10099.2
-4997.16 -4997.16
0.0 0.0
0.0 0.0 and yields a reasonable speedup: julia> using BenchmarkHistograms
julia> @benchmark foo!($dy1, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 755; memory estimate: 0 bytes; allocs estimate: 0
ns
(169.8 - 171.3] ▏1
(171.3 - 172.8] 0
(172.8 - 174.3] 0
(174.3 - 175.8] 0
(175.8 - 177.3] 0
(177.3 - 178.8] 0
(178.8 - 180.3] ██████████████████████████████ 8613
(180.3 - 181.8] ████▌1269
(181.8 - 183.4] ▍91
(183.4 - 184.9] ▏6
(184.9 - 186.4] ▏5
(186.4 - 187.9] ▏3
(187.9 - 189.4] 0
(189.4 - 190.9] ▏2
(190.9 - 202.5] ▏10
Counts
min: 169.793 ns (0.00% GC); mean: 180.198 ns (0.00% GC); median: 180.020 ns (0.00% GC); max: 202.498 ns (0.00% GC).
julia> @benchmark turbo_foo!($dy1, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 987; memory estimate: 0 bytes; allocs estimate: 0
ns
(51.61 - 51.86] ██████████████████████████████ 7540
(51.86 - 52.12] ███████▊1925
(52.12 - 52.38] 0
(52.38 - 52.64] 0
(52.64 - 52.9 ] ▊160
(52.9 - 53.16] █▎310
(53.16 - 53.42] ▏29
(53.42 - 53.67] ▏10
(53.67 - 53.93] ▏8
(53.93 - 54.19] ▏3
(54.19 - 54.45] ▏1
(54.45 - 54.71] ▏1
(54.71 - 54.97] ▏1
(54.97 - 55.23] ▏2
(55.23 - 74.64] ▏10
Counts
min: 51.606 ns (0.00% GC); mean: 51.870 ns (0.00% GC); median: 51.795 ns (0.00% GC); max: 74.639 ns (0.00% GC). By far the slowest part of this loop is
To speed this up further, we can replace julia> function turbo_foo_root!(dy, f, Ca, BBB, v_vz, dz)
N_tip = length(f)
length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
inv_2dz = 1/2dz
inv_dz² = 1/dz^2
dy .= 0
We = 0.1
@turbo for i = 2:N_tip-2
fᵢ₋₂ = i < 5 ? 0.0 : f[i-2]
fᵢ₋₁ = i < 4 ? 0.0 : f[i-1]
fᵢ = i == 2 ? 0.0 : f[i ]
fᵢ₊₁ = f[i+1]
fᵢ₊₂ = f[i+2]
dfᵢ₋₁ = (fᵢ - fᵢ₋₂)*inv_2dz # df/dz
# dfᵢ = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz # df/dz
dfᵢ₊₁ = (fᵢ₊₂ - fᵢ )*inv_2dz # df/dz
d2fᵢ₋₁ = (fᵢ - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz² # d²f/dz²
# d2fᵢ = (fᵢ₊₁ - 2fᵢ + fᵢ₋₁)*inv_dz² # d²f/dz²
d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ )*inv_dz² # d²f/dz²
pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) / (2*sqrt(abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁))^(3))
# pᵢ = i == 2 ? 0.0 : ((2-d2fᵢ )*fᵢ + dfᵢ ^2) / (2*abs(1/4*dfᵢ ^2 + fᵢ )^(3/2))
pᵢ₊₁ = ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) / (2*sqrt(abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁))^(3))
dy[i] = -BBB[i]
viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
# viscᵢ = i == 2 ? 0.0 : Ca[i ] * (dfᵢ - BBB[i ]) / fᵢ
viscᵢ₊₁ = Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz # dP/dz
dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
end
return dy
end
turbo_foo_root! (generic function with 1 method)
julia> foo!(dy1, f, Ca, BBB, v_vz, dz) ≈ turbo_foo_root!(dy2, f, Ca, BBB, v_vz, dz)
true
julia> @benchmark turbo_foo_root!($dy2, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 993; memory estimate: 0 bytes; allocs estimate: 0
ns
(34.39 - 34.6 ] ██████████████████████████████ 9639
(34.6 - 34.81] ▏4
(34.81 - 35.03] 0
(35.03 - 35.24] 0
(35.24 - 35.46] ▏4
(35.46 - 35.67] ▉259
(35.67 - 35.89] ▎57
(35.89 - 36.1 ] ▏11
(36.1 - 36.31] ▏9
(36.31 - 36.53] ▏1
(36.53 - 36.74] ▏2
(36.74 - 36.96] ▏1
(36.96 - 37.17] ▏2
(37.17 - 37.38] ▏1
(37.38 - 60.06] ▏10
Counts
min: 34.386 ns (0.00% GC); mean: 34.552 ns (0.00% GC); median: 34.502 ns (0.00% GC); max: 60.061 ns (0.00% GC).
julia> function turbo_foo_negpower!(dy, f, Ca, BBB, v_vz, dz)
N_tip = length(f)
length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
inv_2dz = 1/2dz
inv_dz² = 1/dz^2
dy .= 0
We = 0.1
@turbo for i = 2:N_tip-2
fᵢ₋₂ = i < 5 ? 0.0 : f[i-2]
fᵢ₋₁ = i < 4 ? 0.0 : f[i-1]
fᵢ = i == 2 ? 0.0 : f[i ]
fᵢ₊₁ = f[i+1]
fᵢ₊₂ = f[i+2]
dfᵢ₋₁ = (fᵢ - fᵢ₋₂)*inv_2dz # df/dz
# dfᵢ = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz # df/dz
dfᵢ₊₁ = (fᵢ₊₂ - fᵢ )*inv_2dz # df/dz
d2fᵢ₋₁ = (fᵢ - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz² # d²f/dz²
# d2fᵢ = (fᵢ₊₁ - 2fᵢ + fᵢ₋₁)*inv_dz² # d²f/dz²
d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ )*inv_dz² # d²f/dz²
pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) * (0.5*abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁)^(-3/2))
# pᵢ = i == 2 ? 0.0 : ((2-d2fᵢ )*fᵢ + dfᵢ ^2) / (2*abs(1/4*dfᵢ ^2 + fᵢ )^(3/2))
pᵢ₊₁ = ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) * (0.5*abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁)^(-3/2))
dy[i] = -BBB[i]
viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
# viscᵢ = i == 2 ? 0.0 : Ca[i ] * (dfᵢ - BBB[i ]) / fᵢ
viscᵢ₊₁ = Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz # dP/dz
dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
end
return dy
end
turbo_foo_negpower! (generic function with 1 method)
julia> foo!(dy1, f, Ca, BBB, v_vz, dz) ≈ turbo_foo_negpower!(dy2, f, Ca, BBB, v_vz, dz)
true
julia> @benchmark turbo_foo_negpower!($dy2, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 990; memory estimate: 0 bytes; allocs estimate: 0
ns
(41.93 - 42.41] ██████████████████████████████ 9500
(42.41 - 42.89] ▎60
(42.89 - 43.37] ▍86
(43.37 - 43.85] █309
(43.85 - 44.33] ▏20
(44.33 - 44.81] ▏5
(44.81 - 45.29] ▏3
(45.29 - 45.77] ▏1
(45.77 - 46.25] ▏1
(46.25 - 46.73] ▏2
(46.73 - 47.21] 0
(47.21 - 47.69] ▏2
(47.69 - 48.17] 0
(48.17 - 48.65] ▏1
(48.65 - 71.43] ▏10
Counts
min: 41.925 ns (0.00% GC); mean: 42.317 ns (0.00% GC); median: 42.251 ns (0.00% GC); max: 71.434 ns (0.00% GC). Relative performance will vary by CPU, which is why it'd be good for LoopVectorization to perform the optimization (i.e., it can pick using architectural details). Note that LoopVectorization substitutes julia> @inline mypow(x,y) = x^y
julia> function turbo_foo_negmypower!(dy, f, Ca, BBB, v_vz, dz)
N_tip = length(f)
length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
inv_2dz = 1/2dz
inv_dz² = 1/dz^2
dy .= 0
We = 0.1
@turbo for i = 2:N_tip-2
fᵢ₋₂ = i < 5 ? 0.0 : f[i-2]
fᵢ₋₁ = i < 4 ? 0.0 : f[i-1]
fᵢ = i == 2 ? 0.0 : f[i ]
fᵢ₊₁ = f[i+1]
fᵢ₊₂ = f[i+2]
dfᵢ₋₁ = (fᵢ - fᵢ₋₂)*inv_2dz # df/dz
# dfᵢ = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz # df/dz
dfᵢ₊₁ = (fᵢ₊₂ - fᵢ )*inv_2dz # df/dz
d2fᵢ₋₁ = (fᵢ - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz² # d²f/dz²
# d2fᵢ = (fᵢ₊₁ - 2fᵢ + fᵢ₋₁)*inv_dz² # d²f/dz²
d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ )*inv_dz² # d²f/dz²
pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) * (0.5*mypow(abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁),(-3/2)))
pᵢ₊₁ = ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) * (0.5*mypow(abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁),(-3/2)))
dy[i] = -BBB[i]
viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
# viscᵢ = i == 2 ? 0.0 : Ca[i ] * (dfᵢ - BBB[i ]) / fᵢ
viscᵢ₊₁ = Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz # dP/dz
dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
end
return dy
end
turbo_foo_negmypower! (generic function with 1 method)
julia> foo!(dy1, f, Ca, BBB, v_vz, dz) ≈ turbo_foo_negmypower!(dy2, f, Ca, BBB, v_vz, dz)
true
julia> @benchmark turbo_foo_negmypower!($dy2, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 943; memory estimate: 0 bytes; allocs estimate: 0
ns
(95.5 - 95.99 ] ▏4
(95.99 - 96.48 ] █▌251
(96.48 - 96.96 ] ███████████████▊2856
(96.96 - 97.45 ] ██████████████████████████████5446
(97.45 - 97.94 ] ██▊479
(97.94 - 98.43 ] ███▋640
(98.43 - 98.91 ] █▍236
(98.91 - 99.4 ] ▍65
(99.4 - 99.89 ] ▏6
(99.89 - 100.37] 0
(100.37 - 100.86] ▏1
(100.86 - 101.35] ▏3
(101.35 - 101.84] 0
(101.84 - 102.32] ▏3
(102.32 - 139.71] ▏10
Counts
min: 95.504 ns (0.00% GC); mean: 97.203 ns (0.00% GC); median: 97.101 ns (0.00% GC); max: 139.710 ns (0.00% GC). While much slower than any of the other pow : max 0.7971188400935709 at x = 94.88, 80.62 : mean 0.12711344263659394 |
I just tested the error of That's a SLEEFPirates bug...I'll try and improve error. |
Thanks. I would suggest adding something like to the documentation for |
(Closing as a duplicate.) |
For the function
foo!
below, adding@turbo
to thefor
loop changes the results completely (by more than a factor of 2).In particular,
produces
Notice that one of the outputs is changed from
-4997.16
to-12533.1
.This is with Julia 1.6 and LoopVectorization v0.12.25 on macOS (x86_64-apple-darwin18.7.0) with an Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz.
The text was updated successfully, but these errors were encountered: