Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect results from turbo loop #274

Closed
stevengj opened this issue May 28, 2021 · 4 comments
Closed

incorrect results from turbo loop #274

stevengj opened this issue May 28, 2021 · 4 comments

Comments

@stevengj
Copy link

stevengj commented May 28, 2021

For the function foo! below, adding @turbo to the for loop changes the results completely (by more than a factor of 2).

In particular,

N = 5
f = [0.1,0.2,0.3,0.4,0.5]
Ca = [0.1,0.2,0.3,0.4,0.5]
BBB = [0.1,0.2,0.3,0.4,0.5]
v_vz = [0.1,0.2,0.3,0.4,0.5]
dz = 0.01
dy1 = zeros(2N)
dy2 = zeros(2N)
foo!(dy1, f, Ca, BBB, v_vz, dz)
turbo_foo!(dy2, f, Ca, BBB, v_vz, dz)
display([dy1 dy2])

produces

10×2 Matrix{Float64}:
      0.0        0.0
     -0.2       -0.2
     -0.3       -0.3
      0.0        0.0
      0.0        0.0
      0.0        0.0
 -10099.2   -10099.2
  -4997.16  -12533.1
      0.0        0.0
      0.0        0.0

Notice that one of the outputs is changed from -4997.16 to -12533.1.

This is with Julia 1.6 and LoopVectorization v0.12.25 on macOS (x86_64-apple-darwin18.7.0) with an Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz.

using LoopVectorization

function foo!(dy, f, Ca, BBB, v_vz, dz)
    N_tip = length(f)
    length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
    
    inv_2dz = 1/2dz
    inv_dz² = 1/dz^2
    dy .= 0
    We = 0.1
    fᵢ = viscᵢ₋₁ = viscᵢ = pᵢ₋₁ = pᵢ = 0.0
    for i = 2:N_tip-2
        fᵢ₊₁ = f[i+1]
        dfᵢ₊₁ = (f[i+2] - fᵢ)*inv_2dz   # df/dz
        d2fᵢ₊₁ = (f[i+2] - 2fᵢ₊₁ + fᵢ)*inv_dz²   # d²f/dz²
        pᵢ₊₁ = ((2-d2fᵢ₊₁)*fᵢ₊₁+dfᵢ₊₁^2)/(2*abs(1/4*dfᵢ₊₁^2+fᵢ₊₁)^(3/2))
        
        dy[i] = -BBB[i]
        viscᵢ₊₁ = Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
        dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
        dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz   # dP/dz
        dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
        
        # change to i+1 values for next iteration:
        fᵢ = fᵢ₊₁
        pᵢ₋₁ = pᵢ
        pᵢ = pᵢ₊₁
        viscᵢ₋₁ = viscᵢ
        viscᵢ = viscᵢ₊₁
    end
    return dy
end

# identical except for @turbo annotation:
function turbo_foo!(dy, f, Ca, BBB, v_vz, dz)
    N_tip = length(f)
    length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
    
    inv_2dz = 1/2dz
    inv_dz² = 1/dz^2
    dy .= 0
    We = 0.1
    fᵢ = viscᵢ₋₁ = viscᵢ = pᵢ₋₁ = pᵢ = 0.0
    @turbo for i = 2:N_tip-2
        fᵢ₊₁ = f[i+1]
        dfᵢ₊₁ = (f[i+2] - fᵢ)*inv_2dz   # df/dz
        d2fᵢ₊₁ = (f[i+2] - 2fᵢ₊₁ + fᵢ)*inv_dz²   # d²f/dz²
        pᵢ₊₁ = ((2-d2fᵢ₊₁)*fᵢ₊₁+dfᵢ₊₁^2)/(2*abs(1/4*dfᵢ₊₁^2+fᵢ₊₁)^(3/2))
        
        dy[i] = -BBB[i]
        viscᵢ₊₁ = Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
        dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
        dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz   # dP/dz
        dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
        
        # change to i+1 values for next iteration:
        fᵢ = fᵢ₊₁
        pᵢ₋₁ = pᵢ
        pᵢ = pᵢ₊₁
        viscᵢ₋₁ = viscᵢ
        viscᵢ = viscᵢ₊₁
    end
    return dy
end
@chriselrod
Copy link
Member

Duplicate of: #266

LoopVectorization's current model for loops assumes all iterations are independent and can be reordered, with a special case added for constant offsets between loads and stores that are <= the step size of the loop (so long as there is an alternative loop to reorder instead):

for i in 1:I, j in J # I loop has a step of 1
   a, b = f(x[i], x[i+1])
   x[i] = a
   x[i+1] = b # we're loading and storing from `x` with offsets `i, i+1`; (i + 1 - i) <= step(1:I) so it will reorder `j` loop instead.
end

If it weren't for the j loop, currently it'd give incorrect answers. I should probably add a way to get it to run the fallback loop instead.

But my priority / long term plan here involves writing a new loop model that'll explicitly include a dependency graph that'll indicate constraints on iteration order and available transformations.
The primary goal is to have it correctly optimize naive triple loop implementations of things like solving triangular systems of equations or performing Cholesky decompositions.

#266 and this issue are more examples it should be able to correctly handle.
Eliminating the loop carried dependencies gives us:

function turbo_foo!(dy, f, Ca, BBB, v_vz, dz)
  N_tip = length(f)
  length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())
    
  inv_2dz = 1/2dz
  inv_dz² = 1/dz^2
  dy .= 0
  We = 0.1
  @turbo for i = 2:N_tip-2
    fᵢ₋₂    = i < 5 ?  0.0 : f[i-2]
    fᵢ₋₁    = i < 4 ?  0.0 : f[i-1]
    fᵢ      = i == 2 ? 0.0 : f[i  ]
    fᵢ₊₁    =                f[i+1]
    fᵢ₊₂    =                f[i+2]

    dfᵢ₋₁  = (fᵢ   - fᵢ₋₂)*inv_2dz   # df/dz
    # dfᵢ    = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz   # df/dz
    dfᵢ₊₁  = (fᵢ₊₂ - fᵢ  )*inv_2dz   # df/dz

    d2fᵢ₋₁ = (fᵢ   - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz²   # d²f/dz²
    # d2fᵢ   = (fᵢ₊₁ - 2fᵢ   + fᵢ₋₁)*inv_dz²   # d²f/dz²
    d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ  )*inv_dz²   # d²f/dz²

    pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) / (2*abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁)^(3/2))
    # pᵢ   = i == 2 ? 0.0 : ((2-d2fᵢ  )*fᵢ   + dfᵢ ^2) / (2*abs(1/4*dfᵢ  ^2 + fᵢ  )^(3/2))
    pᵢ₊₁ =               ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) / (2*abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁)^(3/2))

    dy[i] = -BBB[i]

    viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
    # viscᵢ   = i == 2 ? 0.0 : Ca[i  ] * (dfᵢ   - BBB[i  ]) / fᵢ
    viscᵢ₊₁ =                Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁
    
    dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
    dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz   # dP/dz
    dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
  end
  return dy
end

This works:

julia> foo!(dy1, f, Ca, BBB, v_vz, dz)  turbo_foo!(dy2, f, Ca, BBB, v_vz, dz)
true

julia> display([dy1 dy2])
10×2 Matrix{Float64}:
      0.0        0.0
     -0.2       -0.2
     -0.3       -0.3
      0.0        0.0
      0.0        0.0
      0.0        0.0
 -10099.2   -10099.2
  -4997.16   -4997.16
      0.0        0.0
      0.0        0.0

and yields a reasonable speedup:

julia> using BenchmarkHistograms

julia> @benchmark foo!($dy1, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 755; memory estimate: 0 bytes; allocs estimate: 0
ns

 (169.8 - 171.3]  ▏1
 (171.3 - 172.8]   0
 (172.8 - 174.3]   0
 (174.3 - 175.8]   0
 (175.8 - 177.3]   0
 (177.3 - 178.8]   0
 (178.8 - 180.3]  ██████████████████████████████ 8613
 (180.3 - 181.8]  ████▌1269
 (181.8 - 183.4]  ▍91
 (183.4 - 184.9]  ▏6
 (184.9 - 186.4]  ▏5
 (186.4 - 187.9]  ▏3
 (187.9 - 189.4]   0
 (189.4 - 190.9]  ▏2
 (190.9 - 202.5]  ▏10

                  Counts

min: 169.793 ns (0.00% GC); mean: 180.198 ns (0.00% GC); median: 180.020 ns (0.00% GC); max: 202.498 ns (0.00% GC).

julia> @benchmark turbo_foo!($dy1, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 987; memory estimate: 0 bytes; allocs estimate: 0
ns

 (51.61 - 51.86]  ██████████████████████████████ 7540
 (51.86 - 52.12]  ███████▊1925
 (52.12 - 52.38]   0
 (52.38 - 52.64]   0
 (52.64 - 52.9 ]  ▊160
 (52.9  - 53.16]  █▎310
 (53.16 - 53.42]  ▏29
 (53.42 - 53.67]  ▏10
 (53.67 - 53.93]  ▏8
 (53.93 - 54.19]  ▏3
 (54.19 - 54.45]  ▏1
 (54.45 - 54.71]  ▏1
 (54.71 - 54.97]  ▏1
 (54.97 - 55.23]  ▏2
 (55.23 - 74.64]  ▏10

                  Counts

min: 51.606 ns (0.00% GC); mean: 51.870 ns (0.00% GC); median: 51.795 ns (0.00% GC); max: 74.639 ns (0.00% GC).

By far the slowest part of this loop is x / y^(3/2). Two options to speed this up are to replace it with:

  1. x / sqrt(y)^3 -- sqrt with literal integer power.
  2. x * y^(-3/2) -- fold the division into the power.
    LoopVectorization should perhaps perform one of these transforms automatically, but for now:

To speed this up further, we can replace abs(x)^(3/2) with sqrt(abs(x))^3:

julia> function turbo_foo_root!(dy, f, Ca, BBB, v_vz, dz)
         N_tip = length(f)
         length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())

         inv_2dz = 1/2dz
         inv_dz² = 1/dz^2
         dy .= 0
         We = 0.1
         @turbo for i = 2:N_tip-2
           fᵢ₋₂    = i < 5 ?  0.0 : f[i-2]
           fᵢ₋₁    = i < 4 ?  0.0 : f[i-1]
           fᵢ      = i == 2 ? 0.0 : f[i  ]
           fᵢ₊₁    =                f[i+1]
           fᵢ₊₂    =                f[i+2]

           dfᵢ₋₁  = (fᵢ   - fᵢ₋₂)*inv_2dz   # df/dz
           # dfᵢ    = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz   # df/dz
           dfᵢ₊₁  = (fᵢ₊₂ - fᵢ  )*inv_2dz   # df/dz

           d2fᵢ₋₁ = (fᵢ   - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz²   # d²f/dz²
           # d2fᵢ   = (fᵢ₊₁ - 2fᵢ   + fᵢ₋₁)*inv_dz²   # d²f/dz²
           d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ  )*inv_dz²   # d²f/dz²

           pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) / (2*sqrt(abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁))^(3))
           # pᵢ   = i == 2 ? 0.0 : ((2-d2fᵢ  )*fᵢ   + dfᵢ ^2) / (2*abs(1/4*dfᵢ  ^2 + fᵢ  )^(3/2))
           pᵢ₊₁ =               ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) / (2*sqrt(abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁))^(3))

           dy[i] = -BBB[i]

           viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
           # viscᵢ   = i == 2 ? 0.0 : Ca[i  ] * (dfᵢ   - BBB[i  ]) / fᵢ
           viscᵢ₊₁ =                Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁

           dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
           dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz   # dP/dz
           dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
         end
         return dy
       end
turbo_foo_root! (generic function with 1 method)

julia> foo!(dy1, f, Ca, BBB, v_vz, dz)  turbo_foo_root!(dy2, f, Ca, BBB, v_vz, dz)
true

julia> @benchmark turbo_foo_root!($dy2, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 993; memory estimate: 0 bytes; allocs estimate: 0
ns

 (34.39 - 34.6 ]  ██████████████████████████████ 9639
 (34.6  - 34.81]  ▏4
 (34.81 - 35.03]   0
 (35.03 - 35.24]   0
 (35.24 - 35.46]  ▏4
 (35.46 - 35.67]  ▉259
 (35.67 - 35.89]  ▎57
 (35.89 - 36.1 ]  ▏11
 (36.1  - 36.31]  ▏9
 (36.31 - 36.53]  ▏1
 (36.53 - 36.74]  ▏2
 (36.74 - 36.96]  ▏1
 (36.96 - 37.17]  ▏2
 (37.17 - 37.38]  ▏1
 (37.38 - 60.06]  ▏10

                  Counts

min: 34.386 ns (0.00% GC); mean: 34.552 ns (0.00% GC); median: 34.502 ns (0.00% GC); max: 60.061 ns (0.00% GC).

julia> function turbo_foo_negpower!(dy, f, Ca, BBB, v_vz, dz)
         N_tip = length(f)
         length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())

         inv_2dz = 1/2dz
         inv_dz² = 1/dz^2
         dy .= 0
         We = 0.1
         @turbo for i = 2:N_tip-2
           fᵢ₋₂    = i < 5 ?  0.0 : f[i-2]
           fᵢ₋₁    = i < 4 ?  0.0 : f[i-1]
           fᵢ      = i == 2 ? 0.0 : f[i  ]
           fᵢ₊₁    =                f[i+1]
           fᵢ₊₂    =                f[i+2]

           dfᵢ₋₁  = (fᵢ   - fᵢ₋₂)*inv_2dz   # df/dz
           # dfᵢ    = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz   # df/dz
           dfᵢ₊₁  = (fᵢ₊₂ - fᵢ  )*inv_2dz   # df/dz

           d2fᵢ₋₁ = (fᵢ   - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz²   # d²f/dz²
           # d2fᵢ   = (fᵢ₊₁ - 2fᵢ   + fᵢ₋₁)*inv_dz²   # d²f/dz²
           d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ  )*inv_dz²   # d²f/dz²

           pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) * (0.5*abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁)^(-3/2))
           # pᵢ   = i == 2 ? 0.0 : ((2-d2fᵢ  )*fᵢ   + dfᵢ ^2) / (2*abs(1/4*dfᵢ  ^2 + fᵢ  )^(3/2))
           pᵢ₊₁ =               ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) * (0.5*abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁)^(-3/2))

           dy[i] = -BBB[i]

           viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
           # viscᵢ   = i == 2 ? 0.0 : Ca[i  ] * (dfᵢ   - BBB[i  ]) / fᵢ
           viscᵢ₊₁ =                Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁

           dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
           dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz   # dP/dz
           dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
         end
         return dy
       end
turbo_foo_negpower! (generic function with 1 method)

julia> foo!(dy1, f, Ca, BBB, v_vz, dz)  turbo_foo_negpower!(dy2, f, Ca, BBB, v_vz, dz)
true

julia> @benchmark turbo_foo_negpower!($dy2, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 990; memory estimate: 0 bytes; allocs estimate: 0
ns

 (41.93 - 42.41]  ██████████████████████████████ 9500
 (42.41 - 42.89]  ▎60
 (42.89 - 43.37]  ▍86
 (43.37 - 43.85]  █309
 (43.85 - 44.33]  ▏20
 (44.33 - 44.81]  ▏5
 (44.81 - 45.29]  ▏3
 (45.29 - 45.77]  ▏1
 (45.77 - 46.25]  ▏1
 (46.25 - 46.73]  ▏2
 (46.73 - 47.21]   0
 (47.21 - 47.69]  ▏2
 (47.69 - 48.17]   0
 (48.17 - 48.65]  ▏1
 (48.65 - 71.43]  ▏10

                  Counts

min: 41.925 ns (0.00% GC); mean: 42.317 ns (0.00% GC); median: 42.251 ns (0.00% GC); max: 71.434 ns (0.00% GC).

Relative performance will vary by CPU, which is why it'd be good for LoopVectorization to perform the optimization (i.e., it can pick using architectural details).

Note that LoopVectorization substitutes ^ for @inline pow_fast(x,y) = exp2(log2(x)*y), if you'd prefer greater accuracy:

julia> @inline mypow(x,y) = x^y

julia> function turbo_foo_negmypower!(dy, f, Ca, BBB, v_vz, dz)
         N_tip = length(f)
         length(dy) == 2N_tip && length(Ca) == length(BBB) == length(v_vz) == N_tip || throw(DimensionMismatch())

         inv_2dz = 1/2dz
         inv_dz² = 1/dz^2
         dy .= 0
         We = 0.1
         @turbo for i = 2:N_tip-2
           fᵢ₋₂    = i < 5 ?  0.0 : f[i-2]
           fᵢ₋₁    = i < 4 ?  0.0 : f[i-1]
           fᵢ      = i == 2 ? 0.0 : f[i  ]
           fᵢ₊₁    =                f[i+1]
           fᵢ₊₂    =                f[i+2]

           dfᵢ₋₁  = (fᵢ   - fᵢ₋₂)*inv_2dz   # df/dz
           # dfᵢ    = (fᵢ₊₁ - fᵢ₋₁)*inv_2dz   # df/dz
           dfᵢ₊₁  = (fᵢ₊₂ - fᵢ  )*inv_2dz   # df/dz

           d2fᵢ₋₁ = (fᵢ   - 2fᵢ₋₁ + fᵢ₋₂)*inv_dz²   # d²f/dz²
           # d2fᵢ   = (fᵢ₊₁ - 2fᵢ   + fᵢ₋₁)*inv_dz²   # d²f/dz²
           d2fᵢ₊₁ = (fᵢ₊₂ - 2fᵢ₊₁ + fᵢ  )*inv_dz²   # d²f/dz²

           pᵢ₋₁ = i < 4 ? 0.0 : ((2-d2fᵢ₋₁)*fᵢ₋₁ + dfᵢ₋₁^2) * (0.5*mypow(abs(1/4*dfᵢ₋₁^2 + fᵢ₋₁),(-3/2)))
           pᵢ₊₁ =               ((2-d2fᵢ₊₁)*fᵢ₊₁ + dfᵢ₊₁^2) * (0.5*mypow(abs(1/4*dfᵢ₊₁^2 + fᵢ₊₁),(-3/2)))

           dy[i] = -BBB[i]

           viscᵢ₋₁ = i < 4 ? 0.0 : Ca[i-1] * (dfᵢ₋₁ - BBB[i-1]) / fᵢ₋₁
           # viscᵢ   = i == 2 ? 0.0 : Ca[i  ] * (dfᵢ   - BBB[i  ]) / fᵢ
           viscᵢ₊₁ =                Ca[i+1] * (dfᵢ₊₁ - BBB[i+1]) / fᵢ₊₁

           dvisc_dz = (viscᵢ₊₁-viscᵢ₋₁)*inv_2dz
           dp = (pᵢ₊₁-pᵢ₋₁)*inv_2dz   # dP/dz
           dy[N_tip+i] = -v_vz[i] - (dp + dvisc_dz)/We
         end
         return dy
       end
turbo_foo_negmypower! (generic function with 1 method)

julia> foo!(dy1, f, Ca, BBB, v_vz, dz)  turbo_foo_negmypower!(dy2, f, Ca, BBB, v_vz, dz)
true

julia> @benchmark turbo_foo_negmypower!($dy2, $f, $Ca, $BBB, $v_vz, $dz)
samples: 10000; evals/sample: 943; memory estimate: 0 bytes; allocs estimate: 0
ns

 (95.5   - 95.99 ]  ▏4
 (95.99  - 96.48 ]  █▌251
 (96.48  - 96.96 ]  ███████████████▊2856
 (96.96  - 97.45 ]  ██████████████████████████████5446
 (97.45  - 97.94 ]  ██▊479
 (97.94  - 98.43 ]  ███▋640
 (98.43  - 98.91 ]  █▍236
 (98.91  - 99.4  ]  ▍65
 (99.4   - 99.89 ]  ▏6
 (99.89  - 100.37]   0
 (100.37 - 100.86]  ▏1
 (100.86 - 101.35]  ▏3
 (101.35 - 101.84]   0
 (101.84 - 102.32]  ▏3
 (102.32 - 139.71]  ▏10

                  Counts

min: 95.504 ns (0.00% GC); mean: 97.203 ns (0.00% GC); median: 97.101 ns (0.00% GC); max: 139.710 ns (0.00% GC).

While much slower than any of the other @turbo methods we considered so far, it's still much faster than foo! (on this computer with AVX512), and it should be accurate to 1 ULP error, according to SLEEFPirate.jl's tests:

pow               : max 0.7971188400935709 at x = 94.88, 80.62                    : mean 0.12711344263659394

@chriselrod
Copy link
Member

I just tested the error of pow_fast(x,y) = exp2(log2(x)*y) and it is unacceptably high.

That's a SLEEFPirates bug...I'll try and improve error.

@stevengj
Copy link
Author

stevengj commented May 29, 2021

LoopVectorization's current model for loops assumes all iterations are independent and can be reordered

Thanks. I would suggest adding something like to the documentation for @turbo. I just noticed now that it is in the README, but I was looking in the @turbo docs for restrictions like this and didn't see any.

@stevengj
Copy link
Author

(Closing as a duplicate.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants