# Exercise: Performance Optimization 2

Optimize the following function.

In [1]:
function work!(A, B, v, N)
    val = zero(eltype(v))
    for i in 1:N
        for j in 1:N
            val = mod(v[i],256)
            A[i,j] = B[i,j] * (sin(val) * sin(val) - cos(val) * cos(val))
        end
    end
    return A
end

work! (generic function with 1 method)

The following data is **fixed** and **not supposed to be modified**!

In [2]:
using Random
Random.seed!(42)

const N = 4000
const A = zeros(N,N)
const B = rand(N,N)
const v = rand(Int, N);

const A_result = work!(A,B,v,N);

You can compare against `A_result` to test your implementation(s):

In [3]:
using Test

@test work!(A,B,v,N) ≈ A_result

[32m[1mTest Passed[22m[39m

You can benchmark as follows:

In [4]:
using BenchmarkTools
@btime work!($A, $B, $v, $N); # or use @benchmark for more information

  836.749 ms (0 allocations: 0 bytes)


## Your Optimizations

Your optimized variants go here!

**Hints** (hopefully):
* What is suboptimal about the code? What is it that you'd want to change (but can't directly)?
* Sometimes writing the code in a different way doesn't give direct speedups but enables further optimization.
* A >30x speedup should be possible on Noctua 2 😉

In [None]:
# Your code ...

In [11]:
function work_opt!(A, B, v, N)
    val = mod.(v,256)
    coef = (sin.(val).^2 .- cos.(val).^2)
    A .= B .* coef
    return A
end

work_opt! (generic function with 1 method)

In [12]:
@test work_opt!(A,B,v,N) ≈ A_result

[32m[1mTest Passed[22m[39m

In [13]:
@benchmark work_opt!(A,B,v,N)

BenchmarkTools.Trial: 211 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m17.884 ms[22m[39m … [35m54.938 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m22.810 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m23.663 ms[22m[39m ± [32m 4.662 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m▂[39m [39m█[39m▅[39m [39m▄[39m▅[39m▄[39m [39m [39m [39m [39m [39m [39m▂[34m [39m[39m [39m [32m▃[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▆[39m▅[39m█[39m█[39m█[39m█[39m

## Bonus Question: Performance limit?

Look at your final optimized version of `work!`.

* What is conceptually limiting the performance, the compute capability or memory transfer?
* Assuming that a single CPU-core in Noctua 2 can achieve a **maximal memory bandwidth of ~45 GB/s**, can you give a performance bound estimate, i.e. the minimal runtime that we could possibly hope to achieve?
  * Hint: how many flops are performed per iteration and how many bytes are transferred?
* How far off is your implementation from achieving the limit (in percent)?