# Exercise: Performance Optimization 1

Optimize the following code, that is, try to reduce the runtime as well as the number of allocations as much as you can.

In [1]:
function work!(A, b, c)
    D = zeros(N,N)
    for i in 1:N
        D = b[i]*c*A
        b[i] = sum(D)
    end
    return b
end

work! (generic function with 1 method)

The following data is **fixed** and **not supposed to be modified**!

In [2]:
using Random
Random.seed!(42)

N = 1000
A = rand(N,N)
b = rand(N)
c = 1.23

const b_result = work!(A, b, c);

You can compare against `b_result` to test your implementation(s):

In [3]:
using Test 

@test work!(A, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

You can benchmark as follows:

In [4]:
using BenchmarkTools

@benchmark work!($A, $b, $c)

BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.249 s[22m[39m … [35m  2.433 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m7.06% … 7.28%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.323 s              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m7.63%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.335 s[22m[39m ± [32m92.728 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m7.40% ± 0.40%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[

## Your Optimizations

Your optimized variants go here!

**Hints** (hopefully):
* Is the function self-contained?
* Is it efficient with respect to allocations?
* A >10000x speedup and 0 allocations are possible on Noctua 2 😉

In [5]:
# Your code ....
function work2!(A, b, c)
    N = length(b)
    sum_A = sum(A)
    for i in 1:N
        b[i] = b[i] * c * sum_A
    end
    return b
end

@test work2!(A, b, c) ≈ b_result

[32m[1mTest Passed[22m[39m

In [6]:
@benchmark work2!($A, $b, $c)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m122.917 μs[22m[39m … [35m530.269 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m128.950 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m133.257 μs[22m[39m ± [32m 14.451 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▆[39m█[39m█[39m▇[39m▇[34m▆[39m[39m▆[39m▆[39m▆[32m▆[39m[39m▆[39m▅[39m▅[39m▅[39m▃[39m▃[39m▂[39m▂[39m▁[39m▁[39m▂[39m▁[39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃
  [39m█[39m█[39m█[3

In [7]:
2.2 / (123 * 1e-6)

17886.178861788623