# Optimization techniques
 - Loop unrolling
 - Loop interchange
 - Vectorizing operations
 - Loop unswitching
 - Loop nest optimization
 - Loop invariant code motion
 - Inlining
 - Machine specific optimizations

## Why optimzation techniques are important
- Optimized code runs at least 5 times faster than unoptimized code (a factor of 50 is not unheard of)
- The compiler can't always recognize when it can apply an optimization technique, so it becomes your responsibility
- If the compiler is to aggressive in its optimizations, it can create bugs

## Anatomy of a loop

In [2]:
import numpy as np
outA=np.zeros(1000000)
inA=np.zeros(1000000)
for i in range(outA.shape[0]):
    outA[i]=inA[i]*inA[i]

i=0
do{
  if i >=n: break
  outA[i]=inA[i]*inA[i]
  i=i+1
}
  

## Loop unrolling
 

By unrolling the loop we have less jumps and conditional checks. The longer code will take up more registers.


In [None]:
import numpy as np
outA=np.zeros(1000000)
inA=np.zeros(1000000)
for i in range(outA.shape[0],4):
    outA[i]=inA[i]*inA[i]   
    outA[i+1]=inA[i+1]*inA[i+1]
    outA[i+2]=inA[i+2]*inA[i+2]    
    outA[i+3]=inA[i+3]*inA[i+3]   

By unrolling the loop we have less jumps and conditional checks. The longer code will take up more registers.

i=0
do{
  outA[i]=inA[i]*inA[i]
  outA[i+1]=inA[i+1]*inA[i+1]
  outA[i+2]=inA[i+2]*inA[i+2]
  outA[i+3]=inA[i+3]*inA[i+3]
  i=i+4
}


## Loop interchange

In [19]:
%%time
inA=np.zeros((1000,10000))
outA=np.zeros((1000,10000))
@numba.jit()
def badLoop(inA,outA):
    for i1 in range(outA.shape[1]):
        for i2 in range(outA.shape[0]):
            outA[i2,i1]=inA[i2,i1]*inA[i2,i1]

for i in range(10):
    badLoop(inA,outA)

CPU times: user 1.49 s, sys: 25.4 ms, total: 1.52 s
Wall time: 1.53 s


In [20]:
%%time
import numba
inA=np.zeros((1000,10000))
outA=np.zeros((1000,10000))
@numba.jit()
def goodLoop(inA,outA):
    for i2 in range(outA.shape[0]):
        for i1 in range(outA.shape[1]):
            outA[i2,i1]=inA[i2,i1]*inA[i2,i1]

for i in range(10):
    goodLoop(inA,outA)

CPU times: user 225 ms, sys: 20.9 ms, total: 246 ms
Wall time: 247 ms


When we ask for a value from memory, many (a cache line's worth) are transferred. These are successive memory locations. They are transferred to the L1 cache. In the first case by the time we are ready to use the next value in the cache line it has left L1 cache and is now in L2 or L3 cache. There is a significant increased latency in reading from these caches, reducing performance. For simple loops most compilers will recognize that they can interchange the loops at high enough optimization levels but add even a little complexity to the loop and it will follow the code you wrote.

## Vectorization

We've covered vectorization a little in this class so far. The basic idea is that we can do multiple operations simultaneously as long as they are all the same operations. Vectorization works best on 32-bit floats. To enable vectorization in numba change *fastmath* from no to yes in the following function declaration.

In [6]:
%%time
import numba
inA=np.zeros((1000,10000),np.float32)
outA=np.zeros((1000,10000),np.float32)
@numba.njit(fastmath=False)
def goodLoop(inA,outA):
    for i2 in range(outA.shape[0]):
        for i1 in range(outA.shape[1]):
            outA[i2,i1]=inA[i2,i1]*inA[i2,i1]

for i in range(10):
    goodLoop(inA,outA)

CPU times: user 257 ms, sys: 23.6 ms, total: 280 ms
Wall time: 294 ms


When we move onto compiled languages we will learn about optimized function calls for vectorization operations.

## Loop unswitching

Compare the following two cells. The difference is the order of the for and if statements. The first cell does not allow vectorization and requires an if evaluation at every instance of the loop. You will probably see relatively no difference in timing of these two cells because the jit compiler makes the change it self.

In [42]:
%%time
import numba
inA=np.zeros((1000*10000),np.float32)
outA=np.zeros((1000*10000),np.float32)
@numba.njit(fastmath=True)
def deriv1(adj,inA,outA):
    for i in range(1,outA.shape[0]):
        if adj:
            inA[i-1]-=outA[i]
            inA[i]+=outA[i]
        else:
            outA[i]+=inA[i]-inA[i-1]
for i in range(10):
    deriv1(False,inA,outA)


CPU times: user 259 ms, sys: 20.8 ms, total: 280 ms
Wall time: 288 ms


In [43]:
%%time
import numba
inA=np.zeros((1000*10000),np.float32)
outA=np.zeros((1000*10000),np.float32)
@numba.njit(fastmath=True)
def deriv2(adj,inA,outA):
    if adj:
        for i in range(1,outA.shape[0]):
            inA[i-1]-=outA[i]
            inA[i]+=outA[i]
    else:
        for i in range(1,outA.shape[0]):
            outA[i]+=inA[i]-inA[i-1]
for i in range(10):
    deriv2(False,inA,outA)

CPU times: user 253 ms, sys: 17.3 ms, total: 270 ms
Wall time: 272 ms


The following example resulted a 20x difference in performance:


Note the if conditional:

   do imx=down%ax%b, down%ax%n+down%ax%b-1
                jxd = imx - jhx
                jxu = imx + jhx
                if ( (jxd.lt.1) .or. (jxd.gt.size(wfld_d,1)) .or. &
                (jxu.lt.1) .or. (jxu.gt.size(wfld_u,1)) )cycle
                   dsliceR(imx-down%ax%b+1, imy-down%ay%b+1, ihx, ihy,ith) =&
                   dsliceR(imx-down%ax%b+1, imy-down%ay%b+1, ihx, ihy,ith) +&
                    wfld_d(jxd, jyd, iws) * wfld_u(jxu, jyu, iws)
            end do




20X speedup

   do imx=down%ax%b, down%ax%n+down%ax%b-1
                dsliceR(imx-down%ax%b+1, imy-down%ay%b+1, ihx, ihy,ith) = &
                  & dsliceR(imx-down%ax%b+1, imy-down%ay%b+1, ihx, ihy,ith) + &
                  & wfld_d(jjxd, jjyd, iws) * wfld_u(jjxu, jjyu, iws)
            end do
         

## Loop nest optimization


In [49]:
%%time
in1A=np.zeros((1000,1000),np.float32)
in2A=np.zeros((1000,1000),np.float32)
outA=np.zeros((1000,1000),np.float32)
@numba.jit()
def normalLoop(in1,in2,outA):
    for i3 in range(outA.shape[0]):
        for i2 in range(outA.shape[1]):
            outA[i3,i2]=0
            for i1 in range(in1A.shape[1]):
                outA[i3,i2]+=in2A[i3,i1]*in1A[i1,i2]
for i in range(5):
    normalLoop(in1A,in2A,outA)

CPU times: user 10 s, sys: 181 ms, total: 10.2 s
Wall time: 10.4 s


In [50]:
%%time
in1A=np.zeros((1000,1000),np.float32)
in2A=np.zeros((1000,1000),np.float32)
outA=np.zeros((1000,1000),np.float32)
@numba.jit()
def toMuchWork(in1,in2,outA):
    for i3 in range(0,outA.shape[0],2):
        for i2 in range(0,outA.shape[1],2):
            for i1 in range(in1A.shape[1]):
                a00=in2A[i3,i1]*in1A[i1,i2]
                a01=in2A[i3,i1]*in1A[i1,i2+1]
                a10=in2A[i3+1,i1]*in1A[i1,i2]
                a11=in2A[i3+1,i1]*in1A[i2,i2+1]
                outA[i3,i2]=a00
                outA[i3+1,i2]=a10
                outA[i3,i2+1]=a01
                outA[i3+1,i2+1]=a11
for i in range(5):
    toMuchWork(in1A,in2A,outA)

CPU times: user 8.33 s, sys: 127 ms, total: 8.45 s
Wall time: 8.57 s


You might see an improvement in the second example with numba. With a compiled language you will almost always see the second example slower because of higher level of optimizations that the second example disables. 

## Loop invariant code motion

In [2]:
%%time
import numpy as np
import numba

in1A=np.random.rand(1000*100)

@numba.jit()
def realyDumbLoop(in1A):
    i=0
    tot=0
    while tot < in1A.sum()/5:
        tot+=in1A[i]
        i=i+1

realyDumbLoop(in1A)

CPU times: user 2.42 s, sys: 41.2 ms, total: 2.46 s
Wall time: 2.49 s


In [4]:
%%time
import numpy as np
import numba

in1A=np.random.rand(1000*100)

@numba.jit()
def notSoDumb(in1A):
    i=0
    tot=0
    mx=in1A.sum()/5
    while tot < mx:
        tot+=in1A[i]
        i=i+1

notSoDumb(in1A)

CPU times: user 96.1 ms, sys: 5.13 ms, total: 101 ms
Wall time: 106 ms


Pre-computing variables that are unchanging in a loop.

## Functional inlining

In [9]:
%%time 
@numba.jit()
def mypow(var):
    return var*var

def seemsSmartOO(inA,outA):
    for i in range(inA.shape[0]):
        outA[i]=mypow(inA[i])
        

in1A=np.random.rand(1000*1000)
outA=np.zeros(1000*1000)
for i in range(10):
    seemsSmartOO(in1A,outA)

CPU times: user 4.18 s, sys: 59.9 ms, total: 4.24 s
Wall time: 4.31 s


In [11]:
%%time 
@numba.jit()

def muchFaster(inA,outA):
    for i in range(inA.shape[0]):
        outA[i]=inA[i]*inA[i]
        

in1A=np.random.rand(1000*1000)
outA=np.zeros(1000*1000)
for i in range(10):
    muchFaster(in1A,outA)

CPU times: user 135 ms, sys: 11.2 ms, total: 146 ms
Wall time: 154 ms


## Machine-specific optimization
Many choices the compiler make can be improved by knowing about the target CPU
- number of registers
- number of floating point units
- cache

For highly optimized, but non-portable code, turn on these machine-specific optimizations. The speed difference is often a factor of 2 better.

## Relative importance

To important to trust to the compiler
- Loop interchange
- Loop unswitching
- Loop invariance code motion
- Inlining

Worth it in many cases
- vectorization

Not worth your doing, trust the compiler, but make its job doable
- Loop unrolling
- Loop nest optimizations
- Machine specific optimizations