# Good practice for coding

Once you have written some initial code from your prototype with good documentation and a nice modular structure (and of course carefully tracked by git) there are two things you will want to do as you develop the code further

- Debug
- Optimise

For the first Python has a good inbuilt debugger that will help you, `pdg`

## Debugging
If you run code that has raised an exception in jupyter or ipython you can use the magic command `%debug` to launch an interactive debugger where you can examine the code line by line.  Here are the most useful commands

| Command | Desciption |
|---|---|
| u | Up |
| d | Down|
| p | Print |
| q | Quit |

This is best seen in an example:

In [None]:
def bottom_func(x):
    y = x**2
    return str(x) + " squared equals " + str(y)

def top_func(z):
    result = bottom_func(z)
    return result
        
top_func('7')

In [None]:
%debug

This can be turned on automatically with `%pdb on` so whenever an exception is raised the debugger is launched automatically

You can also run the code interactivly line by line using `%run -d` which can be more useful if your code is just wrong rather than breaking.  If you launch code with this then you can step through it with the following commands:

| Command | Desciption |
|---|---|
| n | Next line |
| s | Step into function|
| c | Continue to run normally |
| q | Quit |




In [None]:
%run -d Code/simple.py

You can also run the debugger from the command line with:

In [None]:
>> python3 -m pdb myscript.py

## Unit testing

The best way to debug your code is to catch them before they happen which you can do with unit testing.  The ideas is that you would set up a bunch of tests for the code then everytime you do a commit to master or after doing major edits you run them to check you haven't broken anything.  For the code calculating the integral in the previous notebook you may want to to create tests that check you polynomials are OK (by checking orthonormality) or that the final integral with Gauss-Legendre quadriture is correct for large l (ie set X = $\delta_{l,200}$ and see if you get the correct answer, 0.000018285996687338485).  

Having set up these tests you can then automate them to create a test package that runs, say, before a push to a central repository.  This is standard practise in commercial development enviroments and if you can include them in interview test questions this will put you above the majority of applicants.

It's a good idea to get into the habit of adding them for functions, ideally before you write it.  Then you can use them to check your code does what you thought it should. Luckly in python basic ones are easy to do, you can just add it to the docstring

In [None]:
import doctest

def function(x):
    """
    Calculate x + 2
    >>> function(5)
    7
    """
    return x+3

doctest.testmod()

This is fine for super simple tests but isn't much use once you write functions that process data rather than just a number.  There are alot of packages available but `pytest` is the standard.  This runs from the terminal and looks for any functions with the name `test_somefunction` or `somefunction_test` then run them.  These functions should contain some code to run then tests to apply to the outputs using the command `assert` which accepts any booleian argument.  If our function was:

In [None]:
def addtwo(x):
    """
        Add 2 to x
    """
    return x+1

Then our test could look like:

In [None]:
def test_addtwo():
    """
        Test addtwo
    """
    assert( addtwo(3)==5)

These are in the files simple.py in the directory `Code`. We can test them from the command line using

In [None]:
%%bash
pytest Code/simple.py

The tests can also be held in seperate files like test_simple.py:

In [None]:
%%bash
pytest Code/test_simple1.py

One thing to remember is that floating point arithmitic is not exact so the test of add02 in test_simple2.py fails

In [None]:
%%bash
pytest Code/test_simple2.py

To fix this pytest had a function called `approx` which by default allows a relative tolerance of 1e-6 which is mostly fine and it works on most data objects:

In [None]:
from pytest import approx
import numpy as np
print(0.1 + 0.2 == approx(0.3))
print((0.1 + 0.2, 0.2+0.4) == approx((0.3,0.6)))
print({'a': 0.1 + 0.2, 'b': 0.2 + 0.4} == approx({'a': 0.3, 'b': 0.6}))
print(np.array([0.1, 0.2]) + np.array([0.2, 0.4]) == approx(np.array([0.3, 0.6])))
print(np.array([0.1, 0.2]) + np.array([0.2, 0.1]) == approx(0.3))

However if you are testing things near zero relative tolerances are useless.  Luckly `approx` also allows you to change the tolerances and make them relative or absolute.  If you specify both, it is true if either are satisfied.

In [None]:
from pytest import approx
print(1.0001 == approx(1))
print(1.0001 == approx(1, rel=1e-3))
print(1.0001 == approx(1, abs=1e-3))
print(1.0001 == approx(1, rel=1e-5, abs=1e-3))

You can also specify a specific failure message with `fail` ie:

In [None]:
from pytest import fail
def test_something():
    x = somefunc()
    if x in badthings:
        fail('A bad thing came back from somefunc()')

Finally note that if we specify no arguments `pytest` looks for files names `test_fimename` or `filename_test` and runs them.  In general is is best practice to put your source code and you tests in different directories as it keeps them seperate and safe.  You should have one test file for each module.  Then running pytest in the test directory will check all your code or you can run on individual modules if you want.

In [None]:
%%bash
pytest Code/

Exercise: Game of life

Write a small module that runs Conway's game of life, https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life.  You start with a 2D grid where the cells are either 'alive' or 'dead'.  Then the rules for stepping in time are:

1. Overpopulation. Live cells with more than 3 neighbours dies
2. Underpopulation. Live cells with less than 2 neighbours dies
3. Reproduction. Dead cells with 3 neighbours becomes live

Boundaries are periodic

Your code should accept either a starting set of cells or dimensions for the board plus a number of iterations and output a plot of the game board at each step.

Try to use debug to fix errors and design unit tests for each of your functions.  Don't forget to prototype and document your code.

## Optimisation

Often the code we initiially write to solve a problem is primarily focused on accuracy but often once this is achieved we find the code runs too slowly to be used in the way we want it to.

In order to understand how to make code faster it can be useful to know the basics of how a computer works. Most computers have an architecture that looks a bit like this:

![](Plots/Computer_architecture.png)

Modern CPU's are so fast that actually most optimisatising for high performance computing comes from optimising memory usage to make sure the CPU is fed with enough data.  These are three numbers listed for each section in the above plot: Latency, which is how long it takes to respond to a command; Bandwidth, which is how past it can pass data up the chain; Size, which is how much data can be stored there.  The main problem we see is that memory latency is ~200 times longer than CPU latency (which is 1 cycle).  So, if you cant to multiple two numbers it takes 200 times as long to get them to teh CPU as it does to multiply them.  The response to this is to introduce multiple layers for faster and faster, but smaller and smaller, memory inbetween the CPU and main memory called 'cache'.  This helps keep data that we are working on close to the CPU so if we want to use it we can without delay.  This means that there are two things we can do to speed up our code.  Keep data local, ie near the CPU, and reuse it as much as possible while it is there.  Obviously you have no direct control over the cache but there are some things you can do that help the computer optimise this.

The best analogy for a computer is trying to learn something from textbooks.  Now the heirarchy looks like this:

![](Plots/Library.png)

And it is easy to get some basic lessons

- Get all the books you need from the library at once because it's a pain to keep going back there
- Constantly swapping books is annoying so try to finish workng on the open ones before you look at the next set

With this vauge idea in mind now lets look at some specifics cases.

There are always many different ways to do the same calculation so you should get into the habit of checking which is fastest.  For simple commands and functions we can just use `%timeit` (single line) and `%%timeit` (entire cell) which are useful for comparing equvilent code.  The best bit is they run the code multiple times to give you an accurate estimate of time.  Here are some examples of simple speedups you can do:

In [92]:
# 1. Squaring vs multiplication vs adding
x = 0.5
%timeit x**2
%timeit x*x
%timeit x/x

%timeit x+x
%timeit 2.0*x
%timeit 2.0*5.0
%timeit x/0.5

65.9 ns ± 0.0794 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
57.3 ns ± 0.204 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
56.3 ns ± 0.248 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
56.6 ns ± 0.254 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
37.2 ns ± 0.513 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
9.09 ns ± 0.00444 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
34.1 ns ± 0.0437 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Strictly speaking on a CPU addition is faster than multiplication which is faster than division.  Exponentiation is not a basic operation for teh CPU and instead takes the log then multiplies.  However, in python we see that while exponentiation is slower than multiplying, multipliing is the same as adding.  Multiplying by a number is faster than multiplying by a variable as we don't have to go and find the number it in memory and multipliing two numbers much faster again.  We can roughly conclude taht it takes twice as long to find a variable as it does to multiply it (you can read faster than find the right bit of a book). In python division and muliplication take the same time.  This is because most of the time is lost in finding the variables rather than the operation itself.  But if we do the operation alot:

In [100]:
%timeit [0.5*x for x in range(10000)]
%timeit [x/2.0 for x in range(10000)]

645 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
684 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Then small differences can emerge.

In [None]:
# 2. Square root
import math
%timeit x**0.5
%timeit math.sqrt(x)

Python functions can be slower than basic operations (in compiled code the reverse would usually be true).   I think here `sqrt` most likely has higher accuracy.

In [None]:
# 3a. Math vs Numpy
import math
import numpy as np
%timeit math.sin(x)
%timeit np.sin(x)

So numpy is slower for scalars but...

In [None]:
# 3b. Math vs Numpy for vector
X = np.random.random(100)
Y1 = np.zeros(100)
Y2 = np.zeros(100)
Y3 = np.zeros(100)
%timeit Y1 = np.sin(X)
%timeit Y2 = list(map(lambda x: math.sin(x),X))

In [None]:
%%timeit
for i in range(100):
    Y3[i] = math.sin(X[i]) 

Numpy is faster for arrays, note the tiny scaling which shows that almost all the time is getting the function itself so gain most of the time is in finding things, not calculation.  Here we have used `map` which is a way of applying a function to a list with the format:

map(function,list) ie:  `list(map(lambda x: x**2, items)`

There is also `filter` which selects based on a condition:

filter(function,list) ie: `list(filter(lambda x: x < 0, items))`

These are faster than loops but can't compete with numpy.   Here is something similar for matrix multiplication:

In [102]:
# 4a. Math vs Numpy multiplication
x=2.3
y=3.4
%timeit np.dot(x,y)
%timeit x*y

851 ns ± 4.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
45.4 ns ± 0.126 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [None]:
# 4b. Math vs Numpy for dot product
X = np.random.random(100)
Y = np.random.random(100)
%timeit np.dot(X,Y)
%timeit sum(list(map(lambda x:x[0]*x[1],list(zip(X,Y)))))

Now the call time for numpy is all the time and much faster than `map` construct

In [98]:
# 5. Constructing lists
%timeit [x*x for x in range(100)]
%timeit [x**2 for x in range(100)]

5.95 µs ± 55.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.8 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [94]:
%%timeit
list1=[]
for x in range(100):
    list1.append(x*x)

9.63 µs ± 54.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [95]:
%%timeit
list1=[]
append = list1.append
for x in range(100):
    append(x*x)

7.66 µs ± 48.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [96]:
%%timeit
list1 = np.zeros((100))
for x in range(100):
    list1[x] = x*x

13.2 µs ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [97]:
%%timeit
np.fromfunction(lambda i,:i*i,(100,),dtype=int)

9.81 µs ± 76.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


So list comprehensions are fastest. Also note that using `append = list1.append` saves us about 20% in time.  This is due to not having to looking up the function, the assignment has make it more 'local' in memory.  Do be careful with this as it can make code very hard to read if you do it alot.  Also note that subsequent operations on list1 may be quicker for the last 2 methods as the list is more likely to be contiguous in memory (all in the same place) for very large lists so we can read it to cache faster.   Finally, note that using a list comprehension won't make you code fast if you use bad algorithms like `x**2` rather than `x*x` which costs ~5 times as much.  In general better algorithms will always win over better code.  This is why prototyping is such a good idea.

In [1]:
#7. Different indexing orders
import numpy as np
A = np.zeros((100,100))
B = np.random.random((100,100))

In [2]:
%%timeit
for i in range(100):
    for j in range(100):
        A[i,j] = B[i,j]

2.74 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
%%timeit
for i in range(100):
    for j in range(100):
        A[j,i] = B[j,i]

2.73 ms ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In C or Fortran this would have mattered alot, but here the loops are so slow you don't see any difference due to how you are accessing memory.  This could still be something to try for very large arrays as it may eventually win.  The problem is that python stores arrays as a single line of all the rows in order so the `(i,j)` loop reads the data in order, so the next value is just next to the last one.  The `(j,i)` has to jump the lenght of the row to find the next value slowing it down.

In [None]:
#8. Move stuff out of loops if possible
import math

def func1():
    x=math.sqrt(2)
    y=0
    for i in range(100):
        y*=x
        
def func2():
    y=0
    for i in range(100):
        y*=math.sqrt(2)
        
%timeit func1()
%timeit func2()

There is no point in working out the square root of 2 100 times. Again try to make stuff used in loops 'local' and avoid any calculation if possible

In [7]:
#8a. Move stuff out of loops if possible
import math
list1 = [range(100)]

def func1():
    y=0
    i=0
    while i<len(list1):
        y+=i
        i+=1
        
def func2():
    y=0
    i=0
    length = len(list1)
    while i<length:
        y+=i
        i+=1
        
%timeit func1()
%timeit func2()

264 ns ± 2.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
222 ns ± 0.553 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Same goes for test conditons for while loops.

In [None]:
#9. imports inside Vs outside of funtions

import math
def func1():
    math.sin(0.3)
    
def func2():
    import math
    math.sin(0.3)
    
    
%timeit func1()
%timeit func2()

It can be tempting to use imports inside functions so it looks tidy but this has significant cost. Best leave them at the top of the module as functions may be used multiple times but modules are (generally) only loaded once.

In [None]:
#10. Function overhead

def func1(i):
    return i*i

def func2():
    x = 0e0
    for i in range(100):
        x += func1(1)
        
def func3():
    x = 0e0
    for i in range(100):
        x += i*i
        
%timeit func2()
%timeit func3()

Modulisation is great but function calls can be expensive so don't go crazy (especially in loops!).  Again this is introducing an extra lookup which makes the data less likely to be local in cache

In [None]:
#11. Multiple assignment rather than tempory variables

def func1():
    a=0
    b=1
    for i in range(1000):
        a,b = b,a+b
        
def func2():
    a=0
    b=1
    for i in range(1000):
        c = a+b
        a = b
        b = c
        
%timeit func1()
%timeit func2()

A small saving (plus looks better)

In [None]:
# 12. Finding elements depends on the data type
list1 = [range(100)]
set1 = set(list1)

%timeit 5 in list1
%timeit 5 in set1

Sets and dictionaries are hash tables so are faster to search than lists. 

In [90]:
# 13. Read entire files at once if possible
import numpy as np
def read1():
    data1 = np.loadtxt('Data/Period1.txt')
    return data1

def read2():
    file = open('Data/Period1.txt','r')
    data1 = []
    for line in file:
        tmp = line.split()
        data1.append(float(tmp[0]))
    data2 = np.array(data1)
    return data2

def read3():
    file = open('Data/Period1.txt','r')
    data1 = file.read()
    data1 = data1.split('\n')
    data1 = [float(x) for x in data1[:-1]]
    data1 = np.array(data1)
    return data1

%timeit read1()
%timeit read2()
%timeit read3()

11.3 ms ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.3 ms ± 9.14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.42 ms ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


So numpy is the cleanest to code but the slowest to run and reading the file in one block is better than line by line (remember getting all library books at once rather than making multiple trips).  This is especially true on shared systems where you may compete for bandwidth for I/O.  There the system may alternate read statements from competing programmes eg: someone reads 1Tb, you read a line, someone reads 1Tb, you read one line etc.. which can be crippling.

I could go on and on but it's impossible to list all the things to consider so how do we decide which bits of our code to focus on?  The answer is to profile your code and see which bits take the longest.

### Profiliing

Profilers analysis your code and tell you what parts are taking the most time (and memory) to run.  There are a few you can use.  Lets look at them with our little prim number generator we created


In [105]:
def primenum(N):
    primes = [2]
    for n in range(3,N):
        if all(n%p>0 for p in primes):
            primes.append(n)
    return primes 

%prun primenum(10000)

 

This indicates that we spent most of our time in `<genexp>` on line 4 with the next most time spent on the build in method `all`.  Which is what we would expect

Next we could install the `line_profiler` package (`conda install line_profiler`) and use this instead

In [111]:
%load_ext line_profiler

def primenum(N):
    primes = [2]
    for n in range(3,N):
        if all(n%p>0 for p in primes):
            primes.append(n)
    return primes 


%lprun -f primenum primenum(10000)

Which is a little easier to read.  As one of our main problems is likely to be memory useage we can also profile memory with the `memory_profiler` package (`conda install memory_profiler`) but this only works on files so we need to save this to a file first

In [123]:
%%file Tools/primetools.py
def primenum(N):
    primes = [2]
    for n in range(3,N):
        if all(n%p>0 for p in primes):
            primes.append(n)
    return primes

Writing Tools/primetools.py


In [124]:
#%load_ext memory_profiler
import sys
sys.path.append('./Tools') 
import primetools as pts 
%mprun -f pts.primenum pts.primenum(10000)




We can also use `%run -p` to profile scripts that we run.  Finally there is also another build in profiler `cProfile`

In [128]:
import cProfile
def primenum(N):
    primes = [2]
    for n in range(3,N):
        if all(n%p>0 for p in primes):
            primes.append(n)
    return primes 

cProfile.run('primenum(10000)')

         797856 function calls in 0.151 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.151    0.151 <ipython-input-128-a151a3abc01c>:2(primenum)
   786627    0.089    0.000    0.089    0.000 <ipython-input-128-a151a3abc01c>:5(<genexpr>)
        1    0.000    0.000    0.151    0.151 <string>:1(<module>)
     9997    0.056    0.000    0.145    0.000 {built-in method builtins.all}
        1    0.000    0.000    0.151    0.151 {built-in method builtins.exec}
     1228    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




This can also be run from the command line with `python -m cProfile [-o output_file] [-s sort_order] myscript.py` which is usefull

Exercise:

Profile your 'game of life' code and see if you can get it to run any faster.