# 6. Performance Improvement
A typical programming procedure involves a sequence of three steps associated with their respective interim goals, i.e., "make it run" $\Rightarrow$ "make it right" $\Rightarrow$ "make it fast". This section discusses approaches to optimize Python code, i.e., make code run faster. The methods presented below are applicable to nearly every programming problem. 

## 6.0 Timing Code
Timing is the first step in measuring code performance.
1. In the Jupyter notenook, one can use magic function $\%timeit$, which can be used to time a single execution statement or a single method. It provides a simple way to time small bits of Python code. (1s = 1000ms, 1ms = 1000$\mu$s, 1$\mu$s = 1000ns)
2. To time more complicated codes, one can consider the "timeit" module.

In [4]:
import timeit
import numpy as np

A = np.random.random((100,100))

%timeit -n10 np.linalg.inv(A)

start = timeit.default_timer()
np.linalg.inv(A)
end = timeit.default_timer()
print("the execution time is", end - start) 


261 µs ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
the execution time is 0.0003791719791479409


## 6.1 Avoiding (Unnecessary) Object Copies
Generally speaking, creating and returning object copies have their time and space costs, as Python has to allocate memory for these copies. Whenever it is safe, operating directly on the original object is preferred from an efficiency point of view. 
### 6.1.1 Using Appropriate Indexing
Recall that elements from Numpy arrays can be selected using 4 methods: scalar selection, slicing, numerical (list-of-locations) indexing and logical (Boolean) indexing. Numerical indexing and/or logical indexing create a copy of the original array in the memory, while slicing (scalar selection is a special case) returns a view of an array. This can be easily verified by the following example.

In [5]:
a = np.arange(5)
a_slicing = a[0:3]
a_numerical = a[np.arange(3)]
a_logical = a[a < 3]
print(a, a_slicing, a_numerical, a_logical)

a *= 2
print(a, a_slicing, a_numerical, a_logical)


[0 1 2 3 4] [0 1 2] [0 1 2] [0 1 2]
[0 2 4 6 8] [0 2 4] [0 1 2] [0 1 2]


Since slicing does not produce copies, using it has some efficiency gains when handling large data (arrays).

In [8]:
def sum_slicing(x):
    return np.mean(x[1000:10001])

def sum_numerical(x):
    return np.mean(x[np.arange(1000,10001)])

def sum_logical(x):
    return np.mean(x[(x >= 1000) & (x <= 10000)])

a = np.arange(100000)

%timeit -n100 sum_slicing(a)
%timeit -n100 sum_numerical(a)
%timeit -n100 sum_logical(a)


21.7 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
51.6 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
124 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 6.1.2 Using In-Place Algorithms
An in-place algorithm operates directly on its input, instead of creating and returning a new copy of the input. Working in-place is a good way to speed up computation and save memory. However, one should keep in mind that in-place algorithms are "destructive" since the original input is changed when it is edited to create the new ouput.

In [9]:
A = np.random.random((1000,1000))
B = np.random.random((1000,1000))

%timeit -n10 global A; A *= B 
%timeit -n10 A*B


1.24 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 6.1.3 Utilizing Broadcasting 
When two arrays have different shapes during arithmetic operations, the smaller array is “broadcast” across the larger one so that they are compatible in shape. Broadcasting does not make needless copies of data and usually leads to efficient algorithm implementations. Furthermore, broadcasting provides a way to vectorize array operations so that looping occurs in C instead of Python.  

In [12]:
A = np.random.random((1000,1))
B = np.random.random((1,100))

# compute the Kronecker product of A and B
%timeit -n50 A*B
%timeit -n50 A@np.ones((1,100))*(np.ones((1000,1))@B)
%timeit -n50 np.kron(A,B)


121 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
260 µs ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
635 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)


## 6.2 Avoiding Loops in Python

### 6.2.1 Vectorizing to Avoid Loops

In [13]:
A = np.random.random((100,10))
B = np.random.random((10,20))

# matrix multiplication
def f(X,Y):
    m,k = X.shape
    k,n = Y.shape
    Z = np.zeros((m,n))
    for i in range(m):
        for j in range(n):
            for l in range(k):
                Z[i,j] += X[i,l]*Y[l,j]                       
    return Z

print(np.allclose(f(A,B), A@B))

%timeit -n50 A@B
%timeit -n50 f(A,B)


True
5.68 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
11.1 ms ± 465 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)


### 6.2.2 Using (Generalized) Universal Functions 
Universal functions (ufunc) are functions that do element-by-element operations on arrays. Generalized universal functions (gufunc) extend ufunc to support “sub-array” by “sub-array” operations. Many of the built-in functions are implemented in compiled C code, and so the execution is fast. 

In [14]:
X = np.random.random((100,3,3))
Y = np.random.random((100,3,3))

print(np.allclose(
      np.array([x@y for x,y in zip(X,Y)]), 
      np.matmul(X,Y)))

%timeit -n50 np.matmul(X,Y)
%timeit -n50 np.array([x@y for x,y in zip(X,Y)])


True
68.1 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
140 µs ± 9.11 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)


### 6.2.3 Using Numpy Functions and Array Methods 
Numpy provides a high-performance multidimensional array object, and tools for working with these arrays. Numpy functions (and according array methods if available) are typically faster than their Python built-in counterparts when processing Numpy arrays. 

In [23]:
x = np.random.random((100,1))

%timeit -n100 np.min(x)
%timeit -n100 x.min()
%timeit -n100 min(x)


8.22 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.71 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
64 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 6.2.4 Using List Comprehension and map( )
List comprehension and "map( )" are built-in methods to apply a function to an iterable. Both are much faster than for loops, but they are dominated by Numpy functions and ufunc/gufunc.

In [28]:
def f_loop(x,y):
    z = []
    for i in range(len(x)-1):
        z.append(max(x[i],y[i]))
    return z

X = np.random.randint(1, 9, (1000,1))
Y = np.random.randint(1, 9, (1000,1))

%timeit -n50 list([np.maximum(x,y) for x,y in zip(X,Y)])   # list comprehension
%timeit -n50 list(map(np.maximum,X,Y))                     # map()
%timeit -n50 np.maximum(X,Y).tolist()                      # ufunc
%timeit -n50 f_loop(X,Y)                                   # for loop


912 µs ± 115 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
733 µs ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
53.1 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
1.05 ms ± 80.2 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)


### 6.2.5 Using List and Set Methods
List/set methods are much faster than loops. Do check available list/set methods before writing a loop!

In [29]:
import itertools as it

def f_combination(x):
    z, n = [], len(x)
    
    for i in range(n-1):
        for j in range(i+1,n):
            z.append([x[i],x[j]])
    return z

v = np.array(range(100))
print(np.allclose(f_combination(v), 
                  list(it.combinations(v, 2))))

%timeit -n50 list(it.combinations(x, 2))
%timeit -n50 f_combination(x)

def f_count(X,x):
    c = 0
    for i in range(len(X)):
        c += X[i] == x
    return c

X = np.random.randint(0,100,1000)

print(f_count(X,9) == list(X).count(9))

%timeit -n50 list(X).count(9)
%timeit -n50 f_count(X,9)

def f_intersection(X,Y):
    Z = set([])
    for x in X:
        if x in Y:
            Z.add(x)
    return Z

A = np.random.randint(0,1000,100)
B = np.random.randint(0,1000,100)

print(f_intersection(A,B) == set(A).intersection(B))

%timeit -n50 set(A).intersection(B)
%timeit -n50 f_intersection(A,B)


True
223 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
2.79 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
True
128 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
2.11 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
True
14.5 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
322 µs ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)


## 6.3 Compiling Python Functions
### 6.3.1 Profiling Code
For programmer productivity, it often makes sense to identify the code bottlenecks before optimizing the code. This can be done by code profiling. A profiling procedure generates a set of statistics that describes how often and for how long various parts of the code executed. These statistics can then be formatted into reports via the "pstats" module. One of basic Python profilers is provided by the "profile" module. 

In [31]:
A = np.random.random((300,10))
B = np.random.random((10,200))

def f1(X,Y):
    m,k = X.shape
    k,n = Y.shape
    Z = np.zeros((m,n))
    for i in range(m):
        for j in range(n):
            for l in range(k):
                Z[i,j] += X[i,l]*Y[l,j]                       
    return Z

def f2(X,Y):
    return X@Y

def test(X,Y):
    f1(X,Y)
    f2(X,Y)
    
import profile
profile.run("print(test(A,B))", "code.profile")


None


The standard report created by $profile.run()$ is not very flexible. Custom reports can be produced by saving the raw profiling data and processing it separately with the $pstats.Stats$ class.

In [32]:
import pstats
cp = pstats.Stats("code.profile")
cp.print_stats()
pass

Mon Mar  5 23:18:27 2018    code.profile

         40 function calls in 0.354 seconds

   Random listing order was used

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    0.354    0.354 profile:0(print(test(A,B)))
        1    0.000    0.000    0.352    0.352 :0(exec)
        1    0.000    0.000    0.352    0.352 <string>:1(<module>)
        1    0.000    0.000    0.352    0.352 <ipython-input-31-49b6bb79b942>:17(test)
        1    0.351    0.351    0.352    0.352 <ipython-input-31-49b6bb79b942>:4(f1)
        1    0.001    0.001    0.001    0.001 :0(zeros)
        1    0.000    0.000    0.000    0.000 <ipython-input-31-49b6bb79b942>:14(f2)
        1    0.000    0.000    0.000    0.000 :0(print)
        2    0.000    0.000    0.000    0.000 /Users/ouyangfu/anaconda3/lib/python3.6/site-packages/ipykernel/iostream.py:342(write)
        2    0.000    0.000    0.000    0.00

In Jupyter notebook, the profiling can be done with line magic $\%prun$.

In [33]:
%prun -q -D code.profile test(A,B)
cp = pstats.Stats("code.profile")
cp.print_stats()
pass

 
*** Profile stats marshalled to file 'code.profile'. 
Mon Mar  5 23:19:31 2018    code.profile

         7 function calls in 0.357 seconds

   Random listing order was used

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.357    0.357 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.zeros}
        1    0.357    0.357    0.357    0.357 <ipython-input-31-49b6bb79b942>:4(f1)
        1    0.000    0.000    0.357    0.357 <ipython-input-31-49b6bb79b942>:17(test)
        1    0.000    0.000    0.000    0.000 <ipython-input-31-49b6bb79b942>:14(f2)
        1    0.000    0.000    0.357    0.357 <string>:1(<module>)




### 6.3.2 Numba
One way to speed up detected code bottlenecks is to compile the code to machine executables, often via an intermediate C or C-like stage. There are two common approaches to compiling Python code - using a Just-In-Time (JIT) compiler and using Cython for Ahead of Time (AOT) compilation. This note mostly covers the JIT approach provided by Numba as it can often significantly speed up Python code with minimal effort.

In [36]:
from numba import jit 

A = np.random.random((50,10))
B = np.random.random((10,50))

def F(X,Y):
    m,k = X.shape
    k,n = Y.shape
    Z = np.zeros((m,n))
    for i in range(m):
        for j in range(n):
            for l in range(k):
                Z[i,j] += X[i,l]*Y[l,j]                       
    return Z

Fjit = jit(F)

%timeit A@B
%timeit Fjit(A,B)
%timeit F(A,B)


5.31 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.6 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.5 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The pure Python loop is slow because
1. Python uses a very general approach to process operations. The strength is that many Python opearators apply to a wide range of objects (arrays, strings, lists, etc), while the drawback is that Python has to identify the type of the object and its associated functions/methods in each operation.
2. Python treats each loop as independent, i.e., it does not learn from what happened before. This makes Python perform a lot of needless actions that are mostly evitable if it could remember what it did in last loop.

The post-JIT code produced by Numba knows substantially more about the structure of the problem. It is natural to think that more efficiency gains can be obtained by providing Numba more information about the problem. This can be done by describing the inputs and outputs (signature specifications).

In [38]:
# both output and inputs are 2-D float64 arrays
Fjit_descr = jit("double[:,:](double[:,:],double[:,:])")(F)

%timeit Fjit_descr(A,B)
%timeit Fjit(A,B)


18.4 µs ± 623 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
20.3 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Numba can generate much faster code than pure Python since it only supports a small (but important)
set of data types, functions, and operators useful for numerical work. Numba supports two modes of operation: object mode and nopython mode. Object mode is "robust" (always works), but normally not obviously faster than pure Python. Nopython mode requires that every command in a function is supported by Numba. When compiling Python code, Numba tries nopython mode first. If it does not work, then Numba falls back to object mode. To prevent Numba from falling back, and instead raise an error, one can pass keyword "nopython = True".

In [37]:
Fjit_nopy = jit(F, nopython = True)  

%timeit Fjit_nopy(A,B)
%timeit Fjit(A,B)


18.5 µs ± 483 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.3 µs ± 800 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### 6.3.3 Cython
When pure Numpy and/or Numba cannot achieve desirable performance improvement, one can try Cython, which is a powerful, but more complex, “optimizing static compiler”. Cython is a superset of Python, and so valid Python code is also valid Cython code. Cython code has a number of advantages over Numba’s JIT (dynamic) compiler:

1. Cython modules are statically compiled, and so using a Cython module does not incur a “warm-up” penalty due to JIT compilation. 
2. Numba is a relatively new, rapidly evolving project, and so code may encounter compatibility problem when executed in different versions.
3. A Python extension produced by Cython can be distributed to other users and does not require Cython to be installed. In contrast, Numba must be installed and performance gains may vary across Numba versions.
4. Cython can be used interface to existing C/C++ code.

Since Numba often provides similar speed-ups with less work (and it is rapidly evolving!), Cython is not covered in this note. The following references might be helpful for studying Cython:

1. [Cython](http://docs.cython.org/en/latest/#)
2. [Cython: A Guide for Python Programmers](https://www.amazon.com/Cython-Kurt-W-Smith/dp/1491901551)