# High Performance Python

### by Ankush Chander
Research Engineer, Crimson Interactive

In [2]:
%load_ext memory_profiler

## Table of contents:
1. The Fundamental Computer System
2. Profiling techniques
3. Python lists
4. Dictionary and namespaces












## Fundamental Computer System

### Computing units
Provides the ability to transform bits it recieves  into other bits or to change the state of process.
Eg: CPU(commonly used), GPU(auxilary unit) 

### Memory units
Used to store bits.
Eg: Spinning disc, Solid State Drive(SSD), RAM, l1/l2/l3 cache, CPU registers  

### Communication layers
How fast we can move data from one place to another
Eg: Frontside bus(between RAM and l1/l2 cache), External bus(hardware devices to CPU and main memory) 


## Computing units
Provides the ability to transform bits it recieves  into other bits or to change the state of process.

#### Measurements:

How many cycles per second?(Clock speed)

How many instructions per cycle(IPC)?

#### Commands:
```
lscpu
eg: output
CPU MHz:                         1700.000 # 1 MHz means 1 million cycles per second
CPU max MHz:                     4700.0000
CPU min MHz:                     400.0000
BogoMIPS:                        3379.20  # Million instructions per second
```

### Memory units
Used to store bits.
#### Measurements
1. How much data it can hold? 
2. How fast it can read/write to it.
3. How fast it take to locate data 

#### Tiered structure:
1. Spinning hard drive
2. Solid State hard drive
3. RAM
4. l1/l2/l3 cache
5. CPU registers

## Communication layers
How fast we can move data from one place to another

### Measurements:
How much data can a bus move in transfer?(Bus width)
How many transfers bus can make in 1 second(bus frequency)

### Types of bus:
1. Frontside bus: connects RAM to l1/l2 cache
2. External bus: connect external devices to system memory/CPU
3. PCI bus: For peripheral devices


## Profiling techniques

### Simple approaches to timing - print and decorator

In [1]:
# time module
import time

s_time = time.time()
time.sleep(4)
print(f"took {time.time() - s_time} secs")

took 4.00078272819519 secs


In [2]:
# log function"s execution time using decorator

import time
from functools import wraps

def timer_func(func):
    # This function shows the execution time of 
    # the function object passed
    @wraps(func)
    def wrap_func(*args, **kwargs):
        t1 = time.time()
        result = func(*args, **kwargs)
        t2 = time.time()
        print(f'Function {func.__name__!r} executed in {(t2-t1):.4f}s')
        return result
    return wrap_func
  

@timer_func
def lengthy_function():
    time.sleep(4)
    
    
lengthy_function()

Function 'lengthy_function' executed in 4.0014s


### Simple Timing Using the Unix time Command

```
$ time python3 -u insertion_sort.py  10000

real	0m5.218s
user	0m5.206s
sys	0m0.004s
```
real records the wall clock or elapsed time.

user records the amount of time the CPU spent on your task outside of kernel

sys records the time spent in kernel-level functions.


### line_profiler
```
# installation: 
pip install line_profiler

# decorate functions to be profile using @profile

# how to run
kernprof -l -v  python_script.py
```

line_profiler provides following info corresponding to each line

Line: Line number

Hits: number of times line got executed

Per Hit: approx time taken by that line

% Time: percentage time spent executing the line

Line Contents: Line content

%time is most useful parameter in case looking for CPU bottlenecks

### perf module
Probing how well memory is being moved to the CPU can be quite hard; however, in
Linux the perf tool can be used to get amazing amounts of insight into how the CPU
is dealing with the program being run.

```
# installation:
sudo apt update
sudo apt install linux-tools-common

# how to run?
sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python script_to_profile.py

# cycles:  tells us how many clock cycles our task took.
#instructions:  tells us how many instructions our code is issuing to the CPU.
#cache-references: data attempted to be referenced via cache
#cache-misses: data references not found in cache and was fetched from RAM
#branches: a time in the code where the execution flow changes. eg: if..else 
#branch-misses: cpu tries to predict the flow and preload the instructions. when prediction is incorrect.
#faults,
#minor-faults
```


## Strategies for profiling code 
1. Disable TurboBoost in the BIOS.

2. Disable the operating system’s ability to override the SpeedStep (you will find this in your BIOS if you’re allowed to control it).

3. Only use mains power (never battery power).

4. Disable background tools like backups and Dropbox while running experiments.

5. Run the experiments many times to obtain a stable measurement.

6. Possibly drop to run level 1 (Unix) so that no other tasks are running.

7. Reboot and rerun the experiments to double-confirm the results.

## Data structures and tradeoffs and quirks
1. Dynamic lists and caching
2. Resizing and memory allocation
3. Difference between python lists and Numpy arrays
4. Performance analysis of dictionaries

### Lists
Python lists are dynamic in nature.

#### Pros: 
1. User dont need to know list size before hand.
2. User don"t have to handle resizing
#### Cons:
1. Resizing(in background) makes appending to array slower.  
2. Due to resizing lists usually take more space than the elements it holds.

In [51]:
%timeit [i*i for i in range(10000)]

375 µs ± 9.89 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [52]:
%%timeit l = []
for i in range(10000):
    l.append(i*i)
    
# takes more time because of list resizing

647 µs ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [17]:
%memit [i*i for i in range(1000000)]

peak memory: 115.42 MiB, increment: 6.21 MiB


In [18]:
%%memit l = []
for i in range(1000000):
    l.append(i*i)
    
# takes more time because of list resizing

peak memory: 107.53 MiB, increment: 30.25 MiB


## Key Factors
### Memory allocations: 
Memory allocations are not cheap. Whenver you assign a value to variable interpreter needs to talk to Operating system to allocate new space and then iterate over that space to access it. Try to allocate space fewer times and then reuse that space throughout the program.

### Memory fragmentation:
Python lists doesn"t natively support vectorization because lists doesnt store the actual data but pointers to it.

### Cache misses
Since data is fragmented across different memory locations, you must move each piece over instead of a block over, hence missing out on benifits of caching


In [27]:
from array import array
import numpy as np


def norm_square_list(vector):
    norm = 0
    for v in vector:
        norm += v*v
    return norm


def norm_square_list_comprehension(vector):
    return sum([v*v for v in vector])


def norm_square_array(vector):
    norm = 0
    for v in vector:
        norm += v*v
    return norm


def norm_square_numpy(vector):
    return np.sum(vector * vector)


def norm_square_numpy_dot(vector):
    return np.dot(vector, vector)



In [31]:
%%timeit
vector = list(range(1000000))
norm_square_list(vector)

70.2 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [32]:
%%timeit
vector = list(range(1000000))
norm_square_list_comprehension(vector)

81.1 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [33]:
%%timeit
vector_array = array("l", range(1000000))
norm_square_array(vector_array)

100 ms ± 454 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [35]:
%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy(vector_numpy)
# numpy has specialized code in C that takes benifits of any vectorization that CPU has enabled. 
#Also numpy arrays are represented sequentially in memory as low level data types

1.94 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy_dot(vector_numpy)
# uses specialized code that doesn"t need to store intermediate output of (vector * vector)

1.41 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Dictionaries and namespaces
Python’s namespace management, which heavily uses dictionaries to do its lookups.

In [6]:
import math
from math import sin
import dis

def test1(x):
    """
    resolve `math` from globals()
    resolve `sin` from math module
    local lookup for x
    
    """
    return math.sin(x)


def test2(x):
    """
    resolve sin from globals()
    resolve x from locals()
    """
    return sin(x)


def test3(x, sin=math.sin):
    """
    resolve sin from locals()
    resolve x from locals()
    """
    return sin(x)

In [2]:
%timeit test1(123456) 
#2 dictionary lookup, 1 local lookup


110 ns ± 0.653 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [3]:
%timeit test2(123456)
#1 dictionary lookup, 1 local lookup

95.7 ns ± 7.78 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [4]:
%timeit test3(123456)

88.9 ns ± 0.54 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [7]:
dis.dis(test1)

 12           0 LOAD_GLOBAL              0 (math)
              2 LOAD_METHOD              1 (sin)
              4 LOAD_FAST                0 (x)
              6 CALL_METHOD              1
              8 RETURN_VALUE


In [8]:
dis.dis(test2)

 20           0 LOAD_GLOBAL              0 (sin)
              2 LOAD_FAST                0 (x)
              4 CALL_FUNCTION            1
              6 RETURN_VALUE


In [9]:
dis.dis(test3)

 28           0 LOAD_FAST                1 (sin)
              2 LOAD_FAST                0 (x)
              4 CALL_FUNCTION            1
              6 RETURN_VALUE


## Further exploration
1. How can we write modules for python using C/fortran
2. Multiprocessing
3. How can we write code that runs across a cluster and use more machines.
4. How can we make programs use less RAM

## References
1. [High Performance Python(book)](https://www.goodreads.com/book/show/49828191-high-performance-python)
2. [What Is Clock Speed?](https://www.intel.com/content/www/us/en/gaming/resources/cpu-clock-speed.html)
3. [BogoMIPS](https://en.wikipedia.org/wiki/BogoMips)
4. [Global Interpretor Lock](https://realpython.com/python-gil/)
5. [dis — Disassembler for Python bytecode](https://docs.python.org/3/library/dis.html)