# A Bit about Bytes: Understanding Python Bytecode

* Speaker: James Bennett
* [Youtube Link](https://www.youtube.com/watch?v=cSSpnq362Bk)
* Date: 12/05/2018
* [Con Info Page](https://us.pycon.org/2018/schedule/presentation/127/)

## Intro

Python is about readable code, but this makes it slow.

> You want to write human-friendly source code.
> 
> Your computer wants binary instructions ("machine code") for is CPU.

We need to get from one to the other:

* Can *compile* directly to CPU instructions.
* Can *interpret* source code while running.
* Can compile an intermediate set of instructions and implement a virtual machine that turns those into CPU instructions while running. Called *bytecode*.
    * The instructions aren't for any real CPU, the interpreter implements that CPU for the actual processor.
    * Like Java bytecode for JVM.

As an example, we're using this Fibonacci generator:

In [1]:
def fib(n):
    if n < 2:
        return n
    current, next = 0, 1
    while n:
        current, next = next, current + next
        n -= 1
    return current

fib(5)

5

would be saved as *fibonacci.py*, but may also produce *fibonacci.pyc*. Python 2 seems to put these directly in the folder with the *.py* file, but Python 3 puts them in the *\_\_pycache\_\_* folder. Sometimes called compiled code, is pre-compiled down to bytecode.

All objects have a ````code```` attribute:

In [2]:
fib.__code__

<code object fib at 0x000001D3F9D5A270, file "<ipython-input-1-aada8b8bfe46>", line 1>

The speaker references [this talk](https://www.youtube.com/watch?v=XhWvz4dK4ng) for more info on code objects.

One useful part is ````co_consts````, which lists all constants referenced by the code:

In [3]:
fib.__code__.co_consts

(None, 2, 0, 1, (0, 1))

The addition of ````None```` is notable, as it's needed in case the function ends without an output.

Another useful part is ````co_varnames```` (all variables in the function):

In [4]:
fib.__code__.co_varnames

('n', 'current', 'next')

And ````co_names```` (all the non-local names), as the example function doesn't use any, it's an empty tuple:

In [5]:
fib.__code__.co_names

()

And the part that the speaker's interested in ````co_code```` - this *is* the bytecode of the object. This isn't a string, but a series of bytes:

In [6]:
fib.__code__.co_code

b'|\x00d\x01k\x00r\x0c|\x00S\x00d\x04\\\x02}\x01}\x02x\x1e|\x00r2|\x02|\x01|\x02\x17\x00\x02\x00}\x01}\x02|\x00d\x038\x00}\x00q\x16W\x00|\x01S\x00'

The first character (the pipe) is standing in for a value:

In [7]:
ord('|')

124

So the first byte has a byte value of 124 - though that's still referencing something else.

The ````dis```` (disassemble) module is in the standard library and has information that can decode these, like the ````opname```` list:

In [8]:
import dis
dis.opname[ord('|')]

'LOAD_FAST'

So that first pipe is ````LOAD_FAST````. The next value is 0, so the instruction is ````LOAD_FAST 0````. So an instruction to look up in ````co_varnames```` and find what's at index 0 and push it onto the evaluation stack.

Before going any further, we can actually use the easy way to look at bytecode:

In [9]:
dis.dis(fib)

  2           0 LOAD_FAST                0 (n)
              2 LOAD_CONST               1 (2)
              4 COMPARE_OP               0 (<)
              6 POP_JUMP_IF_FALSE       12

  3           8 LOAD_FAST                0 (n)
             10 RETURN_VALUE

  4     >>   12 LOAD_CONST               4 ((0, 1))
             14 UNPACK_SEQUENCE          2
             16 STORE_FAST               1 (current)
             18 STORE_FAST               2 (next)

  5          20 SETUP_LOOP              30 (to 52)
        >>   22 LOAD_FAST                0 (n)
             24 POP_JUMP_IF_FALSE       50

  6          26 LOAD_FAST                2 (next)
             28 LOAD_FAST                1 (current)
             30 LOAD_FAST                2 (next)
             32 BINARY_ADD
             34 ROT_TWO
             36 STORE_FAST               1 (current)
             38 STORE_FAST               2 (next)

  7          40 LOAD_FAST                0 (n)
             42 LOAD_CONST               3

````dis.dis```` will take and disassemble pretty much any Python object.

* The number on the far left is the line number in the source code.
* Each instruction has a number next to it (always even).
    * The offset into the bytecode.
    * So the odd ones are the arguments to the instruction (as of Python 3.6 they get an argument whether they want one or not). Making each instruction 2 bytes.
* Lines with the double-arrow next to the instruction number are jump targets (may be jumped to by another instruction).


## [About Python's VM](https://youtu.be/cSSpnq362Bk?t=13m26s)

* Built in CPython?
* Python's VM is stack-oriented
    * Frames are pushed onto the call stack whenever a function is called.
        * Keeps track of every function being executed.
    * Then popped off the top when the frame returns and the returned value pushed into the calling frame.
* CPython has 2 more stacks in the call stack:
    * Evaluation stack / Data stack
        * Holds variables etc
        * Most of the execution happens here.
    * Block stack
        * Keeps track of blocks (````with```` / ````try/except```` blocks etc).
        * Needed as some functionality needs to know what the current block is.


## [Executing a Function](https://youtu.be/cSSpnq362Bk?t=15m25s)

As an example, trying to get the 8th Fibonacci number with our function. This becomes 3 bytecode instructions:

````
0  LOAD_GLOBAL    0 (fib)
2  LOAD_CONST     1 (8)
4  CALL_FUNCTION  1 
````

So when we start we have an empty evaluation stack, this gets loaded up with the three instructions:

* ````LOAD_GLOBAL```` - Load the global name ````fib```` - our Fibonacci function.
    * Looks in the ````co_names```` tuple.
* ````LOAD_CONST```` - Gets the nth item out of the tuple of constants.
    * Index 0 would be ````None````, so the argument is 1.
* ````CALL_FUNCTION```` with an argument 1 (the number of positional arguments).
    * When we're only using positional arguments, Python will push the function and all its arguments onto the stack until it hits ````CALL_FUNCTION```` when it pops all of them off.
    * Pushes a new frame onto the Call stack, executes the Fibonacci function in that frame.
    * Gets a return value of 21, pops that off the Call stack.
    * Pushes the new value back onto the Evaluation stack where we called the Fibonacci function.

````CALL_FUNCTION```` is only for positional arguments, if keywords are used, then it calls ````CALL_FUNCTION_KW```` or ````CALL_FUNCTION_EX```` for various list unpacking methods.

The [dis module documents](https://docs.python.org/3/library/dis.html) have a lot of useful info on all the bytecode instructions. Other useful commands in the module include [distb](https://docs.python.org/3/library/dis.html#dis.distb), for deconstructing stack traces and gives a pointer to where the exception was raised.

The interpreter itself is written in C. The main body of it is a giant ````switch```` (case) statement.

## [What Can we Learn From Bytecode](https://youtu.be/cSSpnq362Bk?t=19m28s)

The main actual use of Python bytecode is understanding what exactly's going on and how Python works. In particular from a performance point of view.

For example, two functions to calculate the number of seconds in a week:

In [10]:
def slow_week():
    SECONDS_PER_DAY = 86400
    return SECONDS_PER_DAY * 7

dis.dis(slow_week)

  2           0 LOAD_CONST               1 (86400)
              2 STORE_FAST               0 (SECONDS_PER_DAY)

  3           4 LOAD_FAST                0 (SECONDS_PER_DAY)
              6 LOAD_CONST               2 (7)
              8 BINARY_MULTIPLY
             10 RETURN_VALUE


In [11]:
def fast_week():
    return 86400 * 7

dis.dis(fast_week)

  2           0 LOAD_CONST               3 (604800)
              2 RETURN_VALUE


The second one doesn't have to run values through constants here.

One interesting point is that Python's merged down the numbers at compile time, so the multiplication doesn't repeatedly happen. Python also does a bit of branch prediction optimisation too, along with a few other bits.

Other examples are the performance of different ways of creating dictionaries:

In [12]:
dis.dis('{}')

  1           0 BUILD_MAP                0
              2 RETURN_VALUE


In [13]:
dis.dis('dict()')

  1           0 LOAD_NAME                0 (dict)
              2 CALL_FUNCTION            0
              4 RETURN_VALUE


So calling ````dict```` both has more instructions and one of them is a call (which is expensive as it puts another frame on the call stack).

### [Example Function Optimisation](https://youtu.be/cSSpnq362Bk?t=24m11s)

With a function to calculate the first 10 perfect squares

In [14]:
def squares_while():
    squares = []
    i = 0
    while i <= 10:
        squares.append(i ** 2)
        i += 1
    return squares

dis.dis(squares_while)

  2           0 BUILD_LIST               0
              2 STORE_FAST               0 (squares)

  3           4 LOAD_CONST               1 (0)
              6 STORE_FAST               1 (i)

  4           8 SETUP_LOOP              34 (to 44)
        >>   10 LOAD_FAST                1 (i)
             12 LOAD_CONST               2 (10)
             14 COMPARE_OP               1 (<=)
             16 POP_JUMP_IF_FALSE       42

  5          18 LOAD_FAST                0 (squares)
             20 LOAD_ATTR                0 (append)
             22 LOAD_FAST                1 (i)
             24 LOAD_CONST               3 (2)
             26 BINARY_POWER
             28 CALL_FUNCTION            1
             30 POP_TOP

  6          32 LOAD_FAST                1 (i)
             34 LOAD_CONST               4 (1)
             36 INPLACE_ADD
             38 STORE_FAST               1 (i)
             40 JUMP_ABSOLUTE           10
        >>   42 POP_BLOCK

  7     >>   44 LOAD_FAST          

Changing the while loop with counter to for loop with range:

In [15]:
def squares_range():
    squares = []
    for i in range(1, 11):
        squares.append(i ** 2)
    return squares

dis.dis(squares_range)

  2           0 BUILD_LIST               0
              2 STORE_FAST               0 (squares)

  3           4 SETUP_LOOP              32 (to 38)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_CONST               1 (1)
             10 LOAD_CONST               2 (11)
             12 CALL_FUNCTION            2
             14 GET_ITER
        >>   16 FOR_ITER                18 (to 36)
             18 STORE_FAST               1 (i)

  4          20 LOAD_FAST                0 (squares)
             22 LOAD_ATTR                1 (append)
             24 LOAD_FAST                1 (i)
             26 LOAD_CONST               3 (2)
             28 BINARY_POWER
             30 CALL_FUNCTION            1
             32 POP_TOP
             34 JUMP_ABSOLUTE           16
        >>   36 POP_BLOCK

  5     >>   38 LOAD_FAST                0 (squares)
             40 RETURN_VALUE


We start to pull down the number of operations by doing less things explicitly.

Going further into idiomatic Python:

In [16]:
def squares_comprehension():
    return [i ** 2 for i in range(1, 11)]

dis.dis(squares_comprehension)

  2           0 LOAD_CONST               1 (<code object <listcomp> at 0x000001D3F9DE3DB0, file "<ipython-input-16-3b0ae924b76d>", line 2>)
              2 LOAD_CONST               2 ('squares_comprehension.<locals>.<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_CONST               3 (1)
             10 LOAD_CONST               4 (11)
             12 CALL_FUNCTION            2
             14 GET_ITER
             16 CALL_FUNCTION            1
             18 RETURN_VALUE


Though not all calls are equal in costs (this requires both ````MAKE_FUNCTION```` and ````CALL_FUNCTION````, which are both expensive).

This starts to run into the territory of micro-optimisations pretty quickly, in which case you probably should just use functions implemented in C. There's very little contest.

## [General Guidelines](https://youtu.be/cSSpnq362Bk?t=26m16s)

* Local names are faster than guidelines.
    * ````LOAD_CONST```` > ````LOAD_FAST```` > ````LOAD_NAME```` or ````LOAD_GLOBAL````.
    * The wider the search space, the slower the search (particularly if we don't know which place to look).
* Loops and blocks are expensive.
    * e.g. ````SETUP_LOOP````, ````SETUP_WITH```` and ````SETUP_EXCEPTION````.
    * I'm not sure what the balance of resilient code and quick code is here (e.g. closing files after an error).
* Attribute access, dictionary access etc is expensive.
    * Make sure it's being aliased to a local variable.


## [Recommended Reading](https://youtu.be/cSSpnq362Bk?t=28m10s)

* [Inside the Python Virtual Machine, Obj Ike-Nwosu](https://leanpub.com/insidethepythonvirtualmachine)
* [A Python Interpreter Written in Python, Alison Kaptur](http://www.aosabook.org/en/500L/a-python-interpreter-written-in-python.html)
* [The CPython bytecode interpreter](https://github.com/python/cpython/blob/master/Python/ceval.c)
