In [1]:
# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings = lambda *a, **kw: None
from IPython.core.display import HTML; HTML(open("../documents/custom.html", "r").read())

<span style="font-size:250%;">General Introduction</span>

# Why is my program slow?

There are many reasons why you might have to wait a week for your computation to finish:

1. You implementation is inefficient or you chose the wrong algorithm. 
2. The programming language you are using is slow.
3. Huge number of computational operations like adding a trillion numbers or multiplication of huge matrices. 
4. You are processing a huge amount of data.
5. Your computers hardware is slow.


As you can see Python itself is not always the culprit why your program is slow. We will address the first four points during the rest of the workshop.

Point 5 requires some extra considerations:

# Evolution of computer speed since the 70s...

- The operations on a computer are controlled by an **internal clock** and operations happen in so called **clock-cycles**. 
- Typical instructions like 'add two integers' take 1 or a few more cycles depending on the CPU type.
- Thus Clock-speed is the main determinant of a computers speed for single core computers.
- `1 GHz` clock speed means `1 Billion cycles per second`.

The following plot shows the average clock speed of a selected set of Intel processors over time. 

<img src="images/clockrates.png"/>

- Since this plot is logarithmic on the y-axis, you can see there was a **exponential growth over many years**, which stopped around the middle of the 2000s.
- The main reason for this is that physics demands that higher clock-rates require smaller electric circuits to work, and this again is limited by other physical and also economical constraints.

Until this time point, the easiest way to get your computations results quicker was to buy the next generation CPU, but then **during the 2000s the "free lunch" was over**.

What does this mean?

- Buying a faster computer involved spending money but you could still use the same software.
- Since the mid of the 2000s computer got faster by building processors with multiple cores, but to use these programs needed to be rewritten and required different models of computation. 

Another approach to improve the speed of computations is the development of [GPUs](https://scikit-learn.org/stable/) (Graphic Processing Units) which can be much faster than `CPU`s but only for a limited set of applications. `GPU`s originated as special processors on graphic cards for special computations and offered significantly higher speed than traditional `CPU`s can offer.

**The biggest advance in speed comes from development of better algorithms anyway**.



# Python is slower than ...

Python is known to be slower than many other programming languages. This is a chart with measured runtime for one particular problem (Computing $\pi$ using the "Leibnitz-Formula"):

<center>
<img src="images/speed.png" width=70% />
</center>

1. `CPython` (this is the "default" Python interpreter) is on the second-last position,
2. ... but there is also `pypy` which we will learn about later.
3. Be careful with absolute numbers, the ranking might change for different computing problems.


On the other side Python programs are very expressive, which means that you need on average only 1/6 of lines of code to achieve the same goal as in `C`:


|Language|Statements Ratio|Lines Ratio|
| --- | --- |--- |
|C| 1 | 1  |
|C++| 0.4 | 1 |
|Fortran | 0.5 | 0.125 |
|Java | 0.4 | 0.66 |
|Perl | 0.17 | 0.16 |
|Python | 0.17 | 0.15 | 


Source: https://en.wikipedia.org/wiki/Comparison_of_programming_languages#Expressiveness

The crucial question is not if something **is slow** but if a program is **too slow**!

1. In case a program needs 50 ms in Python and 1 ms in C and you run the program once, you will not notice the difference.
2. But if you run the program 10.000 times it will be 500 s (short coffee break) vs 10 seconds (not enough time to check emails), which is definitely noticeable.

# Why is Python slower than...?

Programming languages can be distinguished and categorized by their "level". This is not a hard measure and more a "soft" concept.

- The lower level programming languages are conceptionally "close to the hardware". This means you have to access memory cells directly, have to manage memory and data structures as lists and dictionaries are not part of the language. 

    - Pros: fine grained control of code execution and memory usage, the programmer can highly optimize code in terms of speed and memory consumption.
    - Cons: the programmer has to take care of many details and also can make many mistakes and eventually crash a computer.


- High level programming languages offer concepts which are more abstract and the user usually does not have to care about the details like memory management. 
    - Pros: high level languages: more "human" programming model and concepts.
    - Cons: the programmer is not in control of all details of code execution. 
  
<br />
<br />

<div style="font-size: 0.8em;">
     <center>
<img src="images/low_high_level_languages.png" width=50%  /><br/>
   
(C) 2021 ETH Zurich
    </center>
</div>
<br />
<br />
  


## About compilation

Compilation is the translation of source code to a lower level language or representation, such as assembler.

- E.g. `C`, `Fortran` are compiled to machine code.
- `C++` is compiles to machine code too, but first compilers in the 1990 compiled `C++` first to `C` which then in turn was compiled to machine code.
- `Java` is compiled to so called *Java byte code* which is executed by the so called `Java virtual machine (JVM)`.
- The `CPython` interpreter first compiles the Python source code to *Python byte code* which is then executed by the *Python virtual machine*.

Compilation to machine code results in so called **executables**. 

- Compilation to machine code can take a while depending on the size of the code base and the programming language..
- Compilation is only necessary when the source code changes.

## Example: Addition in Assembler, C and Python

We demonstrate how to implement addition in assembler, C and Python and explain what happens "under the hood" and how this affects performance.

### Addition in Assembler

The following four statements add two numbers from two memory locations and stores the sum at another location. (We use basic `x86` assembler here):

```
   MOV EAX, [2000]    ; move content of memory cell 2000 into EAX
   MOV EBX, [2001]    ; move content of memory cell 2001 into EBX
   ADD EAX, EBX       ; add content of EBX to EAX
   MOV EAX, [2002]    ; store content of EAX in memory cell 2002
```

1. This is all about integer numbers, addition of floating point numbers would result in different assembler code.

2. `EAX` and `EBX` are registers. A register can be considered as a "hardware variable".
2. This full snippet only takes a few CPU cycles (exact number depends on context). E.g. 2 cycles on a 3 GHz CPU take `2/3e9 ~ 0.6ns`. (light travels ~20cm in 0.6ns!)

Some **restrictions of assembler**:

1. There are no variables, only memory locations.

2. Branching and looping: you can only "jump" to another instruction depending on simple conditions.
3. Very restricted support for "function call": you can jump to a specific memory location and make sure that the referred code knows where to jump back to. Data layout for arguments and return values has to be handled by the programmer.

**Relation to machine code**:

1. Assembler is a text notation for machine code which is a sequence of zeros and ones. This translation is very simple, e.g. the first `MOV EAX, ...` corresponds to `10100001` followed by the binary representation of `2000`.
2. So machine code is a sequence of bytes in the memory.
5. During execution the CPU fetches code from one memory cell, executes it, fetches the next instruction, executes it and so on. This all happens "in hardware".


<br/><br/>
<center>
<div>
<table><tr>

<td width=20%/>
<td>    
<a href="https://imgflip.com/i/4ter5v"><img src="./images/4ter5v.jpg" title="made at imgflip.com"  /></a><div><a href="https://imgflip.com/memegenerator">from Imgflip Meme Generator</a></div>

</td><td>


<img src="./images/Used_Punchcard_%285151286161%29.jpg" width=50% />
<br/>
Source: wikipedia.com
    
</td></tr></table>
    </div>
<center>
    
In previous times the translation from assembler to a sequence of zeros and ones was done by the programmer and ones were punched as holes into cardboard!

### Addition in C



As most  languages which compile to machine code, `C` is *statically typed*, which means each variable has a defined type, which must be declared by the programmer. This also applies for function arguments and return values. 

Example in `C`:

```
int add(int a, int b) {
    int c = a + b;
    return c;
}
```

1. The first line declares a function `add` which takes two integer values `a` and `b` and returns an `int`. 
2. C does not use indentation to mark the body of the function but curly braces instead.
3. During function execution the sum is computed and stored in another variable `c` which is also of type `int`.

The `C` compiler translates this code to machine code. 

The `C` programmer 
- uses variables instead of memory addresses, 
- thinks in "types" and instead of different assembler instructions for integers or floating point numbers. 
- structures code by using functions.

So we can see that `C` already is much more user friendly, but one can still achieve the speed of assembler in most situations.

The cost for this user-friendliness is the extra compilation time before you can run the program. For every subsequent run compilation is not necessary any more.

### Addition in Python

In [1]:
def add(a, b):
    c = a + b
    return c

Python also has a compilation step involved: 
- It compiles a Python text file to *byte code* which are finally executed by the *Python virtual manchine*.
- The byte code is cached in `*.pyc` files which avoids this compilation step for subsequent runs. 
- This compilation step is fast compared to the later program execution.


We can inspect the byte code using the `dis` module from the standard library:

In [2]:
import dis
dis.dis(add)

  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 STORE_FAST               2 (c)

  3           8 LOAD_FAST                2 (c)
             10 RETURN_VALUE


Comments:
1. You can ignore the last two lines: this is the code to return `c`.
2. The rest looks very familiar to the assembler example: we fetch (load) two values, add them and store the result.

**Why is this slow?**

The biggest difference to the assembler example is how `BINARY_ADD` works. For execution the Python interpreter has to figure out:

1. Which types are involved? (Do we add strings, numbers, lists?)

2. Is the addition well defined (we can not add numbers and strings)? If not: raise an exception.
3. Maybe convert types for mixed operands, like adding integer and floats.
3. Lookup the actual implementation of the addition (Side note: you can implement your own addition operation - "overload addition" - when you implement Python classes.)
4. Call the actual implementation.

Beyond this complexity all these operations happen **in software** and not directly on the CPU.


<div class="alert alert-block alert-info">
    <p style="font-weight: bold; font-size:120%;"><i class="fa fa-info-circle"></i>&nbsp; Note</p>

To be correct: eventually everything happens on the CPU, but the CPython interpreter (which is implemented in `C` (sic!)), needs many more machine instructions for each of these operations. The factor depends on the byte code instruction and context but is in the order of tens or hundreds!


</div>

But Python offers many features which are not present in `C` or difficult to implement, e.g.

1. `C` for-loops just count from `a` to `b` with step size `c`. Python can loop over lists, file handles, strings, ...

2. You have to manage memory in `C`: for arbitrary sized arrays you have to ask the OS for memory, you must not forget to release the memory when done and write to non-allocated memory you program will crash! I guess you never though about this when you program in Python.
3. `C` has no dictionaries, lists, no `import` and only basic and cumbersome string handling.

<center>
<img src="./images/python.png" width=30% title="I wrote 20 short programs in Python yesterday.  It was wonderful.  Perl, I'm leaving you."/>
</center>

1. This is from https://xkcd.com/353/, so don't forget to check the hover text!
2. Run `import antigravity` in a Python console on your local computer! (will not work here in the notebook).



# Serial vs concurrent vs parallel 
Some definitions:

**serial execution** means that a task is executed from start to end without any interruption.

**concurrent execution** is the opposite of **serial execution**.

**parallel execution** means that 2 or more tasks are executed at the same time. Concurrent execution can happen in parallel, but this is not a must.


The **CPU** (central processing unit) is the core of all computations and program execution on a computer.

Modern CPUs consist of multiple **cores** ($n = 4, 8, 16, ..$) which can **operate in parallel**. So the more cores a computer has the more operations can be executed at the same time. This is why you can expect that computers with more cores are generally faster.

<center>
<div style="font-size: 0.8em;">
<img src="images/serial_concurrent_parallel.png" width=66%  /><br/>
    <center>
(C) 2021 ETH Zurich
    </center>
</div>
    </center>
<br />
<br />

You can see that parallel execution on two cores finishes the tasks faster, but not by a factor of 2.


**To enable parallel computations code must be broken down into parts which can run independently in parallel**. This might require some effort ("free lunch is over").



# Local vs distributed computing

Using the power of multiple CPU cores is not limited to the CPU cores on a single **local** computer. High Performance clusters support parallel execution of tasks **distributed over multiple machines**.

# Check questions  <i class="fa fa-check-circle" aria-hidden="true" style="background:#80ff80; margin: 10px; padding: 20px;"></i> 

1. Why is the "free lunch over"? What are the consequences?
2. Why is addition in Python slower than in `C`?
3. What do `for` loops in Python offer compared to `for` loops in `C`?
4. What is sequential processing in contrast to concurrent processing?


# The infamous GIL 

Python has severe limitations in terms of parallel execution of a single program on multiple cores.

The so called **GIL** (global interpreter lock) prohibits running code **in a single Python interpreter** on different cores in parallel!

<center>
<a href="https://imgflip.com/i/4tf3pj"><img src="./images/4tf3pj.jpg" title="made at imgflip.com" width=30%/></a><div><a href="https://imgflip.com/memegenerator">from Imgflip Meme Generator</a></div>
    </center>

This does not mean that (pure) Python programs **cannot** benefit from multiple cores. 

1. One can run multiple Python interpreters which execute code in parallel (see later in the script when we introduce the `multiprocessing` library).
2. You can implement Python modules in `C` (like `numpy`, `scipy`, `tensorflow`, ... do), and such *C extensions* can "*release the GIL*"). We will discuss such *C extensions* later.



<div class="alert alert-block alert-info">
<i class="fa fa-info-circle"></i>&nbsp;Why the GIL:
    <br/>
<ul>
<li> When Python was invented multi-core CPUs where not available, so nobody could consider the long-term effects of the GIL. </li>

<li> Using multiple cores within a single program require so called <i>locks</i> to avoid so called <i>race conditions</i> (see e.g. <a href="https://stackoverflow.com/questions/34510/what-is-a-race-condition">this stackoverflow post</a>)</li>
<li> The GIL implements one such <i>global lock</i> instead of multiple fine grained locks with the benefit that a global lock has less impact on performance of non-threaded programs.</li>
<li> The GIL also makes it easier to use <i>C</i> code from Python, which is one of the major success-factors of Python.
<li> Implementing a GIL is also much easier than implementing fine grained locks. The GIL is deep in the implementation of CPython and several attempts to remove it failed in the past.</li>
</ul>


</div>



# The Python eco-system for science

Python got quite famous in science during the last years, and its success is based on a stack of libraries. Will discuss these below, and you will see that the previously discussed limitations of Python regarding speed don't apply for them.

The [2020 article from nature](https://doi.org/10.1038/s41586-020-2649-2) includes a nice overview:


<div style="font-size: 0.8em;">
    <figure>
<img src="images/python_science.png" width=55% />
    </figure>
    <figcaption>
Source: https://doi.org/10.1038/s41586-020-2649-2
    </figcaption>
</div>

## A few selected libraries

[numpy](https://numpy.org) implements the fundamental data structures from linear algebra, such as vectors, matrices and higher order structures (tensors) (also called  `numpy` **arrays**). Beyond that `numpy` offers  some basic operations on such arrays.

[scipy](https://www.scipy.org) covers many fundamental algorithms beyond the implementations from `numpy`, e.g.
- mathematical optimization: linear optimization, function-fitting, ...
- adaptive numerical integration
- numerical solution of ordinary differential equations
- interpolation 

The [2020 nature article](https://doi.org/10.1038/s41592-019-0686-2) gives a more detailed introduction into `scipy` and also includes some facts about the overall project and its history.


[pandas](https://www.pandas.org) offers a tabular data structure called `DataFrame` and algorithms operating on this.  `pandas` `DataFrame` was inspired the `data.frame` known from `R`.

[scikit-learn](https://scikit-learn.org) offers routines from machine-learning. To be more specific: routines from so called *classical machine learning*, and thus does not offer artificial deep neural networks. 

At the time of writing this script, both [TensorFlow](https://en.wikipedia.org/wiki/TensorFlow) and [PyTorch](https://en.wikipedia.org/wiki/PyTorch) are the "big players" in the field of deep neural networks.

## About their speed

Most `numpy` routines are  implemented as a thin layer on-top of libraries [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and [LAPACK](https://en.wikipedia.org/wiki/LAPACK): development of both started at the end of the 20th century and are known to be robust and fast. The original implementations where in `Fortran`, whereas meanwhile CPU specific optimized implementations are developed in `C`.

Similar to `numpy`, most routines in `scipy` are implemented in `C` or `Fortran` and depend on established libraries such as [MINPACK](https://en.wikipedia.org/wiki/MINPACK), [QUADPACK](https://en.wikipedia.org/wiki/QUADPACK) or [NETLIB](https://en.wikipedia.org/wiki/Netlib).
These are known to be fast but also reliable and robust.

It will not surprise you that `scikit-learn` also implements many algorithms at least partially in `C` and uses `numpy` and `scipy` in many places. Beyond that many routines in `scikit-learn` can benefit from multiple cores.

Both `TensorFlow` and `PyTorch` implement their core in `C` or `C++` and the `Python` interface was added to make this core more approachable. Both also benefit from the speed offered by specialized hardware such as `GPU`s. On a `GPU`, training of a deep neural network can run multiples of 100 times faster than on a `CPU`s!


# A few words about computer architecture

Understanding the hardware architecture of a computer is important when you want to understand why some operations are fast or slow.

Our explanations below address programmers to get a basic understanding and are not 100% accurate, some details also depend on the CPU model used.

<center>

<div style="font-size: 0.8em;">
<img src="images/architecture.png" width="60%">
    <br/>
    <center>
(C) 2021 ETH Zurich
    </center>
</div>
</center>




## Cores

Each core consists of 

- A set of *registers*: modern CPUs have 16 of them. A register can be seen as a fixed variable implemented in hardware. Each can store a 64 bit integer number. Examples:
  - the *accumulator*  (named `EAX` or `RAX`) is used for computations.
  - the *instruction pointer* `EIP` which holds the memory address of the next instruction to be executed.
  - ...
- Arithmetic unit: this is where a CPU adds, multiplies, ... numbers.  
- Control logic: fetches next instruction from memory, runs it and determines next instruction. Plus general control of the work of the core.
- Level 1 cache: more about this in the next paragraph.

## Memory organization

<center>

<div style="font-size: 0.8em;">
<img src="images/architecture_memory.png" width="600px">
    <br/>
    <center>
(C) 2021 ETH Zurich
    </center>
</div>
</center>

The main memory (**RAM** = random access memory) holds the majority of data. Data also includes machine code. This memory is non-persistent which means that its content will be lost when you switch off your computer.

Since access to the main memory is relatively slow, modern CPUs offer a *cache hierarchy* of intermediate memories of different sizes and speed. A **cache** is a structure to *memorize* data:

- When we write data `X` to memory location `Y` this is also stored in the cache. When the cache is full we discard older data.
- When we fetch data from location `Y` we check the cache first. If data is present in the cache we fetch it from there.

About the different levels:

- Level 1 cache is private for each core, typical size is 64 kB.
- Level 2 cache is either shared between cores or private per core (depends on the architecture), typical size is 256 kB.
- Level 3 cache is 2 to 8 MB.

In low level languages a programmer is much more in control how to use memory and in which order to access memory to optimize cache usage and to improve speed.

**This is difficult in Python: the interpreter takes over control and Python data structures are large, thus don't efficiently use caches.**

Another issue with memory access is that **cores can not access main memory in parallel**. Thus when two cores want to read or write at the same time, one of them has to wait for the other to finish!

## Input/Output

<table>
<tr>

<td style="font-size:120%; vertical-align:top;width:50%;">    
            
A computer also needs hardware for input and output of data

- A hard disk to persist data when a computer is switched off and to store more data than the main memory can handle.</br></br>
- A network card to communicate with other computers.</br></br>
- Mouse and keyboard for user input.</br></br>
- A graphics card to show data on the screen.</br></br>

**Special note about graphic cards**

Modern graphic cards have a specialized computing unit called **GPU**. This is a computing unit which historically was optimized for geometrical computations related to 2D and 3D objects in computer games.

Lately these GPUs evolved into more general (yet still specialized) computing units which can be used to speed up mathematical computations, e.g. with applications in training deep neural networks. More about this in the last section of this workshop!

<td style="width: 10%">
    </td>



<td style="width: 15%">
    
<img src="images/architecture_io.png" height=500>
    <br/>
    <center>
(C) 2021 ETH Zurich
</td>    
    </td>
<td style="width: 25%">
    </td>
    
    
    </tr></table>




# Some timings

This are numbers from 2012 and demonstrate the order of magnitudes for different operations. 

Source: https://gist.github.com/jboner/2841832

```
                                             Real time             
L1 cache reference ......................... 0.5 ns                
Execute typical instruction ................   1 ns                
L2 cache reference ........................... 7 ns                
Main memory reference ...................... 100 ns                
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs      
SSD random read ........................ 150,000 ns  = 150 µs      
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs      
Read 1 MB sequentially from SSD  ..... 1,000,000 ns  =   1 ms      
Send packet CA->Netherlands->CA .... 150,000,000 ns  = 150 ms      
```


Comments:
- light travels ~30 cm in one `ns`.
- "reference" can be read as "initiate access". So reading from memory includes "reference" plus some data transfer.
- a "typical instruction" is sth like "add two numbers" or "compare two numbers".
- an "internet packet" usually has a size up to 1500 bytes.

You can see that operations which happen directly on the CPU are much faster than communication with memory, disk, the user or the network. 

There is also a positive correlation between *closeness to the core* and *execution speed*.


## Imagine a "human computer"

Since units like nano- and micro-seconds are beyond times we experience in our daily live, we can make these relations more accessible by imaging a "human computer" which processes numbers:

We assume that a human can **add up two numbers per second** in case the data is in a register or in the L1 cache. Some of the previous timings now scale as follows:

```                                          On human scale
Execute typical instruction .......          1.0 s
L2 cache reference ................          7.0 s
Main memory reference .............         ~1.3 minutes
SSD random read ...................          1.7 days
Send packet CA->Netherlands->CA ...          4.8 years
```

- If you have to fetch numbers from L2 cache you have to slow down by a **factor of 7**.
- If you need data from the main memory you can grab a **cup of coffee**.
- If you need 1 number from your SSD drive you **have to wait almost 2 days**.
- In case you need some data from the other end of the world over the internet you can **finish your PHD** and do work on other projects before you continue to add up numbers.

<a href="https://imgflip.com/i/4tg8oj"><img src="./images/4tg8oj.jpg" title="made at imgflip.com" width=25%
                                       /></a><div><a href="https://imgflip.com/memegenerator">from Imgflip Meme Generator</a></div>

# About swapping

Operating systems implement a technique named **paging** which provide program more *working memory* than the *main memory* can offer. 

The idea is to write unused data to the disk to a so called **swap space** and to read it back when required. The programmer has the impression of having more memory available than hardware offers.


The reading and writing of such memory chunks from/to disk is called **swapping**. Since you know meanwhile that accessing a disk is much slower than accessing memory, it should be clear swapping can slow down computations a lot.


# Check questions  <i class="fa fa-check-circle" aria-hidden="true" style="background:#80ff80; margin: 10px; padding: 20px;"></i> 

1. What is a register?
7. What is a cache?
8. What do some modern graphic cards offer?
9. How much faster is reading from main memory and reading from a SSD disk compared to an instruction (like adding two integers)?
10. What is paging and swapping? How does swapping affect speed?


# CPU and I/O bound computations

The term **I/O** is an abbreviation for **Input / Output** and refers to all operations related to user communication and transferring data to/from the disk or a network.

I/O is different to on-CPU operations:

1. Most operations run on dedicated hardware (e.g. network controller) in **parallel to CPU** tasks
2. I/O is much slower than on-CPU operations.

We talk about
- **CPU bound** operations, when the overall runtime of a program is dominated by operations on the CPU
- **I/O bound** operations, when the overall runtime of a program is dominated by operations involving I/O.

Examples:
- CPU bound: compress an image, train a neural network on a data set which fits into memory, align DNA sequences, ...
- I/O bound: download data from a web server, search through a document collection on disk, ...


Computing problems often involve both, especially when data set sizes and storage requirement are so big that they  don't fit into memory.

Examples:
- weather forecasts run distributed on multiple computers and have to shuffle intermediate results between them,
- training a deep neural network on a huge image collection might also require reading images over and over again.


**Why does this distinction matter?**

CPU bound problems and I/O bound problems require different strategies to make it faster":

1. For a CPU bound task you can
    - try to buy a faster CPU, 
    - reduce the number of operations (aka "better implementation" or "use a better algorithm") 
    - or try to split you computations in independent tasks which can run in parallel on different cores or machines.

2. For a I/O bound task you can
    - buy a faster disk or get a faster internet connection (if the remote endpoint is fast enough)
    - run I/O operations concurrently 
    

I/O bound computations can be **sped up without parallelization**. 


To understand this we start with an analogy:

You volunteered to bake Christmas cookies for your whole department. Because of the amount **you order the same ingredients
from two suppliers**. 

Which strategy would you prefer:

1. Sequential approach:
   - order from supplier 1
   - wait until order 1 arrives at post office 
   - fetch order 1 
   - bake cookies from order 1
   - order from supplier 2
   - wait until order 2 arrives at post office
   - fetch order 2 
   - bake cookies from order 2
   
2. Concurrent approach
   - order from supplier 1
   - order from supplier 2
   - wait until orders 1 and 2 arrive at post office 
   - fetch order 1 
   - fetch order 2
   - bake cookies from order 1
   - bake cookies from order 2

<div class="alert alert-block alert-info">
    <p style="font-weight: bold; font-size:120%;"><i class="fa fa-question-circle"></i>&nbsp; Question to the audience</p>

Which approach is the faster? Can you explain why?

</div>

The second approach is faster, because processing your orders and delivery happen in parallel!

So 
- for the sequential approach the overall time spent waiting is **the sum** of both waiting times, 
- for the concurrent approach it will be **the maximum** of both times.

Lets visualize this with a more programming related example: **fetching and processing data from the internet.**

In the following diagram:

- `C1` and `C2` are placeholders for "connecting to the remote web server".
- The green boxes represent "ask the network card if data arrived" (also called *polling*).
- The white parts are spare waiting time while the web server streams data to your computer. 
- `P1` and `P2` mean "process delivered" data.


<center>

<div style="font-size: 0.8em;">
<img src="images/concurrent_io.png" width=75% />
    <br/>
(C) 2021 ETH Zurich
</div>
</center>

<br/> 

The dimensions in the graphic do not reflect the reality: on a computer the white boxes will be *much longer* and the green boxes will be *much shorter* than the rest.

1. Similar to the Christmas cookies example you can see that the concurrent approach finishes earlier.
2. You can also see that running the solid green and dashed green boxes in parallel will not result in a significantly reduced run time. 
3. **An I/O bound program spends most of its time WAITING**!

During the waiting time your program "sleeps" which allows your operating system to schedule other tasks. Since the white area is reduced in the concurrent approach you also use your computers resources more efficiently.

When you generalize this example to fetching data from many web servers concurrently the white area will reduce further and almost all run-time will be spent doing actual work!

Another consequence is that **the GIL does not affect I/O bound computations**!

# Should I optimize at all?

This headline is supposed to be a bit provocative. In case the answer would be "**no!**", we would close this workshop now.

Nevertheless there are some **pragmatic** aspects which should be considered!

## The three laws of informatics

(source: https://blog.usejournal.com/rust-and-the-three-laws-of-informatics-4324062b322b)

1. Programs must be correct.
2. Programs must be maintainable, except where it would conflict with the First Law.
3. Programs must be efficient, except where it would conflict with the First or Second Law.

You will see during this workshop that many optimizations can require "tricks" or solutions which might not be as easy to understand as your initial approach. 

Running code on a high performance computer (HPC) also requires modifications which add complexity and will be specific for the HPC setup. This will make your program hard to reuse in another setting.

## But Donald Knuth said ..

[Donal Knuth](https://en.wikipedia.org/wiki/Donald_Knuth) is a famous computer scientist, which among many other contributions developed the [TeX](https://en.wikipedia.org/wiki/TeX) computer type setting system. He is also famous for [offering money for reported errors in his publications and software](https://en.wikipedia.org/wiki/Knuth_reward_check).


<center>
<a href="https://imgflip.com/i/4t7moh"><img src="./images/4t7moh.jpg" title="made at imgflip.com" width=20%/></a><div><a href="https://imgflip.com/memegenerator">from Imgflip Meme Generator</a></div>
</center>

This a famous quote among programmers and computer scientists, which is quite often misunderstood. To better understand it we have to read the full quote:



In Donald Knuth's paper [Structured Programming with *go to* Statements](https://doi.org/10.1145/356635.356640)
, he wrote: 

> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

This is often misunderstood as "optimization is bad". But it actually means that you can waste a lot of productive time by optimizing the wrong part of your code and that optimizations complicate code and thus makes code less maintainable and also harder to debug.

We will learn in Section 2 how to spot the crucial 3% of your code!.

## The Pareto principle

> 80% of the work is done in 20% of the time, 20% of the work needs 80% of the time.

Please don't discuss if it is 80 to 20, 90 to 10 or something similar. 

The message is: **a small part of your work will take a significant part of your available time!** E.g.


<table>
<tr>
    <td style="width:50%; padding:2em; background: #e0e0e0; border: 1px solid black;">Working implementation<br/>50%</td>
    <td style=" background: #e0e0e0; border: 1px solid black;">Optimisation<br/>50%</td>
</tr>
</table>

So optimization can take a significant amount of time and you should consider if it is worth the time.

And remember: The crucial question is not if your code **is slow** but if it is **too slow**!



## Micro vs Macro optimization

Micro optimization is about saving a few percent of run time, vs. macro optimization where we aim to reduce run time significantly (like 5 minutes vs 1 hour).

**Why and When micro optimization**

1. Macro optimization is not possible.
2. A few percent here and there add up.
3. Micro optimization in large projects still reduces costs (think of 5% less power usage in a huge computing center).

So you see for example that the Python developers do lots of micro optimizations in different places to make CPython increasingly faster.

# Check questions  <i class="fa fa-check-circle" aria-hidden="true" style="background:#80ff80; margin: 10px; padding: 20px;"></i> 

1. What is the difference between I/O bound and CPU bound computations?
12. Why don't you need parallel execution to speed up I/O bound computations?
13. What are the three laws of informatics?
14. What are common draw backs of optimized code?