# 3 quick tips every pythonist should know to speed up on the fly

Tired of getting 10 coffees to wait for you code to finish?
In this post I am going to provide an overview of three easy python performance tips that I wish I had knew before, to reduce a bit the C++ bullying. 

SPOILER ALERT On my next post I will show how to apply this tips in the context of a Multiobjectives Genetic Algorithms.
So for you that:
- Already wrote your first Genetic Algorithm (GA) and want to advance to more effective Multi-Objective GAs (MOGA)
- or understand the theory behing and is wondering how to build the first one in an effective way.

Keep tuned for my next post.

So shall we?

<img src="speed.jpg" />

## TIP 1 Local Parallelization using ProcessPoolExecutor 

ProcessPoolExecutor comes with the standard [library](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor) and allows a user friendly multiprocessing  of your code, in case you have more than one core. Simplifiying, given processes that do not share memory, can be run in parallel one in each core.

Below is an example of how easy it is to benefit from the ProcessPool, please do not run in jupyter (it will return a BrokenProcessPool), but in a script.
The function verifies, if numbers are prime (Return True) or not(False).

In [15]:
"""
1) Function processpool adapted from official python documentation:
https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor

Adapted by: DTKx
Github: https://github.com/DTKx
Date: 14/03/2021
"""
import concurrent.futures
import math
from time import perf_counter
import numba

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419,
    23515413413121213,
    2135438421313555,
    255131313111384,
    251251351351153,
    5151215132251221,
    5351313131121531,
    35131351513155351,
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419,
    23515413413121213,
    2135438421313555,
    255131313111384,
    251251351351153,
    5151215132251221,
    5351313131121531,
    35131351513155351,
    115797848077099,
    1099726899285419,
    23515413413121213,
    2135438421313555,
    255131313111384,
    251251351351153,
    5151215132251221,
    5351313131121531,
    35131351513155351,
]


def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True


def run_base():
    t0 = perf_counter()
    for number, prime in zip(PRIMES, map(is_prime, PRIMES)):
        # print("%d is prime: %s" % (number, prime))
        print("", end="")  # Just prints an empty line just to allow us not to have the cmd polluted
    t1 = perf_counter()
    print("Time Base Case", t1 - t0)


def run_process_pool():
    t0 = perf_counter()
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
            print("", end="")
    t1 = perf_counter()
    print("Time with ProcessPoolExecutor", t1 - t0)


if __name__ == "__main__":
    run_base()
    run_process_pool()

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

In my machine, even in this small code there was a difference of around 4secs.

1. Time Base Case 14.7976806
1. Time with ProcessPoolExecutor 10.7335183

Note however that ProcessPool is restricted to picklable objects (please take a look at the [docs](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor)).

If you want to learn more on local concurrency using the standard python library, I recommend this amazing [tutorial](https://www.youtube.com/watch?v=fKl2JW_qrso) on multiprocessing.

## TIP 2 Compiling python code with numba

My second tip is to use [numba library](https://numba.pydata.org/), numba compiles on the fly python code to machine code. Yes, that´s correct, we can also save a loooooot of time here. We just need to understand some things first.

Numba Summary

1. Numba works especially well with numpy and multiple loops.
1. You'll probably not be able to add jit decorators to your whole code.
1. Test numba in small and low performance functions (Spoiler to understand how to prioritize wait till Tip 3), that contains numpy and multiple loops. 
1. Look at the supported [python](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html) and [numpy](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html) functions.
1. Do not mix data types, numba won´t help you as python does by automatically converting for instance int64 to int32. 

Now as always let´s test.
First let´s [install](https://numba.readthedocs.io/en/stable/user/5minguide.html) the package:

In [None]:
# In your conda terminal
conda install numba# or pip install numba

Next, let´s test out previous example using numba. 

So first copy with a different name the is_prime function and add the numba decorator  on top @jit(nopython=True,fastmath=True).

In [None]:
@jit(nopython=True,fastmath=True)
def is_prime_numba(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

In [None]:
That´s it!
Now you just need to add the call to the numba function.

In [None]:
def run_numba():
    t0 = perf_counter()
    for number, prime in zip(PRIMES, map(is_prime_numba, PRIMES)):
        print("", end="")
    t1 = perf_counter()
    print("Time Numba", t1 - t0)


if __name__ == "__main__":
    run_base()
    run_process_pool()
    print("Time with Numba compiling")
    run_numba()
    print("Time with Numba after compiling")
    run_numba()

Now to the results:

1. Time Base Case 14.205312000000001
1. Time with ProcessPoolExecutor 9.930437499999998
1. Numba time with compiling
1. Time Numba 1.3300917000000005
1. Numba time after compiling
1. Time Numba 0.9918250000000022

Indeed my friends, compiled code is marvelous. I still love python though, especially now with numba, especially given the straightforward use of numba decorators in current code.

Note that the first time we run the numba code it will take longer, as it is compiling, however on the next run the code is even faster.

### Now if ProcessPool leverages from local parallelization and numba leverages from compiled code, what if we mixed them up together?

Simple, all we have to do is copy our function run_process_pool again and modify the function being called.

In [None]:
def run_process_pool_numba():
    t0 = perf_counter()
    with concurrent.futures.ProcessPoolExecutor(max_workers=2) as executor:# max_workers= maximum number of threads
        for number, prime in zip(PRIMES, executor.map(is_prime_numba, PRIMES)):
            print("", end="")
    t1 = perf_counter()
    print("Time with ProcessPoolExecutor with numba", t1 - t0)

In [None]:
Results:

1. Time Base Case 14.7415336
1. Time with ProcessPoolExecutor 10.2182718
1. Numba time with compiling
1. Time Numba 1.4119825000000006
1. Numba time after compiling
1. Time Numba 0.9829408000000015
1. Time with ProcessPoolExecutor with numba 2.9859480000000005

Well that is disapointing using ProcessPool and Numba is slower than only numba, but understandable. Our problem is too small to to overcome the overhead of the ProcessPool and my machine only have 2 cores, the result in your machine can be really different.

So now test, lets increase our number of values to test. We will add more 100000 numbers, using the function below:

In [None]:
def add_random_numbers(numbers):
    for i in range(numbers):
        PRIMES.append(random.randint(10 ^ 15, 10 ^ 17))

In [None]:
Time Base Case 15.024390200000001
Time with ProcessPoolExecutor 10.4288567
Numba time with compiling
Time Numba 1.4126227
Numba time after compiling
Time Numba 1.2121131999999974
Time with ProcessPoolExecutor with numba 2.2224521000000017
Add more 100000 numbers
100035
100035
Time Base Case 20.258951299999996
Time with ProcessPoolExecutor 122.43605180000002
Time Numba 1.3219542999999874
Time with ProcessPoolExecutor with numba 227.59259859999997

Uou, that is a lot of overhead with a larger number of numbers. This could be related to the cost of our prime function versus the overhead of ProcessPoolExecutor, or the code is sharing resources and requires locks in the given example. For instance in more complicated functions it might be worthy, so this leads us to the final tip.

## Tip 3 Benchmark effectivelly your code with Cprofiling

As you saw, talking about performance several cases, no matter the framework used, it might lead to different results. So one of the key aspects of performance is to know how to effectivelly benchmark your code, understanding the highest costs.
So you know where to focus your precious efforts.
For this we have the [cprofile](https://docs.python.org/3/library/profile.html#profile.run) of the standard python library, let´s see in practice by profiling our run_process_pool_numba and run_numba with 100035 numbers:

In [None]:
def run_cprofile():
    """Runs code making a time profile."""
    numbers_to_add = 100000
    add_random_numbers(numbers_to_add)#Addying more numbers

    pr = cProfile.Profile()
    pr.enable()

    pr.runctx("run_process_pool_numba()", globals(), locals())  # Add the function name here

    pr.disable()
    s = io.StringIO()
    sortby = SortKey.CUMULATIVE
    ps = Stats(pr, stream=s).sort_stats("tottime")
    ps.print_stats()
    export_to_txt(s, os.path.join(os.getcwd(), "cprofile.txt")) #Exports file

- Profile run_process_pool_numba()

In [None]:
         4688653 function calls (4688598 primitive calls) in 193.398 seconds 

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   296094  174.535    0.001  174.535    0.001 {method 'acquire' of '_thread.lock' objects}
   100036    8.230    0.000    8.230    0.000 {built-in method _winapi.WriteFile}
   ...

- Profile run_numba()

In [None]:
         496119 function calls (480498 primitive calls) in 2.429 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100035    1.288    0.000    1.288    0.000 c:\Users\...\processpool_example.py:84(is_prime_numba)
        1    0.383    0.383    2.429    2.429 c:\Users\...\processpool_example.py:120(run_numba)
...

- Our code using the run_process_pool_numba spend most of the time (more than 90%) waiting for the lock to be released, plase check the official website on [Lock Objects](https://docs.python.org/3/library/threading.html).

Indicating we can either test other structures or confidently select our numba approach.

Hope that this overview provided new insights on how to speed up on the fly your code and helped you save at least your 5 min reading time :).

Please if you have any other tips, comments, improvements or if this helped you please comment below. The full code is available in my [Github]().

Next time we will use this tips in a Multiobjective Genetic Algorithm, so if you have any interest please stay tuned.

Thank you, see ya!