Skip to content

Latest commit



997 lines (813 loc) · 38.6 KB

File metadata and controls

997 lines (813 loc) · 38.6 KB

Profiling Python code

This is a tutorial how to get started with profiling Python code by Christoph Deil.

This tutorial assumes that you have used a terminal, Python ipython and Jupyter before. No experience with Python profiling is assumed, this tutorial will get you started and focus on the basics.

We will only cover timing and profiling CPU and memory use. Other kinds of profiling, or how to optimise will not be covered.

Throughout the tutorial you will find short exercises marked with 👉. Usually the solution is given directly below. Please execute the examples and try things for yourself. Interrupt with questions at any time!

This is the first time I'm giving a tutorial on this topic. Please let me know if you have any suggestions to improve!



Please help me adjust the tutorial content and speed a bit:

  • How often do you profile Python code? (never, last year, all the time)?
  • Have you used psutil or psrecord?
  • Have you used time or %timeit?
  • Have you used cProfile or %prun?
  • Have you used line_profiler or %lprun?
  • Have you used memory_profiler or %memit or %mprun?
  • Have you used any other Python profiling tool?


Before we start, let's make sure we have everything set up.

👉 Check your setup!

You should have python (Python 3.5 or later), ipython, jupyter, psutil, psrecord, line_profiler, memory_profiler and snakeviz. Use these commands to check:

python --version
ipython --version
jupyter --version
python -c 'import psutil, psrecord, line_profiler, memory_profiler, snakeviz'

💡 If something is missing, you can install it with pip:

python -m pip install psutil psrecord line_profiler memory_profiler snakeviz

As part of this tutorial, we will go over the Profiling and Timing Code Jupyter notebook from the excellent Python Data Science Handbook by Jake VanderPlas. It's freely available at and generally is a great resource to learn, so I wanted to introduce it.

👉 Get set up with the Python Data Science Handbook to execute the notebooks on your computer now.

Follow these steps:

  • Open a new terminal (because we'll run jupyter lab there and then it can't be used for anything else)
  • Change directory to where you have your repositories
  • Run these commands:
    conda activate school18
    git clone
    cd PythonDataScienceHandbook/notebooks
    jupyter lab Index.ipynb
  • Open the "Profiling and Timing Code" (01.07-Timing-and-Profiling.ipynb) notebook from Chapter 1.
  • Leave it open, but go back to this tutorial for now.

1. When to profile?

You've probably heard this before: "Premature optimization is the root of all evil".

Timing and profiling is very much related to optimisation: you only do it if your code is too slow or you run out of memory.

The general recommendation concerning profiling and optimisation:

  • Start by writing simple, clean, well-structured code.
  • Establish correctness via automated test cases.
  • Start using your code for your application.
  • Never profile or optimise!

Computers these days are fast and have a lot of memory. Your time is precious. For the vast majority of code you write, optimisation and profiling are simply not needed.

Python and the libraries are so high-level, that you can write very advanced applications quickly. It's OK and advisable to re-factor or completly re-write using lessons learned from a first implementation, using better data structures, algorithms, or e.g. using numba for the small performance-critical part.

Of course, today is one of the few days in your life where you need to do profiling (you joined this tutorial), so let's do it.

2. How to profile?

Then the general recommendation is to proceed systematically in these steps:

  • Write a "benchmark", a script that reflects a real use case where you want better performance.
  • Define the key performance numbers (often runtime or peak memory usage) that you care about.
  • Measure and write down current performance. Make sure your current code version is checked in to version control before starting the profiling and optimisation.
  • Time and profile to find the performance bottlenecks
  • Optimise only the parts where it matters.

The advice "measure first, using a real use case of your application" always holds. Python is a very dynamic language, understanding even basic performance characteristics is hard and surprising. E.g. attribute access and function calls are fast in most languages, but in Python are slow. For real complex applications it's impossible just from looking at the code.

3. What to profile?

In this section we will look a bit at the following components of your computer using Python examples:

  • CPU (usually multi-core)
  • memory
  • disk
  • network

In this tutorial we will mainly focus on CPU and memory. We will not cover I/O (disk and network) much, and not mention GPU or multi-CPU at all.

Note that your computer hardware and software is incredibly complex. It is quite common performance bottlenecks and behaviour is surprising and confusing. Also, results will differ, mostly based on your hardware and operating system.

To see what your system is doing, the easiest way is to use your system monitor tool. I'm on Mac, where it's an app called Activity Monitor.

👉 Open up your system monitor tool.

💡 This tool is different for every operating system, so please use Google to find out how to do this.

Let's use as an example of a Python script that uses up a lot of CPU and memory.

👉 Run python and observe the process with your system monitor.

4. Measure CPU and memory usage

Running the Python script starts a process and within the process runs your code in a single thread.

So only one CPU core will be used by your Python script, unless you call into Python C extensions that can use multiple CPU cores.

We will not need this for the rest of this tutorial, but sometimes it can be useful to know how to figure out the ID of your Python process or thread.

To find out the number of a Python process:

>>> import os
>>> os.getpid()

To find out the active thread count, and thread identifier of the current thread:

>>> import threading
>>> threading.active_count()
>>> threading.get_ident()

To learn more about threads and processes, and how to create and control them from Python, see the Python standard library threading, multiprocessing and subprocess modules. The os and resource modules give you access to information about your operating system and sytem resources.

👉 Check how many CPU cores you have.

>>> import multiprocessing
>>> multiprocessing.cpu_count()

Note that sometimes the number reported does not reflect the number of hardware CPU cores. E.g. I have an Intel CPU with 4 cores, but here and in the activity monitor see 8. This is due to hyper-threading, where each physical CPU core appears as two logical cores.

In the Python standard library, things are pretty scattered and sometimes a bit cumbersome to use. Thankfully, there is a third-party Python package for process and system monitoring: psutil. To quote from the docs:

psutil (python system and process utilities) is a cross-platform library for retrieving information on running processes and system utilization (CPU, memory, disks, network, sensors) in Python. It is useful mainly for system monitoring, profiling, limiting process resources and the management of running processes.

Let's try out just a few of the things psutil can do:

import psutil

👉 How many CPU cores do you have? What frequency is your CPU?

psutil.cpu_freq() # In MHz

👉 How much memory do you have? How much free?

psutil.virtual_memory().total / 1e9 # In GB
psutil.virtual_memory().free / 1e9 # In GB

👉 How much disk space do you have? How much free?

psutil.disk_usage('/').total / 1e9
psutil.disk_usage('/').free / 1e9

👉 How many processes are running?


The recipes section in the psutil docs contains examples how to find and filter and control processes.

To get information about a specific process, you create a psutil.Process object. By default, the process will be the current process (with the number given by os.getpid())

>>> psutil.Process()
psutil.Process(pid=19651, name='python3.6', started='18:01:23')

But you can create a psutil.Process object for any process running on your machine, by giving the PID.

👉 Print the pid and name of the last 10 processes started.

for pid in psutil.pids()[-10:]:
    p = psutil.Process(pid)

👉 Print the current CPU and memory use of the current process.

p = psutil.Process()
p.memory_full_info().rss / 1e6 # in MB

Note that the CPU percent is given per core. So if you have 4 cores and a process that uses all of them, it will show up with cpu_percent = 400. To quote from the docs here:

The returned value is explicitly NOT split evenly between all available logical CPUs. This means that a busy loop process running on a system with 2 logical CPUs will be reported as having 100% CPU utilization instead of 50%.

The psutil docs are very good; one gotcha to watch out for is that some functions appear twice, e.g. there is psutil.cpu_percent for the whole system, and there is psutil.Process.cpu_percent for a given process.

You can use psutil directly from your script, or use tools built on top of psutil. There are some example applications, many projects using psutil and ports to other languages.

One tool I find nice is psrecord, which makes it simple to record and plot the CPU and memory activity of a given process.

👉 Run through psrecord.

psrecord --help
psrecord --interval 0.1 --plot compute_and_io.png --log compute_and_io.txt 'python'
head compute_and_io.txt
open compute_and_io.png

For me, this takes about 7 seconds:

$ psrecord --interval 0.1 --plot compute_and_io.png --log compute_and_io.txt 'python'
Starting up command 'python' and attaching to process
0.000 sec :  starting computation
0.352 sec :  starting network download
2.753 sec :  starting more computation
5.873 sec :  starting disk I/O
6.673 sec :  done
Process finished (7.09 seconds)

In the recorded compute_and_io.png one can nicely see the typical behaviour of Python processes:

  • One thread runs on one core at 100% CPU utilisation while you're doing computations. The Python interpreter is basically a while True: execute next byte code loop.
  • And when disk or network I/O happens, Python makes calls into the operating sytem, and the CPU utilisation is lower. It can be between 0% and 100%, depending on the I/O task and your computer.

Note that Numpy, Scipy, Pandas or other libraries you use might use multiple CPU cores in some functions (see here). E.g. to compute the dot product of two arrays, calls into a linear algebra library and often uses multiple CPU cores if they are available and if that would speed up the computation. Sometimes more cores don't help because the bottleneck is the data access from memory. The performance of a given script might be very different not just depending on your CPU, but also your software (e.g. Numpy and BLAS), see e.g. here. Anaconda by default gives you a high performance Intel MKL.

👉 Use numpy.show_config to see which linear algebra library (called BLAS and LAPACK) you are using.

👉 Run psrecord on

psrecord --interval 0.1 --plot multi_core.png 'python'

On my machine, numpy.random.random_sample uses one core, and uses all four available cores: multi_core.png.

If you want to write functions yourself that use multiple cores, this is possible with Numba, Cython or from any C extension, but not from normal Python code. If you're interested in this, see e.g. here or here.

psutil can also help you with disk or network I/O and monitoring or controlling subprocesses. We won't go into this here, just one quick example:

👉 Get a list of open files for your process.

import psutil
p = psutil.Process()
fh = open('')

5. Time code execution

Total runtime of your analysis is often the most important performance number you care about.

To time the execution of a Python script, you can use the Unix time command.

👉 Time the python interpreter startup. Time import numpy.

$ time python -c ''
real	0m0.038s
user	0m0.025s
sys	0m0.008s

$ time python -c 'import numpy'

real	0m0.141s
user	0m0.105s
sys	0m0.032s

Detailed information about the three times is given here and here. Basically:

  • The real time is the wall clock time. It's what you usually care about and want to be small.
  • The user and sys are the time spent in "user mode" and "in the kernel". You usually don't care about this breakdown.

Note that real time, i.e. wall clock time, doesn't depend on the number of CPU cores that was used. But user and sys does, for processes that use multiple cores, they can be larger than the real time. Here's an example:

$ time python

real	0m15.656s
user	0m51.727s
sys	0m0.895s

If you want to time only part of your Python script, you can use the Python standard library time module, specifically the time.time function, like this:

import time
t_start = time.time()
# start of code you want to time
# end of code you want to time
t_run = time.time() - t_start
print('t_run:', t_run)

We already saw this above in the example using

If you want to do more precise timing of small bits of Python code (say less that 1 second) use timeit.

Both %time and %timeit are available from ipython and Jupyter.

👉 Let's work through the "Timing Code Snippets" section in the Profiling and Timing Code Jupyter notebook from the excellent Python Data Science Handbook by Jake VanderPlas. Instructions how to start it are in the Setup section.

Using %timeit for the first time is confusing. It's not obvious what is happening and what all the numbers mean:

In [1]: %%timeit a = 42 
   ...: a * a 
40.8 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  • The reported time is wall clock time (no distinction of user and sys made, number of cores used not considered)
  • The statement on the first line is setup, not included in the measured time. You can also do this in a previous input prompt instead of putting it after the %%timeit.
  • There were 7 runs and 10 million loops for each run. That means the statement a * a was timed 70 million times.
  • For each run, only the minimum time is used - that's the best estimate of execution time under optimal conditions.
  • %timeit reports the mean and std. dev. of the runs. So the main number of interest is the first one on the line, in this case: multiplying two Python ints takes 40.8 ns. The std. dev. (1.5 ns in this case) isn't really of interest, except to judge the reliability of the measurement a little bit: it should be much smaller than the mean.
  • As you will see on the %timeit? help page, you can use -r to set the number of runs (default is 7) and -n to set the number of loops per run. The default for -n is to use an adaptive method, so that if it will run very often if your code executes very quickly, and only one or a few times if your code takes a long time (e.g. 1 second) to execute.

Using %time for the first time is confusing. Four times are given for each measurement:

In [5]: %time a = np.ones(int(1e6))                                                                             
CPU times: user 2.27 ms, sys: 3.76 ms, total: 6.03 ms
Wall time: 4.85 ms

In [8]: %time 3 * 3                                                                                             
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 8.11 µs
  • The number you care about is the wall time, ignore the others.
  • I'm not sure what "total" is, and why it's sometimes shorter, sometimes longer than the wall time. Maybe the number of CPU cores are taken into account?
  • The Python timeit and the ipython %timeit usually report a shorter time than the Python time and ipython %time. Basically timeit will give the best possible execution time, by running multiple times and reporting the best, but also by avoiding interruptions from operating system calls, and the Python garbage collector. time just runs once and doesn't do those things. You have to decide which you want.

6. Function-level profiling

The examples above using psutil and psrecord use "sampling" at regular time intervals "from the outside" to measure the CPU and memory usage. This can be useful to see the overall performance of your process, but the connection to your Python code is lost, to understand and optimise you need something different.

The Python standard library contains a deterministic function-level profiler. It traces the execution of your Python code, and records every function call and return (a more detailed explanation is here). Then at the end, you can examine the stats to find which functions were run how often and how much time is spent in each function.

Note: 😲 The Python standard library has a profile and cProfile that do the same thing. 😕 Only profile is implemented in Python and slower and cProfile in C and faster. So you should always use cProfile. :relieved:

👉 Run the script through the Python profiler.

$ python -m cProfile --help
Usage: [-o output_file_path] [-s sort] scriptfile [arg] ...

  -h, --help            show this help message and exit
  -o OUTFILE, --outfile=OUTFILE
                        Save stats to <outfile>
  -s SORT, --sort=SORT  Sort order when printing to stdout, based on
                        pstats.Stats class

$ python -m cProfile
         30 function calls in 0.114 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    0.114    0.114<module>)
        2    0.000    0.000    0.088    0.044
        1    0.001    0.001    0.113    0.113
        2    0.088    0.044    0.088    0.044<listcomp>)
        2    0.000    0.000    0.023    0.012
        1    0.000    0.000    0.114    0.114 {built-in method builtins.exec}
       20    0.023    0.001    0.023    0.001 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

The meaning of the columns is described here.

👉 Profile and store the resulting stats in Use pstats to read and view the stats in different ways.

$ python -m cProfile -o
$ python -m pstats
Welcome to the profile statistics browser.
% help

Documented commands (type help <topic>):
EOF  add  callees  callers  help  quit  read  reverse  sort  stats  strip

% read stats
Tue Jun  5 17:07:43 2018

         30 function calls in 0.109 seconds

   Random listing order was used

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.109    0.109 {built-in method builtins.exec}
       20    0.023    0.001    0.023    0.001 {built-in method builtins.sum}
        2    0.084    0.042    0.084    0.042<listcomp>)
        1    0.001    0.001    0.107    0.107
        1    0.002    0.002    0.109    0.109<module>)
        2    0.000    0.000    0.023    0.011
        2    0.000    0.000    0.084    0.042
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects} sort ncalls stats
Tue Jun  5 17:07:43 2018

         30 function calls in 0.109 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    0.023    0.001    0.023    0.001 {built-in method builtins.sum}
        2    0.084    0.042    0.084    0.042<listcomp>)
        2    0.000    0.000    0.023    0.011
        2    0.000    0.000    0.084    0.042
        1    0.000    0.000    0.109    0.109 {built-in method builtins.exec}
        1    0.001    0.001    0.107    0.107
        1    0.002    0.002    0.109    0.109<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects} quit

cProfile and stats are described in detail in the tutorials here or here. They are very powerful, but can be a bit cumbersome to use. Let's look at two user-friendly options that use cProfile under the hood.

From ipython or jupyter you can use the %prun or %%prun magic commands (see docs or bring up the help via %prun?).

👉 Import compute and %prun the compute.main() function.

$ ipython

In [1]: %prun?

In [2]: import compute

In [3]: %prun compute.main()
         30 function calls in 0.114 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.086    0.043    0.086    0.043<listcomp>)
       20    0.025    0.001    0.025    0.001 {built-in method builtins.sum}
        1    0.002    0.002    0.114    0.114 <string>:1(<module>)
        1    0.001    0.001    0.112    0.112
        2    0.000    0.000    0.025    0.013
        1    0.000    0.000    0.114    0.114 {built-in method builtins.exec}
        2    0.000    0.000    0.086    0.043
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

The functionality from pstats is OK to understand the profile results. But often being able to quickly visualise and browse the results is nicer. For this, you can use snakeviz.

👉 Open with snakeviz.

$ snakeviz 
snakeviz web server started on; enter Ctrl-C to exit

Make sure you explore the output a bit, especially try both "Sunburst" and "Icicle" for the style.

👉 Run snakeviz from ipython or jupyter using %load_ext snakeviz and then %snakeviz or %%snakeviz.

7. Line-level profiling

With function-level profiling you can find the functions that are relevant to the performance of your application. But what if you want to know which lines of code in the function are slow? The line_profiler package let's you measure execution time line by line, from Python, ipython or Jupyter.

👉 Let's continue with the Profiling and Timing Code notebook to see an example of line profiling from Python.

8. Memory profiling

You want your program to fit in main memory. If it doesn't, then the operating system will either start swapping to disk, which is slow, or kill your process.

In this section we look a bit how you can figure out how much memory your Python program uses, and how to figure out where that memory is allocated.

👉 Write some Python code that runs out of memory, i.e. causes a MemoryError.

💡 See

This is surprisingly difficult, because depending on your operating system and configuration, it might start swapping to disk. So allocating more memory than you have RAM might work just fine. psutil.swap_memory can give some info on this.

👉 To find the peak memory use of a program, you can run it through psrecord and then take the max of the third column "Real (MB)"

import pandas as pd
df = pd.read_csv(
      skiprows=1, delim_whitespace=True,
      names=['t', 'cpu', 'mem', 'vmem'],
mem_max = df['mem'].max()
print(f'Max memory: {mem_max} MB')

Now what if you're using too much memory, and would like to know why?

First of all, it helps to know a bit about how Python stores data. As explained here, a python int of float is more than just an int or float in C. Similarly, e.g. a list uses more memory than an array in C.

👉 Use sys.getsizeof to measure the size (in bytes) of a few objects:

>>> import sys
>>> sys.getsizeof(42) # int
>>> sys.getsizeof('spam') # str
>>> sys.getsizeof({}) # empty dict
>>> sys.getsizeof([]) # empty list
>>> sys.getsizeof([42]) # list with one int
>>> import numpy as np
>>> data = np.ones(1000) # default dtype is float with 64 bit = 8 byte
>>> sys.getsizeof(data)

Some Python objects have convenience methods to get their memory use. Especially there is numpy.ndarray.nbytes:

>>> import numpy as np
>>> array = np.ones(1000)
>>> array.nbytes
>>> array.size
>>> array.itemsize

Pandas shows the memory used by a data frame via

>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3.3, 4.4]})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
a    2 non-null int64
b    2 non-null float64
dtypes: float64(1), int64(1)
memory usage: 112.0 bytes

There is also the memory_profiler which you can use to monitor memory usage or even look at line by line increment / decrement of memory usage.

👉 Let's use the 01.07-Timing-and-Profiling.ipynb to try this out. (See last section how to get it.)


Let's try to apply what we have learned to a longer and more complex example.

How to make cash with Python, fast!

👉 Read and run

👉 Measure the execution time and peak memory use.

time python
psrecord --interval 0.02 --plot cash.png 'python'

👉 Profile the execution, and browse the profile stats to see where the time is spent.

python -m cProfile -o

👉 Create a line-by-line profile for the cash function.

Open IPython and use these commands:

%load_ext line_profiler
import cash
%lprun -f cash.benchmark()
%lprun -f cash.benchmark -f cash.model -f cash.benchmark()

👉 For x = np.ones(int(1e6)), use %timeit to check how long x + x, x * x, x ** 2 and np.log(x) take. How does the execution time change with array size (e.g. try 1 or 1000 elements)?

Do it in IPython or Jupyter

import numpy as np
x = np.ones(int(1e6), dtype=np.float64)
%timeit np.log(x)

👉 Try different data types (32 and 64 bit floats and integers) for the simple statements and Which data type is fastest?

👉 Try to optimise the implementation of the cash function for this test case. Is there a better way to implement it using Numpy? Maybe try numba, numexpr, Cython, Tensorflow or pytorch? Can you find a way to use multiple CPU cores or even a GPU? How big is the speedup compared to the reference implementation that you can achieve?

Note: I'm very interested in ideas or solutions here, if you get something, please share! In Gammapy analyses we spend a significant fraction of time to evalutate models and the cash fit statistic.

Note: the task is to profile and optimise the function implemenation, don't change the function inputs (e.g. use different or adaptive binning) or outputs (assert on result value should still pass).

Things to remember


  • Profiling is hard. Performance depends on your code, data and parameters, but also on your CPU, C compiler, Python, libraries, ...
  • Only profile and optimise if needed. Most of the time you don't.
  • Always measure and profile a real use case before starting to optimise. Often the measure of interest is runtime, sometimes memory use or disk I/O or other things.
  • Usually data structures and algorithms are more important than micro optmisations.


  • Python provides great tools for timing and profiling code.
  • Use psutil and psrecord to measure and record CPU and memory use. You can use this on any process, not just Python.
  • Use the Unix time, Python standard library time or timeit, and ipython / Jupyter %time, %timeit line and %%time, %%timeit cell magic commands to measure CPU time.
  • To profile CPU usage, the Python standard library provides cProfile and pstats. The %prun and snakeviz make this nice to use. This is a "deterministic function-level profiler", i.e. works by tracing function calls.
  • Use line_profiler and included kernprof, %lprun, %%lprun to line-by-line profiling for a given function. Again, this is a deterministic profiler tracing line execution.
  • There are also "sampling profilers" that sample a process at given time intervals. This is what psrecord does, and also what system monitor tools do. We didn't cover them here, but some links to other profilers are given in the next section.
  • Use memory_profiler to monitor the memory usage of Python code. %memit for a single statement, and %mprun for line-by-line profiling of a given function.
  • Use sys.getsize or Numpy array.nbytes or Pandas to see the memory usage for a given object.

Going further

If you'd like to learn more, here's how you can go further:

  • If you only read through this tutorial, go back to the start and type and execute the exercises to make it stick. They are marked with 👉.
  • We did not do a real-word complex example here. Take some of your application code or data analysis and time and profile it to practice. Is it I/O or CPU limited? How much peak memory does it use? Which functions or lines are the bottleneck?
  • The Profiling and Timing Code notebook from the Python data science handbook by Jake VanderPlas covers similar material, executing everything from the Jupyter notebook.
  • The How to optimize for speed page in scikit-learn docs, which contains a bit of infos on profiling C extensions as well.
  • The profile and pstats — Performance Analysis tutorial from the Python module of the week by Doug Hellman is a very detailed overview of cProfile and pstats.
  • The README and docs of psutil give a good overview of all the things you can monitor and measure about your sytem and process.
  • The snakeviz docs contain descriptions and examples of how to visually explore the profile stats.
  • line_profiler documentation
  • memory_profiler documentation

There are other Python profiling and visualisation tools. I didn't try them yet, but

  • PyCharm has a profiler, but only in the non-free professional edition.
  • vmprof-python - a statistical program profiler
  • yappi - Yet Another Python Profiler
  • vprof - Visual profiler for Python
  • plop - Python Low-Overhead Profiler
  • pyinstrument - Call stack profiler for Python. Shows you why your code is slow!
  • gprof2dot - Converts profiling output to a dot graph
  • pyprof2calltree - Profile python programs and view them with kcachegrind
  • PyFlame - A Ptracing Profiler For Python
  • Intel VTune - Supports many languages, including Python

We did not have time to cover optimisation.

If you'd like to make your code faster, here's some things you could look at:

  • Get to know the data structures and the performance characteristics of Python types (numbers, lists, dicts, objects) as well as numpy and pandas. The slides from David Beazley here give a good overview.
  • Vectorise your code using numpy.
  • If your algorithm isn't easy to express in vectorised form with numpy, or if numpy is too slow or uses too much memory, try Numba or Cython. There are a lot of tutorials and comparisons available online. A recent one that contains a good summary and link collection at the top is The case for Numba by Matthew Rocklin.
  • To take advantage of multiple cores, try multiprocessing from the Python standard library. Also look at Dask.
  • If you're willing to consider other languages to write a Python C extension, you have options which language to use:
    • You can write in Cython and it will generate C.
    • If you use C, popular options to interface include CFFI and Cython (as well as others, see e.g. here)
    • For C++, traditionally SWIG and Cython and Boost.Python have been frequently, but recently I think pybind11 has become the tool of choice.
    • Julia has very good interfacing to PyCall. From what I've seen, it's not commonly used though to ship with Python libraries, probably because it's new and harder to support installation for the many different machines and distributions users have. For other modern languages like rust or go it's similar as far as I know.