Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
e503272
start course with optimisation section and finish with profiling tools
msarkis-icr Jan 14, 2025
618fb80
change the course narrative to start with the optimisation
msarkis-icr Jan 14, 2025
c074680
remove logo by disconnecting varnish
msarkis-icr Jan 14, 2025
a7a09ca
change carpentry type
msarkis-icr Jan 14, 2025
c73399a
Fixed US/British spelling mix-up and reduced the number of new paragr…
Jan 21, 2025
010a273
Fixed comma issues and fixed grammar in the second last paragraph
Jan 21, 2025
d18ba1c
Fix spelling mistakes and changed output/description order in vectori…
Jan 22, 2025
ea7ba46
Fixed comma issues and added 'Later' specifier for one topic
Jan 22, 2025
5a31eff
Fixed gammar issues
Jan 22, 2025
45751ef
Fixed gammar issues
Jan 22, 2025
e00e48e
Merge pull request #1 from ICR-RSE-Group/fix-issue-21
stacyrse Jan 23, 2025
29e3e24
remove registration
msarkis-icr Jan 27, 2025
354f4fa
re-write the algo and diagram explanation for linear probing
msarkis-icr Jan 27, 2025
6375824
rephrase introduction for a smoother link with testing
msarkis-icr Jan 27, 2025
eb289fa
re-write memory section
msarkis-icr Jan 27, 2025
ee341e0
add a brief definition of CPython and simplify text
msarkis-icr Jan 27, 2025
bbfb7be
Merge pull request #2 from ICR-RSE-Group/fix_issue_21_bis
msarkis-icr Jan 29, 2025
f95d4df
added an explicit comment about search in dictionaries being average …
Jan 29, 2025
58857a6
removed extra imports which already exist on the page
Jan 29, 2025
a3271f2
added a visual representation of python lists resizing
Jan 29, 2025
3d73d1d
added a summary table for all data structure at the end of the data s…
Jan 30, 2025
0d03524
changed summary able of data structures
Jan 30, 2025
3320f94
Merge pull request #3 from ICR-RSE-Group/fix-issue-21-ss
stacyrse Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 12 additions & 14 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
#------------------------------------------------------------

# Which carpentry is this (swc, dc, lc, or cp)?
# swc: Software Carpentry
# swc: Software Carpentry -
# dc: Data Carpentry
# lc: Library Carpentry
# cp: Carpentries (to use for instructor training for instance)
# incubator: The Carpentries Incubator
carpentry: 'incubator'
carpentry: 'swc'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remains incubator until it's graduated to Carpentries lab (aka super stable, well tested), I think it then has to graduate from lab (somehow?) to be officially adopted into software carpentries.


# Overall title for pages.
title: 'Performance Profiling & Optimisation (Python)'
title: 'Python Optimisation and Performance Profiling'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've discussed the order switching with colleagues and we remain firmly in the camp that profiling comes before optimisation (find the bottleneck, then identify whether it's something that can be addressed).

There's now a slight plan to extend the profiling Predator Prey exercise into the optimisation half of the course (#53), as you may have noticed optimisation is very exercise light, which won't really work without profiling being introduced first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I understand your point of view.
Switching the order fits mostly the timing we chose for giving the course.
We always realise that students concentrate less after lunch time, thus we wanted a lighter section post-lunch

Copy link
Member

@Robadob Robadob Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always realise that students concentrate less after lunch time,

I'm aware that some carpentries courses instead run as two half-day sessions for this reason (our local Git course is normally structured that way).


# Date the lesson was created (YYYY-MM-DD, this is empty by default)
created: 2024-02-01~ # FIXME
Expand All @@ -27,13 +27,13 @@ life_cycle: 'alpha'
license: 'CC-BY 4.0'

# Link to the source repository for this lesson
source: 'https://github.com/RSE-Sheffield/pando-python'
source: 'https://github.com/ICR-RSE-Group/carpentry-pando-python'

# Default branch of your lesson
branch: 'main'

# Who to contact if there are any issues
contact: 'robert.chisholm@sheffield.ac.uk'
contact: 'mira.sarkis@icr.ac.uk'

# Navigation ------------------------------------------------
#
Expand All @@ -59,23 +59,21 @@ contact: 'robert.chisholm@sheffield.ac.uk'

# Order of episodes in your lesson
episodes:
- profiling-introduction.md
- profiling-functions.md
- short-break1.md
- profiling-lines.md
- profiling-conclusion.md
- optimisation-introduction.md
- optimisation-data-structures-algorithms.md
- long-break1.md
- optimisation-minimise-python.md
- optimisation-use-latest.md
- optimisation-memory.md
- optimisation-conclusion.md
- long-break1.md
- profiling-introduction.md
- profiling-functions.md
- profiling-lines.md
- profiling-conclusion.md

# Information for Learners
learners:
- setup.md
- registration.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an unbranded fork of the course on carpentries incubator.

https://github.com/carpentries-incubator/pando-python

- acknowledgements.md
- ppp.md
- reference.md
Expand All @@ -91,5 +89,5 @@ profiles:
# This space below is where custom yaml items (e.g. pinning
# sandpaper and varnish versions) should live

varnish: RSE-Sheffield/uos-varnish@main
url: 'https://rse.shef.ac.uk/pando-python'
#varnish: RSE-Sheffield/uos-varnish@main
#url: 'https://icr-rse-group.github.io/carpentry-pando-python'
Binary file added episodes/fig/python_lists.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great figure in principle, however I'm concerned the handwritten text isn't widely accessible as some readers may struggle with it (all the learning support guidance at my institution is very adamant that we stick to clear sans-serif fonts).

I'm happy to reproduce it and include it in the course giving you credit with your blessing. Potentially with a few changes to the text too (e.g. more context to "continuous block of memory").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah feel free to reproduce it and modify it and thank you for giving us the credit for it.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion episodes/long-break1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Break
title: Lunch Break
teaching: 0
exercises: 0
break: 60
Expand Down
49 changes: 33 additions & 16 deletions episodes/optimisation-data-structures-algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,12 @@ CPython for example uses [`newsize + (newsize >> 3) + 6`](https://github.com/pyt

![The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.](episodes/fig/cpython_list_allocations.png){alt='A line graph displaying the relationship between the number of calls to append() and the number of internal resizes of a CPython list. It has a logarithmic relationship, at 1 million appends there have been 84 internal resizes.'}

![Visual note on resizing behaviour of Python lists.](episodes/fig/python_lists.png){alt='Small cheat note for better visualization of Python lists.'}


This has two implications:

* If you are creating large static lists, they will use upto 12.5% excess memory.
* If you are creating large static lists, they will use up to 12.5% excess memory.
* If you are growing a list with `append()`, there will be large amounts of redundant allocations and copies as the list grows.

### List Comprehension
Expand Down Expand Up @@ -151,21 +154,23 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t
### Hashing Data Structures

<!-- simple explanation of how a hash-based data structure works -->
Python's dictionaries are implemented as hashing data structures.
Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key.
The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary.
If that index doesn't already contain another key, the key (and any associated values) can be inserted.
When the index isn't free, a collision strategy is applied. CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c) both use a form of open addressing whereby a hash is mutated and corresponding indices probed until a free one is located.
When the hashing data structure exceeds a given load factor (e.g. 2/3 of indices have been assigned keys), the internal storage must grow. This process requires every item to be re-inserted which can be expensive, but reduces the average probes for a key to be found.
Python's dictionaries are implemented using hashing as their underlying data structure. In this structure, each key is hashed to generate a (preferably unique) integer, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the hash value of a key, modulo the dictionary's length, determines its initial index. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index.

![An visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram demonstrating how the keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. This is followed by the insertion of 59, 80 and 39 which require linear probing to be inserted due to collisions."}
In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found.

To retrieve or check for the existence of a key within a hashing data structure, the key is hashed again and a process equivalent to insertion is repeated. However, now the key at each index is checked for equality with the one provided. If any empty index is found before an equivalent key, then the key must not be present in the ata structure.
When a dictionary or hash table in Python grows, the underlying storage is resized, which necessitates re-inserting every existing item into the new structure. This process can be computationally expensive but is essential for maintaining efficient average probe times when searching for keys.
![A visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram showing how keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. The insertion of 59, 80, and 39 demonstrates linear probing to resolve collisions."}
To look up or verify the existence of a key in a hashing data structure, the key is re-hashed, and the process mirrors that of insertion. The corresponding index is probed to see if it contains the provided key. If the key at the index matches, the operation succeeds. If an empty index is reached before finding the key, it indicates that the key does not exist in the structure.

The above diagrams shows a hash table of 5 elements within a block of 11 slots:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation of the figure is great, something that was missed in the original.

However, we're now planning to migrate some of these more technical explanations to a technical appendix, as they distract a bit from the course materials.

I think this section will get moved, though that's yet to be confirmed.

Regardless there's a note added to #61 to review this when that's handled processed, which I'm hoping to do within the next month.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, moving technical sections to an appendix, thus making it more accessible for most people
We found the course very dense and advanced: we learned a lot reading it

1. We try to add element k=59. Based on its hash, the intended position is p=4. However, slot 4 is already occupied by the element k=37. This results in a collision.
2. To resolve the collision, the linear probing mechanism is employed. The algorithm checks the next available slot, starting from position p=4. The first available slot is found at position 5.
3. The number of jumps (or steps) it took to find the available slot are represented by i=1 (since we moved from position 4 to 5).
In this case, the number of jumps i=1 indicates that the algorithm had to probe one slot to find an empty position at index 5.

### Keys

Keys will typically be a core Python type such as a number or string. However multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.
Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.

You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function.

Expand Down Expand Up @@ -265,7 +270,7 @@ Constructing a set with a loop and `add()` (equivalent to a list's `append()`) c

The naive list approach is 2200x times slower than the fastest approach, because of how many times the list is searched. This gap will only grow as the number of items increases.

Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of it's runtime is now spent sorting the list.
Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of its runtime is now spent sorting the list.

```output
uniqueSet: 0.30ms
Expand All @@ -280,9 +285,9 @@ uniqueListSort: 2.67ms

Independent of the performance to construct a unique set (as covered in the previous section), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists.

The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with a single access.
The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with single access, result in an average time complexity of a constant (which is very good!).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Course does not explain O-notation or time complexity, I have reworded this though.

In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing on the first attempt (without probing beyond the original hash).

Likewise, not sure why you removed "a", and "result" feels like it should be "resulting".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem dropping this change about O-notation.


In contrast if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found half way through the list, meaning that an average search will require checking half of the items.
In contrast, if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore, the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found halfway through the list, meaning that an average search will require checking half of the items.

If however the list or array is sorted, a binary search can be used. A binary search divides the list in half and checks which half the target item would be found in, this continues recursively until the search is exhausted whereby the item should be found or dismissed. This is significantly faster than performing a linear search of the list, checking a total of `log N` items every time.

Expand Down Expand Up @@ -333,9 +338,7 @@ print(f"linear_search_list: {timeit(linear_search_list, number=repeats)-gen_time
print(f"binary_search_list: {timeit(binary_search_list, number=repeats)-gen_time:.2f}ms")
```

Searching the set is fastest performing 25,000 searches in 0.04ms.
This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search.
The linear search of the list is more than 56,600x slower than the fastest, it really shouldn't be used!
Searching the set is the fastest, performing 25,000 searches in 0.04ms. This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search. The linear search of the list is more than 56,600x slower than searching the set, it really shouldn't be used!

```output
search_set: 0.04ms
Expand All @@ -345,6 +348,20 @@ binary_search_list: 5.79ms

These results are subject to change based on the number of items and the proportion of searched items that exist within the list. However, the pattern is likely to remain the same. Linear searches should be avoided!

::::::::::::::::::::::::::::::::::::: callout

Dictionaries are designed to handle insertions efficiently, with average-case O(1) time complexity per insertion for a small size dict, but it is clearly problematic for large size dict. In this case, it is better to find an alternative Data Structure for example List, NumPy Array or Pandas DataFrame. The table below summarizes the best uses and performance characteristics of each data structure:
Copy link
Member

@Robadob Robadob Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, attendees are not expected to be aware of O-notation (the vast majority are not formally trained programmers), this would need to provide alot more context (e.g. what does O(1) mean, what does O(N) mean, why O-notation is not the whole story [e.g. linked lists are rarely good in practice, due to scattered memory accesses]).

It may be suitable for the technical appendix in future.


| Data Structure | Small Size Insertion (O(1)) | Large Size Insertion | Search Performance (O(1)) | Best For |
|------------------|-----------------------------------|------------------------------------------|---------------------------|--------------------------------------------------------------------------|
| Dictionary | ✅ | ⚠️ Occasional O(n) (due to resizing) | ✅ O(1) (Hashing) | Fast insertions and lookups, key-value storage, small to medium data |
| List | ✅ Amortized (O(1) Append) | ✅ Efficient (Amortized O(1)) | ❌ O(n) (Linear Search) | Dynamic appends, ordered data storage, general-purpose use |
| Set | ✅ Average O(1) | ⚠️ Occasional O(n) (due to resizing) | ✅ O(1) (Hashing) | Membership testing, unique elements, small to medium datasets |
| NumPy Array | ❌ (Fixed Size) | ⚠️ Costly (O(n) when resizing) | ❌ O(n) (Linear Search) | Numerical computations, fixed-size data, vectorized operations |
| Pandas DataFrame | ❌ (if adding rows) | ⚠️ Efficient (Column-wise) | ❌ O(n) (Linear Search) | Column-wise analytics, tabular data, large datasets |
NumPy and Pandas, which we have not yet covered, are powerful libraries designed for handling large matrices and arrays. They are implemented in C to optimize performance, making them ideal for numerical computations and data analysis tasks.

:::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: keypoints

Expand Down
Loading