# Lists, Sets and High-Performance Collections

**September 08 2020**  
*Vincenzo Perri* 

A particularly appealing feature of python is its powerful built-in support for list, sequences, sets and dictionaries. Here we introduce some basics as well as advanced features like high-performance collections, that are crucial when processing large data sets.

## Lists and List Comprehension

Different from many other languages, in `python` lists are first-class citizens of the language, which makes their use particularly convenient. A list consisting of elements with arbitrary (and different) types can be defined by grouping them with angular brackets:

In [1]:
l = [1, 42, 'test', True, 7.5, 'x']

The `append` function can be used to add an element to the end of a list:

In [2]:
l.append('one more item')
print(l)

[1, 42, 'test', True, 7.5, 'x', 'one more item']


We can access list elements via their zero-based index.

In [3]:
print(l[0])
print(l[2])

1
test


We can access the last entry by using the index -1:

In [4]:
print(l[-1])

one more item


The `insert` function allows us to insert an element at a given index.

In [5]:
l.insert(2, 42.0)
print(l)

[1, 42, 42.0, 'test', True, 7.5, 'x', 'one more item']


The `remove` function removes all occurrences of a given value from the list.

In [6]:
l.remove('test')
print(l)

[1, 42, 42.0, True, 7.5, 'x', 'one more item']


Finally, we can use the `pop` function to remove (and return) an element at a given index. If no index is specified the function will remove the last element in the list. This implies that we can simply use the `append` and `pop` functions of the `list` class if we need a LIFO stack data structure. If we need  a FIFO queue, we can simply use `insert(0, x)` and `pop()` on a `list`:

In [7]:
print(l.pop(0))
print(l)

print(l.pop())
print(l)

1
[42, 42.0, True, 7.5, 'x', 'one more item']
one more item
[42, 42.0, True, 7.5, 'x']


Lists are iterable objects for which we can directly iterate through the elements by a for-loop:

In [8]:
for element in l:
    print(element)

42
42.0
True
7.5
x


A powerful concept in `python` is **list comprehension**. It allows to define lists using an intuitive notation, which resembles the mathematical definition of sequences. The general syntax is:

`[expression for element in iterable]`

This is useful to generate number sequences based on arbitrary mathematical expressions. For instance, we can define a list that contains the first 10 values of the harmonic sequence $\frac{1}{x}$ as follows:

In [9]:
x = [1/x for x in range(1,10)]
print(x)

[1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125, 0.1111111111111111]


## Dictionaries

Different from a list, a dictionary is a structure that contains key-value pairs, i.e. we have a set of (unique) keys, each of which is assigned exactly one value (though this value can be a list containing several items). We can create an empty dictionary either by using the constructor of the `dict` class. We can then assign values to the keys in a dictionary `d` by using the bracket notation `d[key] = value`.

Let us define an empty dictionary and assign values to string keys:

In [10]:
d = {}
d['key1'] = 42
d['key2'] = 'some string'
d['key3'] = True
d['key4'] = 4.5
print(d)

{'key1': 42, 'key2': 'some string', 'key3': True, 'key4': 4.5}


We can alternatively use the curly bracket notation `{ key1: value1, key2: value2}` to define a dictionary with multiple key-value pairs.

In [11]:
d = {'key1': 42, 'key2': 'some string', 'key3': True, 'key4': 4.5}

If we assign a value to a key, the dictionary is either expanded, or the value of an existing key will be overwritten.

In [12]:
d['key4'] = 5
print(d)

d['key5'] = 'expanded'
print(d)

{'key1': 42, 'key2': 'some string', 'key3': True, 'key4': 5}
{'key1': 42, 'key2': 'some string', 'key3': True, 'key4': 5, 'key5': 'expanded'}


We can use the `in` operator to check whether a key is already contained in the dictionary. The implementation of python dictionaries internally hashes keys, which turns this into an O(1) operation.

In [13]:
print('key1' in d)
print('not present' in d)

True
False


When we iterate over a dictionary, we actually iterate over the keys and the keys are not guaranteed to be the same order as they were added. We can also use the `items()` function to iterate over key-value pairs.

In [14]:
for k in d:
    print('key {0} -> value {1}'.format(k, d[k]))
print('---')
for k, v in d.items():
    print('key {0} -> value {1}'.format(k, v))

key key1 -> value 42
key key2 -> value some string
key key3 -> value True
key key4 -> value 5
key key5 -> value expanded
---
key key1 -> value 42
key key2 -> value some string
key key3 -> value True
key key4 -> value 5
key key5 -> value expanded


## Sets and Tuples

Speaking of order, lists are ordered and elements are not necessarily unique. Thus, if we test two lists for equality both the order of elements as well as multiple instances of the same element are considered. The following code creates three lists with integers 1,2,3,4. In list 1 and 2 elements are in different order, in list 3 one integer occurs twice, which means they are different objects in `python`:

In [15]:
a = [1, 2, 3, 4]
b = [2, 1, 3, 4]
c = [2, 1, 3, 4, 4]
print(a==b)
print(b==c)

False
False


An important consequence from the fact that each element in a list can occur multiple times is the performance of a lookup operation. When we check whether a list contains a given element, python internally iterates over the list, which means different from a dictionary the performance of the `in` operator is in O(n) where $n$ is the number of list elements.

If we want to store unordered collections of unique elements, we can use the `set` class. This class offers the `add` function to add elements. In a `set`, the same element will not be added twice:

In [16]:
s = set()
s.add(1)
s.add(1)
print(s)

{1}


We can turn any iterable into a set by passing it to the constructor of the `set` class. This allows us to check, e.g. whether multiple lists contain the same elements (irrespective of ordering and not accounting for multiple occurrences of elements). Let us try this in the example of the three lists created above:

In [17]:
print(set(a)==set(b))
print(set(b)==set(c))

True
True


Finally, python comes with a class to store a  `tuple` of elements. Just like lists, tuples are ordered collections where the same element can occur multiple times. But different from lists, python tuples are immutable, i.e. once they are created we cannot change them. We can create a tuple by using the round bracket notation `(elem1, elem2, elem3)`.

In [18]:
t = (1, 2, 3, 4, 4)
print(t)

(1, 2, 3, 4, 4)


Trying to change a tuple will yield an error:

In [19]:
t[2] = 5

TypeError: 'tuple' object does not support item assignment

Considering that they are more restricted than lists, why should we use tuples then in the first place? Apart from the fact that we sometimes specifically want our data to be immutable, a simple reason is that tuples are much faster than lists. 

We can test this by using a great feature of `jupyter`, so-called `cell magic`. We can annotate code to be executed by the kernel using special commands that are prefixed by `%`. In general, this works in two modes: If we prefix a line by `%command` that command applies to a single line of code. If we prefix a cell by `%%command`, the command applies to all code in this cell. 

There are numerous powerful magic commands and we refer you to the [documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html) for a complete list. One interesting thing that we can do is to profile the execution of a cell by repeatedly executing code and reporting the average execution time. We can use the argument `-n` to indicate how often the statements should be executed. The reported times are averages over multiple repetitions (controlled by the parameter `-r`) of this experiment. This simple approach allows us to quickly get an idea about the performance of `python` code snippets. Let us compare the time needed to create an integer, a tuple and a list:

In [20]:
%%timeit -n 10000000 -r 10

t = 1

13.6 ns ± 2.17 ns per loop (mean ± std. dev. of 10 runs, 10000000 loops each)


In [21]:
%%timeit -n 10000000 -r 10

t = (1, 2, 3, 4, 4)

12.8 ns ± 0.328 ns per loop (mean ± std. dev. of 10 runs, 10000000 loops each)


In [22]:
%%timeit -n 10000000 -r 10

t = [1, 2, 3, 4, 4]

92.5 ns ± 18.2 ns per loop (mean ± std. dev. of 10 runs, 10000000 loops each)


Creating a list takes more about four times longer than creating a tuple. Moreover the time needed to create a tuple of integers is about the same time than the time needed to create a simple integer variable.

Above, we mentioned that the `in` operator in a set is or dictionary is O(1) while it is O(n) in a list (due to the fact that an element can occur multiple times, which prevents hashing). Let us use the `timeit` cell magic to test this:

In [3]:
x = list(range(10000))
y = set(x)
z =  { k:k for k in x }

import random

In [4]:
%%timeit -n 10000 -r 10
test = random.choice(x) in x

58.3 µs ± 1.11 µs per loop (mean ± std. dev. of 10 runs, 10000 loops each)


In [5]:
%%timeit -n 10000 -r 10
test = random.choice(x) in y

755 ns ± 249 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)


In [6]:
%%timeit -n 10000 -r 10
test = random.choice(x) in z

667 ns ± 132 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)


## High-performance containers

Speaking of performance of collection types, we conclude this notebook by introducing the [**high-performance containers**](https://docs.python.org/2/library/collections.html) provided in the standard `python` submodule `collections`. It is crucial if you want to process and analyse large data sets, and considering what we introduced above, it is easy to motivate why we need this. 

Something that you will come across frequently when analysing big data is the need to update or add to the values of a given key in the dictionary, e.g. when counting elements. Often, we don't know whether the key exists before, so using the standard `dic` class we actually have to check this first using an if statement. Consider, e.g., the following code snippet, which counts the frequencies of characters in a string:

In [32]:
string = 'Introduction to Data Science'
counts = {}

for c in string:
    counts[c] += 1
print(counts)

KeyError: 'I'

Clearly, we have to initialise the dictionary value of each key with zero whenever we first encounter a character, i.e. we need to write the following code:

In [33]:
string = 'Introduction to Data Science'
counts_dict = {}

for c in string:
    if c not in counts_dict:
        counts_dict[c] = 0
    counts_dict[c] += 1
print(counts_dict)

{'I': 1, 'n': 3, 't': 4, 'r': 1, 'o': 3, 'd': 1, 'u': 1, 'c': 3, 'i': 2, ' ': 3, 'D': 1, 'a': 2, 'S': 1, 'e': 2}


That not only makes the code more difficult to read (especially if you are working with multiple, possibly nested dictionaries) it also introduces a lot of overhead due to numerous unneccessary checks whether the key is in the dictionary or not. A smarter way would be to have a dictionary class that uses a given default value to initialise non-existing values. 

The `defaultdict` class in the collections module does exactly this. We can initialise it by passing a lambda expression that specifies what to do when a non-existing value is encountered. In the simplest case, this lambda expression can just be the initival value. In other cases, we could give an expression that calls a function or calls a constructor of an object.

In [34]:
from collections import defaultdict

counts_defaultdict = defaultdict(lambda: 0)

for c in string:
    counts_defaultdict[c] += 1
print(counts_defaultdict)

defaultdict(<function <lambda> at 0x0000021EC8AEFB70>, {'I': 1, 'n': 3, 't': 4, 'r': 1, 'o': 3, 'd': 1, 'u': 1, 'c': 3, 'i': 2, ' ': 3, 'D': 1, 'a': 2, 'S': 1, 'e': 2})


In fact, for the special purpose of quickly counting frequencies the `collections` module introduces the special class [`Counter`](https://docs.python.org/2/library/collections.html#collections.Counter). We can create an instance of this class and just pass an iterable (e.g. a list, a set, or a string, which is an iterable of characters) as argument:

In [37]:
from collections import Counter

counts = Counter(string)
print(counts)

Counter({'t': 4, 'n': 3, 'o': 3, 'c': 3, ' ': 3, 'i': 2, 'a': 2, 'e': 2, 'I': 1, 'r': 1, 'd': 1, 'u': 1, 'D': 1, 'S': 1})


This is eqivalent to writing the following code, i.e. the `Counter` class automatically interates over any iterable that we pass into the constructor and increments the counter of all unique values.

In [38]:
counts_counter = Counter()
for c in string:
    counts_counter[c] += 1
print(counts_counter)

Counter({'t': 4, 'n': 3, 'o': 3, 'c': 3, ' ': 3, 'i': 2, 'a': 2, 'e': 2, 'I': 1, 'r': 1, 'd': 1, 'u': 1, 'D': 1, 'S': 1})


Let's see how much more efficient this is compared to our naive implementation based on the standard `dictionary` class. We use a longer string to test how the performance differs for larger iterables:

In [39]:
%%timeit -n 10000 -r 10

string = '''Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science
Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science
Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science
Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science
Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science Introduction to Data Science
'''

counts_dict = {}
for c in string:
    if c not in counts_dict:
        counts_dict[c] = 0
    counts_dict[c] += 1

83.1 µs ± 16 µs per loop (mean ± std. dev. of 10 runs, 10000 loops each)


In [40]:
%%timeit -n 10000 -r 10
counts_counter = Counter(string)

4.5 µs ± 853 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)


That is a huge difference by a factor of 20. In fact, the difference will become even more pronounced if we consider even bigger data, because we are saving the test whether an element is already in the dictionary for each tested element! Comparing this to the implementation based on the default dictionary shows that there is no significant difference in terms of performance:

In [41]:
%%timeit -n 10000 -r 10
counts_defaultdict = defaultdict(lambda: 0)

for c in string:
    counts_defaultdict[c] += 1

4.31 µs ± 486 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)


However, there is an important difference in terms of how those two classes treat missing data. In the default dictionary, every read access to a key that has not been observed will actually create a new zero value for that key.

In [43]:
print(counts_defaultdict['z'])
print(counts_defaultdict)

0
defaultdict(<function <lambda> at 0x0000021EC8AEFB70>, {'I': 1, 'n': 3, 't': 4, 'r': 1, 'o': 3, 'd': 1, 'u': 1, 'c': 3, 'i': 2, ' ': 3, 'D': 1, 'a': 2, 'S': 1, 'e': 2, 'z': 0})


This can cause problems if you perform many such tests for a large number of elements, which will let your default dictionary explore with zero values. Also, the number of keys in the default dictionary will grow just through the test whether elements are contained or not.

The counter class has a different behavior. Here, no zero values will be created if we ask for the count of an element that hasn't been observed.

In [44]:
print(counts_counter['z'])
print(counts_counter)

0
Counter({'t': 4, 'n': 3, 'o': 3, 'c': 3, ' ': 3, 'i': 2, 'a': 2, 'e': 2, 'I': 1, 'r': 1, 'd': 1, 'u': 1, 'D': 1, 'S': 1})


Nevertheless, the counter class wil just return the default value of zero for elements that are not in the underlying dictionary. Note that this leass to the - for a dictionary - unintuitive behavior that you can get a (zero) value for keys that are actually not in the dictionary.

In [40]:
print(counts_counter['z'])

Finally, the counter class comes with additional functions that are useful in the collection of data for statistical analyses. We can easily check which are the $n$ most common elements:

In [45]:
counts_counter.most_common(5)

[('t', 4), ('n', 3), ('o', 3), ('c', 3), (' ', 3)]

Moreover, we can easily add or subtract multiple counters:

In [41]:
counts_2 = Counter('This is another counter')
print(counts_2)

print(counts_counter+counts_2)
print(counts_counter-counts_2)