## 8.6 Bags in Python

Class `Counter` in module `collections` is a subclass of `dict`.
It implements a bag of hashable items as a dictionary of object–integer pairs:
each key is an item and the corresponding value is the multiplicity of the item.
The bag operations are written as follows. I'll comment on them further below.

Bag operation | Python
:-|:-
new  |  `bag = Counter()`
size  | not available
membership  |  `item in bag` or `item not in bag`
add  |  `bag[item] = bag[item] + 1`
remove  |  `bag[item] = bag[item] - 1`
intersection  | `bag1 & bag2`
difference  | `bag1 - bag2`
multiplicity | `bag[item]`
inclusion  | not available

Bags also have a union operation, written as `bag1 | bag2`.

A bag can be created from any iterable collection.

In [1]:
from collections import Counter

SINATRA = 'doo bee doo bee doo'
STING = 'de do do do de da da da'
Counter(SINATRA)

Counter({'d': 3, 'o': 6, ' ': 4, 'b': 2, 'e': 4})

The dictionary representation of the bag clearly shows
the frequency of each character in the string.

The union, intersection and difference operations work as expected,
in case you're interested in comparing profound lyrics.

In [2]:
Counter(SINATRA) & Counter(STING)   # common characters

Counter({'d': 3, 'o': 3, ' ': 4, 'e': 2})

Unlike sets, the class doesn't provide an `add` or `discard` method.
Instead, we increment or decrement the counter associated with the item.
The counter starts automatically at zero.

In [3]:
letters = Counter()
for character in SINATRA:
    if character != ' ':
        letters[character] = letters[character] + 1
letters

Counter({'d': 3, 'o': 6, 'b': 2, 'e': 4})

If you replace `Counter` with `dict` in the first line and run the code,
you get a key error when trying to add the first letter ('d') because
a value can't be accessed before the key–value pair has been inserted.
Here's another example that raises a key error with a dictionary but
not with a `Counter` instance.

In [4]:
letters['x']    # multiplicity of letter 'x' in LYRICS

0

You can directly set the multiplicity of an item or change it by more than one.

In [5]:
word = Counter('Fahrtreppenbenutzungshinweise')
word['u'] = word['u'] - 2   # remove both u from the bag
word['ä'] = 5               # add some umlaute for no good reason

Being a dictionary, you can iterate over the keys (unique bag items) or
over the key–value pairs (unique items and their multiplicity).

In [6]:
for (letter, multiplicity) in word.items():
    if letter in 'aeiouäöü':
        print('vowel', letter, 'occurs', multiplicity, 'times')

vowel a occurs 1 times
vowel e occurs 5 times
vowel u occurs 0 times
vowel i occurs 2 times
vowel ä occurs 5 times


Note that setting the multiplicity to zero (as was done for 'u')
doesn't remove the item from the bag.
You must use the dictionary's `pop` method to do that.

In [7]:
word.pop('e')               # remove all e from the bag
for letter in word:         # iterate over the keys this time
    if letter in 'aeiouäöü':
        print('vowel', letter, 'occurs', word[letter], 'times')

vowel a occurs 1 times
vowel u occurs 0 times
vowel i occurs 2 times
vowel ä occurs 5 times


### 8.6.1 Mistakes

It's important to keep in mind that `Counter` is a subclass of `dict`,
not a superclass of `set`, and maybe that's why it's not called `Bag`.
Python's bags don't follow set notation,
don't provide union, intersection and difference methods (only operators),
don't remove items when their multiplicity reaches zero and
don't provide the inclusion operation.
So using a `Counter` instance as if it were a more general set is a mistake.

In [8]:
Counter('aeiou') <= Counter(SINATRA) # is each vowel in the lyrics?

TypeError: '<=' not supported between instances of 'Counter' and 'Counter'

It's also a mistake to use the `len` function to get the size of a bag: it
returns the size of the dictionary, which is the number of keys (unique items).
In our example, the size of the bag is the number of characters, i.e.
the length of the string the bag was created from.

In [9]:
print('size of the bag:', len(SINATRA))
print('unique characters:', len(Counter(SINATRA)))

size of the bag: 19
unique characters: 5


Even worse, being a dictionary means you can associate any value to each item,
not just a natural number.

In [10]:
bonkers = Counter()
bonkers['hello'] = -5
bonkers['jggdggu'] = None
bonkers['algorithm'] = 'data structures'
bonkers

Counter({'hello': -5, 'jggdggu': None, 'algorithm': 'data structures'})

Now let's take the union with the empty bag,
which should return the same bag as `bonkers`. What can possibly go wrong?

In [11]:
bonkers | Counter()

TypeError: '<' not supported between instances of 'NoneType' and 'int'

The union, intersection and difference operations require
the multiplicities of items to be compared,
which raises an error when they're not integers.

Making `Counter` a subclass of `dict` views a bag as a map between
items and their multiplicity. Nothing wrong with that.
Actually, it provides a convenient notation to add and remove items. But
inheriting dictionary operations without adapting them to bags isn't so great.

It's always useful to know how to implement our own data types:
the built-in ones may not fully address our needs.

### 8.6.2 Using bags

Sets and bags are useful to solve problems where the order doesn't matter.
If items can't occur more than once, then we should use a set, otherwise a bag.
Many problems don't mention sets or bags explicitly.
It's up to us to see that sets or bags and their operations can be of use.
Here's an example of such a problem.

Given two sequences of items, what is
the smallest number of items to be removed from both sequences
so that one becomes a permutation, i.e. rearrangement, of the other?
For example, if the sequences are _left_ = (1, 2, 3, 2) and
_right_ = (3, 2, 2, 5) then only two items must be removed:
number 1 from _left_ and number 5 from _right_.
If _left_ = ('a', 'man', 'a', 'plan') and _right_ = ('a', 'canal') then
four strings must be removed (three from _left_, one from _right_)
for both sequences to become ('a').

#### Exercise 8.6.1

Write further problems instances to test your algorithm later.

Case | _left_ | _right_ | Deletions
:-|:-|:-|:-
some overlap  | (1, 2, 3, 2)  | (3, 2, 2, 5)  |  2
one common  | ('a', 'man', 'a', 'plan')  | ('a', 'canal')  |  4

[Hint](../31_Hints/Hints_08_6_01.ipynb)
[Answer](../32_Answers/Answers_08_6_01.ipynb)

#### Exercise 8.6.2

Describe an algorithm to compute the number of deletions.

_Write your answer here._

[Hint](../31_Hints/Hints_08_6_02.ipynb)
[Answer](../32_Answers/Answers_08_6_02.ipynb)

#### Exercise 8.6.3

What's the complexity of the algorithm?

_Write your answer here._

[Hint](../31_Hints/Hints_08_6_03.ipynb)
[Answer](../32_Answers/Answers_08_6_03.ipynb)

#### Exercise 8.6.4

Complete the function and tests below and run the cell.
They use lists as the input sequences.
Feel free to use your own bag implementation instead of `Counter`.

In [12]:
from collections import Counter
%run -i ../m269_util

def deletions(left: list, right: list) -> int:
    """Return how many deletions make the lists have the same items.

    Postconditions:
    - the output is the smallest number of deletions necessary
    """
    pass

deletions_tests = [
    # case,             left,                   right,  deletions
    ('some overlap',    [1, 2, 3, 2],           [3, 2, 2, 5],   2),
    ('one common',      ['a','man','a','plan'], ['a','canal'],  4),
    # new tests
]
test(deletions, deletions_tests)

[Hint](../31_Hints/Hints_08_6_04.ipynb)
[Answer](../32_Answers/Answers_08_6_04.ipynb)

#### Optional exercises

The Kattis Guide lists further [problems on bags](https://mwermelinger.github.io/kattis-guide/unordered.html#bags).

⟵ [Previous section](08_5_bag.ipynb) | [Up](08-introduction.ipynb) | [Next section](08_7_summary.ipynb) ⟶