Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

https://github.com/pnavaro/big-data/blob/master/02.Containers.ipynb

- All approaches in notebook 01 load all the data into memory. A very large file might fill up memory. 
- Counting words in each line is totally independent of the others. 
- We can evaluate each piece of data and immediately free up the memory space. 
- Data chunks would be small enough not to stress memory, but big enough for efficient use of the CPU.

In this notebook we will see how to divide the load between different processes.

# Container datatypes

`collection` module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, `dict`, `list`, `set`, and `tuple`.

- `namedtuple`	: factory function for creating tuple subclasses with named fields
- `deque`	: list-like container with fast appends and pops on either end
- `ChainMap`	: dict-like class for creating a single view of multiple mappings
- `Counter`	: dict subclass for counting hashable objects
- `defaultdict` :	dict subclass that calls a factory function to supply missing values


## Counter

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts.

Elements are counted from an iterable or initialized from another mapping (or counter):

In [1]:
from collections import Counter

violet = dict(r=23,g=13,b=23)
cnt = Counter(violet)  # or Counter(r=238, g=130, b=238)
print(cnt['c'])
print(cnt['r'])

0
23


In [2]:
print(*cnt.elements())

b b b b b b b b b b b b b b b b b b b b b b b g g g g g g g g g g g g g r r r r r r r r r r r r r r r r r r r r r r r


In [3]:
cnt.most_common(2)

[('b', 23), ('r', 23)]

In [4]:
cnt.values()

dict_values([23, 13, 23])

### Exercise 2.1

Use a `Counter` object to count words occurences in `text` produced by the `lorem` module.

The Counter class is similar to bags or multisets in some Python libraries or other languages. We will see later how to use Counter-like objects in a parallel context. 

## Partition data

In order to parallelize **reduce** operation, 
data must be aligned in a container. For this operation we will use the
`dict` subclass `defaultdict`.

![domain decomposition](https://computing.llnl.gov/tutorials/parallel_comp/images/domain_decomp.gif)

## defaultdict

`dict` subclass that calls a factory function to supply missing values.
Using list as the default_factory, it is easy to group a sequence of key-value pairs into a dictionary of lists:





In [5]:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

sorted(d.items())

[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

### Exercise 2.2

- Replace the default_factory to `int` in the example above.
- The second part every item of the class will be an integer instead of a list. You must replace the `append` by the suitable operator.
- Use the defaultdict for counting words in a text created by lorem module:


### Exercise 2.3

Create a function named `partition` that stores the key/value pairs from `words` (function created in notebook 01) into a `defaultdict` from `collections` module. Output will be:
```python
[('word1', [1, 1]), ('word2', [1]), ('word3', [1, 1, 1])]
```

### Exercise 2.4
- [itertools.chain(*mapped_values)](https://docs.python.org/3.6/library/itertools.html#itertools.chain) could be used for treating consecutive sequences as a single sequence. 
- [operator](https://docs.python.org/3.6/library/operator.html).itemgetter(1)
Return a callable object that fetches item from its operand using the operand’s __getitem__() method. It could be used to sort results.
```python
>>> import itertools, operator
>>> fruits = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
>>> vegetables = [('endive', 2), ('spinach', 1), ('celery', 5), ('carrot', 4)]
>>> getcount = operator.itemgetter(1)
>>> print(list(map(getcount, itertools.chain(fruits,vegetables) )))
[3, 2, 5, 1, 2, 1, 5, 4]
>>> print(sorted(itertools.chain(fruits,vegetables), key=getcount))
[('orange', 1), ('spinach', 1), ('banana', 2), ('endive', 2), ('apple', 3), ('carrot', 4), ('pear', 5), ('celery', 5)]
```

Write the program with the map, partition and reduce steps to compute
the list of words with their number of occurences of files sample[0-7].txt 
created in notebook 01. Example of output:
```python
[('aliquam', 17),('voluptatem', 15),('tempora', 14),('sit', 13),
 ('quisquam', 13), ('non', 13),('eius', 13),('quiquia', 12), ('magnam', 12)]
 ```