Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

[Display on nbviewer](http://nbviewer.jupyter.org/github/pnavaro/big-data/blob/master/01.MapReduce.ipynb)

Some recommendations:
- *Please don't google the solution on the web, use the python documentation through `help` function.*
- *Do not try to find a clever or optimized solution, do something that works.*
- *Please don't get solution from your friends*
- *Notebooks will be updated next week with solutions*

# Data processing through MapReduce

![MapReduce](http://mm-tom.s3.amazonaws.com/blog/MapReduce.png)


## `map` function example

The `map(func, seq)` Python function applies the function func to all the elements of the sequence seq. It returns a new list with the elements changed by func

In [16]:
def func(x):
    return x + 1

res = map(f, [2, 6, -3, 7])
res  # Res is an iterator

<map at 0x10cf1e780>

In [17]:
print(*res)

3 7 -2 8


## `functools.reduce` example

The function `reduce(func, seq)` continually applies the function func() to the sequence seq and return a single value. For example, reduce(f, [1, 2, 3, 4, 5]) calculates f(f(f(f(1,2),3),4),5).

In [18]:
def g(x,y):
    return x + y

from functools import reduce
reduce(g, [1, 2, 3, 4, 5]) # computes ((((1+2)+3)+4)+5). 

15

## Vector norm
We want to compute the vector norm $|v| = \sqrt{\sum_i v_i^2}$ with a Map-Reduce process:
- use `map` function
- use `reduce` funtion from `functools`

### Exercise 1.1
Write these two functions and compute the norm of V representing by the Python list `V` above.

In [19]:
V = [4,1,2,3]

## Wordcount Example

[WordCount](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0) is a simple application that counts the number of occurrences of each word in a given input set.

Each mapper takes a line of text files as input and breaks it into words. It then emits a key/value pair of the word and 1 (separated by a tab). Each reducer sums the counts for each word and emits a single key/value with the word and sum.

In [20]:
from lorem import text
t = text()

with open("sample.txt", "w") as sample:
    sample.write(t)

print(t[:150]) # print only 150 first characters

Quaerat quaerat labore voluptatem. Est quaerat labore dolor adipisci amet ipsum. Porro dolore voluptatem aliquam amet dolor tempora. Consectetur eius 


### Exercise 1.2
Write a python program that counts the number of words in that file.


## Map - Read file and return a key/value pairs

### Exercise 1.3

Write a function `words` with file name as input that returns a sorted sequence of tuples (word, 1) values.

Hints: `str.lower` , `str.maketrans` and `str.translate` methods can help to remove punctuation (`string.punctuation`).

# Reduce 

### Exercice 1.4

Write the funtion `reduce` to read the results of words and sum the occurrences of each word to a final count, and then output the results
as a list of (word, occurences). Two steps:
- Group (word, 1) pairs into a dictionary as
```python
{word1 : [1, 1], word2 : [1, 1, 1], word3 : [1] }
```
- Reduce operation prints out the word and its number of occurences.

### Exercise 1.5

Use `words` and `reduce` functions to return the words list of sample.txt with number of occurrences.

Each item of this list is a tuple. Sort this list with `operator.itemgetter(1)` to use the second element of this tuple as key for the `sorted` function.

### Exercise 1.6

Create 8 files `sample[0-7].txt` and use functions implemented above to
count (word, occurences). Set most common words at the top of the output list.