Pierre Navaro - [Institut de Recherche Mathématique de Rennes](https://irmar.univ-rennes1.fr) - [CNRS](http://www.cnrs.fr/)

[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](http://nbviewer.jupyter.org/github/pnavaro/big-data/blob/master/01.MapReduce.ipynb)

Some recommendations:
- *Please don't google the solution on the web, use the python documentation through `help` function.*
- *Do not try to find a clever or optimized solution, do something that works.*
- *Please don't get solution from your friends*
- *Notebooks will be updated next week with solutions*

# Data processing through MapReduce

![MapReduce](http://mm-tom.s3.amazonaws.com/blog/MapReduce.png)


## `map` function example

The `map(func, seq)` Python function applies the function func to all the elements of the sequence seq. It returns a new list with the elements changed by func

In [1]:
def f(x):
    return x + 1

res = map(f, [2, 6, -3, 7])
res  # Res is an iterator

<map at 0x7fd24465b828>

In [2]:
print(*res)

3 7 -2 8


## `functools.reduce` example

The function `reduce(func, seq)` continually applies the function func() to the sequence seq and return a single value. For example, reduce(f, [1, 2, 3, 4, 5]) calculates f(f(f(f(1,2),3),4),5).

In [3]:
def g(x,y):
    return x + y

from functools import reduce
reduce(g, [1, 2, 3, 4, 5]) # computes ((((1+2)+3)+4)+5). 

15

## Vector norm
We want to compute the vector norm $|v| = \sqrt{\sum_i v_i^2}$ with a Map-Reduce process:
- use `map` function
- use `reduce` funtion from `functools`

### Exercise 1.1
Write these two functions and compute the norm of V representing by the Python list `V` above.

In [4]:
V = [4,1,2,3]


In [5]:
from operator import add
from functools import reduce
from math import sqrt

f = lambda x: x*x   # Function applied
L = map(f, V)       # map return a iterator
s = reduce(add,L)   # reduce compute the sum
sqrt(s) == sqrt(sum(map(f,V))) # check the result

True

## Wordcount Example

[WordCount](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0) is a simple application that counts the number of occurrences of each word in a given input set.

Each mapper takes a line of text files as input and breaks it into words. It then emits a key/value pair of the word and 1 (separated by a tab). Each reducer sums the counts for each word and emits a single key/value with the word and sum.

In [8]:
from lorem import text
t = text()

with open("sample.txt", "w") as sample:
    sample.write(t)

print(t[:150]) # print only 150 first characters

Ut sed adipisci sed labore. Etincidunt dolor sed dolore magnam. Quisquam modi adipisci velit. Aliquam ipsum aliquam etincidunt etincidunt ut dolor. Qu


### Exercise 1.2
Write a python program that counts the number of words in that file.


In [9]:
import re
splitter = re.compile('\w+') # use regular expression to split file into words
with open('sample.txt', 'r') as f:
    data = f.read()
result = len(splitter.findall(data))
result

298

## Map - Read file and return a key/value pairs

### Exercise 1.3

Write a function `words` with file name as input that returns a sorted sequence of tuples (word, 1) values.

Hints: `str.lower` , `str.maketrans` and `str.translate` methods can help to remove punctuation (`string.punctuation`).

In [10]:
import string

def words(file):
    """
    Read a text file and return a sorted list of (word, 1) values.
    """
    translator = str.maketrans('', '', string.punctuation)
    output = []
    with open(file) as f:
        for line in f:
            line = line.strip()
            line = line.translate(translator)
            for word in line.split():
                word = word.lower()
                output.append((word, 1))
    output.sort()
    return output

# Reduce 

### Exercice 1.4

Write the funtion `reduce` to read the results of words and sum the occurrences of each word to a final count, and then output the results
as a list of (word, occurences). Two steps:
- Group (word, 1) pairs into a dictionary as
```python
{word1 : [1, 1], word2 : [1, 1, 1], word3 : [1] }
```
- Reduce operation prints out the word and its number of occurences.

In [11]:
import operator
def reduce(words):
    """ Read the sorted list from map and print out every word with 
    its number of occurences"""
    d = {}
    for w in words:
        try:
            d[w[0]] +=1
        except KeyError:
            d[w[0]] = 1 
    
    return sorted(d.items(), key=operator.itemgetter(1), reverse=True)

### Exercise 1.5

Use `words` and `reduce` functions to return the words list of sample.txt with number of occurrences.

Each item of this list is a tuple. Sort this list with `operator.itemgetter(1)` to use the second element of this tuple as key for the `sorted` function.

In [12]:

reduce(words('sample.txt'))


[('etincidunt', 18),
 ('amet', 15),
 ('numquam', 15),
 ('sed', 15),
 ('adipisci', 14),
 ('ut', 14),
 ('est', 13),
 ('labore', 13),
 ('modi', 13),
 ('quiquia', 13),
 ('porro', 12),
 ('consectetur', 11),
 ('dolor', 11),
 ('ipsum', 11),
 ('dolorem', 10),
 ('magnam', 10),
 ('non', 10),
 ('sit', 10),
 ('voluptatem', 10),
 ('quaerat', 9),
 ('aliquam', 8),
 ('dolore', 8),
 ('neque', 8),
 ('eius', 7),
 ('quisquam', 7),
 ('tempora', 7),
 ('velit', 6)]

### Exercise 1.6

Create 8 files `sample[0-7].txt` and use functions implemented above to
count (word, occurences). Set most common words at the top of the output list.



In [13]:
from lorem import text
for i in range(8):
    with open("sample{0:02d}.txt".format(i), "w") as f:
        f.write(text())

### Solution 1 with a for loop

In [14]:
import glob
import collections
files = glob.glob('sample0*.txt')

mapped_values = []
for file in files:  # loop over files
    mapped_values += words(file)

reduce(mapped_values)

[('non', 68),
 ('adipisci', 64),
 ('consectetur', 64),
 ('dolorem', 64),
 ('quiquia', 62),
 ('dolore', 58),
 ('sit', 56),
 ('velit', 56),
 ('quaerat', 54),
 ('dolor', 53),
 ('modi', 52),
 ('numquam', 52),
 ('amet', 51),
 ('etincidunt', 51),
 ('quisquam', 51),
 ('sed', 51),
 ('neque', 50),
 ('porro', 50),
 ('ipsum', 49),
 ('magnam', 49),
 ('eius', 48),
 ('ut', 47),
 ('tempora', 45),
 ('aliquam', 44),
 ('voluptatem', 44),
 ('labore', 42),
 ('est', 34)]

### Solution 2 with `map` function
`itertools.chain(*mapped_values)` is used for treating consecutive sequences as a single sequence. It makes an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted.

In [15]:
import glob
import collections
import itertools
files = glob.glob('sample0*.txt')

mapped_values = map(words, files)

reduce(itertools.chain(*mapped_values))

[('non', 68),
 ('adipisci', 64),
 ('consectetur', 64),
 ('dolorem', 64),
 ('quiquia', 62),
 ('dolore', 58),
 ('sit', 56),
 ('velit', 56),
 ('quaerat', 54),
 ('dolor', 53),
 ('modi', 52),
 ('numquam', 52),
 ('amet', 51),
 ('etincidunt', 51),
 ('quisquam', 51),
 ('sed', 51),
 ('neque', 50),
 ('porro', 50),
 ('ipsum', 49),
 ('magnam', 49),
 ('eius', 48),
 ('ut', 47),
 ('tempora', 45),
 ('aliquam', 44),
 ('voluptatem', 44),
 ('labore', 42),
 ('est', 34)]

In [16]:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(int)
for k, v in s:
    d[k] += v

sorted(d.items())

[('blue', 6), ('red', 1), ('yellow', 4)]