## 5. MapReduce

#### Big Data

(e.g.) Google : trying to catalog and index all books in the world. Simply too much data to handle in one disk. That's when we use MapReduce

#### Other examples of MapReduce Applications
- Discover new oil reserves
- Power an e-commerce website
- Identify malware and cyber attack patterns for online security
- Help doctors answer questions about patients' health

#### Basics of MapReduce

- Parallel programming model for processing large datasets across a cluster of computers
- High Level Computation performed by two functions "mapper & reducer"
- Google catalog & index example...
  - step1: Assuming that each book is a separate document, we send each document to many mappers which each perform the same mapping on their respective documents and produce a series of intermediate key-value pairs
  - step2: We shuffle all these intermediate results and send all key value pairs of the same key to the same reducer for processing
  - step3: Each reducer can produce one final key for each key

### Basic MapReduce - Word Count

"Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do..."

For this exercise, write a program that serially counts the number of occurrences of each word in the book Alice in Wonderland. The text of Alice in Wonderland will be fed into your program line-by-line. Your program needs to take each line and do the following:
- Tokenize the line into string tokens by whitespace
  - Example: "Hello, World!" should be converted into "Hello," and "World!"
- Remove all punctuation
  - Example: "Hello," and "World!" should be converted into "Hello" and "World"
- Make all letters lowercase
  - Example: "Hello" and "World" should be converted to "hello" and "world"
    
- Store the the number of times that a word appears in Alice in Wonderland in the word_counts dictionary, and then *print* (don't return) that dictionary
- In this exercise, print statements will be considered your final output. Because of this, printing a debug statement will cause the grader to break. Instead, you can use the logging module which we've configured for you. For example: logging.info("My debugging message")
- The logging module can be used to give you more control over your debugging or other messages than you can get by printing them. Messages logged via the logger we configured will be saved to a file. If you click "Test Run", then you will see the contents of that file once your program has finished running.
- The logging module also has other capabilities; see https://docs.python.org/2/library/logging.html for more information.

In [71]:
import string

def word_count(lst):
    word_counts = {} # initialize an empty dictionary
    data = line.strip().split(" ")
    for i in data:
        key = i.translate(str.maketrans('','', string.punctuation)).lower()
        if key in word_counts:
            word_counts[key] += 1 
        else:
            word_counts[key] = 1
    print(word_counts)

word_lst =\
"Alice was beginning to get very tired of sitting by her sister \
on the bank, and of having nothing to do..."

word_count(word_lst)

{'and': 1, 'get': 1, 'by': 1, 'beginning': 1, 'her': 1, 'alice': 1, 'the': 1, 'on': 1, 'tired': 1, 'very': 1, 'sitting': 1, 'to': 2, 'nothing': 1, 'of': 2, 'having': 1, 'bank': 1, 'sister': 1, 'do': 1, 'was': 1}


#### Mapper

In [74]:
import sys
import string

def mapper():
    for line in sys.stdin:
        data = line.strip().split(" ")
        for i in data:
            cleaned_data =\
            i.translate(str.maketrans('','',string.punctuation)).lower()
            print("{}\t{}".format(cleaned_data,1))

e.g. "Hello, my name is Dave. Dave is my name."

위에 있는 예시 mapper에 돌리면...
(hello, 1) (my, 1) (name, 1) (is, 1) (dave, 1) (dave, 1) (is, 1) (my, 1) (name, 1)

이렇게 mapper로 출력된 key-value pair를 Multiple Reducers 로 넘긴다

#### Reducer

In [3]:
import sys

def reducer():
    word_count = 0
    old_key = None
    
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 2:
            continue
        this_key, count = data
        if old_key and old_key != this_key:
            print("{}\t{}".format(old_key, word_count))
            word_count = 0
    
        old_key = this_key
        word_count += float(count)
    
    if old_key != None:
        print("{}\t{}".format(old_key, word_count))

### Aadhaar Data - Map Reduce

In [4]:
import sys
import string

def mapper():
    for line in sys.stdin:
        data = line.strip().split(",")
        if len(data) != 12 or data[0] == 'Registrar':
            continue
        print("{}\t{}".format(data[3], data[8]))

In [5]:
def reducer():
    aadhaar_generated = 0
    old_key = None
    
    for line in sys.stdin:
        data = line.strip().split("\t")
        
        if len(data) != 2:
            continue
        this_key, count = data
        
        # If this is a new key, let's print the final key-value pair
        if old_key and old_key != this_key:
            print("{}\t{}".format(old_key, aadhaar_generated))
            
            aadhaar_generated = 0
        
        old_key = this_key
        aadhaar_generated += float(count)
    
    if old_key != None:
        print("{}\t{}".format(old_key, aadhaar_generated))

#### Mapreduce Programming Model

Hadoop based products
- Hive(FB)
- Pig(Yahoo)