# Principles of MapReduce

Notes and images from [Mining of Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/book.pdf), by Leskovec, Rajaraman, Ullman

### What is MapReduce?
Mapreduce is a programming model by which parallelized and distributed algorithms can be run on large datasets that reside on a cluster

### Is MapReduce the right tool?
**Yes, if: **
- datasets are large
- once generated, datasets are rarely updated in place
- computations are being performed on a cluster

### What is Hadoop?
Hadoop is a MapReduce implementation by the Apache Foundation, which can be executed within Hadoop Distributed File Systems (HDFS)

# Anatomy of a MapReduce program

## Jobs
Are composed of map tasks and reduce tasks.  There can be multiple MapReduce jobs in a program or query, e.g. if the query has a nested structure. 

### Execution workflow
**Input Reader: **Splits data and assigns chunks to different map nodes

**Map function: **Takes a series of key-value pairs, processes them according to a map function, and returns another series of key-value pairs.  Note that in this case, keys do not have to be unique.

**Partition function: **Shuffles (shards) the Map output to the Reducer nodes.  One method is to take each key's hash and modulo by the number of Reducer nodes available.

**Comparison function: **Sources the Map outputs according to the Partition function and sorts them before applying to Reducer nodes.

**Reduce function: **Aggregate-level computations on one or more keys

<img src='../images/mapreduce_schematic.png', width = '500', height='300'>

###Nodes
**Master Node: **Computes the necessary number of map and reduce tasks. Keeps track of the identity of each map and reduce node, as well as the status of each map and reduce task. Schedules workers as they become available for new processes, or if a node has failed.

**Worker Nodes: ** Can be either a Map worker or a Reduce worker, executing the actual map or reduce tasks.
<img src='../images/mapreduce_nodes.png', width = '500', height='300'>

