# Map reduce

**Map reduce** is a programming pattern that is used a lot in big distributed data computation

**Map**: square each item in a list $L=[0,1,2,3]$, output is $[0,1,4,9]$

In [1]:
# traditional way

## for loop
L = [0, 1, 2, 3]
O = []
for i in L:
    O.append(i*i)
    
## list comprehension
[i*i for i in L]

[0, 1, 4, 9]

In [5]:
# map

list(map(lambda x: x*x, L))

[0, 1, 4, 9]

The "traditional" way computes from first to last in order whereas in the map-reduce strategy the computation order is not specified

**Reduce**: compute the sum of a list $L=[3,1,5,7]$, output is $16$

In [1]:
# traditional way

## use builtin
L = [3, 1, 5, 7]
sum(L)

## for loop
s = 0
for i in L:
    s += i
s

16

In [3]:
# reduce

from functools import reduce

reduce(lambda x, y: x + y, L)

16

The traditional way computes everything from first to last in order whereas in the map-reduce strategy the computation order is not specified

**Map + Reduce**: compute the sum of squares from a list $L=[0,1,2,3]$, note the differences:

In [5]:
# traditional way

## for loop
L = [0, 1, 2, 3]
s = 0
for i in L:
    s += i*i
    
## list comprehension
sum([i*i for i in L])

14

In [6]:
# map-reduce

reduce(lambda x, y: x + y, map(lambda i: i*i, L))

14

The traditional way computes everything from first to last order and we are basically describing exactly what should happen, thinking about the computer being in one command at a time whereas the map-reduce strategy the computation order is not specified and we specify an execution plan

In [67]:
# the WRONG way 
reduce(lambda x, y: x+y * y, L)

14

Map-reduce operations should not depend on order of items in the list (commutativity) and order of operations (associativity)

**Order of independence**: the result of map or reduce does not depend on the order. The computation order can be chosen by the compiler/optimizer. It allows for parallel computation of sums of subsets. Modern hardware calls for parallel computation but parallel computation is very hard to program

Map-reduce is the basis for many systems and for big data, Hadoop and Spark

# Short history of map-reduce

**Google File System (GFS) + Map-reduce (2003)**

In 2003, Google had a lot of computers, but each was its own independent computer. So, they designed a system called HD, in which there is a master that basically knows where all the data is and the data itself is distributed across a lot of computers. A large file is choped into smaller pieces, and each piece is replicated across two or three computers. So now, we could process things in parallel. Each computer can do map operations on the pieces of data it has and it can start doing reduce operations, it only communicates the final answer to other computers once it finishes its own reduce.

**Apache Hadoop (2006)**

An open-source implementation of Google's idea, the file system is called Hadoop File System (HDFS), the compute system was called Google MapReduce, in Apache is Hadoop MapReduce. Large eco-system: Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache Zookeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm

**Apache Spark (2014)**

Matei Zaharia, MPLab, Berkeley. Main difference from Hadoop: distributed memory instead of distributed files!

The native language of the Hadoop eco-system is Java. Spark can be programmed in Java, but code tends to be long. **Scala** (built on top of Java) allows for parallel programming to be abstracted. It is the core language for Spark, but one of the problems is its small user base (you will want to learn Scala if you want to extend Spark). **Pyspark** is a Python library for programming Spark, it is not the most efficient, but it is easier to learn.

**Spark Architecture: SC and RDD**

SparkContext: control of other nodes is achieved through a special object called the **SparkContext** (usually named **sc**). A notebook can have only one SparkContext object. Initialization is usually `sc = SparkContext()`, use parameters for non-default configuration

Resilient Distributed Dataset (RDD): it is a list whose elements are distributed over several computers. The main data structure in Spark. When in RDD form, the elements of the list can be manipulated only through RDD specific methods. RDDs are created from a list on the master node or from a file. RDDs can be translated back to a local list using `collect()`

**Pyspark**: some basic examples

In [1]:
from pyspark import *

In [2]:
sc = SparkContext()

In [3]:
# initialize an RDD
RDD = sc.parallelize([0,1,2])

In [6]:
# sum the squares of the items
RDD.map(lambda x: x*x).reduce(lambda x, y: x+y)

5

Operations take a RDD and map it to a new RDD

In [7]:
# initialize RDD
RDD = sc.parallelize([0,1,2])
# sum the squares of the items
A = RDD.map(lambda x: x*x)
A.collect()

[0, 1, 4]

`collect()` collects all the items in the RDD into a list in the master. If the RDD is large, this can take a long time

Checking the start of an RDD:

In [15]:
# initialize a largish RDD
n = 10000
B = sc.parallelize(range(n))

# get the first few elements of an RDD
print('first element =', B.first())
print('first 5 elements =', B.take(5))

first element = 0
first 5 elements = [0, 1, 2, 3, 4]


Sampling an RDD

In [29]:
n = 10000
B = sc.parallelize(range(n))

# sample about m elements into a new RDD
m = 5.
C = B.sample(False, m/n)
C.collect()

[5518, 7239]

Each run results in a different sample, sample size varies, expected size is 5, result is an RDD, need to collect to list, sampling is very useful to Machine Learning.