# Map reduce

**Map reduce** is a programming pattern that is used a lot in big distributed data computation

**Map**: square each item in a list $L=[0,1,2,3]$, output is $[0,1,4,9]$

In [1]:
# traditional way

## for loop
L = [0, 1, 2, 3]
O = []
for i in L:
    O.append(i*i)
    
## list comprehension
[i*i for i in L]

[0, 1, 4, 9]

In [2]:
# map

list(map(lambda x: x*x, L))

[0, 1, 4, 9]

The "traditional" way computes from first to last in order whereas in the map-reduce strategy the computation order is not specified

**Reduce**: compute the sum of a list $L=[3,1,5,7]$, output is $16$

In [3]:
# traditional way

## use builtin
L = [3, 1, 5, 7]
sum(L)

## for loop
s = 0
for i in L:
    s += i
s

16

In [4]:
# reduce

from functools import reduce

reduce(lambda x, y: x + y, L)

16

The traditional way computes everything from first to last in order whereas in the map-reduce strategy the computation order is not specified

**Map + Reduce**: compute the sum of squares from a list $L=[0,1,2,3]$, note the differences:

In [5]:
# traditional way

## for loop
L = [0, 1, 2, 3]
s = 0
for i in L:
    s += i*i
    
## list comprehension
sum([i*i for i in L])

14

In [6]:
# map-reduce

reduce(lambda x, y: x + y, map(lambda i: i*i, L))

14

The traditional way computes everything from first to last order and we are basically describing exactly what should happen, thinking about the computer being in one command at a time whereas the map-reduce strategy the computation order is not specified and we specify an execution plan

In [7]:
# the WRONG way 
reduce(lambda x, y: x+y * y, L)

14

Map-reduce operations should not depend on order of items in the list (commutativity) and order of operations (associativity)

**Order of independence**: the result of map or reduce does not depend on the order. The computation order can be chosen by the compiler/optimizer. It allows for parallel computation of sums of subsets. Modern hardware calls for parallel computation but parallel computation is very hard to program

Map-reduce is the basis for many systems and for big data, Hadoop and Spark

# Short history of map-reduce

**Google File System (GFS) + Map-reduce (2003)**

In 2003, Google had a lot of computers, but each was its own independent computer. So, they designed a system called HD, in which there is a master that basically knows where all the data is and the data itself is distributed across a lot of computers. A large file is choped into smaller pieces, and each piece is replicated across two or three computers. So now, we could process things in parallel. Each computer can do map operations on the pieces of data it has and it can start doing reduce operations, it only communicates the final answer to other computers once it finishes its own reduce.

**Apache Hadoop (2006)**

An open-source implementation of Google's idea, the file system is called Hadoop File System (HDFS), the compute system was called Google MapReduce, in Apache is Hadoop MapReduce. Large eco-system: Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache Zookeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm

**Apache Spark (2014)**

Matei Zaharia, MPLab, Berkeley. Main difference from Hadoop: distributed memory instead of distributed files!

The native language of the Hadoop eco-system is Java. Spark can be programmed in Java, but code tends to be long. **Scala** (built on top of Java) allows for parallel programming to be abstracted. It is the core language for Spark, but one of the problems is its small user base (you will want to learn Scala if you want to extend Spark). **Pyspark** is a Python library for programming Spark, it is not the most efficient, but it is easier to learn.

**Spark Architecture: SC and RDD**

SparkContext: control of other nodes is achieved through a special object called the **SparkContext** (usually named **sc**). A notebook can have only one SparkContext object. Initialization is usually `sc = SparkContext()`, use parameters for non-default configuration

Resilient Distributed Dataset (RDD): it is a list whose elements are distributed over several computers. The main data structure in Spark. When in RDD form, the elements of the list can be manipulated only through RDD specific methods. RDDs are created from a list on the master node or from a file. RDDs can be translated back to a local list using `collect()`

**Pyspark**: some basic examples

In [8]:
from pyspark import *

In [9]:
sc = SparkContext()

In [10]:
# initialize an RDD
RDD = sc.parallelize([0,1,2])

In [11]:
# sum the squares of the items
RDD.map(lambda x: x*x).reduce(lambda x, y: x+y)

5

Operations take a RDD and map it to a new RDD

In [12]:
# initialize RDD
RDD = sc.parallelize([0,1,2])
# sum the squares of the items
A = RDD.map(lambda x: x*x)
A.collect()

[0, 1, 4]

`collect()` collects all the items in the RDD into a list in the master. If the RDD is large, this can take a long time

Checking the start of an RDD:

In [13]:
# initialize a largish RDD
n = 10000
B = sc.parallelize(range(n))

# get the first few elements of an RDD
print('first element =', B.first())
print('first 5 elements =', B.take(5))

first element = 0
first 5 elements = [0, 1, 2, 3, 4]


Sampling an RDD

In [14]:
n = 10000
B = sc.parallelize(range(n))

# sample about m elements into a new RDD
m = 5.
C = B.sample(False, m/n)
C.collect()

[1045, 2014, 2488, 5180]

Each run results in a different sample, sample size varies, expected size is 5, result is an RDD, need to collect to list, sampling is very useful to Machine Learning.

# Spark architecture

In local installation, cores serve as master and slaves. In a hardware organization, there is one computer which is the master and other computer which are slaves, the master controls everything, but the slaves do the actual work and store all the data.

**Spatial software organization**: the head node (represented by the Spark Driver and the Cluster Master Mesos, YARN or Standalone) and the workers (Cluster worker and the executor), each one has a useful part (Spark Driver and Executor) which is actually doing the work and a management part (Cluster master and Cluster worker).

**Spark driver**: the driver runs the master, it is the program you wrote, it executes the `main()` code of your program

**Cluster master**: manages computation resources

**Worker**: manges a single core. Each RDD is partitioned among the workers. Workers manage partitions and executors. Executors execute tasks on their partition (are myopic)

The SparkContext is the abstraction that encapsulates the cluster for the driver node and the programmer. Worker nodes manage resources in a single slave machine. Worker nodes communicates with the cluster manager. Executors are the processes that can perform tasks. Cache refers to the local memory on the slave machine

Materialization: we don't always need to store intermediate results. Consider RDD1 -> Map (x: x\*x) -> RDD2 -> Reduce (x,y: x+y) -> float (in the head node). RDD2 can be consumed as it is being generated, it doesn't have to be materialized (stored in memory). RDDs in general by default are not materialized.

**Temporal organization**: The **stage** is a set of operations that can be done before a materialization is necessary. After a set of operations, a stage ends when the RDD needs to be materialized

Terms and concepts:

* RDDs are partitioned across workers
* RDD graph defines the lineage of the RDDs
* SparkContext divides the RDD graph into stages which define the execution plan (or physical plan)
* A task corresponds to one stage restricted to one partition
* An executor is a process that performs tasks

# Manipulating plain RDDs

Plain RDDs are parallelized lists of elements. For example:

In [15]:
sc.parallelize(range(4))

PythonRDD[10] at RDD at PythonRDD.scala:48

In [16]:
sc.parallelize(['you are my sunshine',
                'my only sunshine',
                'you make me happy'])

ParallelCollectionRDD[11] at parallelize at PythonRDD.scala:175

**Three groups of commands**

* Creation: RDD from files, databases, or data on driver node
* Transformations: RDD to RDD
* RDD to data on driver node, databases, files

In [17]:
sc.parallelize(range(4)).collect()

[0, 1, 2, 3]

In [18]:
sc.parallelize(range(4)).count()

4

In [19]:
A=sc.parallelize(range(4))
A.reduce(lambda x,y:x+y)

6

# Manipulating KeyVal RDD

Each element of the RDD is a pair (key,value). Key is an identifier and the value can be anything

In [20]:
database = sc.parallelize((55632, {'name': 'yoav', 'city': 'jerusalem'},
                           33421, {'name': 'homer', 'city': 'fairview'}))

In [21]:
car_count = sc.parallelize((('honda',3),
                            ('subaru',2),
                            ('honda',1)))

## Transformations in RDDs

In [22]:
A=sc.parallelize(range(4))\
    .map(lambda x: (x, x*x))
    
A.collect()

[(0, 0), (1, 1), (2, 4), (3, 9)]

**ReduceByKey**: perform reduce separately on each key value. Note transformation, not action

In [23]:
A=sc.parallelize([(1,3),(4,100),(1,-5),(3,2)])
A.reduceByKey(lambda x,y: x*y).collect()

[(4, 100), (1, -15), (3, 2)]

The keys are 1, 4, 1 and 3. The keys 4 and 3 are unique, only 1 repeats, so the operation will multiply $3\times-5$

In the first loop, first the temporary range list is created, then loop iterates over it.

In the second loop, the elements of the temporary list are created and "destroyed" as needed, so no memory is wasted

```
# waster of memory space:
for i in range(1000000):
    # do something
    
# no waste
for i in xrange(1000000):
    # do something
```

**groupByKey**: returns a `(key,<iterator>)` pair for each key value. The iterator iterator over the values corresponding to the key

In [24]:
A=sc.parallelize([(1,3),(3,100),(1,-5),(3,2)])
# A.groupByKey().map(lambda k,value: (k, [val for val in value])).collect()
A.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[(1, [3, -5]), (3, [100, 2])]

## Actions in RDDs

**countByKey** returns a python dictionary with the number of pairs for each key

In [25]:
A=sc.parallelize([(1,3),(3,100),(1,-5),(3,2)])
A.countByKey()

defaultdict(int, {1: 2, 3: 2})

**lookup(key)**: returns the list of all of the values associated with the key

In [26]:
A=sc.parallelize([(1,3),(3,100),(1,-5),(3,2)])
A.lookup(3)

[100, 2]

**collectAsMap()**: like collect() but instead of returning a list of tuples it returns a Map (it is a Dictionary)

In [27]:
A=sc.parallelize([(1,3),(3,100),(1,-5),(3,2)])
A.collectAsMap()

{1: -5, 3: 2}