MapReduce vs. Spark

2004 Google’s MapReduce paper
It was inspired by the map and reduce functions in functional programming: "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages"
Phases: map(key, value) + shuffle phase(partitioning + sorting + grouping) + reduce(key, list) slides 11-21
Data are always serialized and stored on the disk after the map phase.
Note: after the map phase, combiner can be defined. it can help to reduce the amount of data written to disk, and transmitted over the network by aggregating partial results. Input and output of the combiner have to be identical.

In this section, it is considered MapReduce in a meaning of a default processing engine of Hadoop.

Characteristics:

It is best suited for handling very large data that sit in permanent storage.

But there are some disadvantages:

constrained by two stages (map and reduce)
uncomfortable work with API -> skilled developer
heavily leverages permanent storage, reading and writing multiple times per task, it tends to be fairly slow

It has to be said that many concepts are good even today and Hadoop is frequently used as a building block for other software in many frameworks.

All what was mentioned plus ...

comfortable API
not only two stages
it is possible create data flow
work with operation memory -> better performace
batch and streaming (kind of) * doesn't have to rely on Hadoop (no HDFS, no YARN), warriaty of inputs/outpus

But there are may other frameworks that can be used. We focus on Apache Spark because it has huge community

Provide feedback