<img src="uva_seal.png", align="left">  

## MapReduce Framework

### University of Virginia
### [Course Number]: Big Data Analytics
### Last Updated: Sep 4, 2019

---  


### SOURCE: 

Hadoop: The Definitive Guide, Tom White

### OBJECTIVES
-  Introduction to the MapReduce framework

### CONCEPTS

- `MapReduce`
- Jobs
- Tasks
- Map operation
- Reduce operation
- Shuffle operation

---

**Introducing MapReduce**

`MapReduce` is a programming framework for processing large data sets with a parallel, distributed algorithm on a cluster.  

First the data set is “mapped” into a collection of `<key, value>` pairs, and then “reduced” over all pairs with the same key.  We need to discuss what this means.  

The operations are surprisingly broad, enabling them to handle a wide range of use cases.

The popular minimal example is to do a word count on some text.

Initially published in 2004 by employees at Google

A `MapReduce` job is a unit of work the client wants to be performed  

A job consists of:  

- Input data
- `MapReduce` program
- configs

The job will be divided into *tasks*  
Two types of tasks: *map tasks* and *reduce tasks*

**Aside on `Hadoop`**  

`MapReduce` forms the computation paradigm for Hadoop  

`Hadoop` was created by Doug Cutting and Mike Cafarella.  Version 1 was called Nutch.  

Yahoo! was instrumental in providing a dedicated team and resources to turn Hadoop into a system that ran at web scale.

Hadoop is developed and maintained by the Apache Software Foundation.


**Aside on Hadoop and Spark**  

Spark does not follow the `MapReduce` framework in the same way that `Hadoop` does, where `Mapper` and `Reducer` functions are detailed explicitly.  This is actually a big advantage of Spark.

Spark and Hadoop can be better together.  For example, some teams will use Hadoop data storage (HDFS) with Spark.  That is, a Spark job can read data from HDFS, and write results to HDFS.

Reported runtimes:
Spark can be as much as 10 times faster than `MapReduce` for batch processing, and up to 100 times faster for in-memory analytics.

A large fraction of this course will focus on Spark for various tasks.

Next we illustrate the `MapReduce` framework with an example.

**MapReduce Word Count Process Diagram**

<img src="map_reduce_example.png">

Set up a minimal case Spark Session:
- using the local machine as master
- naming the app

In [25]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("pspark_test") \
        .getOrCreate()

In [26]:
# print info about the session
spark

In [108]:
sc = spark.sparkContext


### RDDs and Datasets

Before Spark 2.0, the main programming interface of Spark was the *Resilient Distributed Dataset (RDD)*.  

After Spark 2.0, RDDs are replaced by *Dataset*, which is strongly-typed like an RDD, but with richer optimizations under the hood. 

The RDD interface is still supported  

Using Dataset is recommended, and it has better performance than RDD.

## Computing

### Example 1: Read lines from text file

In [27]:
import os

In [28]:
data_path = '/home/jovyan/work/data/'

In [78]:
data_filename = 'README.md'

In [101]:
lines = spark.read.text(os.path.join(data_path, data_filename))

In [102]:
lines.count()

103

In [103]:
lines.first()

Row(value='# Apache Spark')

In [None]:
lines.collect()

In [83]:
type(lines.collect())

list

In [87]:
type(lines.collect()[0])

pyspark.sql.types.Row

### Example 2: Text Search  - print all lines containing “Spark”

In [96]:
spark_lines.columns

['value']

In [88]:
spark_lines = lines.filter(lines.value.contains("Spark"))

In [89]:
# return list of first 5 records
spark_lines.take(5)   

[Row(value='# Apache Spark'),
 Row(value='Spark is a fast and general cluster computing system for Big Data. It provides'),
 Row(value='rich set of higher-level tools including Spark SQL for SQL and DataFrames,'),
 Row(value='and Spark Streaming for stream processing.'),
 Row(value='You can find the latest Spark documentation, including a programming')]

In [90]:
type(spark_lines)

pyspark.sql.dataframe.DataFrame

### Example 3: Word Count

In [113]:
# Read the file into an RDD
lines = sc.textFile(os.path.join(data_path, data_filename))

In [114]:
type(lines)

pyspark.rdd.RDD

In [116]:
words = lines.flatMap(lambda x: x.split())

In [121]:
words.take(5)

['#', 'Apache', 'Spark', 'Spark', 'is']

In [122]:
wordcounts = words.map(lambda x: (x, 1)) \
                  .reduceByKey(lambda x,y:x+y) \
                  .map(lambda x:(x[1],x[0])) \
                  .sortByKey(False)

In [124]:
wordcounts.take(10)

[(24, 'the'),
 (17, 'to'),
 (16, 'Spark'),
 (12, 'for'),
 (9, 'and'),
 (9, '##'),
 (8, 'a'),
 (7, 'can'),
 (7, 'on'),
 (7, 'run')]