<img src="ucsb_logo_seal.png"> 

## Resilient Distributed Datasets (RDDs)

### PSTAT 135 / 235: Big Data Analytics
### University of California, Santa Barbara
### Last Updated: Sep 4, 2019

---  


**Sources:**  
Learning Spark, Chapter 3: Programming with RDDs   

### OBJECTIVES
-  Basics of RDDs including transformations and actions  
-  Discuss parallelization concepts  



### CONCEPTS

- RDD  
- Transformation  
- Action  
- lazy evaluation  
- Lineage graph - Graph of dependencies between all involved RDDs  
- Set operations  
- Pipelining or chaining  
- accumulator  
- `persist()` and `cache()`  
- `parallelize()`  
- `collect()` and `take()`  
- `map()`, `filter()`, `flatMap()`  
- `reduce()`, `fold()`, `aggregate()`  
- `count()`, `countByValue()`  
- `saveAsTextFile()`, `saveAsSequenceFile()`  

---

### RDD BASICS

An *RDD* is a distributed collection of elements  

All work consists of:  
- RDD creation  
- RDD transformation  
- RDD action (e.g., compute a result)  

Spark “magically” handles distributing data and code across cluster, parallelization of operations

Spark doesn’t actually do any work until it encounters an *action*, for example a `count()`.  
It is lazy.  Spark creates a plan or roadmap to optimize performance of the project.

When testing/debugging code, it can be helpful to call `count()` to force Spark to evaluate results.

A *transformation* creates a new RDD  
Actions returns a different data type  

RDDs created by:  
1. Loading external dataset (`textFile()`, for example)
2. Distributing a collection of objects from driver program


**Example of Transformation: Filter on text**

In [None]:
pythonLines = lines.filter(lambda line: "Python" in line)
py = pythonLines.collect()

for i, p in enumerate(py):
    print('line: {} text: {}'.format(i,p))

#### Useful Functions on RDDs

Store or “persist” an RDD by calling  

`RDD.persist()` 

`cache()` is the same as `persist()` with the default storage level

`collect()`  
Retrieve entire RDD on driver.  
Careful w large RDDs, as the results need to fit in memory on single machine!

`take()`  
Retrieve small number of elements from RDD (user can specify size).  
NOTE: values may NOT be in order

`saveAsTextFile()`, `saveAsSequenceFile()`, `…`  
Save contents of RDD as a file. Different function call depending on file storage type.


#### Basic Transformations 

`map()`  
Applies transform to each element in RDD  

`filter()`  
Return new RDD with only records meeting condition

`parallellize()`  
Distribute the data to workers, creating an RDD  


In [None]:
nums = sc.parallelize([1,2,3,4])
nums

`flatMap()`  
Apply map to produce list of elements in a single list (e.g, tokenize a sentence into words)  

**Learning Spark** page 36 shows difference between `map()` and `flatMap()`

#### Set Operations

In [None]:
list1 = sc.parallelize(['cat','dog','baby'])
list2 = sc.parallelize(['giraffe','baby'])
list1.union(list2).collect()

Notice this does not filter duplicates  

Also notice we can “chain” or “pipeline” commands in sequence  

Let’s get the distinct list from the union:

In [None]:
list1.union(list2).distinct().collect()

NOTE: `distinct()` is expensive as it requires shuffling all data over the network  

Shuffling: the process of redistributing data across partitions  

Page 38 summarizes important basic RDD transforms


#### Actions

`reduce()`  
Process elements into a new element of the same type (e.g., add two RDDs)

In [None]:
l1 = sc.parallelize([1,2,3,4])
sum = l1.reduce(lambda x,y: x+y)

In [None]:
print('sum: {}'.format(sum))
print('l1 type: {}'.format(type(l1)))
print('sum type: {}'.format(type(sum)))

`fold()`  
Similar to `reduce()`, includes “zero value” acting as identity  

`aggregate()`  
Similar to reduce and fold, uses:  
1. initial value 
2. combining function for each worker or node
3. combining function to merge results across workers

Page 40 has good code example

`countbyValue()`

In [None]:
nums = sc.parallelize([1,2,3,3,4])
cv = nums.countByValue()

<class 'collections.defaultdict'>
print('cv[1]: {}'.format(cv[1]))
print('cv[3]: {}'.format(cv[3]))