## Key Value Pairs

### University of California, Santa Barbara  
### PSTAT 135/235  
### Last Updated: Oct 23, 2018

---  

### Sources 

1. Learning Spark

### OBJECTIVES
1. Learn about properties and methods for pair RDDs


### CONCEPTS AND FUNCTIONS
- Pair RDDs  
- Partition  
- reduceByKey(), groupByKey(), combineByKey(), sortByKey()  
- mapValues(), flatMapValues()  
- keys(), values()  
- join(), subtractByKey(), rightOuterJoin(), leftOuterJoin(), cogroup  
- countByKey()  
- collectAsMap()  
- lookup()  
- groupWith()  

---  

### PAIR RDD BASICS

A pair RDD contains key/value pairs (e.g., dictionary in Python)  

Useful for merging, aggregating  

Calling map() on an RDD will produce a pair RDD  

In [15]:
import os

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()

sc = spark.sparkContext

In [4]:
lines = sc.parallelize(['french fries','chicken burrito'])

In [11]:
p = lines.map(lambda x: (x.split(" ")[0], x)).collect()

In [None]:
p

In [None]:
type(p)

Page 49 contains transformations on pair RDDs  

Some examples:

In [12]:
rdd = sc.parallelize([(1,2),(3,4),(3,6)])

In [None]:
# Extract the keys
rdd.keys().collect()

**Basic Transformations**

fold()  
Similar to reduce, includes “zero value” acting as identity

reduceByKey()  
Runs several parallel reduce operations, one for each key  
Combining is done locally on each machine for each key before computing a global combine for the key


**Reduce (sum) by keys**

In [None]:
rdd.reduceByKey(lambda x,y: x+y) \
   .collect()

**Revisiting Word Count**

In [17]:
lines = sc.textFile('/home/jovyan/work/data/README.md')

In [20]:
wordcounts = lines.map(lambda x: x.replace(',',' ') \
                        .replace('.','   ').replace('-',' ').lower()) \
                        .flatMap(lambda x: x.split()) \
                        .map(lambda x: (x, 1)) \
                        .reduceByKey(lambda x,y:x+y) \
                        .map(lambda x:(x[1],x[0])) \
                        .sortByKey(False) 

In [None]:
wordcounts.take(10)

**Finding Frequent Word Bigrams**

In [22]:
bigrams = lines \
            .map(lambda x: x.split()) \
            .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])

In [None]:
bigrams.collect()

In [26]:
# including a reducer and sort by key

bigrams = lines \
          .map(lambda x: x.split()) \
          .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])\
          .reduceByKey(lambda x,y: x+y) \
          .map(lambda x: (x[1],x[0])) \
          .sortByKey(False)

In [None]:
bigrams.take(10)

**Partition**  

Determines the amount of parallelism when executing on RDDs.  
Most operators in this chapter take parameter for partition.  

Example: reduceByKey(lambda x, y: x + y, 10)

**Join**  
join()  is an inner join  
leftOuterJoin()  
rightOuterJoin()  

**Sorting**  
Takes param for sort direction.  
Can provide comparison function for custom sorting.  
Example of converting integers to strings and using string compare function:  
rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))  

**Actions on Pair RDDs**  
All transformations for base RDDs are avail for pair RDDs  
Plus some additional like:  
countByKey()  
collectAsMap()  
lookup(key)  

In [28]:
rdd = sc.parallelize([(1,2),(3,4),(3,6)])

In [None]:
rdd.countByKey()

In [None]:
rdd.lookup(3)