## Key Value Pairs

### University of California, Santa Barbara  
### PSTAT 135/235: Big Data Analytics
### Last Updated: January 29, 2019

---  

### Sources 

1. Learning Spark

### OBJECTIVES
1. Learn about properties and methods for pair RDDs


### CONCEPTS AND FUNCTIONS
- Pair RDDs  
- Partition  
- reduceByKey(), groupByKey(), combineByKey(), sortByKey()  
- mapValues(), flatMapValues()  
- keys(), values()  
- join(), subtractByKey(), rightOuterJoin(), leftOuterJoin(), cogroup  
- countByKey()  
- collectAsMap()  
- lookup()  
- groupWith()  

---  

### PAIR RDD BASICS

A pair RDD contains key/value pairs (e.g., dictionary in Python)  

Useful for merging, aggregating  

Calling map() on an RDD will produce a pair RDD  

In [1]:
import os

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()

sc = spark.sparkContext

In [3]:
lines = sc.parallelize(['french fries','chicken burrito'])

In [4]:
type(lines)

pyspark.rdd.RDD

In [8]:
lines.map(lambda x: x.split(" ")).collect()

[['french', 'fries'], ['chicken', 'burrito']]

In [9]:
lines.flatMap(lambda x: x.split(" ")).collect()

['french', 'fries', 'chicken', 'burrito']

In [15]:
lines.flatMap(lambda x: (x.split(" ")[0], x)).collect()

['french', 'french fries', 'chicken', 'chicken burrito']

In [17]:
lines.map(lambda x: (x.split(" ")[0], x)).collect()

[('french', 'french fries'), ('chicken', 'chicken burrito')]

In [22]:
p = lines.flatMap(lambda x: (x.split(" ")[0], x)).collect()

In [23]:
p

['french', 'french fries', 'chicken', 'chicken burrito']

In [24]:
[w for w in p if 'french' in p]

['french', 'french fries', 'chicken', 'chicken burrito']

Page 49 contains transformations on pair RDDs  

Some examples:

In [26]:
rdd = sc.parallelize([(1,2),(3,4),(3,6)])

In [27]:
rdd

ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:184

In [28]:
# Extract the keys
rdd.keys().collect()

[1, 3, 3]

**Basic Transformations**

fold()  
Similar to reduce, includes “zero value” acting as identity

reduceByKey()  
Runs several parallel reduce operations, one for each key  
Combining is done locally on each machine for each key before computing a global combine for the key


**Reduce (sum) by keys**

In [29]:
rdd.reduceByKey(lambda x,y: x+y) \
   .collect()

[(1, 2), (3, 10)]

**Revisiting Word Count**

In [30]:
lines = sc.textFile('/home/jovyan/UCSB_BigDataAnalytics/data/README.md')

In [31]:
lines.take(5)

['# Apache Spark',
 '',
 'Spark is a fast and general cluster computing system for Big Data. It provides',
 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that',
 'supports general computation graphs for data analysis. It also supports a']

In [32]:
type(lines)

pyspark.rdd.RDD

In [33]:
lines.count()

103

In [36]:
wordcounts = lines.map(lambda x: x.replace(',',' ') \
                        .replace('.','   ').replace('-',' ').lower())
wordcounts.take(5)

['# apache spark',
 '',
 'spark is a fast and general cluster computing system for big data    it provides',
 'high level apis in scala  java  python  and r  and an optimized engine that',
 'supports general computation graphs for data analysis    it also supports a']

In [47]:
wordcounts = lines.map(lambda x: x.replace(',',' ') \
                        .replace('.','   ').replace('-',' ').lower()) \
                        .flatMap(lambda x: x.split()) \
                        .map(lambda x: (x, 1))

wordcounts.take(5)

[('#', 1), ('apache', 1), ('spark', 1), ('spark', 1), ('is', 1)]

In [49]:
wordcounts.count()

546

In [50]:
wordcounts = lines.map(lambda x: x.replace(',',' ') \
                        .replace('.','   ').replace('-',' ').lower()) \
                        .flatMap(lambda x: x.split()) \
                        .map(lambda x: (x, 1)) \
                        .reduceByKey(lambda x,y:x+y)

wordcounts.take(5)

[('#', 1), ('apache', 11), ('spark', 19), ('is', 6), ('a', 9)]

In [70]:
wordcounts = lines.map(lambda x: x.replace(',',' ') \
                        .replace('.','   ').replace('-',' ').lower()) \
                        .flatMap(lambda x: x.split()) \
                        .filter(lambda x: x == 'apache') \
                        .map(lambda x: (x, 1)) \
                        .reduceByKey(lambda x,y:x+y) \
                        .map(lambda x:(x[1],x[0])) \
                        .sortByKey(False)
wordcounts.take(5)

[(11, 'apache')]

In [51]:
type(wordcounts)

pyspark.rdd.PipelinedRDD

In [52]:
wordcounts.count()

270

In [18]:
wordcounts = lines.map(lambda x: x.replace(',',' ') \
                        .replace('.','   ').replace('-',' ').lower()) \
                        .flatMap(lambda x: x.split()) \
                        .map(lambda x: (x, 1)) \
                        .reduceByKey(lambda x,y:x+y) \
                        .map(lambda x:(x[1],x[0])) \
                        .sortByKey(False) 

In [19]:
wordcounts.take(10)

[(26, 'the'),
 (19, 'spark'),
 (19, 'to'),
 (15, 'for'),
 (11, 'apache'),
 (10, 'and'),
 (9, 'a'),
 (9, '##'),
 (8, 'you'),
 (7, 'can')]

In [71]:
bigrams = lines \
            .map(lambda x: x.split())
bigrams.take(5)
            

[['#', 'Apache', 'Spark'],
 [],
 ['Spark',
  'is',
  'a',
  'fast',
  'and',
  'general',
  'cluster',
  'computing',
  'system',
  'for',
  'Big',
  'Data.',
  'It',
  'provides'],
 ['high-level',
  'APIs',
  'in',
  'Scala,',
  'Java,',
  'Python,',
  'and',
  'R,',
  'and',
  'an',
  'optimized',
  'engine',
  'that'],
 ['supports',
  'general',
  'computation',
  'graphs',
  'for',
  'data',
  'analysis.',
  'It',
  'also',
  'supports',
  'a']]

In [90]:
lines2 = sc.parallelize(lines.take(1))
lines2.collect()

['# Apache Spark']

In [91]:
bigrams = lines2 \
            .map(lambda x: x.split()) \
            .flatMap(lambda x: [(x[i],x[i+1]) for i in range(0,len(x)-1)])
bigrams.take(5)

[('#', 'Apache'), ('Apache', 'Spark')]

In [92]:
type(bigrams)

pyspark.rdd.PipelinedRDD

In [95]:
bigrams = lines \
            .map(lambda x: x.split()) \
            .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])
bigrams.take(5)

[(('#', 'Apache'), 1),
 (('Apache', 'Spark'), 1),
 (('Spark', 'is'), 1),
 (('is', 'a'), 1),
 (('a', 'fast'), 1)]

In [75]:
for i in range(0,10):
    print(i)

0
1
2
3
4
5
6
7
8
9


In [20]:
bigrams = lines \
            .map(lambda x: x.split()) \
            .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])

In [21]:
bigrams.collect()

[(('#', 'Apache'), 1),
 (('Apache', 'Spark'), 1),
 (('Spark', 'is'), 1),
 (('is', 'a'), 1),
 (('a', 'fast'), 1),
 (('fast', 'and'), 1),
 (('and', 'general'), 1),
 (('general', 'cluster'), 1),
 (('cluster', 'computing'), 1),
 (('computing', 'system'), 1),
 (('system', 'for'), 1),
 (('for', 'Big'), 1),
 (('Big', 'Data.'), 1),
 (('Data.', 'It'), 1),
 (('It', 'provides'), 1),
 (('high-level', 'APIs'), 1),
 (('APIs', 'in'), 1),
 (('in', 'Scala,'), 1),
 (('Scala,', 'Java,'), 1),
 (('Java,', 'Python,'), 1),
 (('Python,', 'and'), 1),
 (('and', 'R,'), 1),
 (('R,', 'and'), 1),
 (('and', 'an'), 1),
 (('an', 'optimized'), 1),
 (('optimized', 'engine'), 1),
 (('engine', 'that'), 1),
 (('supports', 'general'), 1),
 (('general', 'computation'), 1),
 (('computation', 'graphs'), 1),
 (('graphs', 'for'), 1),
 (('for', 'data'), 1),
 (('data', 'analysis.'), 1),
 (('analysis.', 'It'), 1),
 (('It', 'also'), 1),
 (('also', 'supports'), 1),
 (('supports', 'a'), 1),
 (('rich', 'set'), 1),
 (('set', 'of'), 1),


In [97]:
bigrams = lines \
          .map(lambda x: x.split()) \
          .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])\
          .reduceByKey(lambda x,y: x+y)
bigrams.take(5)

[(('#', 'Apache'), 1),
 (('Apache', 'Spark'), 1),
 (('Spark', 'is'), 4),
 (('is', 'a'), 1),
 (('a', 'fast'), 1)]

In [98]:
bigrams = lines \
          .map(lambda x: x.split()) \
          .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])\
          .reduceByKey(lambda x,y: x+y) \
          .map(lambda x: (x[1],x[0]))
bigrams.take(5)

[(1, ('#', 'Apache')),
 (1, ('Apache', 'Spark')),
 (4, ('Spark', 'is')),
 (1, ('is', 'a')),
 (1, ('a', 'fast'))]

In [None]:
bigrams.take(10)

In [101]:
# including a reducer and sort by key

bigrams = lines \
          .map(lambda x: x.split()) \
          .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])\
          .reduceByKey(lambda x,y: x+y) \
          .map(lambda x: (x[1],x[0])) \
          .sortByKey(False)
bigrams.take(10)

[(4, ('Spark', 'is')),
 (3, ('You', 'can')),
 (3, ('build', 'Spark')),
 (3, ('in', 'the')),
 (3, ('to', 'run')),
 (3, ('on', 'how')),
 (3, ('how', 'to')),
 (3, ('to', 'the')),
 (2, ('if', 'you')),
 (2, ('Spark', 'using'))]

**Partition**  

Determines the amount of parallelism when executing on RDDs.  
Most operators in this chapter take parameter for partition.  

Example: reduceByKey(lambda x, y: x + y, 10)

**Join**  
join()  is an inner join  
leftOuterJoin()  
rightOuterJoin()  

**Sorting**  
Takes param for sort direction.  
Can provide comparison function for custom sorting.  
Example of converting integers to strings and using string compare function:  
rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))  

**Actions on Pair RDDs**  
All transformations for base RDDs are avail for pair RDDs  
Plus some additional like:  
countByKey()  
collectAsMap()  
lookup(key)  

In [28]:
rdd = sc.parallelize([(1,2),(3,4),(3,6)])

In [None]:
rdd.countByKey()

In [None]:
rdd.lookup(3)