# Key-Value RDD


In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('example rdd').master('local[*]').getOrCreate()

In [6]:
words = "my name is shreyas is name".split(" ")
rdd = spark.sparkContext.parallelize(words)
print(rdd.collect())

#lets try to create a key value rdd using the above rdd
rdd1 = rdd.map(lambda word:(word,1))
print(rdd1.collect())

['my', 'name', 'is', 'shreyas', 'is', 'name']
[('my', 1), ('name', 1), ('is', 1), ('shreyas', 1), ('is', 1), ('name', 1)]


In [12]:
# spark also provides methods to create keys and values.
# keyBY(func) - creates keys from values present in original rdd according to the function provided.

rdd2 = rdd.keyBy(lambda word:word[0].upper())
print(rdd2.collect())

#if we want to change only values in the rdd we can use below function
#mapValues() - Pass each value in the key-value pair RDD through a map function without changing the keys

rdd3 = rdd2.mapValues(lambda word: word.upper())
print(rdd3)

[('M', 'my'), ('N', 'name'), ('I', 'is'), ('S', 'shreyas'), ('I', 'is'), ('N', 'name')]
PythonRDD[11] at RDD at PythonRDD.scala:53


In [14]:
# to get all keys and values from a rdd

print(rdd3.keys().collect())
print(rdd3.values().collect())

#to get value associated with a particular key
print(rdd3.lookup('N'))

['M', 'N', 'I', 'S', 'I', 'N']
['MY', 'NAME', 'IS', 'SHREYAS', 'IS', 'NAME']
['NAME', 'NAME']


# Aggregations

In [16]:
#countByKey - You can count the number of elements for each key, collecting the results to a local Map

rdd3.countByKey()
display(rdd3.countByKey())

# this creates a python dictionary

defaultdict(int, {'M': 1, 'N': 2, 'I': 2, 'S': 1})

groupByKey() is the most frequently used wide transformation operation that involves shuffling of data across the executors when data is not partitioned on the Key. It takes key-value pairs (K, V) as an input, groups the values based on the key(K), and generates a dataset of KeyValueGroupedDataset(K, Iterable) pairs as an output.
It’s a very expensive operation and consumes a lot of memory if the dataset is huge. Hence better to avoid it

https://sparkbyexamples.com/spark/spark-groupbykey/

reduceByKey() transformation is used to merge the values of each key using an associative reduce function




In [31]:
rdd4=spark.sparkContext.parallelize((("key1", 1), ("key2", 2), ("key1", 3), ("key2", 4)))
#print(rdd.collect())

rdd5= rdd4.groupByKey()

print(rdd5.collect())
# we can see that it gives a key and an iterable over values. We can apply other function using this iterable


[('key2', <pyspark.resultiterable.ResultIterable object at 0x00000222E8AB4AD0>), ('key1', <pyspark.resultiterable.ResultIterable object at 0x00000222E8AB4B10>)]


In [32]:
rdd6=rdd4.reduceByKey(lambda a,b:a+b)
print(rdd6.collect())

#as we can see here it is easier to use reducebykey. Here we easily got the sum of values for a key

[('key2', 6), ('key1', 4)]


In [36]:
#joins on key value RDD. combines values when keys match
a_rdd = spark.sparkContext.parallelize((("key1", 1), ("key2", 2), ("key4", 4)))
b_rdd = spark.sparkContext.parallelize((("key1", 100), ("key2", 200), ("key3", 300),))

a_rdd.join(b_rdd).collect()

#similarly we can perform fullOuterJoin, leftOuterJoin, rightOuterJoin, cartesian by replacing join with these names.


[('key2', (2, 200)), ('key1', (1, 100))]

In [37]:
#zip - “zip” together two RDDs, assuming that they have the same length. 
#This creates a PairRDD. The two RDDs must have the same number of partitions as well as the same number of elements
#this is like union
a_rdd.zip(b_rdd).collect()

[(('key1', 1), ('key1', 100)),
 (('key2', 2), ('key2', 200)),
 (('key4', 4), ('key3', 300))]