#### Creating Pair RDDs

In [3]:

lines = sc.textFile('file:////home/cloudera/Desktop/pyspark/README.txt')

In [2]:
lines.take(2)

[u'This is a sample file',
 u'Python is a langauge developed by Guido Van Rosam']

In [4]:
lines.takeOrdered(2,lambda x:-len(x))

[u'Python is a langauge developed by Guido Van Rosam',
 u'This is a sample file',
 u'']

In [12]:
##Creating a pair RDD using first word as the key
pairRDD =lines.map(lambda x:(x.split(" ")[0],x))

In [13]:
pairRDD.collect()

[(u'This', u'This is a sample file'),
 (u'Python', u'Python is a langauge developed by Guido Van Rosam'),
 (u'', u'')]

Imporant note : while using the standard transformations and actions in Pair RDD , always keep in mind we are <b>dealing with tuples</b>.<br>
But when using the Pair transformation and action , we dont need to think about the Tuples.

<b>reduceByKey(func)</b>

In [2]:
sampleRDD = sc.parallelize([(1,2),(3,4),(3,6)])

In [17]:
#Combining values with with the same key
#multiplication of values having same key
sampleRDD.reduceByKey(lambda x,y:x*y).collect()

[(1, 2), (3, 24)]

In [3]:
#Addition of values with the same keys
sampleRDD.reduceByKey(lambda x,y:x+y).collect()

[(1, 2), (3, 10)]

In [8]:
#Adding all the keys and the values separately
sampleRDD.reduce(lambda (x,y),(a,b):((x+a),(y+b)))

(7, 12)

In [10]:
#Adding all the keys and values
sampleRDD.fold((0,0),lambda k,(a,b):(k[0]+a+b,k[1]))

(19, 0)

One thing to notice here is reduce returns non RDD whereas reduceByKey returns a RDD , Hence reduce is a action whereas reduceByKey is transformation.

<b>groupByKey()</b>

In [22]:
sampleRDD.groupByKey().collect()

<pyspark.resultiterable.ResultIterable at 0x7fecce865510>

<b>mapValues(func)</b>

In [24]:
sampleRDD.mapValues(lambda x :x+1).collect()

[(1, 3), (3, 5), (3, 7)]

<b>keys()</b>

In [26]:
sampleRDD.keys().collect()

[1, 3, 3]

<b>values()</b>

In [8]:
valuesRDD = sampleRDD.values()
valuesRDD.collect()

[2, 4, 6]

Note: keys() and values() are not an action but a Transformation.

<b>flatMapValues(func)</b>

In [7]:
sampleRDD.flatMapValues(lambda x:range(x)).collect()

[(1, 0),
 (1, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (3, 3),
 (3, 0),
 (3, 1),
 (3, 2),
 (3, 3),
 (3, 4),
 (3, 5)]

In [29]:
pairRDD.flatMapValues(lambda x:x.split(" ")).collect()

[(u'This', u'This'),
 (u'This', u'is'),
 (u'This', u'a'),
 (u'This', u'sample'),
 (u'This', u'file'),
 (u'Python', u'Python'),
 (u'Python', u'is'),
 (u'Python', u'a'),
 (u'Python', u'langauge'),
 (u'Python', u'developed'),
 (u'Python', u'by'),
 (u'Python', u'Guido'),
 (u'Python', u'Van'),
 (u'Python', u'Rosam'),
 (u'', u'')]

<b>sortByKey()</b>

In [10]:
sortedRDD = sampleRDD.sortByKey()
sortedRDD.collect()

[(1, 2), (3, 4), (3, 6)]

### Transformation on 2 pair RDDS

In [18]:
otherRDD = sc.parallelize([(3,9),(1,2),(3,11),(3,745)])

<b>subtractByKey()</b>

In [13]:
sampleRDD.subtract(otherRDD).collect()

[(1, 2), (3, 4), (3, 6)]

In [14]:
sampleRDD.subtractByKey(otherRDD).collect()

[(1, 2)]

<b>join()</b> performs inner join

In [19]:
sampleRDD.join(otherRDD).collect()

[(1, (2, 2)),
 (3, (4, 9)),
 (3, (4, 11)),
 (3, (4, 745)),
 (3, (6, 9)),
 (3, (6, 11)),
 (3, (6, 745))]

In [17]:
otherRDD.join(sampleRDD).collect()

[(3, (9, 4)), (3, (9, 6))]