In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PySparkLearning').getOrCreate()

In [2]:
rdd = spark.sparkContext.textFile("../Resources/test.txt")
print("Initial partition count:"+str(rdd.getNumPartitions()))
print('---------------------------')
for element in rdd.collect():
    print(element)

Initial partition count:2
---------------------------
Project Gutenberg’s
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
Project Gutenberg’s




`flatMap`
– flatMap() transformation flattens the RDD after applying the function and returns a new RDD. On the below example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a single word on each record.

In [3]:
rdd2=rdd.flatMap(lambda x: x.split(" "))
for element in rdd2.collect():
    print(element)

Project
Gutenberg’s
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Project
Gutenberg’s




`map` – map() transformation is used the apply any complex operations like adding a column, updating a column e.t.c, the output of map transformations would always have the same number of records as input.

In our word count example, we are adding a new column with value 1 for each word, the result of the RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of type Int as value.

In [4]:
rdd3=rdd2.map(lambda x: (x,1))
for element in rdd3.collect():
    print(element)

('Project', 1)
('Gutenberg’s', 1)
('Alice’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('by', 1)
('Lewis', 1)
('Carroll', 1)
('This', 1)
('eBook', 1)
('is', 1)
('for', 1)
('the', 1)
('use', 1)
('of', 1)
('anyone', 1)
('anywhere', 1)
('at', 1)
('no', 1)
('cost', 1)
('and', 1)
('with', 1)
('Project', 1)
('Gutenberg’s', 1)
('', 1)
('', 1)


`reduceByKey` – reduceByKey() merges the values for each key with the function specified. In our example, it reduces the word string by applying the sum function on value. The result of our RDD contains unique words and their count. 

In [5]:
rdd4=rdd3.reduceByKey(lambda a,b: a+b)
for element in rdd4.collect():
    print(element)

('Project', 2)
('Gutenberg’s', 2)
('Alice’s', 1)
('in', 1)
('Lewis', 1)
('Carroll', 1)
('is', 1)
('use', 1)
('of', 1)
('anyone', 1)
('anywhere', 1)
('at', 1)
('no', 1)
('', 2)
('Adventures', 1)
('Wonderland', 1)
('by', 1)
('This', 1)
('eBook', 1)
('for', 1)
('the', 1)
('cost', 1)
('and', 1)
('with', 1)


`sortByKey` – sortByKey() transformation is used to sort RDD elements on key. In our example, first, we convert RDD[(String,Int]) to RDD[(Int, String]) using map transformation and apply sortByKey which ideally does sort on an integer value. And finally, foreach with println statements returns all words in RDD and their count as key-value pair

In [6]:
rdd5 = rdd4.map(lambda x: (x[1],x[0])).sortByKey()
for element in rdd5.collect():
    print(element)

(1, 'Alice’s')
(1, 'in')
(1, 'Lewis')
(1, 'Carroll')
(1, 'is')
(1, 'use')
(1, 'of')
(1, 'anyone')
(1, 'anywhere')
(1, 'at')
(1, 'no')
(1, 'Adventures')
(1, 'Wonderland')
(1, 'by')
(1, 'This')
(1, 'eBook')
(1, 'for')
(1, 'the')
(1, 'cost')
(1, 'and')
(1, 'with')
(2, 'Project')
(2, 'Gutenberg’s')
(2, '')


`filter()` transformation is used to filter the records in an RDD. In our example we are filtering all words starts with “a”.

In [None]:
rdd6 = rdd5.filter(lambda x : 'a' in x[1])
for element in rdd6.collect():
    print(element)