In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.textFile("../Example_Sources/test.txt")

for element in rdd.collect():
    print(element)

Project Gutenberg’s
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
This eBook is for the use
of anyone anywhere
at no cost and with
Project Gutenberg’s
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
This eBook is for the use
of anyone anywhere
at no cost and with
Project Gutenberg’s
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
Alice’s Adventures in Wonderland
by Lewis Carroll
This eBook is for the use
of anyone anywhere
at no cost and with
This eBook is for the use
of anyone anywhere
at no cost and with
Project Gutenberg’s
Alice’s Adventures in Wonderland
by

## Flatmap

`flatMap()` transformation flattens the RDD after applying the function and returns a new RDD. On the below 
example, first, it splits each record by space in an RDD and finally flattens it. 
Resulting RDD consists of a single word on each record.

In [3]:
print('FlatMap')
rdd2 = rdd.flatMap(lambda x: x.split(" "))
for element in rdd2.collect():
    print(element)

FlatMap
Project
Gutenberg’s
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Project
Gutenberg’s
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Project
Gutenberg’s
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Alice’s
Adventures
in
Wonderland
by
Lewis
Carroll
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with
Project
Gutenberg’s
Alice’s
Adventures
in
Wonde

## Map

`map()` transformation is used the apply any complex operations like adding a column, updating a column e.t.c, 
the output of map transformations would always have the same number of records as input.

In our word count example, we are adding a new column with value 1 for each word, the result of the RDD is 
PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of type Int as value.

In [4]:
print('Map')
rdd3 = rdd2.map(lambda x: (x, 1))
for element in rdd3.collect():
    print(element)

Map
('Project', 1)
('Gutenberg’s', 1)
('Alice’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('by', 1)
('Lewis', 1)
('Carroll', 1)
('This', 1)
('eBook', 1)
('is', 1)
('for', 1)
('the', 1)
('use', 1)
('of', 1)
('anyone', 1)
('anywhere', 1)
('at', 1)
('no', 1)
('cost', 1)
('and', 1)
('with', 1)
('Alice’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('by', 1)
('Lewis', 1)
('Carroll', 1)
('This', 1)
('eBook', 1)
('is', 1)
('for', 1)
('the', 1)
('use', 1)
('of', 1)
('anyone', 1)
('anywhere', 1)
('at', 1)
('no', 1)
('cost', 1)
('and', 1)
('with', 1)
('This', 1)
('eBook', 1)
('is', 1)
('for', 1)
('the', 1)
('use', 1)
('of', 1)
('anyone', 1)
('anywhere', 1)
('at', 1)
('no', 1)
('cost', 1)
('and', 1)
('with', 1)
('Project', 1)
('Gutenberg’s', 1)
('Alice’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('by', 1)
('Lewis', 1)
('Carroll', 1)
('This', 1)
('eBook', 1)
('is', 1)
('for', 1)
('the', 1)
('use', 1)
('of', 1)
('anyone', 1)
('anywhere', 1)
('at', 1)
('no', 1)
('cost', 1)
('

## ReduceByKey

`reduceByKey()` merges the values for each key with the function specified. In our example, 
it reduces the word string by applying the sum function on value. The result of our RDD contains 
unique words and their count.

In [5]:
rdd4 = rdd3.reduceByKey(lambda a, b: a + b)
print('ReduceByKey')
for element in rdd4.collect():
    print(element)

ReduceByKey
('Project', 9)
('Gutenberg’s', 9)
('Alice’s', 18)
('in', 18)
('Lewis', 18)
('Carroll', 18)
('is', 27)
('use', 27)
('of', 27)
('anyone', 27)
('anywhere', 27)
('at', 27)
('no', 27)
('Adventures', 18)
('Wonderland', 18)
('by', 18)
('This', 27)
('eBook', 27)
('for', 27)
('the', 27)
('cost', 27)
('and', 27)
('with', 27)


## sortByKey

`sortByKey()` transformation is used to sort RDD elements on key. In our example, first, 
we convert `RDD[(String,Int])` to `RDD[(Int,String])` using map transformation and later apply 
sortByKey which ideally does sort on an integer value. And finally, foreach with println statement 
prints all words in RDD and their count as key-value pair to console.

In [6]:
rdd5 = rdd4.map(lambda x: (x[1], x[0])).sortByKey()
print('SortByKey')
for element in rdd5.collect():
    print(element)

SortByKey
(9, 'Project')
(9, 'Gutenberg’s')
(18, 'Alice’s')
(18, 'in')
(18, 'Lewis')
(18, 'Carroll')
(18, 'Adventures')
(18, 'Wonderland')
(18, 'by')
(27, 'is')
(27, 'use')
(27, 'of')
(27, 'anyone')
(27, 'anywhere')
(27, 'at')
(27, 'no')
(27, 'This')
(27, 'eBook')
(27, 'for')
(27, 'the')
(27, 'cost')
(27, 'and')
(27, 'with')


## Filter

`filter()` transformation is used to filter the records in an RDD. In our example we are 
filtering all words starts with the letter `a`.

In [7]:
print('Filter')
rdd6 = rdd5.filter(lambda x: 'a' in x[1])
for element in rdd6.collect():
    print(element)

Filter
(18, 'Carroll')
(18, 'Wonderland')
(27, 'anyone')
(27, 'anywhere')
(27, 'at')
(27, 'and')



## Additional transformations

`mapPartitions()` - Similar to `map`, but executs transformation function on each partition, 
This gives better performance than `map` function.

`mapPartitionsWithIndex()` - Similar to `map` Partitions, but also provides func with an integer value 
representing the index of the partition.

`randomSplit()` - Splits the RDD by the weights specified in the argument. For example `rdd.randomSplit(0.7,0.3)`

`union()` - Comines elements from source dataset and the argument and returns combined dataset. This is similar to `union` 
function in Math set operations.

`sample()` - Returns the sample dataset.

`intersection()` - Returns the dataset which contains elements in both source dataset and an argument

`distinct()` - Returns the dataset by eliminating all duplicated elements.

`repartition()` - Return a dataset with number of partition specified in the argument. This operation reshuffles the 
RDD randomly, it could either return lesser or more partitioned RDD based on the input supplied.

`coalesce()` - Similar to repartition by operates better when we want to the decrease the partitions. Betterment 
achieves by reshuffling the data from fewer nodes compared with all nodes by repartition.
