Features Of RDDs:

#In-Memory Computations: It improves the performance by an order of magnitudes.
#Lazy Evaluation: All transformations in RDDs are lazy, i.e, doesn’t compute their results right away.
#Fault Tolerant: RDDs track data lineage information to rebuild lost data automatically.
#Immutability: Data can be created or retrieved anytime and once defined, its value can’t be changed.
#Partitioning: It is the fundamental unit of parallelism in PySpark RDD.
#Persistence: Users can reuse PySpark RDDs and choose a storage strategy for them.
#Coarse-Grained Operations: These operations are applied to all elements in data sets through maps or filter or group by operation.

RDD Operations in PySpark:

1.Transformations: These are the operations which are applied to an RDD to create a new RDD. Transformations follow the principle of Lazy Evaluations (which means that the execution will not start until an action is triggered). This allows you to execute the operations at any time by just calling an action on the data. Few of the transformations provided by RDDs are:
1.map
2.flatMap
3.filter
4.distinct
5.reduceByKey
6.mapPartitions
7.sortBy

2.Actions: Actions are the operations which are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver. Few of the actions include:
1.collect
2.collectAsMap
3.reduce
4.countByKey/countByValue
5.take
6.first

In [4]:
myRDD = sc.parallelize([('JK', 22), ('V', 24), ('Jimin',24), ('RM', 25), ('J-Hope', 25), ('Suga', 26), ('Jin', 27)])
myRDD.take(7)

[('JK', 22),
 ('V', 24),
 ('Jimin', 24),
 ('RM', 25),
 ('J-Hope', 25),
 ('Suga', 26),
 ('Jin', 27)]

In [5]:
CSV_RDD = (sc.textFile("heroes_information.csv", minPartitions= 4).map(lambda element: element.split("\t")))
CSV_RDD.take(3)

[[',name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight'],
 ['0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0'],
 ['1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0']]

In [8]:
New_RDD = sc.textFile("Sample",minPartitions= 4)


In [15]:
New_RDD.collect

<bound method RDD.collect of Sample MapPartitionsRDD[18] at textFile at NativeMethodAccessorImpl.java:0>

In [16]:
CSV_RDD.count()


735

In [20]:
def Func(lines):
    lines = lines.lower()
    lines = lines.split()
    return lines


In [21]:
Split_rdd = New_RDD.map(Func)


In [23]:
stopwords = ['a','all','the','as','is','am','an','and','be','been','from','had','I','I’d','why','with']
RDD = New_RDD.flatMap(Func)
RDD1 = RDD.filter(lambda x: x not in stopwords)

In [24]:
import re
filteredRDD = RDD.filter(lambda x: x.startswith('c'))

In [26]:
#Creating RDDs with key-value pair
a = sc.parallelize([('a',2),('b',3)])
b = sc.parallelize([('a',9),('b',7),('c',10)])

In [27]:
c = a.join(b)
c.collect()

[('a', (2, 9)), ('b', (3, 7))]

In [28]:
num_rdd = sc.parallelize(range(1,5000))
num_rdd.reduce(lambda x,y: x+y)

12497500

In [29]:
data_keydata_key = sc.parallelize([('a', 4),('b', 3),('c', 2),('a', 8),('d', 2),('b', 1),('d', 3)],4)
data_keydata_key.reduceByKey(lambda x, y: x + y).collect()

[('b', 4), ('c', 2), ('a', 12), ('d', 5)]

In [31]:
#Saving the data in a text file
num_rdd.saveAsTextFile("newoutput.txt")

In [32]:
#Sorting the data based on a key
test = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
sc.parallelize(test).sortByKey(True, 1).collect()

[('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]

In [33]:
#Performing Set Operations
rdd_a = sc.parallelize([1,2,3,4])
rdd_b = sc.parallelize([3,4,5,6])


In [34]:
rdd_a.intersection(rdd_b).collect()

[3, 4]

In [35]:
rdd_a.subtract(rdd_b).collect()

[1, 2]

In [36]:
rdd_a.cartesian(rdd_b).collect()

[(1, 3),
 (1, 4),
 (1, 5),
 (1, 6),
 (2, 3),
 (2, 4),
 (2, 5),
 (2, 6),
 (3, 3),
 (3, 4),
 (3, 5),
 (3, 6),
 (4, 3),
 (4, 4),
 (4, 5),
 (4, 6)]

In [37]:
rdd_a.union(rdd_b).collect()

[1, 2, 3, 4, 3, 4, 5, 6]