# **Working with RDD (Resilient Distributed Dataset)**

**`Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark`**

**`Author: Amin Karami (PhD, FHEA)`**

---

**Resilient Distributed Dataset (RDD)**: RDD is the fundamental data structure of Spark. It is fault-tolerant (resilient) and immutable distributed collections of any type of objects.

source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

source: https://spark.apache.org/docs/latest/api/python/reference/

In [None]:
########## ONLY in Colab ##########
!pip3 install pyspark
########## ONLY in Colab ##########

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 44 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 64.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=3ced7c8e0b357e1405ae909932802da3b120ee1f73a8fc3d4c40a987cba22e9b
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [None]:
########## ONLY in Ubuntu Machine ##########
# Load Spark engine
!pip3 install -q findspark
import findspark
findspark.init()
########## ONLY in Ubuntu Machine ##########

In [None]:
# Linking with Spark
from pyspark import SparkContext, SparkConf

In [None]:
# Initializing Spark
conf = SparkConf().setAppName("RDD_practice").setMaster("local[*]")
sc = SparkContext(conf=conf)
print(sc)

<SparkContext master=local[*] appName=RDD_practice>


# **Part 1: Create RDDs and Basic Operations**
# **There are two ways to create RDDs:**

1.   Parallelizing an existing collection in your driver program
2.   Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

In [None]:
# Generate random data:
data=[]
for i in range(1,101):
  data.append(i)

In [None]:
# Create RDD:
rdd_data=sc.parallelize(data,10)
rdd_data.getNumPartitions()

10

In [None]:
datacoll=rdd_data.collect()

In [None]:
# Data distribution in partitions:
rdd_data = rdd_data.repartition(5)
rdd_data.getNumPartitions()

5

In [None]:
rdd_data.glom().collect()

[[61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  69,
  70,
  81,
  82,
  83,
  84,
  85,
  86,
  87,
  88,
  89,
  90],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100],
 [21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
 [41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60]]

In [None]:
# Print last partition
rdd_data.glom().collect()[-1]

[41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60]

In [None]:
# count():
rdd_data.count()

100

In [None]:
# first():
rdd_data.first()

61

In [None]:
# top():
rdd_data.top(15)

[100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86]

In [None]:
# distinct():
rdd_data.distinct()

PythonRDD[28] at RDD at PythonRDD.scala:53

In [None]:
# map():
my_rdd=rdd_data.map(lambda x:(x,1))
# for element in my_rdd.collect():
#   print(element)

In [None]:
#filter()
my_rdd1 = rdd_data.filter(lambda x:(x,1))
print(my_rdd1.collect())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


In [None]:
# flatMap():
my_rdd2 = rdd_data.flatMap(lambda x:[[x,x+1,x+2]])
# my_rdd2.collect()

In [None]:
# Descriptive statistics:
 

In [None]:
# mapPartitions():


# **Part 2: Advanced RDD Transformations and Actions**

In [None]:
# union():
data1 = rdd_data.union(rdd_data)

In [None]:
# intersection():
print(rdd_data.intersection(rdd_data))

PythonRDD[57] at RDD at PythonRDD.scala:53


In [None]:
# Find empty partitions
def f(iterator):
  a=0
  for partition in iterator:
    a=a+1
  print(a)
p=rdd_data.foreachPartition(f)

In [None]:
# coalesce(numPartitions):
rdd_data.coalesce(2)

CoalescedRDD[3] at coalesce at NativeMethodAccessorImpl.java:0

In [None]:
# takeSample(withReplacement, num, [seed])
rdd_data.takeSample(True,10,12)

[46, 8, 13, 55, 97, 19, 64, 73, 77, 7]

In [None]:
# takeOrdered(n, [ordering])
rdd_data.takeOrdered(6,lambda x:-x)

[100, 99, 98, 97, 96, 95]

In [None]:
# reduce():
rdd_data.reduce(lambda x,y:x+y)

5050

In [None]:
new = sc.parallelize([("a", 1), ("b", 2), ("a", 1)])

In [None]:
# reduceByKey():
new20=new.reduceByKey(lambda x,y:x+y)

In [None]:
# sortByKey():#Gives error if data doesnt have the format of Key,Value pairs
sorted_rdd=new.sortByKey()
sorted_rdd.collect()

[('a', 1), ('a', 1), ('b', 1)]

In [None]:
# countByKey()
count_rdd=new.countByKey()
count_rdd

defaultdict(int, {'a': 2, 'b': 1})

In [None]:
# groupByKey():
group_rdd=new.groupByKey()
group_rdd.collect()

[('b', <pyspark.resultiterable.ResultIterable at 0x7ff0809464d0>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x7ff0809466d0>)]

In [None]:
# lookup(key):
new.lookup("b")

[2]

In [None]:
# cache:
# By default, each transformed RDD may be recomputed each time you run an action on it.
# However, you may also persist an RDD in memory using the persist (or cache) method,
# in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
rdd_data.cache()

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [None]:
# Persistence (https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence)
caching = rdd_data.persist().is_cached
caching

True