<a href="https://colab.research.google.com/github/Deepaksai1919/Spark/blob/DrAminKarami-Udemy/Working_With_RDD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **RDD**: Resilient Distributed Dataset.
It is a fundamental datastructure of Spark. It is *fault-tolerant*, *immutable* *distributed collection* of any type of objects.

https://spark.apache.org/docs/latest/rdd-programming-guide.html

In [2]:
!pip3 install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=a11a8801437e60418d5c6c12f103db77dcf4d040838a2e394fcfec479cc79b0c
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [3]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('RDD_Practice').setMaster('local[*]')
sc = SparkContext(conf = conf)
print(sc)

<SparkContext master=local[*] appName=RDD_Practice>


In [4]:
import random
sampleList = random.sample(range(0,40), 10)
print(sampleList)

[1, 6, 19, 13, 22, 28, 21, 29, 2, 10]


`parallelize` will create the distributed dataset into defined number of partitions

In [5]:
rdd1 = sc.parallelize(sampleList, 4)

In [6]:
rdd1.getNumPartitions()

4

parallelize will consider the default parallelism to number of cores allocated

In [7]:
sc.parallelize(sampleList).getNumPartitions(), sc.defaultParallelism

(2, 2)

In [8]:
rdd1.collect()

[1, 6, 19, 13, 22, 28, 21, 29, 2, 10]

using `.glom()`, we can see the data in each partition

In [9]:
rdd1.glom().collect()

[[1, 6], [19, 13], [22, 28], [21, 29, 2, 10]]

We can see the numbers in each of the 4 partitions created

With real datasets, the size is often huge and `collect()` will send all the data to driver which is not advisable. So we can use `.take(n)` instead to see the first n number of records.

In [10]:
print(rdd1.glom().take(2))
print(rdd1.take(2))

[[1, 6], [19, 13]]
[1, 6]


What if we create more number of partitions than data size?

In [11]:
sc.parallelize(sampleList, 15).glom().collect()

[[], [1], [6], [], [19], [13], [], [22], [28], [], [21], [29], [], [2], [10]]

Empty partitions will get created as above.

In [12]:
rdd1.count()

10

In [13]:
rdd1.glom().count()

4

In [14]:
rdd1.first()

1

In [15]:
rdd1.glom().first()

[1, 6]

`.top(n)` will give the top n elements in decreasing order

In [17]:
rdd1.top(2)

[29, 28]

In [19]:
sorted(sampleList, reverse=True)[:2]

[29, 28]

In [20]:
print(rdd1.top(100))
print(sorted(sampleList, reverse=True))

[29, 28, 22, 21, 19, 13, 10, 6, 2, 1]
[29, 28, 22, 21, 19, 13, 10, 6, 2, 1]


In [21]:
rdd1.distinct().collect()

[28, 1, 13, 21, 29, 6, 22, 2, 10, 19]

`.map(func)` returns a new rdd in which the each element of the new rdd is the return value of each element of old rdd from the function

In [23]:
def my_func(item):
  return (item + 1) * 3

rdd_map = rdd1.map(my_func)

print(rdd1.collect())
print(rdd_map.collect())

[1, 6, 19, 13, 22, 28, 21, 29, 2, 10]
[6, 21, 60, 42, 69, 87, 66, 90, 9, 33]


The data in each partition resides in the same partition when map is used

In [25]:
print(rdd1.glom().collect())
print(rdd_map.glom().collect())

[[1, 6], [19, 13], [22, 28], [21, 29, 2, 10]]
[[6, 21], [60, 42], [69, 87], [66, 90, 9, 33]]


In [26]:
rdd_filter = rdd1.filter(lambda x: x%2 == 0)
print(rdd1.collect())
print(rdd_filter.collect())

[1, 6, 19, 13, 22, 28, 21, 29, 2, 10]
[6, 22, 28, 2, 10]


using `.filter()` will select the elements which satisfy the condition with in each partition. If there is no element passing the condition in a partition, an empty partition will be created.

In [27]:
print(rdd1.glom().collect())
print(rdd_filter.glom().collect())

[[1, 6], [19, 13], [22, 28], [21, 29, 2, 10]]
[[6], [], [22, 28], [2, 10]]


`.flatMap(func)` is same as `.map(func)` but each input item can be mapped to 0 or more output items. It should return a sequence instead of the elements. So `yield` should be used instead of `return`

In [28]:
rdd_flatMap = rdd1.flatMap(lambda x: [x+1, x+2])
print(rdd1.collect())
print(rdd_flatMap.collect())

[1, 6, 19, 13, 22, 28, 21, 29, 2, 10]
[2, 3, 7, 8, 20, 21, 14, 15, 23, 24, 29, 30, 22, 23, 30, 31, 3, 4, 11, 12]


Similar to map and filter, flatMap will also create the elements in the same partition

In [29]:
print(rdd1.glom().collect())
print(rdd_flatMap.glom().collect())

[[1, 6], [19, 13], [22, 28], [21, 29, 2, 10]]
[[2, 3, 7, 8], [20, 21, 14, 15], [23, 24, 29, 30], [22, 23, 30, 31, 3, 4, 11, 12]]


In [30]:
print(rdd1.collect())
print(rdd1.map(lambda x: [x+1, x+2]).collect())

[1, 6, 19, 13, 22, 28, 21, 29, 2, 10]
[[2, 3], [7, 8], [20, 21], [14, 15], [23, 24], [29, 30], [22, 23], [30, 31], [3, 4], [11, 12]]


In [33]:
import pprint
pprint.pprint(rdd1.map(lambda x: [x+1, x+2]).glom().collect())

[[[2, 3], [7, 8]],
 [[20, 21], [14, 15]],
 [[23, 24], [29, 30]],
 [[22, 23], [30, 31], [3, 4], [11, 12]]]


In [35]:
rdd_reduce = rdd1.reduce(lambda a,b: a + b)
rdd_reduce

151

In [38]:
# Descriptive statistics

print([rdd1.max(), rdd1.min(), rdd1.sum(), rdd1.mean(), rdd1.stdev()])

[29, 1, 151, 15.1, 9.7]


What if we want to perform some operation for each partition?
*For eg: Sum of elements in each partition or number of elements in each partition*


In [41]:
def my_partition_func(partition):
  _sum = 0
  _count = 0
  for item in partition:
    _sum += item
    _count += 1

  yield {'sum': _sum, 'count': _count}


rdd2 = rdd1.mapPartitions(my_partition_func)

In [42]:
print(rdd1.glom().collect())
print(rdd2.collect())

[[1, 6], [19, 13], [22, 28], [21, 29, 2, 10]]
[{'sum': 7, 'count': 2}, {'sum': 32, 'count': 2}, {'sum': 50, 'count': 2}, {'sum': 62, 'count': 4}]
