<a href="https://colab.research.google.com/github/Deepaksai1919/Spark/blob/DrAminKarami-Udemy/Working_With_RDD(Part2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip3 install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=90b5eee35c927df8067f77f81b2460fc29bc3952f44a4d94f3ac00b17baa1f29
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [3]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('RDD Part2').setMaster('local[*]')

sc = SparkContext(conf = conf)

sc

In [2]:
import random

dataset1 = random.sample(range(40), 10)
dataset2 = random.sample(range(40), 10)

In [4]:
rdd1 = sc.parallelize(dataset1, 4)
rdd2 = sc.parallelize(dataset2, 2)

In [5]:
print(rdd1.getNumPartitions())
print(rdd2.getNumPartitions())


4
2


In [6]:
print(rdd1.glom().collect())
print(rdd2.glom().collect())

[[25, 14], [26, 21], [39, 6], [0, 18, 27, 16]]
[[34, 27, 4, 16, 11], [35, 3, 39, 29, 2]]


Union of rdd will add the partitions in both rdds to create new rdd

In [8]:
rdd_union = rdd1.union(rdd2)

print(rdd_union.glom().collect())
print(rdd_union.getNumPartitions())

[[25, 14], [26, 21], [39, 6], [0, 18, 27, 16], [34, 27, 4, 16, 11], [35, 3, 39, 29, 2]]
6


Intersect will create the same number of partitions as that of union. But the partitions can be empty.

In [9]:
rdd_intersect = rdd1.intersection(rdd2)

print(rdd_intersect.collect())
print(rdd_intersect.glom().collect())

[39, 27, 16]
[[], [], [], [39, 27], [16], []]


In [11]:
count = 0
for partition in rdd_intersect.glom().collect():
  if len(partition) == 0:
    count += 1
print('Num of empty partitions:', count)

Num of empty partitions: 4


`.coalesce(n)` will reduce the number of partitions in an rdd to `n`. It is used for running the operations more efficiently after filtering down a large dataset.

In [12]:
rdd_intersect.coalesce(1).glom().collect()

[[39, 27, 16]]

In [13]:
rdd_intersect.coalesce(4).glom().collect()

[[], [39, 27], [], [16]]

Even if we pass a higher number than the actual number of partitions, there are only the actual number of partitions

In [16]:
print(rdd_intersect.getNumPartitions())
print(rdd_intersect.coalesce(10).glom().collect())
print(rdd_intersect.coalesce(10).glom().getNumPartitions())

6
[[], [], [], [16], [39, 27], []]
6


`.takeSample` will get a random sample. We can the result is different for the same code executed multiple times.

In [17]:
rdd1.takeSample(False, 5)

[25, 18, 21, 0, 39]

In [18]:
rdd1.takeSample(False, 5)

[21, 0, 6, 18, 39]

Same as `.takeSample` but the result is taken from ordered dataset. So will get the same result when executed multiple times

In [20]:
print(rdd1.collect())
print(rdd1.takeOrdered(5))
print(rdd1.takeOrdered(5, key=lambda x: -x))

[25, 14, 26, 21, 39, 6, 0, 18, 27, 16]
[0, 6, 14, 16, 18]
[39, 27, 26, 25, 21]


`.repartition(n)` Reshuffles the data to create `n` number of partitions. It can be used to `increase` or `decrease` the number of partitions. Reshuffling of data will always take place in case of repartition.

In [22]:
print(rdd1.getNumPartitions())
print(rdd1.glom().collect())

4
[[25, 14], [26, 21], [39, 6], [0, 18, 27, 16]]


In [23]:
print(rdd1.repartition(2).glom().collect())

[[39, 6, 0, 18, 27, 16], [25, 14, 26, 21]]


Even if we mention the repartition size to equal to the size of dataset, empty partitions can be created

In [25]:
print(rdd1.repartition(10).glom().collect())

[[], [], [], [], [26, 21], [], [], [], [0, 18, 27, 16], [25, 14, 39, 6]]


In [28]:
from pprint import pprint

In [29]:
rdd_kv = sc.parallelize([(1,10),(1,12),(2,15),(4,18),(2,20),(5,30),(1,19),(4,10)], 4)
pprint(rdd_kv.glom().collect())

[[(1, 10), (1, 12)], [(2, 15), (4, 18)], [(2, 20), (5, 30)], [(1, 19), (4, 10)]]


`reduceByKey` will also create the same number of partitions as that of parent rdd. So there might be empty partitions in the resulting rdd

In [31]:
pprint(rdd_kv.reduceByKey(lambda x,y: x+y).glom().collect())

[[(4, 28)], [(1, 41), (5, 30)], [(2, 35)], []]


In [36]:
import pandas as pd
data = pd.DataFrame({'Key': rdd_kv.keys().collect(), 'Value': rdd_kv.values().collect()})
data

Unnamed: 0,Key,Value
0,1,10
1,1,12
2,2,15
3,4,18
4,2,20
5,5,30
6,1,19
7,4,10


In [37]:
print(rdd_kv.reduceByKey(lambda x,y: x+y).sortByKey().glom().collect())

[[(1, 41), (2, 35)], [(4, 28)], [(5, 30)], []]


In [38]:
rdd_kv.countByKey()

defaultdict(int, {1: 3, 2: 2, 4: 2, 5: 1})

In [45]:
rdd_grp = rdd_kv.groupByKey()

In [46]:
rdd_grp.glom().collect()

[[(4, <pyspark.resultiterable.ResultIterable at 0x7ff0c268a2f0>)],
 [(1, <pyspark.resultiterable.ResultIterable at 0x7ff0c2688430>),
  (5, <pyspark.resultiterable.ResultIterable at 0x7ff0c2688e50>)],
 [(2, <pyspark.resultiterable.ResultIterable at 0x7ff0c2689810>)],
 []]

In [47]:
for item in rdd_grp.collect():
  print('Key:', item[0], 'Values:', [value for value in item[1]])

Key: 4 Values: [18, 10]
Key: 1 Values: [10, 12, 19]
Key: 5 Values: [30]
Key: 2 Values: [15, 20]


In [48]:
rdd_kv.lookup(1)

[10, 12, 19]

https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

`.cache()` is the shorthand notation for `.persist(storageLevel.MEMORY_ONLY)`

In [53]:
rdd1.persist()

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:287

In [54]:
rdd1.collect()

[25, 14, 26, 21, 39, 6, 0, 18, 27, 16]

In [56]:
from pyspark import StorageLevel

rdd2.persist(storageLevel = StorageLevel.MEMORY_AND_DISK)

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:287

In [57]:
rdd2.collect()

[34, 27, 4, 16, 11, 35, 3, 39, 29, 2]

**Note:** In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, DISK_ONLY_2, and DISK_ONLY_3.