# Resilient Distributed Dataset


* Each record in RDD in divided into logical partitions, which can be computed on different nodes of the cluster

* RDD is computed on several processes scattered across multiple physical servers called nodes

## Advantages

* In-Memory Processing
    * loads data from disk and process in memory and keeps the data in memory
    * can cache RDD in memory to reuse

* Immutability
    
* Fault Tolerance

* Lazy Evaluation

* Partitioning


## Limitations

not suitable for applications that make updates to the state store such as storage system for a web app


In [3]:
# RDD are mainly created in two ways

# Parallelizing an exising collection
# Referencing dataset in external storage

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]") \
        .appName("RDD") \
        .getOrCreate()
spark

bash: /home/magus/miniconda3/envs/nlp/lib/libtinfo.so.6: no version information available (required by bash)
bash: /home/magus/miniconda3/envs/nlp/lib/libtinfo.so.6: no version information available (required by bash)
22/12/21 18:24:03 WARN Utils: Your hostname, Magus resolves to a loopback address: 127.0.1.1; using 172.26.192.58 instead (on interface eth0)
22/12/21 18:24:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/12/21 18:24:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/21 18:24:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [5]:
# Parallelizing

data = list(range(1,13))
rdd = spark.sparkContext.parallelize(data)
rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [7]:
rdd2 = spark.sparkContext.textFile("endomondoHR.json")
rdd2

endomondoHR.json MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [9]:
# Creating empty RDD

rdd3 = spark.sparkContext.emptyRDD()
rdd3

EmptyRDD[5] at emptyRDD at NativeMethodAccessorImpl.java:0

In [10]:
# Empty RDD with partitions

rdd4 = spark.sparkContext.parallelize([], 10) # 10 partitions
rdd4

ParallelCollectionRDD[6] at readRDDFromFile at PythonRDD.scala:274

In [11]:
# Parallelize automatically creates partitions based on resource availability

In [12]:
# Repartition and Coalesce

# Repartition -> Shuffles data from all nodes 
# Coalesce -> Shuffle data from minimum nodes

reparRdd = rdd.repartition(4)
reparRdd.getNumPartitions()

4

In [13]:
# RDD Transformation are lazy meaning they return another
# RDD instead of updating the current one

In [14]:
# df = spark.read.json("./endomondoHR.json")

                                                                                

In [15]:
# df

DataFrame[altitude: array<double>, gender: string, heart_rate: array<bigint>, id: bigint, latitude: array<double>, longitude: array<double>, speed: array<double>, sport: string, timestamp: array<bigint>, url: string, userId: bigint]

In [16]:
rdd = spark.sparkContext.textFile("./test.txt")


In [18]:
# Flatmap flattens the RDD after applying the function and returns a new RDD

rdd2 = rdd.flatMap(lambda x: x.split(" "))
rdd2

PythonRDD[18] at RDD at PythonRDD.scala:53

In [25]:
# View data of rdd
rdd.collect()[:10]

['Project Gutenberg’s',
 'Alice’s Adventures in Wonderland',
 'by Lewis Carroll',
 'This eBook is for the use',
 'of anyone anywhere',
 'at no cost and with',
 'Alice’s Adventures in Wonderland',
 'by Lewis Carroll',
 'This eBook is for the use',
 'of anyone anywhere']

In [26]:
rdd2.collect()[:10]

['Project',
 'Gutenberg’s',
 'Alice’s',
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 'This']

In [29]:
rdd3 = rdd.map(lambda x: x.split(" "))
rdd3.collect()[:10]

[['Project', 'Gutenberg’s'],
 ['Alice’s', 'Adventures', 'in', 'Wonderland'],
 ['by', 'Lewis', 'Carroll'],
 ['This', 'eBook', 'is', 'for', 'the', 'use'],
 ['of', 'anyone', 'anywhere'],
 ['at', 'no', 'cost', 'and', 'with'],
 ['Alice’s', 'Adventures', 'in', 'Wonderland'],
 ['by', 'Lewis', 'Carroll'],
 ['This', 'eBook', 'is', 'for', 'the', 'use'],
 ['of', 'anyone', 'anywhere']]

In [31]:
rdd4 = rdd2.map(lambda x: (x,1))
rdd4.collect()[:10]

[('Project', 1),
 ('Gutenberg’s', 1),
 ('Alice’s', 1),
 ('Adventures', 1),
 ('in', 1),
 ('Wonderland', 1),
 ('by', 1),
 ('Lewis', 1),
 ('Carroll', 1),
 ('This', 1)]

In [32]:
# ReduceByKey merges value for each key with the function provided

rdd5 = rdd4.reduceByKey(lambda x,y: x+y)
rdd5.collect()[:10]

[('Project', 9),
 ('Gutenberg’s', 9),
 ('Alice’s', 18),
 ('Adventures', 18),
 ('in', 18),
 ('Wonderland', 18),
 ('by', 18),
 ('Lewis', 18),
 ('Carroll', 18),
 ('This', 27)]