%md
## Overview
This notebook shows how to create RDD in pyspark using different ways and options. 

#### **Contents :**

- **Setting up Spark Context**
- **Create RDD [Resilient Distributed Datasets]**
- **Two ways to create RDD**
    1. Parallelized Collection
    2. External Datasets
1. RDD from List
2. RDD from Tuple
3. Empty RDD
4. RDD from an external text file
5. RDD from range() function
6. RDD from existing RDD
7. RDD from JSON data


This is a **Python** notebook so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` magic command. `Python`, `Scala(%scala)`, `SQL(%sql)`, `FileStore(%fs)` and `R(%r)` all are supported.

**Spark RDD Documentation Link**
- https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds

In [0]:
# In Databricks, the cell has bydefault spark session. So we can run the pyspark code without creating any spark session or spark context.
# checking default spark version
spark

#### Setting up the SparkContext 

In [0]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('MySparkApp').setMaster('local')
print(conf.get("spark.master"))
print(conf.get("spark.app.name"))

# sc = SparkContext(conf=conf)
sc = SparkContext.getOrCreate(conf=conf)

local
MySparkApp


In [0]:
# from pyspark.sql import SparkSession

# Create spark session
# spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

#### Create RDD [Resilient Distributed Datasets]

RDDs are commonly created through the `parallelization of collections`, such as taking an existing collection from the driver program (e.g., Scala, Python) and providing it to the `SparkContext‘s parallelize()` method. This method is used only for testing but not in real-time, as the entire data used to create RDD is available in the driver node, which is not ideal for production.

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes. 
Basically RDD is a collection of elements, partitioned across the nodes of the cluster so that we can execute various parallel operations on it.
- **Resilient :** Restore the data on failure.
- **Distributed :** Data is distributed among different nodes.
- **Dataset :** Group of data.

There are two ways to create RDDs -
1. **Parallelized Collection :** Parallelizing an existing data in the driver program. To create parallelized collection, call SparkContext's parallelize method on an existing collection in the driver program. Each element of collection is copied to form a distributed dataset that can be operated on in parallel.
2. **External Datasets :** In Spark, the distributed datasets can be created from any type of storage sources supported by Hadoop such as HDFS, Cassandra, HBase and even our local file system. Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop Input Format.

`rdd = spark.sparkContext.parallelize()` creates an RDD. The parallelize method distributes the data across the nodes in the Spark cluster, creating a parallel collection.

`rdd.collect()` retrieves all the elements of the RDD to the driver program (in this case, the Python script), allowing them to be printed. 
The `collect()` action should be used cautiously with large datasets as it brings all the data to the driver, and it may lead to out-of-memory issues for large datasets.

##### 1. RDD from List

In [0]:
list_data = [1, 2, 3, 4, 5]
rdd_list = sc.parallelize(list_data)

rddListCollect = rdd_list.collect()

print("Number of Partitions: " + str(rdd_list.getNumPartitions()))
print("Action: First element: " + str(rdd_list.first()))
print(rddListCollect)

Number of Partitions: 8
Action: First element: 1
[1, 2, 3, 4, 5]


##### 2. RDD from Tuple

In [0]:
tuple_data = [("Java", 20000),("Python", 10000),("Scala", 30000)]
rdd_tuple = sc.parallelize(tuple_data)

rddTupleCollect = rdd_tuple.collect()

print("Number of Partitions: " + str(rdd_tuple.getNumPartitions()))
print("Action: First element: " + str(rdd_tuple.first()))
print(rddTupleCollect)

Number of Partitions: 8
Action: First element: ('Java', 20000)
[('Java', 20000), ('Python', 10000), ('Scala', 30000)]


##### 3. Empty RDD 

In [0]:
emptyRDD = sc.emptyRDD()
emptyRDD2 = sc.parallelize([])

print("is Empty RDD : " + str(emptyRDD.isEmpty()))
print("is Empty RDD : " + str(emptyRDD2.isEmpty()))

rddEmptyCollect = emptyRDD.collect()
print("Number of Partitions: " + str(emptyRDD.getNumPartitions()))
print(rddEmptyCollect)


is Empty RDD : True
is Empty RDD : True
Number of Partitions: 0
[]


##### 4. RDD from an external text file

In [0]:
rddFile = sc.textFile("/FileStore/tables/SparkText")
rddFileCollect = rddFile.collect()
print(rddFileCollect)

['Apache Spark is a lightning-fast cluster computing framework designed for real-time processing. Spark is an open-source project from Apache Software Foundation. Spark overcomes the limitations of Hadoop MapReduce, and it extends the MapReduce model to be efficiently used for data processing. Spark is a market leader for big data processing. It is widely used across organizations in many ways. It has surpassed Hadoop by running 100 times faster in memory and 10 times faster on disks.', '', 'Apache Spark is a fast, flexible, and developer-friendly leading platform for large-scale SQL, machine learning, batch processing, and stream processing. It is essentially a data processing framework that has the ability to quickly perform processing tasks on very large data sets. It is also capable of distributing data processing tasks across multiple computers, either by itself or in conjunction with other distributed computing tools.', '', 'Apache Spark is a unified engine designed for large-sca

In [0]:
# To read the entire content of a file as a single record, use the wholeTextFiles() method on sparkContext.

rddWholeFile = sc.wholeTextFiles('/FileStore/tables/SparkText')
rddWholeFileCollect = rddWholeFile.collect()
print(rddWholeFileCollect)

[('dbfs:/FileStore/tables/SparkText', 'Apache Spark is a lightning-fast cluster computing framework designed for real-time processing. Spark is an open-source project from Apache Software Foundation. Spark overcomes the limitations of Hadoop MapReduce, and it extends the MapReduce model to be efficiently used for data processing. Spark is a market leader for big data processing. It is widely used across organizations in many ways. It has surpassed Hadoop by running 100 times faster in memory and 10 times faster on disks.\n\nApache Spark is a fast, flexible, and developer-friendly leading platform for large-scale SQL, machine learning, batch processing, and stream processing. It is essentially a data processing framework that has the ability to quickly perform processing tasks on very large data sets. It is also capable of distributing data processing tasks across multiple computers, either by itself or in conjunction with other distributed computing tools.\n\nApache Spark is a unified 

##### 5. RDD from range() function

In [0]:
rddRange = sc.parallelize(range(1, 6))
rddRangeCollect = rddRange.collect()
print(rddRangeCollect)

[1, 2, 3, 4, 5]


##### 6. RDD from existing RDD

In [0]:
# We can use transformations like map, flatmap, and filter() to create a new RDD from an existing one.
rdd = sc.parallelize([1, 2, 3, 4, 5]) 
newRdd = rdd.map(lambda x: x * 2)

rddNewCollect = newRdd.collect()
print(rddNewCollect)

[2, 4, 6, 8, 10]


##### 7. RDD from JSON data

In [0]:
import json
# The input to parallelize() is a Python dictionary obtained by loading JSON data (json_data) using the json.loads() method.

# Create RDD from JSON
json_data = '{"name": "Kumar", "age": 39, "city": "New York"}' 
rddJson = sc.parallelize([json.loads(json_data)])

rddJsonCollect = rddJson.collect()
print(rddJsonCollect)

[{'name': 'Kumar', 'age': 39, 'city': 'New York'}]
