<a href="https://colab.research.google.com/github/Dhaneshkp/Spark/blob/main/Creating%20RDDs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---

<center> <h1> Creating RDDs </h1> </center>

***`Resilient Distributed Dataset`*** or ***`RDD`*** is the Spark’s main abstraction. In this notebook we will see how to create the RDDs.


We can create RDD in the following 2 ways:

* 1. **Parallelizing existing collections**
* 2. **Referencing a Dataset**


---


### `IMPORTING THE REQUIRED LIBRARIES`

---

In [1]:
# Install pyspark
!pip install pyspark

from pyspark import SparkContext

# command to create sparkContext
sc = SparkContext()



---

#### `1. Parallelizing existing collections`

![](https://github.com/Dhaneshkp/Spark/blob/main/images/sc_collection.png?raw=1)



---

In [2]:
# Create a list
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [3]:
# Type of my_list
type(my_list)

list

In [4]:
# Create rdd using parallelize funciton
rdd_list = sc.parallelize(my_list)

In [None]:
# Type of rdd_list
type(rdd_list)

pyspark.rdd.RDD

####  `Check number of partitions by deafult.`

---

In [5]:
# Check number of partitions
rdd_list.getNumPartitions()

2

In [6]:
# Stop the existing SparkContext
sc.stop()

# Recreate the SparkContext with increased timeout settings
from pyspark import SparkConf, SparkContext

# Increase the timeout for the Python worker and configure Spark settings
conf = SparkConf() \
    .set("spark.python.worker.reuse", "true") \
    .set("spark.python.worker.timeout", "600") \
    .set("spark.network.timeout", "800s") \
    .set("spark.executor.heartbeatInterval", "60s") \
    .set("spark.driver.maxResultSize", "2g") \
    .set("spark.executor.memory", "4g") \
    .set("spark.driver.memory", "4g")

sc = SparkContext(conf=conf)

# Check data inside each partition
rdd_list = sc.parallelize(my_list)
rdd_list.glom().collect()

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]

#### `Define the number of partitions`

---

In [7]:
# Create a list
country_list = ["India", "USA", "South Africa", "Australia", "France"]

In [8]:
# Create rdd
rdd_country = sc.parallelize(country_list, numSlices=2)

In [9]:
# Check number of partitions
rdd_country.getNumPartitions()

2

In [10]:
# Check inside each partition
rdd_country.glom().collect()

[['India', 'USA'], ['South Africa', 'Australia', 'France']]

####  `2. External dataset`


![](https://github.com/Dhaneshkp/Spark/blob/main/images/sc_external_dataset.png?raw=1)



---

In [13]:
# Create rdd from external file
rdd_movies = sc.textFile("movies.csv", minPartitions=5)

In [14]:
# Check number of partitions
rdd_movies.getNumPartitions()

5