## RDD example

This notebook provide a dummy example of a map on Spark RDD, that can be used to check that parallelisation works fine on your setup.

It consist of creating an RDD with `n_partitions` partitions, and apply a map function that waits for 2 seconds for each partition.

#### General imports

In [1]:
import os 
import time

#### Start Spark session

A Spark session is created by using the pyspark.sql.SparkSession object. See [here](https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession) for the API documentation on the SparkSession Object. 


In [2]:
#This is needed to start a Spark session from the notebook
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=2g  pyspark-shell"

from pyspark.sql import SparkSession

ModuleNotFoundError: No module named 'py4j'

In [None]:
#Uncomment below to recreate a Spark session with other parameters
#spark.stop()
spark = SparkSession \
    .builder \
    .master("local[4]") \
    .appName("demoRDD") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

#### Start dummy Spark jobs

In [None]:
# Wait function
def wait2s(x):
    time.sleep(2)
    return x

In [None]:
n_partitions=8

data=range(0,n_partitions)
datardd=sc.parallelize(data,n_partitions)

datardd.map(wait2s).collect()

### Open Spark UI and check parallelisation

Open Spark UI at `127.0.0.1:4040`

![](./DemoRDD.png)