# Creating RDDs

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable
distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may
be computed on different nodes of the cluster.

There are two ways to create RDDs: 
- parallelizing an existing collection in your driver program, or 
- referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.

**Creating SparkContext:**

To execute any operation in spark, you have to first create object of SparkContext class. A SparkContext class represents the connection to our existing Spark cluster and provides the entry point for interacting with Spark.
We need to create a SparkContext instance so that we can interact with Spark and distribute our jobs.

In [None]:
#!pip install findspark

In [1]:
# Windows
import findspark
findspark.init()
findspark.find()

'C:\\Tools\\spark-3.3.0-bin-hadoop3'

In [None]:
#!pip install pyspark

In [2]:
import pyspark
sc = pyspark.SparkContext(appName='Spark RDDs')

**Examples:**

In [None]:
a = range(10)

In [None]:
data = sc.parallelize(a,2)

In [None]:
type(data)

In [None]:
data.count()

In [None]:
data.take(10)

## Operations on RDDs


Spark has certain operations which can be performed on RDD. An operation is a method, which
can be applied on a RDD to accomplish certain task. RDD supports two types of operations, which
are Action and Transformation. An operation can be something as simple as sorting, filtering and
summarizing data.

- **Transformation:** Transformation refers to the operation applied on a RDD to create new RDD. Filter, groupBy and map are the examples of transformations.
- **Actions:** Actions refer to an operation which also applies on RDD, that instructs Spark to perform computation and send the result back to driver. This is an example of action.

### Transformation: map and flatMap

When applying the map() transformation, each item in the parent RDD will map to one element in the new RDD.

In [3]:
data1 = ['Hello', 'Spark', 'Programming','Python'] # python list
rdd=sc.parallelize(data1)
rdd1 = rdd.map(lambda x: (x,1))
rdd1.collect()

[('Hello', 1), ('Spark', 1), ('Programming', 1), ('Python', 1)]

In [4]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
# Not parallelize instantly
wordsRDD = sc.parallelize(wordsList, 2)
# Previous transformation will actually executed here
wordsRDD.collect()

['cat', 'elephant', 'rat', 'rat', 'cat']

In [5]:
d= wordsRDD.distinct()
d.collect()

['cat', 'elephant', 'rat']

In [6]:
wordPairs = wordsRDD.map(lambda w: (w, 1))
wordPairs.collect()

[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]

flatMap() is similar to map(), except that with flatMap() each input item can be mapped to zero
or more output elements.

## map and flatMap

In [7]:
# Use map
singularAndPluralWordsRDDMap = wordsRDD.map(lambda x: (x, x + 's'))
# View the results
print (singularAndPluralWordsRDDMap.collect())
# View the number of elements in the RDD
print (singularAndPluralWordsRDDMap.count())

[('cat', 'cats'), ('elephant', 'elephants'), ('rat', 'rats'), ('rat', 'rats'), ('cat', 'cats')]
5


In [8]:
# Use flatMap - maps zero or more times
singularAndPluralWordsRDD = wordsRDD.flatMap(lambda x: (x, x + 's'))
print (singularAndPluralWordsRDD.collect())
print (singularAndPluralWordsRDD.count())

['cat', 'cats', 'elephant', 'elephants', 'rat', 'rats', 'rat', 'rats', 'cat', 'cats']
10


In [9]:
sc.parallelize([3,4,5]).map(lambda x: [x, x*x]).collect()

[[3, 9], [4, 16], [5, 25]]

In [10]:
sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect()

[3, 9, 4, 16, 5, 25]

### Transformation: filter

We can use a "filter" transformation which will return a new RDD containing only the elements that satisfy given condition(s).

In [11]:
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
# filter operation
y = x.filter(lambda x: x % 2 == 0)
y.collect()

[2, 4, 6, 8, 10]

### Transformation: groupByKey / reduceByKey

We can apply the “groupByKey” / “reduceByKey” transformations on (key,val) pair RDD. The “groupByKey” will group the values for each key in the original RDD. It will create a new pair, where the original key corresponds to this collected group of values.

In [15]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.groupByKey().collect()

[('b', <pyspark.resultiterable.ResultIterable at 0x1c13546aa30>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x1c13546aeb0>)]

In [12]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.groupByKey().mapValues(len).collect())

[('a', 2), ('b', 1)]

In [13]:
sorted(rdd.groupByKey().mapValues(list).collect())

[('a', [1, 1]), ('b', [1])]

In [16]:
wordCountsCollected = wordPairs.reduceByKey(lambda x, y: x + y)
wordCountsCollected.collect()

[('cat', 2), ('elephant', 1), ('rat', 2)]

In [17]:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

[('a', 2), ('b', 1)]

### Action: Reduce

A reduce action is use for aggregating all the elements of RDD by applying pairwise user function.

In [7]:
num_rdd = sc.parallelize(range(1,101))
num_rdd.reduce(lambda x,y: x+y)

5050

In [9]:
num_rdd.stats()

(count: 100, mean: 50.5, stdev: 28.86607004772212, max: 100.0, min: 1.0)

### Action: count

The count action will count the number of elements in RDD.

In [19]:
num_rdd.count()

100

### Action: max, min, sum, variance and stdev

In [20]:
num_rdd.max(), num_rdd.min()

(100, 1)

In [21]:
num_rdd.sum(),num_rdd.variance(),num_rdd.stdev()

(5050, 833.25, 28.86607004772212)

### Action: getNumPartitions

With “getNumPartitions”, we can find out that how many partitions exist in our RDD.

In [22]:
num_rdd.getNumPartitions() # deault partion value

4

In [23]:
x.getNumPartitions()

2

### Transformation: distinct
We can apply “distinct” transformation on RDD to get the distinct elements.

In [24]:
wd = wordsRDD.distinct()
print(wd.collect())

['cat', 'elephant', 'rat']


In [25]:
len(wd.collect())

3

## Creating RDD from Files

In [26]:
data = sc.textFile("sample.txt")
data.collect()

['Hi',
 'Welcome to hadoop big data spark programming',
 'using python programming',
 'hadoop spark']

In [27]:
type(data)

pyspark.rdd.RDD

## Creating RDD from HDFS File : WordCount

In [1]:
# load file from hdfs
contentRDD =sc.textFile("hdfs://localhost:9000/sample.txt")

NameError: name 'sc' is not defined

In [30]:
# omitting empty lines
nonempty_lines = contentRDD.filter(lambda x: len(x) > 0)
# Split the content based on space
words = nonempty_lines.flatMap(lambda x: x.split(' '))
# Perform a Map and Reduce Task
wordcount = words.map(lambda x:(x,1)).reduceByKey(lambda x,y: x+y).map(lambda x: (x[1], x[0])).sortByKey(True)
# collect words and print in descending order
for word in wordcount.collect():
    print(word)

(1, 'Hi')
(1, 'Welcome')
(1, '')
(1, 'using')
(1, 'python')
(1, 'to')
(1, 'big')
(1, 'data')
(2, 'hadoop')
(2, 'programming')
(2, 'spark')


Example 2:

In [None]:
lines = sc.textFile("war_and_peace.txt",2)
nonNullLines = lines.filter(lambda line: len(line)>0) 
words = nonNullLines.flatMap(lambda line: line.split())

In [None]:
upperWords = words.map(lambda word: word.upper())
pairedOnes = upperWords.map(lambda uw: (uw, 1))
wordCounts = pairedOnes.reduceByKey(lambda prev, next: prev + next)

In [None]:
for word in wordCounts.take(5):
    print("*****", word)

In [None]:
wordCounts.count()

In [None]:
# Run myapp.py
# spark-submit --master local[*]  myapp.py
# spark-submit --master spark://hduser-VirtualBox:7077 myapp.py
# spark-submit --master yarn myapp.py

# http://localhost:4040/jobs/

In [4]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext = SparkSession.builder.appName("sparkrdd example").getOrCreate().sparkContext
words = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(words.first())

('a', 1)


takeOrdered: It will return the first n elements

In [6]:
print(words.takeOrdered(2))

[('a', 1), ('a', 1)]
