# Chapter 12. Resilient Distributed Datasets (RDDs)

## Creating RDD

User needs to be concerned with only 2 types of RDD, “generic” RDD type or a key-value RDD”. “key-value RDDs have special operations as well as a concept of custom partitioning by key.”




### From Local Collection.

In [2]:
myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple"\
  .split(" ")

words = spark.sparkContext.parallelize(myCollection, 2)

In [3]:
words.collect()

['Spark',
 'The',
 'Definitive',
 'Guide',
 ':',
 'Big',
 'Data',
 'Processing',
 'Made',
 'Simple']

In [4]:
words.setName("myWords")
words.name() # myWords

u'myWords'

## Manipulating RDDs


### Transformations

In [6]:
# distinct
words.distinct().count()

10

In [8]:
# filter
def startsWithS(individual):
  return individual.startswith("S")

words.filter(lambda word: startsWithS(word)).collect()

['Spark', 'Simple']

In [17]:
# map

words2 = words.map(lambda word: (word, word[0], word.startswith("S")))
words2.filter(lambda record: record[2]).take(10)

[('Spark', 'S', True), ('Simple', 'S', True)]

In [29]:
# FLATMAP
words.flatMap(lambda word: list(word)).take(10)


['S', 'p', 'a', 'r', 'k', 'T', 'h', 'e', 'D', 'e']

In [33]:
# sort
words.sortBy(lambda word: len(word)*-1 ).take(2)

['Definitive', 'Processing']

In [37]:
# Random Splits
fiftyFiftySplit = words.randomSplit([0.5, 0.5])
fiftyFiftySplit[0].collect()

['Spark', 'Guide', 'Big', 'Data', 'Processing', 'Simple']