In [3]:
import findspark
findspark.init()
import pyspark

In [4]:
from pyspark import SparkContext
sc = SparkContext()

In [6]:
# We can create a spark context from a sparksession
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

---
## Creating RDDs:

**1) Interoperating Between DataFrames and RDDs**
Because Python doesn’t have Datasets —it has only DataFrames— you will get an RDD of type
Row:

In [7]:
spark.range(50).rdd

MapPartitionsRDD[5] at javaToPython at NativeMethodAccessorImpl.java:0

To operate on this data, you will need to convert this Row object to the correct data type or extract
values out of it, as shown in the example that follows. This is now an RDD of type Row:

In [9]:
spark.range(50).toDF("id").rdd.map(lambda row: row[0])

PythonRDD[12] at RDD at PythonRDD.scala:53

In [10]:
# get dataframe out of rdd
spark.range(50).rdd.toDF()

DataFrame[id: bigint]

**2) From a Local Collection** To create an RDD from a collection, you will need to use the $parallelize$ method on a $SparkContext$ (within a SparkSession). \
$This\ turns\ a\ single\ node\ collection\ into\ a\ parallel\ collection.$\
we can also explicitly state the number of partitions into which you would like to distribute this array.

In [12]:
myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple"\
.split(" ")
words = sc.parallelize(myCollection, 2)

In [15]:
#An additional feature is that you can then name this RDD to 
# show up in the Spark UI according to a given name:
words.setName("myWords")
words.name()

'myWords'

**3) FROM DATA SOURCES:**

In [16]:
# spark.sparkContext.textFile("/some/path/withTextFiles")

---
## Manipualting RDDs:
You manipulate RDDs in much the same way that you manipulate DataFrames. 

### Transformations:
For the most part, many transformations mirror the functionality that you find in the Structured
APIs. Just as you do with DataFrames and Datasets, you specify transformations on one RDD to
create another. In doing so, we define an RDD as a dependency to another along with some
manipulation of the data contained in that RDD.

`words.show() --> Error, 'RDD' object has no attribute 'show'`

* **Distinct:**

In [19]:
words.distinct().count()

10

* **Filter:**

In [23]:
def startsWithS(individual):
    return individual.startswith("S")

words.filter(lambda word: startsWithS(word)).collect()

['Spark', 'Simple']

* **Map:**

In [35]:
words2 = words.map(lambda word: (word, word[0], word.startswith("S")))
words2.collect()

[('Spark', 'S', True),
 ('The', 'T', False),
 ('Definitive', 'D', False),
 ('Guide', 'G', False),
 (':', ':', False),
 ('Big', 'B', False),
 ('Data', 'D', False),
 ('Processing', 'P', False),
 ('Made', 'M', False),
 ('Simple', 'S', True)]

In [37]:
words2.filter(lambda record: record[2]).take(5)

[('Spark', 'S', True), ('Simple', 'S', True)]

* **FlatMap:** Sometimes, each
current row should return multiple rows

In [41]:
words.flatMap(lambda word: word).take(5)

['S', 'p', 'a', 'r', 'k']

* **Sort:**

In [43]:
words.sortBy(lambda word: len(word) * -1).take(5)

['Definitive', 'Processing', 'Simple', 'Spark', 'Guide']

* **Random Splits:**
We can also randomly split an RDD into an Array of RDDs by using the randomSplit method,
which accepts an Array of weights and a random seed:

In [47]:
words.randomSplit([0.5, 0.5])
# words.randomSplit([0.5, 0.5])[0].collect()

[PythonRDD[65] at RDD at PythonRDD.scala:53,
 PythonRDD[66] at RDD at PythonRDD.scala:53]

### Actions:
we specify actions to kick off our specified transformations. Actions either collect data to the driver or write to an external data source.

* **reduce:** used to specify a function to “reduce” an RDD of any kind of value to one value.\
For instance, given a set of numbers, you can reduce this to its sum by specifying a function that takes as input two values and reduces them into one. If you have experience in functional programming, this should not be a new concept:

In [48]:
sc.parallelize(range(1, 21)).reduce(lambda x, y: x+y)

210

In [50]:
# The longest word in a string
def wordLengthReducer(leftWord, rightWord):
    if len(leftWord) > len(rightWord):
        return leftWord
    else:
        return rightWord
words.reduce(wordLengthReducer)

'Processing'

* **count:**

In [51]:
words.count()

10

In [54]:
# countApprox
confidence = 0.80
timeoutMilliseconds = 400
words.countApprox(timeoutMilliseconds, confidence)

10

In [57]:
# countApproxDistinct
words.countApproxDistinct(0.05)

10

In [59]:
# countByValue
words.countByValue()

defaultdict(int,
            {'Spark': 1,
             'The': 1,
             'Definitive': 1,
             'Guide': 1,
             ':': 1,
             'Big': 1,
             'Data': 1,
             'Processing': 1,
             'Made': 1,
             'Simple': 1})

* **first:**

In [61]:
words.first()

'Spark'

* **max and min:**

In [64]:
spark.sparkContext.parallelize(range(1, 20)).max()

19

In [65]:
sc.parallelize(range(1, 20)).min()

1

* **take** take and its derivative methods take a number of values from your RDD. This works by first
scanning one partition and then using the results from that partition to estimate the number of
additional partitions needed to satisfy the limit.

In [69]:
words.take(5)

['Spark', 'The', 'Definitive', 'Guide', ':']

In [70]:
words.takeOrdered(5)

[':', 'Big', 'Data', 'Definitive', 'Guide']

In [71]:
words.top(5)

['The', 'Spark', 'Simple', 'Processing', 'Made']

In [68]:
withReplacement = True
numberToTake = 6
randomSeed = 100
words.takeSample(withReplacement, numberToTake, randomSeed)

['Data', 'Definitive', 'Data', 'The', 'Definitive', 'Spark']

### Saving Files:
* **saveAsTextFile:**
To save to a text file, you just specify a path and optionally a compression codec:

In [79]:
# words.saveAsTextFile("..\Book_Exercises\data")

### Caching:
We can either cache or persist an RDD. By default, cache and persist only handle data in memory. We can name it if we use the $setName$ function that we referenced previously in this chapter:

In [80]:
words.cache()

myWords ParallelCollectionRDD[25] at readRDDFromFile at PythonRDD.scala:274

We can specify a storage level as any of the storage levels in the singleton object:
org.apache.spark.storage.StorageLevel, which are combinations of memory only; disk
only; and separately, off heap.

In [81]:
words.getStorageLevel()

StorageLevel(False, True, False, False, 1)

### Checkpointing
One feature not available in the DataFrame API is the concept of checkpointing. \
**Checkpointing is the act of saving an RDD to disk** so that future references to this RDD point to those intermediate partitions on disk rather than recomputing the RDD from its original source. \
**This is similar to caching except that it’s not stored in memory, only disk. This can be helpful whenperforming iterative computation, similar to the use cases for caching**

In [82]:
sc.setCheckpointDir("Book_Exercises/data")
words.checkpoint()