## overview
1. Every spark application consists of a driver program that runs
the users main function and excutes various parallel operatoins on
a cluster.
2. RDD is a collection of elements partitioned across the nodes of
cluster that can be operated on in parallel.
3. User can ask Spark to persist an RDD in memory.
4. RDD automatically reconver from node failures
5. Shared variables can be used in parallel operations.
6. When spark runs a function in parallel as a set of tasks on 
different nodes, it ships a copy of each variable used in the 
function to each tasks.
7. Spark supports tow types of share variables: broadcast variables
    , which can be used to cache a value in mememory on all nodes,
    and accumulators, which are varibles that are only 'added' to
    such as counters and sums

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
## The first thing a Spark program must do is to create a SparkContext
## which tells Spark how to access a cluster. To create a SparkContext
## you first need to build a SparkConf object that contains information
## about your application
conf = SparkConf().setAppName("RDD programing guide").setMaster("local")
sc = SparkContext(conf=conf)

## RDD can be created in two ways
1. parallelize method on an existing itertable or collection in your
driver program
2. referencing a dataset in an external storage system such as shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat

In [2]:
## parallelize collections
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
## wec can perform parallel operation after distribute the collection
distData.reduce(lambda a, b: a + b)

15

In [3]:
## External DataSets
## Spark Supports text files, SequenceFiles, and any other Hadoop
## InputFormat
distFile = sc.textFile("README.md")

In [4]:
distFile.map(lambda s: len(s)).reduce(lambda a, b: a + b)

3706

## Writable Support
When reading an RDD of key-value pairs from squenceFile, the Pyspark
SequenceFile support Loads an RDD of key-value pairs within java, 
converts Writables to base Java types, and pickles the resulting java
objects using Pyrolite.

In [5]:
rdd = sc.parallelize(range(1,4)).map(lambda x: (x, "a" * x))
## When saving an RDD of key-value pairs to SequenceFile, Pyspark
## does the reverse. It unpickles Python objects into Java objects
## and then converts the to Writables.
rdd.saveAsSequenceFile("./saveSequence")
sorted(sc.sequenceFile("./saveSequence").collect())

[(1, 'a'), (2, 'aa'), (3, 'aaa')]

## Saving and Loading Other Hadoop Input/Output Formats
PySpark can also read any Hadoop InputFormat or write any Hadoop
OutputFormat, for both 'new' and 'old' Hadoop MapReduce APIs. If
required, a Hadoop configuration can be passed in as a Python dict

If you have custom serialized binary data (such as loading data from
Cassandra / Hbase), then you will first need to transform that data
on the Scala/java side to something which can be handled by Pyrolite's'
pickler. 

A Converter trait is provided for this. Simply extend this trait and
implement your transformation code in the convert method.

## RDD Operations
RDDs support two types of operations:
1. transformation
    create a new dataset from an existing one
2. actions
    return a value to the driver program after running a computation
    on the dataset.
transformations are lazy.

However you may also persist an RDD in memory or disk or replicated across multiple nodes.

In [10]:
## Basics
lines = sc.textFile("README.md")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

## if you wanted to use lineLengths again later, we could add:
## rdd.persist() method will persist the result in memory
lineLengths.persist()
totalLength

3706

## Passing Functions to Spark
Spark recommend three ways to pass functions in the driver program
to run on the cluster
1. Lambda expressions
2. Local defs
3. Top-level functions in a module

In [11]:
def myFunc(s):
    words = s.split(" ")
    return len(words)

sc.textFile("README.md").map(myFunc).reduce(lambda a,b: a + b)

566

## Understanding closures
One of the harder things about Spark is understanding the scope and
life cycle of variables and methods when executing code across cluster.
RDD operations that modify variables outside of their scope can be 
a frequent source of confusion. In the example below we ll look at code
that uses foreach() to increment a counter, but similar issues can 
occur for other operations as well.

In [13]:
## Example
counter = 0
rdd = sc.parallelize(data)

# Wrong: Don't do this !!
def increment_counter(x):
    global counter
    counter += x

rdd.foreach(increment_counter)
print("Counter value:", counter)

Counter value: 0


# Local vs. cluster modes
The behavior of the above code is undefined, and may not work as inteded
To execute jobs, Spark breaks up the processing of RDD oeprations into
tasks, each of which is executed by an executor. Prior to execution,
Spark computes the task's closure. The closure is those variables
and methods which must be visible for the executor