## Working with RDDs (ch 3)

You control a Spark process by means of a Spark Context. In Python, the `Spark context` is a built-in variable, already bound to the Context. When using Java (or Scala), you need to do the binding yourself

In [0]:
sc

you will use `sc` to ask Spark to load the content of a file into memory, for instance:

## RDD: Resilient Distributed Dataset ##

The basic data abstraction for all Spark programs is the RDD. Think of an RDD as an immutable, distributed collection of objects (data elements) of a certain type.

Users create RDDs in two ways: by loading an external dataset, as we have just seen, or programmatically by specifying that a collection of objects, for instance a list or set, is to be processed a set of N *workers*:

note that Spark 2.x now support a higher level abstraction for relational data, called **DataFrames**. if you are familiar with Python Pandas or the R language, you will recognise Spark DataFrames as a very similar abstraction for tabular data. We will look at DataFrames in details later. For now, you may  refer to this tutorial: [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)

In [0]:
# specifies the number of workers:
myData = sc.parallelize([1,2,3,5],3)
myData

In [0]:
# distributes the content of the RDD to all available workers:

sc.parallelize(["pandas", "i like pandas"])

In [0]:
#Example (3.1): create a new RDD by loading a text file using the SparkContext sc
lines = sc.textFile('/FileStore/tables/Dante_Inferno.txt') 

## Transformations and Actions ##

Spark operates on RDDs in two ways:

- through **transformations**. A transformation takes an input RDD and produces a new RDD. Example: filtering an RDD 
- through **actions**. An Action takes an RDD and produces data of some other type, which typically encodes the result of some data analysis. Example: count the number of lines in a file

example of transformation: 

the myRDD.filter() function filters myRDD according to a condition. The condition is specified using a function as one of the arguments to filter(). Here is the general Python syntax for passing a `lambda function` to another function:

In [0]:
infernoLines = lines.filter(lambda x: "inferno" in x)

Action: a result computed from an RDD, which can be either returned to the driver program or saved to an external storage system (e.g., HDFS).  examples:

In [0]:
# operates on the lines RDD
lines.count()

In [0]:
# shows top k elements in the RDD 
lines.take(20)

Actions are typically performed at the end of a pipeline consisting of one or more transformations.

Here is a new transformation that makes use of the `map()` higher order function:

In [0]:
uppercaseLines = lines.map(lambda s: s.upper())

like filter,  `map()` is a transformation. It takes an input function (a named or a lambda function). It applies the function to each element of the input RDD to produce an output RDD

In [0]:
uppercaseLines

In [0]:
uppercaseLines.take(5)

In [0]:
infernoLines.first()

note that the result of the take(n) action function is no longer an RDD! in this instance, it is a list of strings

note that you may concatenate the two transformations and the action into a single command using 'dot notation':

In [0]:
lines.filter(lambda x: "inferno" in x).map(lambda s: s.upper()).take(5)

or you can use the functions you have defined to perform these operations:

In [0]:
def upperCase(doc):
    return doc.map(lambda s: s.upper())

In [0]:
def filterDocForTerm(doc, term):
    return doc.filter(lambda x: term in x)

In [0]:
upperCase(filterDocForTerm(lines, "inferno")).take(10)

we can also explicitly define functions that we can then use in `map()`:

In [0]:
def removex92(s):
    return s.replace('\x92','\'')

In [0]:
lines.filter(lambda x: "inferno" in x).map(removex92).take(5)

or equivalently:

In [0]:
lines.filter(lambda x: "inferno" in x).map(lambda s: s.replace('\x92','')).take(5)

In [0]:
# other obvious action: counting the number of elements
infernoLines.count(), uppercaseLines.count()

## Lazy evaluation:##

Spark computes a pipeline in a **lazy** fashion—that is, the resulting RDD remains virtual and no computation  actually occurs until the RDD is used as input to an action.

Thus, actions have the potentual to trigger an entire complex pipeline to be executed.

If Spark were to load and store all the lines in the file as soon as we wrote lines = sc.textFile(...), it would waste a lot of storage space, given that we then immediately filter out many lines. Instead, once Spark sees the whole chain of transformations, it can compute just the data needed for its result. In fact, for the first() action, Spark scans the file only until it finds the first matching line; it doesn’t even read the whole file.

Note that a virtual computation may be described by a graph, for instance:

try changing the file path below to a non-existent name like `/FileStore/tables/Dante_Inferno-XXX.txt`

In [0]:
lines = sc.textFile('/FileStore/tables/Dante_Inferno.txt') 

In [0]:
cleanLines = lines.map(lambda s: s.replace('\x92',''))

In [0]:
upperInferno = cleanLines.filter(lambda x: "inferno" in x).map(lambda s: s.upper())

In [0]:
lowerAmore = cleanLines.filter(lambda x: "amore" in x).map(lambda s: s.lower())

at this point, no computation has actually occurred. In fact the file hasn't even been opened (try and change the path to the file)

however when an action is performed on *at least one* of the output RDDs:  `upperInferno`, `lowerAmore`, Spark schedules the entire execution graph in a distributed fashion using the available workers, in order to produce a result:

In [0]:
upperInferno.collect()

In [0]:
lowerAmore.collect()

have you tried using the wrong filename?  at which point do you get a runtime error?

## Persistence: ##
    
A RDD that is used in multiple actions, is recomputed as part of the pipeline that leads to that action. There is therefore a potential for re-computing the same RDD multiple times. 

However we may tell Spark that it should reuse a partial result when we know it is safe to do so. This is done using `RDD.persist()`. For example:

In [0]:
# without persistence, this action causes the entire pipeline to be executed again:
upperInferno.count()

In [0]:
# this ensures that the file is only loaded once and that the cleanLines RDD is kept in memory:
cleanLines.persist()

obviously there is a trade-off between the cost of reloading / recomputing and the process space needed for storing a persisted RDD

In [0]:
upperInferno.persist()
lowerAmore.persist()

In [0]:
# these repeated action invocationsare now inexpensive:
upperInferno.count(), lowerAmore.count()

# practice: loading and operating on a movie set #

We now practice these notions on a different dataset, and extend our overview of the set of transformations and actions avaiable through Spark

In [0]:
    inputRDD = sc.textFile('/FileStore/tables/sample_movielens_movies.txt')

In [0]:
inputRDD.take(5)

Examples of **transformations**:

In [0]:
thrillerRDD = inputRDD.filter(lambda x: "Thriller" in x)

In [0]:
comedyRDD = inputRDD.filter(lambda x: "Comedy" in x)

you can take the **union** of two RDDs:

In [0]:
whatILikeRDD = thrillerRDD.union(comedyRDD)

*Actions* - printing out the first 10 lines of each movie category

In [0]:
thrillerRDD.take(5)

In [0]:
comedyRDD.take(10)

In [0]:
whatILikeRDD.take(5)

note: the `collect()` function retrieves the entire RDD.

*Note that this forces Spark to execute the entire pipeline* so not to be used on large datasets unless you know that processing requires the entire dataset to be consumed outside of the RDD framework.

In [0]:
whatILikeRDD.count()

**Passing functions to Spark** (ch 3 pg. 30). we have seen this before...

In [0]:
def isThriller(s): 
    return "Thriller" in s

In [0]:
thrillers = whatILikeRDD.filter(isThriller)

In [0]:
thrillers.take(5)