# Lesson 10 - Lazy Evaluation and Persistence

## Prepare Environment

We will begin the lesson by importing some packages and creating `SparkSession` and `SparklContext` objects.

In [0]:
from pyspark.sql import SparkSession
from pyspark.mllib.random import RandomRDDs

import numpy as np
import time

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Lazy Evaluation

The classification of RDD methods as transformations and actions provides more than a conceptual organization of these methods. Spark treats transformations and actions in different ways, and it is important to understand this difference.

When a transformation is called on an RDD, Spark will not immediately perform the requested calculation. Instead, it will add the transformation to a queue of transformations. A queued transformation will not be performed until an action is called that depends on the output of that transformation. This evaluation strategy is referred to as **lazy evaluation**. 

Lazy evaluation allows Spark to postpone expensive transformations until absolutely necessary. It also allows Spark to perform some behind the scenes optimization on the queued transformations when an action is called.

We will now consider a few examples to illustrate the effects of lazy evaluation. Note that these examples are not intended to be practical examples. Their only purpose is to provide you with a deeper understanding of how lazy evaluation works. 

In the next cell, we will construct an RDD containing 50 million values selected at random from the interval `[0,1]`. We will use this RDD in the examples in this section.

In [0]:
rand_rdd = RandomRDDs.uniformRDD(sc, size=50000000, numPartitions=8, seed=1)

#### Example 1

In our first example, we will use `map()` to perform a simple transformation of our RDD, adding 1 to each element. We will then use the `sum()`  action to sum to elements of the resulting RDD. We will measure the time required to perform each of these two RDD methods.

In [0]:
t1 = time.time()
new_rdd = rand_rdd.map(lambda x : x+1)   # Add 1 to each element of rand_rdd
t2 = time.time()
total = new_rdd.sum()                    # Sum elements of new_rdd
t3 = time.time()

print('Map Time:', t2 - t1)
print('Sum Time:', t3 - t2)


Notice that the `map()` transformation took almost no time at all to run, while the `sum()` action required around 5 seconds to execute. We are seeing lazy evaluation in action (or inaction, as the case may be). 

Since `map()` is a transformation, no calculations are performed when this method is called. Instead, the transformation is placed into a queue until action is called that requires the results of the `map()`. As a matter of fact, the values in `rand_rdd` don't actually even exist until an action is called that requires these values to be generated. 

No calculation are performed until the `sum()` action is called. At that point, the elements of the RDD are generated, the `map()` transformation is applied, and then the values are totaled, with the final sum being returned to the driver.

#### Example 2

In the next example, we will add a second `map()` transformation prior to calling the `sum()` action. We will again measure the time required by each method.

In [0]:
t1 = time.time()
temp_rdd_1 = rand_rdd.map(lambda x : x+1)     # Add 1 to each element of rand_rdd
t2 = time.time()
temp_rdd_2 = temp_rdd_1.map(lambda x : x**2)  # Square each value in the new RDD
t3 = time.time()
temp_rdd_2.sum()                              # Sum elements in the final RDD
t4 = time.time()


print('Map 1 Time:', t2 - t1)
print('Map 2 Time:', t3 - t2)
print('Sum Time:  ', t4 - t3)

Notice that the time required by the either of the `map()` transformations was again insubstantial. That is because neither transformation is actually performed until `sum()` is called. 

The `sum()` method did take longer to execute in this example than in the previous one. That is because there are two `map()` transformations processed in this example rather than the single one in the previous example. 

However, consider what happens if you replace `temp_rdd_2.sum()` with `temp_rdd_1.sum()` in the code cell above. Try this now. You should notice that the amount of time required by the `sum()` action in this case is similar to that in the first example. This is because the sum in this modified version of the code is not dependent on the results of the second transformation. As a result, that transformation is not triggered by calling the `sum()` action. Only the first transformation is actually performed.

## Persistence

By default, Spark will calculate the contents of an RDD only when required to do so by an action. While an action is being processed, the values of an RDD will be stored in the memory of the nodes in the cluster, which each node storing their own partitions of the RDD. However, the RDD values will be removed from memory as soon as the action has completed. This can help to avoid tying up valuable resources with RDDs that are no longer needed, but it also has a downside. Any time an RDD is needed for an action, it will have to be recalcuated from the entire chain of transformations that defined that particular RDD. If will expect to be performing multiple calculations with an RDD, it can be terribly inefficient to have to recalculate the RDD from scratch every time. 

Fortunately, Spark provides a `.persist()` method that allows us to cache an RDD to memory for reuse. If we call this method on an RDD, then the next time the contents of that RDD are computed, they will be preserved in memory on the nodes in the network. This will speed up future calculations involving this RDD, but will also required additional use of limited memory resources. If we decide to later free up these memory resources, we can do so by calling the `.unpersist()` method of the RDD.

### Example 3

In the cell below, we will create an RDD using `map()`, and will then call `count()` on this RDD twice. As in the previous examples, we will calculate the time required by each method call.

In [0]:
t1 = time.time()
my_rdd = rand_rdd.map(lambda x : x**2)   # Square each element of rand_rdd
t2 = time.time()
my_rdd.count()                           # Count elements in my_rdd
t3 = time.time()
my_rdd.count()                           # Count elements in my_rdd
t4 = time.time()

print('Map Time:     ', t2 - t1)
print('Count 1 Time:', t3 - t2)
print('Count 2 Time:', t4 - t3)

Notice that each `count()` action in the cell above took roughly the same amount of time to run. Much of the required to perform this action is actually spent performing the queued `map()` transformation. This transformation will get performed from scratch both times `count()` is called. 

To see this, we will modify the example above by persisting `my_rdd` after calling `map()`.

In [0]:
t1 = time.time()
my_rdd = rand_rdd.map(lambda x : x**2)   # Square each element of rand_rdd
t2 = time.time() 
my_rdd.persist()                         # Persist my_rdd
t3 = time.time()
my_rdd.count()                           # Count elements in my_rdd
t4 = time.time()
my_rdd.count()                           # Count elements in my_rdd
t5 = time.time()


print('Map Time:    ', t2 - t1)
print('Persist Time:', t3 - t2)
print('Count 1 Time:', t4 - t3)
print('Count 2 Time:', t5 - t4)


Notice that the second `count()` action takes significantly less time to run in this version of the example. The `map()` transformation defining `my_rdd` is performed when the first `count()` is called, but it is then stored in memory. As a result, the second call to `count()` does not need to recalculate `my_rdd` and needs only to count the elements in it.

### Example 4

In the next example, we will work with an RDD that is expensive to create. To simulate an expensive operation for creating an RDD, we will define a function that performs several pointless calculations, but then returns the square root of the input.

In the cell below, we define `sqrt_rdd` using `map()` and `expensive_fn()`. We define `filtered_rdd` based on `sqrt_rdd`, on which we call `sum()` and `count()`. When either of these final two actions are called, both `sqrt_rdd` and `filtered_rdd` are recalculated from scratch.

In [0]:
t1 = time.time()
sqrt_rdd = rand_rdd.map(lambda x : x ** 0.5)          # Take the square root of each RDD element
t2 = time.time()
filtered_rdd = sqrt_rdd.filter(lambda x : x < 0.1)    # Filter out some elements of the RDD
t3 = time.time()
n = filtered_rdd.count()                              # Count elements in filtered RDD
t4 = time.time()
total = filtered_rdd.sum()                            # Sum elements in filtered RDD
t5 = time.time()

print('Map Time:   ', t2 - t1)
print('Filter Time:', t3 - t2)
print('Count Time: ', t4 - t3)
print('Sum Time:   ', t5 - t4)

In the cell below, we call `persist()` on `filtered_rdd`. Notice that the `count()` still takes a long time to calculate (since neither `sqrt_rdd` or `filtered_rdd` are calculated until this step). However, the `sum()` is now performed much more quickly.

In [0]:
t1 = time.time()
sqrt_rdd = rand_rdd.map(lambda x : x ** 0.5)                # Take the square root of each RDD element
t2 = time.time()
filtered_rdd = sqrt_rdd.filter(lambda x : x < 0.1)          # Filter out some elements of the RDD
t3 = time.time()
filtered_rdd.persist()                                      # Persist filtered RDD
t4 = time.time()
filtered_rdd.count()                                        # Count elements in filtered RDD
t5 = time.time()
filtered_rdd.sum()                                          # Sum elements in filtered RDD
t6 = time.time()

print('Map Time:    ', t2 - t1)
print('Filter Time: ', t3 - t2)
print('Persist Time:', t4 - t3)
print('Count Time:  ', t5 - t4)
print('Sum Time:    ', t6 - t5)

In [0]:
filtered_rdd.unpersist()
# to free up the memory