# Spark Partitioning and Optimization (a) - Exercises with Results

## Exercise 1

#### Task 1
##### Create an RDD called ex_collection from an array with elements ranging from 1 to 50000.
##### Check the number of partitions the RDD was split into.

#### Result:

In [None]:
// Create an array.
val sample_array = Array.range(1, 50000)

// Parallelize it.
val sample_collection = sc.parallelize(sample_array)
println("Number of partitions by default: " + sample_collection.getNumPartitions)

#### Task 2
##### Load the rdd-input-exercises text file into an RDD and save as ex_text_lines.
##### Check the number of partitions in our ex_text_lines variable.

#### Result:

In [None]:
val data_dir = "/FileStore/tables"

In [None]:
val ex_text_lines = sc.textFile(data_dir + "/rdd-input-exercises.txt")
ex_text_lines.getNumPartitions

#### Task 3
##### Load data from the file `creditcard.csv` as a DataFrame called creditcard.

#### Result:

In [None]:
val creditcard = spark.read.format("csv")          
  .option("inferSchema", "true")             
  .option("header", "true")                    
  .load(data_dir + "/creditcard.csv")

#### Task 4
##### Show the number of partitions and the number of records per partition in the DataFrame creditcard as we did in class.
##### What can you tell about the number vs size of partitions? How many? Are the partitions approximately equal in size?

#### Result:

In [None]:
creditcard.rdd
.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show

- There are 4 partitions total
- Each partition was split to be approximately equal in size

#### Task 5
##### Create a new DataFrame called ex_filtered from the creditcard DataFrame with non-fraudulent transactions only and sort it in descending order of the transaction amount.
##### Take a look at the number of records per partition for ex_filtered variable.
##### What can you tell about the number of partitions now and what is the number of records per partition? Are they approximately equal in size?

#### Result:

In [None]:
val ex_filtered = creditcard.filter($"Class" < 1)
.sort($"Amount")

In [None]:
ex_filtered.rdd
.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show

- There are 200 partitions total
- Partitions are now wildly different in size, some are in double digits, while others are in thousands of records per partition

#### Task 6
##### Find the mean value for the Amount column in the creditcard DataFrame using the describe() function.
##### Using the value you got, create a new column named amount_level, where
- If the Amount is lower than the mean, the amount_level is set to low
- Otherwise the amount_level is set to high.
- Name this new DataFrame as creditcard2.

##### Show the DataFrame or use SQL interpreter to display the resulting table.

##### Hint: In this Task make use of the withColumn function

#### Result:

In [None]:
creditcard.describe("Amount")
.show()

In [None]:
val creditcard2 = creditcard.withColumn("amount_level", when($"Amount" < 88.35, "low")
                                        .otherwise("high"))

In [None]:
creditcard2.show(10)

#### Task 7
##### Check the number of partitions and records per partition of creditcard2.

#### Result:

In [None]:
creditcard2.rdd
.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show

#### Task 8
##### Repartition creditcard2 by the amount_level column and save as creditcard3.
##### Get the number of partitions for creditcard3.

#### Result:

In [None]:
val creditcard3 = creditcard2.repartition($"amount_level")

creditcard3.rdd.getNumPartitions

#### Task 9
##### Take a look at the number of records per partition for creditcard3.
##### What can you say about partitioning behaviour now?

#### Result:

In [None]:
creditcard3.rdd
.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show

- The number of records per partition is very unbalanced, there are lots of empty partitions, and almost all data is on the same partition, because we only have 2 levels “low” and “high” and the partitioner will group the records by the column even though all those other partitions are empty and available

#### Task 10
##### Use repartition by amount_level and set the number of partitions for the `creditcard3` to 50 and name it as `creditcard4`.
##### Show the number of partitions and also the number of records per partition for `creditcard4`.
##### What is the number of records per partition now? In this instance, does it make sense to partition by the amount_level? What function would you use and how many partitions would you create? 

#### Result:

In [None]:
val creditcard4 = creditcard3.repartition(50, $"amount_level")

In [None]:
creditcard4.rdd
.getNumPartitions

In [None]:
creditcard4.rdd
.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show

- Partitioning behaviour in this instance is not much better than in the previous Task, because we still have 2 levels in the amount_level, and the number of partitions is 50
- We ended up with most partitions being empty again
- In this instance it does not make sense to partition by the amount_level column, because the only number of partitions that will make sense for it is 2, and we have 8 executors on the 2 worker nodes. This means that most of our executors will be sitting idle waiting for a task and wasting resources; we will not be taking advantage of parallelism if we subject our data to 2 partitions only, because Spark will only be able to run 2 processess concurrently at most!
- In this instance coalesce is probably the better option
- Spark documentation suggests that 2-3 tasks per CPU core in a cluster is optimal, so that makes the number of partitions equal to 16-24 (given that the data is not very small or very large)

## Exercise 2

#### Task 1
##### Load data from the file `creditcard.csv` as a DataFrame called creditcard.

#### Result:

In [None]:
val creditcard = spark.read.format("csv")          
  .option("inferSchema", "true")             
  .option("header", "true")                    
  .load(data_dir + "/creditcard.csv")

#### Task 2
##### Run the following code as a user-defined function to measure the time it takes Spark to perform a computation (we used it in class).

```
def time[R](block: => R): R = {
    val t0 = System.nanoTime()
    val result = block    // call-by-name
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
    result
}
```

#### Result: 

In [None]:
def time[R](block: => R): R = {
    val t0 = System.nanoTime()
    val result = block    // call-by-name
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
    result
}

#### Task 3
##### Group the creditcard data by class and get the mean value of the Amount column.
##### Use the time function to see how long the computation takes.

#### Result:

In [None]:
time { 
    creditcard.select($"Amount", $"class")
    .groupBy($"class")
    .agg("Amount" -> "mean")
    .show() 
}

## Exercise 3

#### Task 1
##### Cache the creditcard dataset.
##### Also, perform any Spark action (you can also perform a transformation before the action) on creditcard variable to make sure that Spark caches the dataset before we perfrom any other computations.

#### Result:

In [None]:
creditcard.cache()

creditcard.take(20) //<- take is an action

#### Task 2
##### Using the cached dataset, perform the same computation we did in Exercise 2 Task 3.
##### Use the time function to also track how long the computation takes.
##### How long did the computation take this time? How does it compare to the same code on the uncached data?

#### Result:

In [None]:
time { 
    creditcard.select($"Amount", $"class")
    .groupBy($"class")
    .agg("Amount" -> "mean")
    .show() 
}

#### Task 3
##### Let’s re-import the file from `creditcard.csv` and save as `creditcard_persist`.
##### After you save the file as a DataFrame, use the persist() function with StorageLevel.MEMORY_ONLY on the DataFrame.
##### Also, perform any Spark action (you can also perform a transformation before the action) on creditcard_persist variable to make sure that Spark caches the dataset before we perform any other computations.

#### Result:

In [None]:
val creditcard_persist = spark.read.format("csv")          
  .option("inferSchema", "true")             
  .option("header", "true")                    
  .load(data_dir + "/creditcard.csv")

In [None]:
import org.apache.spark.storage.StorageLevel

creditcard_persist.persist(StorageLevel.MEMORY_ONLY)
creditcard_persist.sample(true, //<- with replacement
                          10,   //<- sample size
                          123)  //<- random seed

#### Task 4
##### Using the persisted dataset, perform the same computation we did in Exercise 1 Task 3.
##### Use the time function to also track how long the computation takes.
##### How long did the computation take this time? How does it compare to the same computation on the dataset without persisting from Exercise 1 Task 3?

#### Result:

In [None]:
time { 
    creditcard_persist.select($"Amount", $"class")
       .groupBy($"class")
       .agg("Amount" -> "mean")
       .show() 
} 

#### Task 5
##### Now let’s unpersist the creditcard DataFrame, which is currently cached.
##### Remember you don’t have to take any additional action for Spark to execute the unpersist function.

#### Result: 

In [None]:
creditcard.unpersist()

#### Task 6
##### Perform the same computation as we did in Exercise 1 Task 3 again to see how long it takes with the unpersisted DataFrame.
##### How long did the computation take this time?

#### Result:

In [None]:
time { 
    creditcard.select($"Amount", $"class")
     .groupBy($"class")
     .agg("Amount" -> "mean")
     .show() 
}

#### Bonus Task

##### Use persist, but with another StorageLevel memory option to compare to MEMORY_ONLY. You can select one of the following:
- MEMORY_ONLY_SER
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER
- DISK_ONLY

##### Discuss the difference in performance (if any) within your group.
##### Make sure to unpersist when you are done.

#### Result:

In [None]:
creditcard.persist(StorageLevel.MEMORY_AND_DISK)
creditcard.first()

In [None]:
time { 
    creditcard.select($"Amount", $"class")
     .groupBy($"class")
     .agg("Amount" -> "mean")
     .show() 
}

In [None]:
creditcard.unpersist()