# Spark Data Structures and Parallelism - Exercises with results

## Exercise 1

#### Task 1
##### Create an RDD called RDD_1 from an array with elements 10,20,30,40.

#### Result:

In [None]:
val Array_1 = Array(10,20,30,40)
val RDD_1 = sc.parallelize(Array_1)

#### Task 2
##### Create an RDD called RDD_2 from the text file `rdd-input-exercises.txt`.
##### This is a text file regarding some of the examples of types of credit card fraud.

#### Result:

In [None]:
val data_dir = "/FileStore/tables"

In [None]:
val RDD_2 = sc.textFile(data_dir + "/rdd-input-exercises.txt")

#### Task 3
##### Create a new RDD called RDD_identity from the lines in RDD_2 that contain the word identity.
##### Create a new RDD called RDD_cardholder from the lines in RDD_2 that contain the word cardholder.

#### Result:

In [None]:
val RDD_identity = RDD_2.filter(line => line.contains("identity"))
val RDD_cardholder = RDD_2.filter(line => line.contains("cardholder"))

#### Task 4
##### Combine RDD_identity and RDD_cardholder in an RDD called RDD_union by taking the union.

#### Result:

In [None]:
val RDD_union = RDD_identity.union(RDD_cardholder)

#### Task 5
##### Count the number of lines in RDD_union.

#### Result:

In [None]:
RDD_union.count()

#### Task 6
##### Print the first 3 lines in RDD_union.

#### Result:

In [None]:
RDD_union.take(3)
.foreach(println)

## Exercise 2

#### Task 1
##### Create a dataframe called creditcard from the file `creditcard.csv`.
##### Choose option("inferSchema", "true") and option("header", "true").
##### This dataset contains credit card transactions of different amounts and flagged as fraudulent or not. Additionally, it also contains 28 columns of PCA transformations of sensitive data that could not be shared.

#### Result:

In [None]:
val creditcard = spark.read.format("csv")       // read csv format
  .option("inferSchema", "true")                //  infer the schema
  .option("header", "true")                     //  include header
  .load(data_dir + "/creditcard.csv")           // read the file

#### Task 2
##### View the schema of the dataframe.

#### Result: 

In [None]:
creditcard.printSchema()

#### Task 3
##### View the first 10 rows of the Amount and Class column from the dataframe.

#### Result:

In [None]:
creditcard.select("Amount", "Class")
.show(10)

#### Task 4
##### Create a SQL TempView of the dataframe and name it as creditcard_view.


#### Result:

In [None]:
creditcard.createOrReplaceTempView("creditcard_view")

#### Task 5
##### Using the SQL command, print the Amount and Class column.

#### Result:

In [None]:
sql("SELECT Amount, Class FROM creditcard_view")

#### Task 6
##### Let’s go back to the creditcard DataFrame.
##### Select the Amount column and show the first 10 entries

#### Result: 

In [None]:
creditcard.select("Amount")
.show(10)

#### Task 7
##### Using the agg() function, aggregate and find the maximum value from the Amount column.

#### Result:

In [None]:
creditcard.agg("Amount"->"max")
.show()

#### Task 8
##### Filter the dataframe for rows with Class greater than 0 and sort these rows by descending order of Amount. Show the first 5 rows.

#### Result:

In [None]:
creditcard.filter($"Class" > 0)
.sort($"Amount")
.show(5)

#### Task 9
##### Group by Class and take the mean of Amount for each group.

#### Result:

In [None]:
creditcard.select($"Amount", $"Class")
.groupBy($"Class")
.agg("Amount" -> "mean")
.show()

## Exercise 3

#### Task 1
##### Define a case class with the types for each of the variables in the DataFrame for the creditcard data and name as creditcard_class.
##### Refer to the answer from Exercise 1 Task 2.
- For the Time column, make sure to cast as Long type.

#### Result:

In [None]:
case class creditcard_class (
    Time: Double,
    V1: Double,
    V2: Double,
    V3: Double,
    V4: Double,
    V5: Double,
    V6: Double,
    V7: Double,
    V8: Double,
    V9: Double,
    V10: Double,
    V11: Double,
    V12: Double,
    V13: Double,
    V14: Double,
    V15: Double,
    V16: Double,
    V17: Double,
    V18: Double,
    V19: Double,
    V20: Double,
    V21: Double,
    V22: Double,
    V23: Double,
    V24: Double,
    V25: Double,
    V26: Double,
    V27: Double,
    V28: Double,
    Amount: Double, 
    `Class`: Integer)

#### Task 2
##### Load the file `creditcard.csv` as a DataFrame and type cast as creditcard_class to create a Dataset.
##### Save the object as creditcard_set Dataset.

#### Result:

In [None]:
val creditcard_set = spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .load(data_dir + "/creditcard.csv")
  .as[creditcard_class]

#### Task 3
##### Print the first few rows of the creditcard_set Dataset.

#### Result:

In [None]:
creditcard_set.take(10)  
.foreach(println)

#### Task 4
##### Using the columns and schema function, print the columns and the schema of creditcard_set.

#### Result:

In [None]:
creditcard_set.columns

In [None]:
creditcard_set.schema

#### Task 5
##### Using select, print the values from the class column.

#### Result: 

In [None]:
creditcard_set.select("class")
.collect()
.take(20)
.foreach(println)

#### Task 8
##### Group creditcard_set by class and get the sum and the maximum value of Amount.

#### Result:

In [None]:
creditcard_set.groupBy("class")
.agg(sum("Amount"), max("Amount"))
.collect()
.foreach(println)

#### Task 9
##### Convert creditcard_set to a Spark RDD and name as creditcard_RDD.

#### Result:

In [None]:
val creditcard_rdd = creditcard_set.rdd

## Exercise 4

#### Task 1
##### Let’s go back to using RDD_2, which is a text file we loaded in exercise 1.
##### Print the default parallelism in the current Spark context and also the number of partitions for RDD_2.

#### Result: 

In [None]:
sc.defaultParallelism

#### Task 2
##### Reload the rdd_input.txt from `rdd-input-exercises.txt` with 4 partitions.
##### Save the object as RDD_4.

#### Result: 

In [None]:
val RDD_4 = sc.textFile(data_dir + "/rdd-input-exercises.txt", 4)

#### Task 3
##### Count number of records per partition for RDD_4 and save as ex_num_records.
##### Print ex_num_records.

#### Result: 

In [None]:
val ex_num_records = RDD_4.glom()
.map(_.length)
.collect()

ex_num_records.foreach(println)

#### Task 4
##### Make a dataframe listing partitions and number of records in each partition for RDD_4.

#### Result: 

In [None]:
RDD_4.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show

#### Task 5
##### Check number of default partitions in our creditcard variable.

#### Result: 

In [None]:
creditcard.rdd
.getNumPartitions

#### Task 6
##### Make a dataframe listing partitions and number of records in each partition for creditcard.

#### Result: 

In [None]:
creditcard.rdd
.mapPartitionsWithIndex{
    case (id, records) => Iterator((id, records.size))
}.toDF("partition_id","number_of_records")
.show