# Datasets

Datasets API, added in Spark 1.6, provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized execution engine. 

A Dataset can be constructed using Scala objects/case classes and then manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.) similar to RDD. The benefits is that, unlike RDD, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage several optimizations (Catalyst).

Typically, Jupyter will create SparkSession (spark) and SparkContext (sc) for you, but in this case we are going to need to recreate the SparkSession to be albe to import implicits:

In [2]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("MakeLabeledCartesian").getOrCreate()
import spark.implicits._

In [9]:
val dataset = Seq(1, 2, 3).toDS()
dataset.show()

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
+-----+



3

If you have a sequence of case classes, simply calling `.toDS()` will provide a dataset with all the necessary fields in the dataset.

Similarly to the above, we can define a dataset from the sequence of objects (case classes):

In [10]:
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 62)).toDS()
personDS.show()

+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+



# Creating datasets from RDD and DataFrames

Datasets can be easily converted to/from DataFrames and RDDs. 
Calling `.toDS()` in an RDD converts it into a Dataset:

In [14]:
val rdd = sc.parallelize(Seq((1, "Spark"), (2, "Databricks")))
val integerDS = rdd.toDS()
integerDS.show()

+---+----------+
| _1|        _2|
+---+----------+
|  1|     Spark|
|  2|Databricks|
+---+----------+



In [23]:
val integerDF = rdd.toDF("Id","Name")
integerDF.show()

+---+----------+
|uid|      name|
+---+----------+
|  1|     Spark|
|  2|Databricks|
+---+----------+



You can call df.as[SomeCaseClass] to convert the DataFrame to a Dataset.

In [24]:
case class Entity(Id: Int, Name: String)
val integerDS = integerDF.as[Entity]
integerDS.show()

+---+----------+
|uid|      name|
+---+----------+
|  1|     Spark|
|  2|Databricks|
+---+----------+



You can also deal with tuples while converting a DataFrame to Dataset without using a case class

In [26]:
val dataset = integerDF.as[(Int, String)]
dataset.show()

+---+----------+
|uid|      name|
+---+----------+
|  1|     Spark|
|  2|Databricks|
+---+----------+



# Working with Datasets

Let us consider a simple word count example. First, we prepare a Dataset from a sequence of strings. Next, we apply normalization and tokenization transformations - note here, we have to use `flatMap` transformation instead of `map` to create a single list from all the sentences (the result of the tokenization of each sentence is a list of tokens, so `map` would operate on a list of lists).

Finally, we filter empty tokens, and groupBy value column:

In [27]:
val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am your father")).toDS()
val groupedDataset = wordsDataset.flatMap(_.toLowerCase.split(" ")).filter(_ != "").groupBy("value")

We can apply various sorts of aggrewgation functions on grouped data (more about this in the next section), in particular, applying `count` will produce a column with counts of elements in each group (with the same name - count):

In [28]:
val countsDataset = groupedDataset.count()
countsDataset.show()

+------+-----+
| value|count|
+------+-----+
|father|    2|
|   you|    1|
|  with|    1|
|    be|    1|
|  your|    2|
|   may|    1|
| spark|    3|
|   the|    1|
|     i|    2|
|    am|    2|
+------+-----+



# Union, group by, join of datasets

The following example demonstrates:

 1. Union multiple datasets
 1. Doing an inner join on a condition
 1. Grouping by a specific column
 
The examples use only Datasets API to demonstrate all the operations available. In reality, using dataframes for doing aggregation would be simpler and faster than doing custom aggregation with `mapGroups`.

In [50]:
case class Employee(name: String, age: Int, departmentId: Int, salary: Double)

val employeeDataset1 = Seq(("Max", 22, 1, 100000.0), ("Adam", 33, 2, 93000.0), ("Eve", 35, 2, 89999.0), ("Muller", 39, 3, 120000.0)).toDF("name","age","departmentId","salary").as[Employee]
val employeeDataset2 = Seq(("John", 26, 1, 990000.0), ("Joe", 38, 3, 115000.0)).toDF("name","age","departmentId","salary").as[Employee]

val employeeDataset = employeeDataset1.union(employeeDataset2)
employeeDataset.show()

+------+---+------------+--------+
|  name|age|departmentId|  salary|
+------+---+------------+--------+
|   Max| 22|           1|100000.0|
|  Adam| 33|           2| 93000.0|
|   Eve| 35|           2| 89999.0|
|Muller| 39|           3|120000.0|
|  John| 26|           1|990000.0|
|   Joe| 38|           3|115000.0|
+------+---+------------+--------+



In [49]:
case class Department(id: Int, name: String)
case class Record(name: String, age: Int, salary: Double, departmentId: Int, departmentName: String)
case class ResultSet(departmentId: Int, departmentName: String, avgSalary: Double)

val departmentDataSet = Seq((1, "Engineering"), (2, "Marketing"), (3, "Sales")).toDF("id","name").as[Department]
departmentDataSet.show()

+---+-----------+
| id|       name|
+---+-----------+
|  1|Engineering|
|  2|  Marketing|
|  3|      Sales|
+---+-----------+



In [66]:
val exampleJoin = employeeDataset.joinWith(departmentDataSet, $"departmentId" === $"id", "inner")

In [67]:
exampleJoin.show()

+--------------------+---------------+
|                  _1|             _2|
+--------------------+---------------+
| [Max,22,1,100000.0]|[1,Engineering]|
| [Adam,33,2,93000.0]|  [2,Marketing]|
|  [Eve,35,2,89999.0]|  [2,Marketing]|
|[Muller,39,3,1200...|      [3,Sales]|
|[John,26,1,990000.0]|[1,Engineering]|
| [Joe,38,3,115000.0]|      [3,Sales]|
+--------------------+---------------+

