## Spark DataFrame API

* Row Level Transformations or Projection of Data can be done using `select`, `selectExpr`, `withColumn`, `drop` on DataFrame.

* We typically apply functions from `org.apache.spark.sql.functions` on columns using `select` and `withColumn`

* Filtering is typically done either by using `filter` or `where` on DataFrame.

* We can pass the condition to `filter` or `where` either by using SQL Style or Programming Language Style.

* Global Aggregations can be performed directly on the DataFrame like `count`, `min`, `max`.

* Grouping & Aggregations are typically performed using `groupBy` by Keys and then Aggregate functions using `agg`.

* We can sort the data in DataFrame using `sort` or `orderBy`.

* We can use use Window Functions for some advanced Aggregations and Ranking.

In [1]:
// Create a List
val employees = List((1, "Scott", "Tiger", 1000.0, "united states"),
                     (2, "Henry", "Ford", 1250.0, "India"),
                     (3, "Nick", "Junior", 750.0, "united KINGDOM"),
                     (4, "Bill", "Gomes", 1500.0, "AUSTRALIA"))

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.138:4043
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1670442615859)
SparkSession available as 'spark'


employees: List[(Int, String, String, Double, String)] = List((1,Scott,Tiger,1000.0,united states), (2,Henry,Ford,1250.0,India), (3,Nick,Junior,750.0,united KINGDOM), (4,Bill,Gomes,1500.0,AUSTRALIA))


#### **Transform List to DataFrame**

In [2]:
val employeesDF = employees.toDF("employee_id", 
                                 "first_name", 
                                 "last_name", 
                                 "salary", 
                                 "nationality"
                                )

employeesDF: org.apache.spark.sql.DataFrame = [employee_id: int, first_name: string ... 3 more fields]


In [3]:
employeesDF.printSchema

root
 |-- employee_id: integer (nullable = false)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = false)
 |-- nationality: string (nullable = true)



In [4]:
employeesDF.show(5, truncate=false)

+-----------+----------+---------+------+--------------+
|employee_id|first_name|last_name|salary|nationality   |
+-----------+----------+---------+------+--------------+
|1          |Scott     |Tiger    |1000.0|united states |
|2          |Henry     |Ford     |1250.0|India         |
|3          |Nick      |Junior   |750.0 |united KINGDOM|
|4          |Bill      |Gomes    |1500.0|AUSTRALIA     |
+-----------+----------+---------+------+--------------+



#### **Select Specific Columns From DataFrame**

* Project employee first name and last name

In [5]:
employeesDF.select("first_name", "last_name").show(truncate=false)

+----------+---------+
|first_name|last_name|
+----------+---------+
|Scott     |Tiger    |
|Henry     |Ford     |
|Nick      |Junior   |
|Bill      |Gomes    |
+----------+---------+



#### **Select All Columns except Some**

* Project all the fields except for Nationality

In [6]:
employeesDF.drop("nationality").show(truncate=false)

+-----------+----------+---------+------+
|employee_id|first_name|last_name|salary|
+-----------+----------+---------+------+
|1          |Scott     |Tiger    |1000.0|
|2          |Henry     |Ford     |1250.0|
|3          |Nick      |Junior   |750.0 |
|4          |Bill      |Gomes    |1500.0|
+-----------+----------+---------+------+

