## Spark Data Projection

* Projection of Data can be done using `select`, `selectExpr` or `drop` on DataFrame.

* We can also apply functions from `org.apache.spark.sql.functions` on columns using `select` in Programming style like Scala, or using `selectExpr` in SQL style.

In [1]:
// Create a List
val employees = List((1, "Scott", "Tiger", 1000.0, "united states"),
                     (2, "Henry", "Ford", 1250.0, "India"),
                     (3, "Nick", "Junior", 750.0, "united KINGDOM"),
                     (4, "Bill", "Gomes", 1500.0, "AUSTRALIA"))

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.138:4040
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1676925184077)
SparkSession available as 'spark'


employees: List[(Int, String, String, Double, String)] = List((1,Scott,Tiger,1000.0,united states), (2,Henry,Ford,1250.0,India), (3,Nick,Junior,750.0,united KINGDOM), (4,Bill,Gomes,1500.0,AUSTRALIA))


#### **Transform List to DataFrame**

In [2]:
val employeesDF = employees.toDF("employee_id", 
                                 "first_name", 
                                 "last_name", 
                                 "salary", 
                                 "nationality")

employeesDF: org.apache.spark.sql.DataFrame = [employee_id: int, first_name: string ... 3 more fields]


In [3]:
employeesDF.printSchema

root
 |-- employee_id: integer (nullable = false)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = false)
 |-- nationality: string (nullable = true)



In [4]:
employeesDF.show(5, truncate=false)

+-----------+----------+---------+------+--------------+
|employee_id|first_name|last_name|salary|nationality   |
+-----------+----------+---------+------+--------------+
|1          |Scott     |Tiger    |1000.0|united states |
|2          |Henry     |Ford     |1250.0|India         |
|3          |Nick      |Junior   |750.0 |united KINGDOM|
|4          |Bill      |Gomes    |1500.0|AUSTRALIA     |
+-----------+----------+---------+------+--------------+



#### **Select Specific Columns From DataFrame**

* Project employee first name and last name

In [5]:
// using "select"

employeesDF.select(col("first_name"),
                   col("last_name")).show(truncate=false)

+----------+---------+
|first_name|last_name|
+----------+---------+
|Scott     |Tiger    |
|Henry     |Ford     |
|Nick      |Junior   |
|Bill      |Gomes    |
+----------+---------+



In [6]:
// using "select"

employeesDF.select($"first_name",
                   $"last_name").show(truncate=false)

+----------+---------+
|first_name|last_name|
+----------+---------+
|Scott     |Tiger    |
|Henry     |Ford     |
|Nick      |Junior   |
|Bill      |Gomes    |
+----------+---------+



In [7]:
// using "selectExpr"

employeesDF.selectExpr("first_name AS fn",
                       "last_name AS ln").show(truncate=false)

+-----+------+
|fn   |ln    |
+-----+------+
|Scott|Tiger |
|Henry|Ford  |
|Nick |Junior|
|Bill |Gomes |
+-----+------+



#### **Select All Columns except Some**

* Project all the fields except for Nationality

In [8]:
// Using "drop"

employeesDF.drop("nationality").show(truncate=false)

+-----------+----------+---------+------+
|employee_id|first_name|last_name|salary|
+-----------+----------+---------+------+
|1          |Scott     |Tiger    |1000.0|
|2          |Henry     |Ford     |1250.0|
|3          |Nick      |Junior   |750.0 |
|4          |Bill      |Gomes    |1500.0|
+-----------+----------+---------+------+

