# Snowpark Dataframe Transformations

To specify which columns should be selected and how the results should be filtered, sorted, grouped, etc., call the DataFrame methods that transform the dataset. To identify columns in these methods, use the `col` function or an expression that evaluates to a column. (See [Specifying Columns and Expressions](https://docs.snowflake.com/en/developer-guide/snowpark/working-with-dataframes.html#specifying-columns-and-expressions).) To retrieve the definition of the columns in the dataset for the DataFrame, call the schema method.

---


### Create a Session
---

Create a Snowpark Session by passing in the connection properties file created in the [first lab exercise](../A-Dataframes/01-Sessions.ipynb).

In [None]:
import com.snowflake.snowpark._
import com.snowflake.snowpark.functions._
import com.snowflake.snowpark.types._

// Set connection properties file variable
val pwd = sys.env.get("PWD").fold("")(_.toString)
val filename = s"$pwd/de_snowpark/connect.properties"

val session = Session.builder.configFile(s"$filename").create


---
### Create DataFrame from a Table

Create a DataFrame from the data in the `ONTIME_REPORTING` table.


In [None]:
val onTimeReportingDF = session.table("raw.ONTIME_REPORTING")



---
## Retrieving Column Definitions

To retrieve the definition of the columns in the dataset for the DataFrame, call the schema method. This method returns a StructType object that contains an Array of StructField objects. Each StructField object contains the definition of a column.

Retrieve the definition of the columns in the dataset for the `onTimeReportingDF` DataFrame, by calling the `schema` method.

---




In [None]:
onTimeReportingDF.schema

Examine the results.  Each StructField object contains the definition of a column in the DataFrame. Notice the Array is truncated.  

Use the `schema.names` function to print each column name.

---


In [None]:
onTimeReportingDF.schema.names.foreach { println }

### Specify the Columns to Select

To specify the columns that should be selected, call the `select` method on a DataFrame.

Create a DataFrame that contains the following columns:
* Flight Date (FL_DATE)
* Origin Airport (ORIGIN)
* Destination Airport (DEST)
* Flight Operating Carrier (OP_CARRIER)
* Flight Tail Number (TAIL_NUM)

In [None]:
val flightRoutesDF = onTimeReportingDF.select(col("FL_DATE"), 
                                                 col("ORIGIN"), 
                                                 col("DEST"),
                                                 col("OP_CARRIER"),
                                                 col("TAIL_NUM"))
                                              
flightRoutesDF.show()

### Specify the Rows to select

To specify which rows should be returned, call the `filter` method:

Create a DataFrame that contains the following rows:
* Year is 2019 (YEAR)
* First Quarter (QUARTER)
* Arrival Delay is less than 0 (ARR_DELAY)


In [None]:
val earlyArrivalDF = onTimeReportingDF.filter(col("YEAR") === 2019 && 
                                              col("QUARTER") === 1 &&
                                              col("ARR_DELAY").cast(IntegerType) < 0)

In [None]:
earlyArrivalDF.count()

## Chaining Method Calls

Because `filter` and `select` methods transform a DataFrame object and return a new DataFrame object with the transformation applied, you can chain method calls together to produce a new DataFrame that is transformed.

Create a DataFrame that contains the following:

* Query the `ONTIME_REPORTING` table.
* Return the rows with `YEAR = 2019` and `QUARTER = 1`.
* Select the `YEAR`, `QUARTER`, `FL_DATE`, `ORIGIN`, `DEST`, `OP_CARRIER`, and `TAIL_NUM` columns.

In [None]:
val flightRoutes2019Q1DF = session.table("raw.ONTIME_REPORTING")
                           .select(col("YEAR"),
                                   col("QUARTER"),
                                   col("FL_DATE"), 
                                   col("ORIGIN"), 
                                   col("DEST"),
                                   col("OP_CARRIER"),
                                   col("TAIL_NUM"))
                           .filter(col("YEAR") === 2019 && 
                                   col("QUARTER") === 1)

flightRoutes2019Q1DF.show()

When you chain method calls, keep in mind that the order of calls is important. Each method call returns a DataFrame that has been transformed. Make sure that subsequent calls work with the transformed DataFrame.

---

Using the `onTimeReportingDF` DataFrame object create a new DataFrame with the following:

* Return the rows with `ORIGIN = SEA`, `DEST = SFO` and `ARR_DELAY < 0`.


In [None]:
val earlyArrivalF = onTimeReportingDF.select(col("YEAR"),
                                               col("QUARTER"),
                                               col("ARR_DELAY"),
                                               col("FL_DATE"), 
                                               col("ORIGIN"), 
                                               col("DEST"),
                                               col("OP_CARRIER"),
                                               col("TAIL_NUM"))
                                     .filter(col("YEAR") === 2019 && 
                                              col("QUARTER") === 1 &&
                                              col("ARR_DELAY").cast(IntegerType) < 0 &&
                                              col("ORIGIN") === "SEA" && 
                                              col("DEST") === "SFO")

earlyArrivalF.show()