# Snowpark GroupBy Transformations

Snowpark API provides a Relational Grouped Data Frame to represent an underlying DataFrame with rows that are grouped by common values. 

These RelationalGroupedDataFrame objects can be used to define aggregations on these grouped DataFrames.

---


### Create a Session
---

Create a Snowpark Session by passing in the connection properties file created in the [first lab exercise](../A-Dataframes/01-Sessions.ipynb).

In [None]:
import com.snowflake.snowpark._
import com.snowflake.snowpark.functions._
import com.snowflake.snowpark.types._

// Set connection properties file variable
val pwd = sys.env.get("PWD").fold("")(_.toString)
val filename = s"$pwd/de_snowpark/connect.properties"

val session = Session.builder.configFile(s"$filename").create


---
### Create DataFrame from a Table

Create a DataFrame that contains the following:

* Use the ONTIME_REPORTING table
* Return the rows with YEAR = 2019 and ORIGIN = SEA and DEST = SFO
* Select the YEAR, QUARTER, ARR_DELAY, DEP_DELAY, FL_DATE, ORIGIN, DEST, OP_CARRIER, and TAIL_NUM columns
* Retrieve the definition of the columns in this dataset


In [None]:
var SEAtoSFO2019DF =  session.table("raw.ONTIME_REPORTING")
                            .select(col("YEAR"),
                                       col("QUARTER"),
                                       col("ARR_DELAY"),
                                       col("DEP_DELAY"),
                                       col("FL_DATE"), 
                                       col("ORIGIN"), 
                                       col("DEST"),
                                       col("OP_CARRIER"),
                                       col("TAIL_NUM"))
                             .filter(col("YEAR") === 2019 &&                 
                                      col("ORIGIN") === "SEA" && 
                                      col("DEST") === "SFO")

SEAtoSFO2019DF.schema

Examine the results. Notice that the ARR_DELAY and DEP_DELAY columns are of the type `String`.  

Create a DataFrame to transform the ARR_DELAY and DEP_DELAY columns to the `Integer` type.

In [None]:
var SEAtoSFO2019DF =  session.table("raw.ONTIME_REPORTING")
                            .select(col("YEAR"),
                                       col("QUARTER"),
                                       col("ARR_DELAY").cast(IntegerType) as "ARR_DELAY",
                                       col("DEP_DELAY").cast(IntegerType) as "DEP_DELAY",
                                       col("FL_DATE"), 
                                       col("ORIGIN"), 
                                       col("DEST"),
                                       col("OP_CARRIER"),
                                       col("TAIL_NUM"))
                             .filter(col("YEAR") === 2019 &&                 
                                      col("ORIGIN") === "SEA" && 
                                      col("DEST") === "SFO")

SEAtoSFO2019DF.schema


---
## Group the Rows in a DataFrame

Create a Relational Grouped Data Frame by calling the `groupBy` function to group the rows by OP_CARRIER and QUARTER.

In [None]:
val carrierGroupRoutes2019Q1DF = SEAtoSFO2019DF.groupBy("OP_CARRIER","QUARTER")

## Compute Aggregates on the Grouped Rows 

---
Create a DataFrame that aggregates the MAX, MIN and MEAN on the ARR_DELAY column for each carrier.

In [None]:
val aggArrSEAtoSFO2019Df = carrierGroupRoutes2019Q1DF.agg(max(col("ARR_DELAY")),
                                           min(col("ARR_DELAY")),
                                           mean(col("ARR_DELAY")))
                                      .show()
                                               

