d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Aggregation

1. Grouping data
1. Grouped data methods
1. Built-in aggregate functions

##### Methods
- DataFrame (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">Scala</a>): `groupBy`
- Grouped Data (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=groupeddata#pyspark.sql.GroupedData" target="_blank" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/RelationalGroupedDataset.html" target="_blank">Scala</a>): `agg`, `avg`, `count`, `max`, `sum`
- Built-In Functions (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#module-pyspark.sql.functions" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html" target="_blank">Scala</a>): `approx_count_distinct`, `avg`, `sum`

In [0]:
%run ./Includes/Classroom-Setup

Let's use the BedBricks sales dataset.

In [0]:
df = spark.read.parquet(eventsPath)
display(df)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Grouping data
Use the DataFrame `groupBy` method to create a grouped data object

This grouped data object is called `RelationalGroupedDataset` in Scala and `GroupedData` in Python

In [0]:
df.groupBy("event_name")

In [0]:
df.groupBy("geo.state", "geo.city")

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Grouped data methods
Various aggregate methods are available on the grouped data object

In [0]:
eventCountsDF = df.groupBy("event_name").count()
display(eventCountsDF)

In [0]:
avgStatePurchasesDF = df.groupBy("geo.state").avg("ecommerce.purchase_revenue_in_usd")
display(avgStatePurchasesDF)

In [0]:
cityPurchaseQuantitiesDF = df.groupBy("geo.state", "geo.city").sum("ecommerce.total_item_quantity")
display(cityPurchaseQuantitiesDF)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Built-in aggregate functions
Use the grouped data method `agg` to apply built-in aggregate functions

This allows you to apply other transformations on the resulting columns, such as `alias`

In [0]:
from pyspark.sql.functions import sum

statePurchasesDF = df.groupBy("geo.state").agg(sum("ecommerce.total_item_quantity").alias("total_purchases"))
display(statePurchasesDF)

Apply multiple aggregate functions on grouped data

In [0]:
from pyspark.sql.functions import avg, approx_count_distinct

stateAggregatesDF = df.groupBy("geo.state").agg(
  avg("ecommerce.total_item_quantity").alias("avg_quantity"),
  approx_count_distinct("user_id").alias("distinct_users"))

display(stateAggregatesDF)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Revenue by Traffic Lab
Get the 3 traffic sources generating the highest total revenue.
1. Aggregate revenue by traffic source
2. Get top 3 traffic sources by total revenue
3. Clean revenue columns to have two decimal places

##### Methods
- <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">DataFrame</a>: groupBy, sort, limit
- <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html" target="_blank">Column</a>: alias, desc, cast, operators
- <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html" target="_blank">Built-in Functions</a>: avg, sum

### Setup
Run the cell below to create the starting DataFrame **`df`**.

In [0]:
from pyspark.sql.functions import col

# purchase events logged on the BedBricks website
df = (spark.read.parquet(eventsPath)
  .withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
  .filter(col("revenue").isNotNull())
  .drop("event_name"))

display(df)

### 1. Aggregate revenue by traffic source
- Group by **`traffic_source`**
- Get sum of **`revenue`** as **`total_rev`**
- Get average of **`revenue`** as **`avg_rev`**

Remember to import any necessary built-in functions.

In [0]:
# TODO

trafficDF = (df.FILL_IN
)

display(trafficDF)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
from pyspark.sql.functions import round
expected1 = [(12704560.0, 1083.175), (78800000.3, 983.2915), (24797837.0, 1076.6221), (47218429.0, 1086.8303), (16177893.0, 1083.4378), (8044326.0, 1087.218)]
testDF = trafficDF.sort("traffic_source").select(round("total_rev", 4).alias("total_rev"), round("avg_rev", 4).alias("avg_rev"))
result1 = [(row.total_rev, row.avg_rev) for row in testDF.collect()]

assert(expected1 == result1)

### 2. Get top three traffic sources by total revenue
- Sort by **`total_rev`** in descending order
- Limit to first three rows

In [0]:
# TODO
topTrafficDF = (trafficDF.FILL_IN
)
display(topTrafficDF)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
expected2 = [(78800000.3, 983.2915), (47218429.0, 1086.8303), (24797837.0, 1076.6221)]
testDF = topTrafficDF.select(round("total_rev", 4).alias("total_rev"), round("avg_rev", 4).alias("avg_rev"))
result2 = [(row.total_rev, row.avg_rev) for row in testDF.collect()]

assert(expected2 == result2)

### 3. Limit revenue columns to two decimal places
- Modify columns **`avg_rev`** and **`total_rev`** to contain numbers with two decimal places
  - Use **`withColumn()`** with the same names to replace these columns
  - To limit to two decimal places, multiply each column by 100, cast to long, and then divide by 100

In [0]:
# TODO
finalDF = (topTrafficDF.FILL_IN
)

display(finalDF)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
expected3 = [(78800000.29, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result3 = [(row.total_rev, row.avg_rev) for row in finalDF.collect()]

assert(expected3 == result3)

### 4. Bonus: Rewrite using a built-in math function
Find a built-in math function that rounds to a specified number of decimal places

In [0]:
# TODO
bonusDF = (topTrafficDF.FILL_IN
)

display(bonusDF)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
expected4 = [(78800000.3, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result4 = [(row.total_rev, row.avg_rev) for row in bonusDF.collect()]

assert(expected4 == result4)

### 5. Chain all the steps above

In [0]:
# TODO
chainDF = (df.FILL_IN
)

display(chainDF)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
expected5 = [(78800000.3, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result5 = [(row.total_rev, row.avg_rev) for row in chainDF.collect()]

assert(expected5 == result5)

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
