
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>



# Revenue by Traffic Lab
Get the 3 traffic sources generating the highest total revenue.
1. Aggregate revenue by traffic source
2. Get top 3 traffic sources by total revenue
3. Clean revenue columns to have two decimal places

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html" target="_blank">DataFrame</a>: **`groupBy`**, **`sort`**, **`limit`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html" target="_blank">Column</a>: **`alias`**, **`desc`**, **`cast`**, **`operators`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html" target="_blank">Built-in Functions</a>: **`avg`**, **`sum`**

In [0]:
%run ../Includes/Classroom-Setup

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "labuser6023680_cng7_da_asp" in the catalog "hive_metastore"...(1 seconds)

Predefined tables in "labuser6023680_cng7_da_asp":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03
| DA.paths.checkpoints: dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks/_checkpoints

Setup completed (5 seconds)

P



### Setup
Run the cell below to create the starting DataFrame **`df`**.

In [0]:
from pyspark.sql.functions import col

# Purchase events logged on the BedBricks website
df = (spark.read.format("delta").load(DA.paths.events)
      .withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
      .filter(col("revenue").isNotNull())
      .drop("event_name")
     )

display(df)

device,ecommerce,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Chrome OS,"List(595.0, 1, 1)",1593611100709726,1593611164590787,"List(Laredo, TX)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593606775008006,UA000000106493130,595.0
Windows,"List(595.0, 1, 1)",1593616541455837,1593616746268903,"List(Rowlett, TX)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",email,1593611153452789,UA000000106511039,595.0
Windows,"List(1195.0, 1, 1)",1593622510420631,1593622624564395,"List(Chino, CA)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593617895347938,UA000000106546589,1195.0
macOS,"List(850.5, 1, 1)",1593843139065128,1593843942849799,"List(Santa Barbara, CA)","List(List(NEWBED10, M_STAN_F, Standard Full Mattress, 850.5, 945.0, 1))",email,1593615761401281,UA000000106534551,850.5
Windows,"List(2240.0, 2, 2)",1593607132024445,1593607724527371,"List(Milwaukee, WI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1), List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593603943421455,UA000000106484108,2240.0
Chrome OS,"List(1195.0, 1, 1)",1593613298187795,1593614265394887,"List(Winston-Salem, NC)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593612666602870,UA000000106518194,1195.0
macOS,"List(1045.0, 1, 1)",1593615168536877,1593615321092049,"List(California City, CA)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",google,1593613976508065,UA000000106524918,1045.0
macOS,"List(1795.0, 1, 1)",1593612402314002,1593612726209589,"List(Bayonne, NJ)","List(List(null, M_PREM_Q, Premium Queen Mattress, 1795.0, 1795.0, 1))",direct,1593605699394476,UA000000106489440,1795.0
Android,"List(1045.0, 1, 1)",1593617613139576,1593617721738177,"List(Portland, OR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",google,1593613033791690,UA000000106519999,1045.0
Windows,"List(1795.0, 1, 1)",1593622944339060,1593623248104571,"List(Corpus Christi, TX)","List(List(null, M_PREM_Q, Premium Queen Mattress, 1795.0, 1795.0, 1))",facebook,1593616330759799,UA000000106537762,1795.0




### 1. Aggregate revenue by traffic source
- Group by **`traffic_source`**
- Get sum of **`revenue`** as **`total_rev`**. 
- Get average of **`revenue`** as **`avg_rev`**

Remember to import any necessary built-in functions.

In [0]:
# TODO

from pyspark.sql.functions import avg, col, sum

traffic_df = (df
              .groupBy("traffic_source")
              .agg(sum(col("revenue")).alias("total_rev"),
                   avg(col("revenue")).alias("avg_rev"))
             )

display(traffic_df)

traffic_source,total_rev,avg_rev
instagram,16177893.0,1083.437784623627
direct,12704560.0,1083.175036234973
youtube,8044326.0,1087.2180024327613
email,78800000.29999994,983.2915347084434
facebook,24797837.0,1076.6221074111058
google,47218429.0,1086.830295078949





**1.1: CHECK YOUR WORK**

In [0]:
from pyspark.sql.functions import round

expected1 = [(12704560.0, 1083.175), (78800000.3, 983.2915), (24797837.0, 1076.6221), (47218429.0, 1086.8303), (16177893.0, 1083.4378), (8044326.0, 1087.218)]
test_df = traffic_df.sort("traffic_source").select(round("total_rev", 4).alias("total_rev"), round("avg_rev", 4).alias("avg_rev"))
result1 = [(row.total_rev, row.avg_rev) for row in test_df.collect()]

assert(expected1 == result1)
print("All test pass")

All test pass




### 2. Get top three traffic sources by total revenue
- Sort by **`total_rev`** in descending order
- Limit to first three rows

In [0]:
# TODO
top_traffic_df = traffic_df.sort(col("total_rev").desc()).limit(3)
display(top_traffic_df)

traffic_source,total_rev,avg_rev
email,78800000.29999994,983.2915347084434
google,47218429.0,1086.830295078949
facebook,24797837.0,1076.6221074111058





**2.1: CHECK YOUR WORK**

In [0]:
expected2 = [(78800000.3, 983.2915), (47218429.0, 1086.8303), (24797837.0, 1076.6221)]
test_df = top_traffic_df.select(round("total_rev", 4).alias("total_rev"), round("avg_rev", 4).alias("avg_rev"))
result2 = [(row.total_rev, row.avg_rev) for row in test_df.collect()]

assert(expected2 == result2)
print("All test pass")

All test pass




### 3. Limit revenue columns to two decimal places
- Modify columns **`avg_rev`** and **`total_rev`** to contain numbers with two decimal places
  - Use **`withColumn()`** with the same names to replace these columns
  - To limit to two decimal places, multiply each column by 100, cast to long, and then divide by 100

In [0]:
# TODO
final_df = (top_traffic_df
            .withColumn("avg_rev", (col("avg_rev") * 100).cast("long") / 100)
            .withColumn("total_rev", (col("total_rev") * 100).cast("long") / 100)
           )

display(final_df)

traffic_source,total_rev,avg_rev
email,78800000.29,983.29
google,47218429.0,1086.83
facebook,24797837.0,1076.62





**3.1: CHECK YOUR WORK**

In [0]:
expected3 = [(78800000.29, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result3 = [(row.total_rev, row.avg_rev) for row in final_df.collect()]

assert(expected3 == result3)
print("All test pass")

All test pass




### 4. Bonus: Rewrite using a built-in math function
Find a built-in math function that rounds to a specified number of decimal places

In [0]:
# TODO
from pyspark.sql.functions import round

bonus_df = (top_traffic_df
            .withColumn("avg_rev", round("avg_rev", 2))
            .withColumn("total_rev", round("total_rev", 2))
           )

display(bonus_df)

traffic_source,total_rev,avg_rev
email,78800000.3,983.29
google,47218429.0,1086.83
facebook,24797837.0,1076.62





**4.1: CHECK YOUR WORK**

In [0]:
expected4 = [(78800000.3, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result4 = [(row.total_rev, row.avg_rev) for row in bonus_df.collect()]

assert(expected4 == result4)
print("All test pass")

All test pass




### 5. Chain all the steps above

In [0]:
# TODO
chain_df = (df
            .groupBy("traffic_source")
            .agg(sum(col("revenue")).alias("total_rev"),
                 avg(col("revenue")).alias("avg_rev"))
            .sort(col("total_rev").desc())
            .limit(3)
            .withColumn("avg_rev", round("avg_rev", 2))
            .withColumn("total_rev", round("total_rev", 2))
           )
display(chain_df)

traffic_source,total_rev,avg_rev
email,78800000.3,983.29
google,47218429.0,1086.83
facebook,24797837.0,1076.62





**5.1: CHECK YOUR WORK**

In [0]:
method_a = [(78800000.3,  983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
method_b = [(78800000.29, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result5 = [(row.total_rev, row.avg_rev) for row in chain_df.collect()]

assert result5 == method_a or result5 == method_b
print("All test pass")

All test pass




### Clean up classroom

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "labuser6023680_cng7_da_asp"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>