
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Stream Aggregations Lab

### Activity by Traffic

Process streaming data to display total active users by traffic source.

##### Objectives
1. Read data stream
2. Get active users by traffic source
3. Execute query with display() and plot results
4. Execute the same streaming query with DataStreamWriter
5. View results being updated in the query table
6. List and stop all active streams

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html" target="_blank">StreamingQuery</a>

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.


### Setup
Run the cells below to generate data and create the **`schema`** string needed for this lab.

In [0]:
%run ./Includes/Classroom-Setup-03L


### 1. Read data stream
- Set to process 1 file per trigger
- Read from Delta with filepath stored in **`'/Volumes/dbacademy_ecommerce/v01/delta/events_hist'`**

Assign the resulting Query to **`df`**.

In [0]:
events_df = <FILL_IN>

In [0]:
%skip
events_df = (spark.readStream
           .option("maxFilesPerTrigger", 1)
           .format("delta")
           .load('/Volumes/dbacademy_ecommerce/v01/delta/events_hist'))


**1.1: CHECK YOUR WORK**

In [0]:
# Define the list of required columns

events_required_columns = ["device", "ecommerce", "event_name", "event_previous_timestamp", "event_timestamp", "geo", "items", "traffic_source", "user_first_touch_timestamp", "user_id"]

In [0]:
DA.validate_dataframe(events_df,events_required_columns)

### 2. Get active users by traffic source
- Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
- Group by **`traffic_source`**
  - Aggregate the approximate count of distinct users and alias with "active_users"
- Sort by **`traffic_source`**

In [0]:
spark.<FILL_IN>

traffic_df = events_df.<FILL_IN>

In [0]:
%skip
from pyspark.sql.functions import col, approx_count_distinct, count

spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)

traffic_df = (events_df
              .groupBy("traffic_source")
              .agg(approx_count_distinct("user_id").alias("active_users"))
              .sort("traffic_source")
             )


**2.1: CHECK YOUR WORK**

In [0]:
# Expected schema fields and types

expected_fields = {
    "traffic_source": "StringType",
    "active_users": "LongType"
}

In [0]:
DA.validate_schema(traffic_df.schema,expected_fields)


### 3. Execute query with display() and plot results
- Execute results for **`traffic_df`** using display()
- Plot the streaming query results as a bar graph

In [0]:
display(<FILL-IN>)

In [0]:
%skip
display(traffic_df)

**3.1: CHECK YOUR WORK**
- Your bar chart should plot **`traffic_source`** on the x-axis and **`active_users`** on the y-axis
- The top three traffic sources in descending order should be **`google`**, **`facebook`**, and **`instagram`**.

### 4. Execute the same streaming query with DataStreamWriter
- Name the query "active_users_by_traffic"
- Set to "memory" format and "complete" output mode
- Set a trigger interval of 1 second

In [0]:
traffic_query = (traffic_df.<FILL_IN>
)

In [0]:
%skip
traffic_query = (traffic_df
                 .writeStream
                 .queryName("active_users_by_traffic")
                 .format("memory")
                 .outputMode("complete")
                 .trigger(processingTime="1 second")
                 .start())

**4.1: CHECK YOUR WORK**

In [0]:
DA.validate_traffic_query(traffic_query)

### 5. View results being updated in the query table
Run a query in a SQL cell to display the results from the **`active_users_by_traffic`** table

In [0]:
%sql
<FILL-IN>

In [0]:
%skip
%sql
SELECT * FROM active_users_by_traffic

### 6. List and stop all active streams
- Use SparkSession to get list of all active streams
- Iterate over the list and stop each query

In [0]:
<FILL-IN>

In [0]:
%skip
for s in spark.streams.active:
    print(s.name)
    s.stop()


**6.1: CHECK YOUR WORK**

In [0]:
DA.validate_query_state(traffic_query)

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>