
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# (Optional) - Stream-Stream Joins

##### Objectives
1. Create streams using the Rate source
1. Perform Stream-Stream Inner Join without Watermarking
1. Perform Stream-Stream Inner Join with Watermarking
2. Perform Stream-Stream Inner Join with Watermarking and Event Time Constraints

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>


## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

#Stream-Stream Joins
We are going to use the canonical example of ad monetization, where we want to find out which ad impressions led to user clicks. 
Typically, in such scenarios, there are two streams of data from different sources - ad impressions and ad clicks. 

Both type of events have a common ad identifier (say, `adId`), and we want to match clicks with impressions based on the `adId`. 
In addition, each event also has a timestamp, which we will use to specify additional conditions in the query to limit the streaming state.



## Create two streams - Impressions and Clicks

We simulate live streams in a lab setup by using the built-in `rate` format, that generates data at a given fixed rate. 

See   <a href="https://spark.apache.org/docs/3.5.7/structured-streaming-programming-guide.html#input-sources" target="_blank">Stream Input Sources</a>  for more information on stream sources for data generation.


In [0]:
from pyspark.sql.functions import rand

spark.conf.set("spark.sql.shuffle.partitions", "1")

impressions = (
  spark
    .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load()
    .selectExpr("value AS adId", "timestamp AS impressionTime")
)

clicks = (
  spark
  .readStream.format("rate").option("rowsPerSecond", "5").option("numPartitions", "1").load()
  .where((rand() * 100).cast("integer") < 10)      # 10 out of every 100 impressions result in a click
  .selectExpr("(value - 50) AS adId ", "timestamp AS clickTime")      # -50 so that a click with same id as impression is generated later (i.e. delayed data).
  .where("adId > 0")
)

Let's see what data these two streaming DataFrames generate.


In [0]:
display(impressions, streamName="display_impressions")

##################################
## Once finished viewing, click  ##
## 'Interrupt' before proceeding ##
##################################

In [0]:
display(clicks, streamName="display_clicks")

##################################
## Once finished viewing, click  ##
## 'Interrupt' before proceeding ##
##################################

### Stream-Stream Inner Join without Watermark

Let's join these two data streams. This is exactly the same as joining two batch DataFrames/Datasets by their common key `adId`.

In [0]:
################################################
## Without Watermark, State continues to grow ##
################################################

display(impressions.join(clicks, "adId"), streamName="naive_streaming_join")

###################################
## Once finished viewing, click  ##
## 'Interrupt' before proceeding ##
###################################

After you start this query, within a minute, you will start getting joined impressions and clicks. The delays of a minute is due to the fact that clicks are being generated with delay over the corresponding impressions.

In addition, if you expand the details of the query above, you will find a few timelines of query metrics - the processing rates, the micro-batch durations, and the size of the state. 
If you keep running this query, you will notice that the state will keep growing in an unbounded manner. This is because the query must buffer all past input as any new input can match with any input from the past.


### Stream-Stream Inner Join with Watermarking

To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state. In other words, you will have to do the following additional steps in the join.

1. Define watermark delays on both inputs such that the engine knows how delayed the input can be. 

1. Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.

  a. Time range join conditions (e.g. `...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR`),

  b. Join on event-time windows (e.g. `...JOIN ON leftTimeWindow = rightTimeWindow`).

Let's apply these steps to our use case. 

1. Watermark delays: Say, the impressions and the corresponding clicks can be delayed/late in event-time by at most "10 seconds" and "20 seconds", respectively. This is specified in the query as watermarks delays using `withWatermark`.

1. Event-time range condition: Say, a click can occur within a time range of 0 seconds to 1 minute after the corresponding impression. This is specified in the query as a join condition between `impressionTime` and `clickTime`.



In [0]:
from pyspark.sql.functions import expr

# Define watermarks
impressionsWithWatermark = (impressions 
  .selectExpr("adId AS impressionAdId", "impressionTime") 
  .withWatermark("impressionTime", "10 seconds "))
                            
clicksWithWatermark = (clicks 
  .selectExpr("adId AS clickAdId", "clickTime")
  .withWatermark("clickTime", "20 seconds"))        # max 20 seconds late

In [0]:
# Inner join with Watermark 
display(impressionsWithWatermark.join(
    clicksWithWatermark,
    expr(""" clickAdId = impressionAdId""")), streamName="streaming_join_with_watermarks")

### Stream-Stream Join with Watermark and Event Time Constraint

This will enable Structured Streaming to perform full state cleanup. Use this for long-running stream processes.

In [0]:
# Inner join with watermark + Time conditions - Required for full state cleanup
display(impressionsWithWatermark.join(
    clicksWithWatermark,
    expr(""" clickAdId = impressionAdId AND 
      clickTime >= impressionTime AND 
      clickTime <= impressionTime + interval 1 minutes""")), streamName="streaming_join_with_watermarks_and_event_time_constraints")

We are getting the similar results as the previous simple join query. However, if you look at the query metrics now, you will find that after about a couple of minutes of running the query, the size of the state will stabilize as the old buffered events will start getting cleared up.


### Further Information
You can read more about stream-stream joins in the following places:

- Databricks blog post on stream-stream joins - https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html
- Apache Programming Guide on Structured Streaming - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins
- Talk at Spark Summit Europe 2017 - https://databricks.com/session/deep-dive-into-stateful-stream-processing-in-structured-streaming


In [0]:
for s in spark.streams.active:
    print(s.name)
    s.stop()

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>