### Join Operations

Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame as well as another streaming Dataset/DataFrame. The result of the streaming join is generated incrementally, similar to the results of streaming aggregations in the previous section. In this section we will explore what type of joins (i.e. inner, outer, semi, etc.) are supported in the above cases. Note that in all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.
### Stream-static Joins

Since the introduction in Spark 2.0, Structured Streaming has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset. Here is a simple example.

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import *
from pyspark.sql.types import *

sparkSession = SparkSession.builder.config(conf=SparkConf() \
                        .setAppName('SS') \
                        .setMaster('local[4]')).enableHiveSupport().getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/10 14:55:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
card = sparkSession.read.csv(path='/Datasets/CardBase.csv', inferSchema=True, header=True)
card = card.cache()

                                                                                

In [3]:
card.columns

['Card_Number', 'Card_Family', 'Credit_Limit', 'Cust_ID']

In [4]:
schema = StructType([StructField('Transaction_ID', StringType(), True) \
            , StructField('Transaction_Date', StringType(), True) \
            , StructField('Transaction_Value', IntegerType(), True) \
            , StructField('Transaction_Segment', StringType(), True) \
            , StructField('Credit_Card_ID', StringType(), True)])
trans = sparkSession.readStream.schema(schema).csv('/Spark_streaming/TransactionBasenew3/') \
            .select('Transaction_ID', \
                    to_timestamp('Transaction_Date').alias('Transaction_Date'), \
                    'Transaction_Value','Transaction_Segment','Credit_Card_ID')

In [None]:
schema = StructType([StructField('Transaction_ID', StringType(), True) \
            , StructField('Transaction_Date', StringType(), True) \
            , StructField('Transaction_Value', IntegerType(), True) \
            , StructField('Transaction_Segment', StringType(), True) \
            , StructField('Credit_Card_ID', StringType(), True)])
trans = sparkSession.readStream.schema(schema).csv('/Spark_streaming/TransactionBasenew3/') \
            .select('Transaction_ID', \
                    to_timestamp('Transaction_Date').alias('Transaction_Date'), \
                    'Transaction_Value','Transaction_Segment','Credit_Card_ID') \
            .withWatermark('Transaction_Date', '10 minutes')

In [5]:
df = trans.join(card, card.Card_Number==trans.Credit_Card_ID, 'inner')

df = df.filter(df.Transaction_Value > df.Credit_Limit) \
            .select('Transaction_ID','Transaction_Value','Credit_Limit','Card_Family')

query = df \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start(truncate=False)

query.awaitTermination()

In [8]:
df = trans.join(card, card.Card_Number==trans.Credit_Card_ID, 'left')

df = df.filter(df.Transaction_Value > df.Credit_Limit) \
            .select('Transaction_ID','Transaction_Value','Credit_Limit','Card_Family')

query = df \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start(truncate=False)

query.awaitTermination()

23/01/10 14:57:29 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-224be1eb-0d14-4901-8281-0fde70994be8. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/01/10 14:57:29 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-----------------+-----------------+------------+-----------+
|Transaction_ID   |Transaction_Value|Credit_Limit|Card_Family|
+-----------------+-----------------+------------+-----------+
|CTID2035953710722|36384            |24000       |Gold       |
|CTID2035953712306|40795            |6000        |Gold       |
|CTID2035953717060|22160            |6000        |Gold       |
|CTID2035953727360|39678            |11000       |Gold       |
|CTID2035953732113|47252            |30000       |Gold       |
|CTID2035953736075|7170             |2000        |Gold       |
|CTID2035953743205|36334            |27000       |Gold       |
|CTID2035953746374|49253            |17000       |Gold       |
|CTID2035953755090|39666            |26000       |Gold       |
|CTID2035953758259|44747            |15000       |Gold       |
|CTID2035953767766|12118            |10000       |Gold       |
|CTID2035953778858|43

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/home/audacious/.local/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/audacious/.local/lib/python3.10/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 