# Spark Structured Streaming - ignoreChanges

This notebook demos what happens when Spark Structured Streaming data source is overwritten/updated.

## Setup

In [1]:
from pyspark.sql import SparkSession

spark = (SparkSession.builder.appName("x")
         .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
         .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
         .getOrCreate())

22/07/12 12:46:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
df_0 = spark.createDataFrame(data=[[1,1]], schema="a int, b int")
df_0.write.format("delta").save("delta/one")
spark.read.format("delta").load("delta/one").show()

                                                                                

+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+



In [4]:
console_0 = spark.readStream.format("delta").option("ignoreChanges", "true").load("delta/one")
console_0.writeStream.format("console").option("checkpointLocation", "delta/checkpoint/two_console").trigger(once=True).start()

<pyspark.sql.streaming.StreamingQuery at 0x7f25a877fbd0>

```-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+
```

In [5]:
err_0 = spark.readStream.format("delta").load("delta/one")
err_0.writeStream.format("console").option("checkpointLocation", "delta/checkpoint/two_err").trigger(once=True).start()

<pyspark.sql.streaming.StreamingQuery at 0x7f25a87f3150>

```-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+
```

In [6]:
two_0 = spark.readStream.format("delta").option("ignoreChanges", "true").load("delta/one")
two_0.writeStream.format("delta").option("checkpointLocation", "delta/checkpoint/two").trigger(once=True).start("delta/two")

22/07/12 12:47:26 WARN MicroBatchExecution: The read limit MaxFiles: 1000 for DeltaSource[file:/home/delta/one] is ignored when Trigger.Once() is used.


<pyspark.sql.streaming.StreamingQuery at 0x7f25b0088b90>

In [7]:
spark.read.format("delta").load("delta/two").show()

+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+



In [8]:
three_0 = spark.readStream.format("delta").option("ignoreChanges", "true").load("delta/one")
three_0 = three_0.dropDuplicates(["a"])
three_0.writeStream.format("delta").option("checkpointLocation", "delta/checkpoint/three").trigger(once=True).start("delta/three")

<pyspark.sql.streaming.StreamingQuery at 0x7f25a877f710>

22/07/12 12:47:50 WARN MicroBatchExecution: The read limit MaxFiles: 1000 for DeltaSource[file:/home/delta/one] is ignored when Trigger.Once() is used.


In [9]:
spark.read.format("delta").load("delta/three").show()

+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+



## Overwrite source table

In [10]:
df_1 = spark.createDataFrame(data=[[1,2],[2,4]], schema="a int, b int")
df_1.write.format("delta").mode("overwrite").save("delta/one")
spark.read.format("delta").load("delta/one").show()

                                                                                

+---+---+
|  a|  b|
+---+---+
|  2|  4|
|  1|  2|
+---+---+



In [11]:
console_1 = spark.readStream.format("delta").option("ignoreChanges", "true").load("delta/one")
console_1.writeStream.format("console").option("checkpointLocation", "delta/checkpoint/two_console").trigger(once=True).start()

<pyspark.sql.streaming.StreamingQuery at 0x7f25a8787e10>

```
-------------------------------------------
Batch: 1
-------------------------------------------
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  2|  4|
+---+---+
```

In [12]:
err_1 = spark.readStream.format("delta").load("delta/one")
err_1.writeStream.format("console").option("checkpointLocation", "delta/checkpoint/two_err").trigger(once=True).start()

<pyspark.sql.streaming.StreamingQuery at 0x7f25a877f350>

22/07/12 12:48:23 WARN MicroBatchExecution: The read limit MaxFiles: 1000 for DeltaSource[file:/home/delta/one] is ignored when Trigger.Once() is used.


### Error message
```
java.lang.UnsupportedOperationException: Detected a data update (for example part-00003-a001073f-99a3-4da2-894f-29f1dd1943e6-c000.snappy.parquet) in the source table at version 1.
This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected,
please restart this query with a fresh checkpoint directory.
```

In [12]:
err_2 = spark.readStream.format("delta").load("delta/one")
err_2.writeStream.format("console").option("checkpointLocation", "delta/checkpoint/two_err_2").trigger(once=True).start()

<pyspark.sql.streaming.StreamingQuery at 0x7f7fc25dab90>

22/07/12 12:33:24 WARN MicroBatchExecution: The read limit MaxFiles: 1000 for DeltaSource[file:/home/delta/one] is ignored when Trigger.Once() is used.


After change the checkpoint location, the new stream reader reads the newly overwritten table `one`.
Note the batch number is 0, instead of 1.
```
-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  2|  4|
+---+---+
```

In [10]:
two_1 = spark.readStream.format("delta").option("ignoreChanges", "true").load("delta/one")
two_1.writeStream.format("delta").option("checkpointLocation", "delta/checkpoint/two").trigger(once=True).start("delta/two")

22/07/12 12:33:11 WARN MicroBatchExecution: The read limit MaxFiles: 1000 for DeltaSource[file:/home/delta/one] is ignored when Trigger.Once() is used.


<pyspark.sql.streaming.StreamingQuery at 0x7f7fc1d45910>