## Structured Streaming:

**Micro Batch Processing** -> means each selected interval (let's say 2 seconds) the data is collected and sent to be processed.  
The processing time should be fast enough to be ready to proccess the upcoming batch.  
The new input will be added as a row in an input table and then once finished, delivered to a results table.


In [None]:
#Applying transformations to streaming DataFrame:  
spark.readStream
    #<insert input configuration>
    .filter(col("col_1")=="finalize")
    .groupBy("col_2").count()

#Configuring a data stream writer:
spark.readStream
    #<insert input configuration>
    .filter(col("col_1")=="finalize")
    .groupBy("col_2").count()
    .writeStream
    #<insert sink configurations>

### Output Modes:
- **Append** -> Add new records only
- **Update** -> Update changed records in place
- **Complete** -> Rewrite full output


### Trigger Types:
- **Default** -> Process each micro-batch as soon as the previous one has been processed
- **Fixed interval** -> Micro-batch processing kicked off at the user-specified interval
- **One-time** -> Process all of the available data as a single micro-batch and then automatically stop the query
- **Continuous Processing** -> Long-running tasks that continuously read, process, and write data as soon events are available *experimental, Spark 2.3+*





## End-to-End Fault Tolerance:
Guaranteed in Structured Streaming by:  
- Checkpointing and write-ahead logs  
- Idempotent sinks  
- Replayable data sources

## Streaming Query Operations
- **Stop stream**
- **Await termination**
- **Status**
- **Is active**
- **Recent progress**
- **Name, ID, runID**

In [None]:
#To start/create and execute a streaming query:
spark.readStream
    #<insert input configuration>
    .filter(col("col_1")=="finalize")
    .groupBy("col_2").count()
    .writeStream
    #<insert sink configurations>
    .start()

In [None]:
#Complete streaming query
spark.readStream
    .schema(dataSchema)
    .option("maxFilesPerTrigger", 1)
    .parquet (eventsPath)
    .filter(col("event_name") == "finalize")
    .groupBy("traffic_source").count()
    .writeStream
    .outputMode("append")
    .format("parquet")
    .queryName("program_ratings")
    .trigger(processing Time="3 seconds")
    .option("checkpointLocation", checkpointPath)
    .start(outputPathDir)
