# 🔄 Structured Streaming in Spark

## 📡 Understanding Data Streams
Data streams represent continuously growing data sources such as:
- New JSON log files landing in cloud storage
- Change Data Capture (CDC) from databases
- Event streams from systems like Kafka
  - Evemts queued in a pub/sub messaging feed

Instead of processing the entire dataset repeatedly, modern streaming systems focus on **incremental processing**, handling only **new data** since the last update.

---

## 🔄 Processing Data Stream
2 approaches:
1. Reprocess the entire source dtaaset each time
2. Only process those new data added since last update
    - Structred Streaming

---

## ⚙️ Spark Structured Streaming Overview
Spark Structured Streaming is a powerful engine that treats data streams as **unbounded tables**.  
You write queries on streaming data just like static tables — new records simply appear as new rows.

It allows you to query an infinite data source where automatically detects new data nad persists the rewsults incremetally into a data sink

A **sink** is a durable file systm, such as files or tables

---

## 💧 Delta Lake Integration
Delta Lake integrates seamlessly with Structured Streaming, allowing for efficient real-time ingestion and query capabilities.

- Use `spark.readStream()` to read from Delta tables.
- Write streaming results using `dataframe.writeStream()` with various configurations.

---

## 🔧 Configuring Streaming Writes

Key configurations for writing data streams include:

- **Trigger Intervals** – Controls how often Spark processes new data.
- **Checkpointing** – Ensures fault tolerance by tracking the state and progress of a stream, allowing recovery in case of failure.

---

## ⏱️ Trigger Options

- **Micro-Batch Mode** – Processes data in small time-based intervals.
- **One-Time Batch Mode** – Executes a single batch to process available data.
Each mode suits different latency and throughput requirements.

---

## 📤 Output Modes

- **Append Mode** – Writes only newly arrived rows.
- **Complete Mode** – Rewrites the full result every time data is processed.

Choice of mode depends on the use case and the type of transformation being applied.

---

## ✅ Fault Tolerance & Guarantees

Structured Streaming offers:
1. Fault Tolerance
   - Checkpointing + Write-ahead logs
     - Record the offset range of data being processed during each trigger interval
2. Exactly-once guarnatee
   - Idempotent sinks
     - multiple writes of the same data, of course identified by the offset, do not result in duplicates being written to the sink

Overall:
- **Exactly-once processing semantics** – Prevents duplicates or data loss.
- **Checkpoints & Write-Ahead Logs** – Ensures resilience and recoverability of streaming jobs.

---

## 🔄 Operations on Streaming DataFrames

Most transformations supported in static DataFrames are also supported in streaming:
Some operations are not supported by streaming DataFrame
 - Sorting
 - Deduplication

⚠️ Some operations like **global sorting** and **deduplication** are **not supported** due to their complexity in unbounded contexts.

---

## ⏳ Advanced Streaming Techniques

There are advanced streaming methods that can help acheive operations that are not supported.
- **Windowing** – Groups data into defined time windows for time-based analysis.
- **Watermarking** – Handles **late-arriving data**, helping avoid processing outdated records indefinitely.

---

Structured Streaming enables **real-time analytics** on live data while offering the flexibility and expressiveness of traditional Spark SQL — making it ideal for modern, scalable, and fault-tolerant data pipelines.
