# Streaming in Spark:

Streaming is the capability that allows you to process real-time data at scale. Instead of processing a static "batch" of data (like a CSV file from last month), Spark Streaming processes data as it arrives from sources like Kafka, Flume, or IoT sensors.

**Spark primarily uses a micro-batch architecture.**

- It treats streaming as a series of very small, short-lived batch jobs.
- The system collects data arriving within a specific time interval (e.g., every 1 second) and creates a small batch.
- This batch is then processed by the Spark Engine using the same optimized logic used for batch processing.

---

# Structured Streaming:

Here, Instead of thinking about "batches," you think about a continuously appended table.

### How it works:
- Input Table: Data arriving from the stream is viewed as new rows being appended to an unbounded input table.

- Query: You define a query (Select, Filter, Aggregate) on this input table as if it were a static table.

- Result Table: At every trigger interval, Spark checks for new data, processes it, and updates a "Result Table."

- Output: The updated result is pushed to a sink (like a database, console, or another Kafka topic).

---

# Typical Workflow:

To build a Spark Streaming application, you follow these steps:

1. Define Input Source: Connect to a source like Kafka, a TCP socket, or a folder of files.
```python
lines = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host:port").load()
```

2. Transform Data: Use standard Spark SQL or DataFrame operations.
```python
words = lines.selectExpr("CAST(value AS STRING)").groupBy("value").count()
```

3. Define Output Sink & Trigger: Decide where the data goes and how often to run the process.
```python
query = words.writeStream.outputMode("complete").format("console").start()
```

---


In Databricks, the most common way to stream from a table in your catalog is using the `table()` method or specifying the delta format (since most Databricks tables are Delta tables).

```python
df = spark.readStream \
    .table("main.default.iot_data")

# Or using the delta format explicitly
df = spark.readStream \
    .format("delta") \
    .table("main.default.iot_data")
```

| Format Category | Format Name | Use Case |  
| :--- | :--- | :--- |  
| **Databricks Standard** | `delta` | The most common for Databricks. Best for "Bronze to Silver" pipelines. |  
| **Cloud Storage** | `cloudFiles` | **Auto Loader.** Highly recommended in Databricks for streaming files from S3/ADLS/GCS. |  
| **File Sources** | `parquet`, `json`, `csv`, `orc`, `text` | Standard file formats. Note: these require a schema to be defined. |  
| **Message Brokers** | `kafka` | Reading from Apache Kafka or Confluent. |  
| **Azure Specific** | `eventhubs` | Direct connector for Azure Event Hubs. |  
| **Testing/Dev** | `socket`, `rate` | `socket` is for text via port; `rate` generates dummy data for performance testing. |  

