## Spark Streaming

> *Spark Streaming* is an extension of the Apache Spark ecosystem designed for processing and analyzing real-time data. It enables developers to build scalable, fault-tolerant, and high-throughput stream processing applications using the familiar programming model of Apache Spark. Rather than processing large batches of data at once, Spark Streaming breaks the data into small, manageable chunks known as *micro-batches*, allowing for near-real-time processing.

Common use cases for Spark Streaming include:

- **Real-time Analytics**: Spark Streaming is widely used for real-time analytics, providing businesses with insights and intelligence as data is generated
- **Fraud Detection**: In financial services, Spark Streaming can be employed to detect fraudulent transactions in real time
- **IoT Data Processing**: Analyzing data from Internet of Things (IoT) devices as it streams in, enabling timely decision-making
- **Log Analysis**: Processing and analyzing logs and events as they occur, aiding in troubleshooting and monitoring

## Anatomy of Spark Streaming Applications

### 1. Micro-Batch Processing

**Micro-batch processing** is a fundamental concept in Spark Streaming, enabling the continuous handling of data in compact time-based intervals. This approach stands in contrast to traditional batch processing, as it allows for more frequent processing cycles, making it ideal for scenarios demanding low-latency analytics.

Spark Streaming operates on micro-batches of data. Each micro-batch represents a discrete unit of work and is processed independently.

### 2. DStreams (Discretized Streams)

> *DStreams*, or *Discretized Streams*, the fundamental abstraction in Spark Streaming, represent a continuous series of data divided into micro-batches. DStreams are built on the concept of *Resilient Distributed Datasets (RDDs)*, which are fault-tolerant, immutable collections of objects that can be processed in parallel across a distributed computing cluster. Each micro-batch in a DStream corresponds to an RDD, essentially creating a sequence or series of RDDs over time. These RDDs capture the state of the streaming data at each time interval.

DStreams provide a high-level API that enables developers to perform transformations and actions on streaming data. This API is similar to the one used in batch processing with Apache Spark.

### 3. Transformations and Actions

Spark Streaming applications consist of a series of transformations and actions applied to DStreams. Transformations modify the data within each micro-batch, while actions trigger the execution of computations and produce results.

### 4. Receivers and Executors

The success of a Spark Streaming application relies on the effective coordination between receivers and executors.

<p align="center">
    <img src="images/streaming-arch.jpg" width="900" height="300"/>
</p>

*Receivers* collect data from various sources, such as Kafka, Flume, or HDFS, and deliver it to the Spark Streaming application. They act as the entry point for ingesting streaming data.

*Executors* process the received data in parallel across the Spark cluster. They perform the defined transformations and actions on micro-batches, ensuring efficient and scalable stream processing.

The combination of receivers and executors provides a balance between parallel processing and fault tolerance. Receivers collect data in parallel, and executors process micro-batches independently, ensuring resilience to node failures.

Let's summarize the process end-to-end:

<p align="center">
    <img src="images/streaming-flow.jpg" width="900" height="200"/>
</p>

1. **We hook up to a Data Stream Source**:

   - Spark Streaming integrates with various streaming sources, such as: Apache Kafka, Amazon Kinesis, Flume, etc. <br><br>

2. **Incoming Data is Divided into Batches**:

   - The streaming data is received in continuous streams and divided into micro-batches
   - These micro-batches represent the fundamental unit of processing in Spark Streaming <br><br>

3. **Batches are Processed by Spark Engine**:

   - The Spark engine processes each micro-batch independently using the same powerful transformations and actions available in batch processing. This ensures a consistent and scalable processing model. <br><br>

4. **Resulting Data Batches Created**:

    - The outcome of the processing is a series of result batches, each reflecting the analysis performed on the corresponding input micro-batch
    - These result batches can then be used for further analysis

> In the Databricks Lakehouse platform, Spark Streaming seamlessly integrates with Databricks notebooks. Databricks clusters support streaming workloads, allowing data engineers to create, experiment, and deploy Spark Streaming applications effortlessly. Through Databricks, developers can connect with streaming sources, define micro-batch processing intervals, and visualize streaming data output directly within notebooks.

## Initializing a Stream

In Structured Streaming, a **data stream** is treated as a table that is continuously appended to. Consider the input data stream as the input table. Every data item arriving in the stream is like a new row being appended to the input table. At every trigger interval, let's say every 1 second, new rows are appended to the input table, which eventually updates the result table. Whenever the result table is updated, the changed results rows are written to an external sink.

Databricks provides default datasets that can be used to emulate streaming scenarios. The default Databricks datasets can be found at `/databricks-datasets/`.

<p align="center">
    <img src="images/DatabricksDatasets.png" width="900" height="300"/>
</p>

As we can see, there is a dataset called `structured-streaming`. Let's examine the contents of the `/databricks-datasets/structured-streaming/events/` directory.

<p align="center">
    <img src="images/StreamingDataset.png" width="900" height="300"/>
</p>

We can see that this directory contains multiple `JSON` files. Let's run the code below to see the structure of one of these `JSON` files:

```python
# Define the path to the JSON file
jsonFilePath = "dbfs:/databricks-datasets/structured-streaming/events/file-0.json"

# Read the content of the JSON file into a DataFrame
jsonContentDF = spark.read.json(jsonFilePath)

# Show the content of the DataFrame
display(jsonContentDF)
```

<p align="center">
    <img src="images/JSONContent.png" width="900" height="450"/>
</p>

Each line in this `JSON` (and all the `JSON`s in this directory) contains two fields: `time` and `action`.

Now this might be confusing, as Databricks calls this dataset `structured-streaming`, but we can clearly see the sample data contains a set of static files. We can emulate a stream from these files by reading one file at a time, in the chronological order, as such:

```python
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

inputPath = "/databricks-datasets/structured-streaming/events/"

# Define the schema to speed up processing
jsonSchema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])

streamingInputDF = (
  spark
    .readStream
    .format("json")
    .schema(jsonSchema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .load(inputPath)
)

display(streamingInputDF)
```

In this example:

- The `readStream` method is used to initialize a streaming DataFrame by reading from a default Databricks dataset for structured streaming events
- The option`("maxFilesPerTrigger", 1)` ensures that each file is treated as an individual stream entry, allowing for one-file-at-a-time processing and maintaining the chronological order of file creation during data ingestion
- The schema of the `JSON` data is explicitly specified using `schema` for efficient processing

> When you run this code in a Databricks notebook, the `display` command will display the streaming DataFrame as a live updating dashboard. This dashboard updates as new files are read from the specified directory, providing a visual representation of the streaming data.

<p align="center">
    <img src="images/LiveDashboard.png" width="800" height="450"/>
</p>

As the source directory is a static set of files and no new files are added, the streaming will not automatically stop. It will keep processing the existing files one by one based on the trigger interval. To stop the streaming manually, you can either interrupt the notebook execution, using the **Interrupt** button at the top-right side of the notebook UI, or stop the streaming query explicitly in your code. We will see how to automatically stop a streaming query later in this notebook.

### `readStream` method

> The `readStream` method can be generalized beyond the example above, and is used to initiate the process of reading streaming data from a specified source.

The general syntax of the method is:

```python
spark.readStream
  .format("source_format")         # Specify the format of the streaming source (e.g., "kafka", "json", "parquet")
  .option("option_key", "value")   # Set specific options for reading from the streaming source
  .schema(custom_schema)            # Specify a custom schema for the streaming data
  .load("stream_source_path")       # Specify the path or location of the streaming data
```

In the syntax above:

- `spark.readStream`: This initiates the reading of streaming data. It is followed by a chain of methods to specify the details of the streaming source.

- `.format("source_format")`: Specifies the format of the streaming source. Examples include `kafka` for Apache Kafka, `json` for `JSON` files, `parquet` for Parquet files, etc.

- `.option("option_key", "value")`: Sets specific options for reading from the streaming source. These options vary depending on the source format. For example, in Kafka, you might specify the Kafka topic, group id, etc.

- `.schema(custom_schema)`: Specifies a custom schema for the streaming data. This is optional. If not specified, Spark will infer the schema from the streaming data. Generally, it is good practice to specify a custom schema, as Spark's automatic schema inference might not capture the intended data types accurately all the time.

- `.load("stream_source_path")`: Specifies the path or location of the streaming data. This could be a file path, directory path, or a connection URL depending on the source format.

We have seen an example using `JSON` files above. In the another lesson, we will look at another commonly used streaming source: AWS Kinesis data streams.

### Defining Structured Streaming Schemas

When working with Structured Streaming, defining a schema is important for processing streaming data. A schema specifies the structure of your streaming data, including the data types of each field. In PySpark, you can define a schema using the `StructType` class, `StructField` class, and related classes from the `pyspark.sql.types` module.

Let's break down the process of defining a schema:

1. **Import Necessary Classes**

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
```

2. **Define the Schema**

```python
# Define a streaming schema using StructType
streaming_schema = StructType([
    StructField("field1", StringType(), True),
    StructField("field2", IntegerType(), True),
    StructField("field3", TimestampType(), True),
    # Add more fields as needed
])
```

   - `StructType`: Represents the structure of a Spark DataFrame. It is composed of a list of `StructField` objects.

   - `StructField`: Represents a field in a DataFrame. It takes three arguments: the name of the field, the data type of the field, and a boolean indicating whether the field can be null.

   - `StringType`, `IntegerType`, `TimestampType`: Represent the data types of the fields. PySpark supports various data types such as strings, integers, timestamps, etc.

3. **Use the Schema in Streaming Operations**

```python
# Read streaming data with the custom schema
streaming_df = (
    spark
    .readStream
    .format("json")
    .schema(streaming_schema)
    .option("maxFilesPerTrigger", 1)  # Example option for file-based streaming source
    .load("/path/to/streaming/data")
)
```

By defining a schema for your streaming data, you ensure that Spark processes the data correctly and consistently during the streaming operations. This becomes essential when dealing with semi-structured or `JSON` data in a streaming context. Adjust the schema definition based on the structure of your streaming data source.

## Window Operations

> Structured Streaming in PySpark provides powerful mechanisms for time-based operations through *window operations*. Windowing allows you to perform computations on data over specific time intervals, enabling you to gain insights into trends, patterns, and aggregates within those windows.

A **window** is a logical unit of time that you define for processing data. It involves grouping data within specified time boundaries, and then applying transformations or aggregations to that grouped data. Windows are particularly useful when dealing with time-series data, enabling you to analyze trends and patterns over fixed intervals.

Let's explore some common window operations that you can perform with Structured Streaming.

### Sliding Windows

*Sliding windows* capture data within a specified duration and slide forward by a specified interval. For example, you might want to compute the sum of values within a 1-hour window every 30 minutes.

```python
sliding_window_df = (
    streaming_df
    .groupBy(window("timestamp", "1 hour", "30 minutes"))
    .agg(sum("value"))
)
```

In the example above:

- `window("timestamp", "1 hour", "30 minutes")`: This part defines the window specification. It indicates that we want to create windows based on the `timestamp` column. Each window has a duration of `1 hour`, and it slides forward every `30 minutes`. This means that we analyze the sum of value within overlapping 1-hour windows that update every 30 minutes.

- `.agg(sum("value"))`: This part specifies the aggregation operation. We use the `agg` function to compute the `sum` of the `value` column within each window. The resulting DataFrame, `sliding_window_df`, provides insights into the total sum of values within each sliding 1-hour window.

### Tumbling Windows

Tumbling windows are non-overlapping windows of fixed duration. For instance, you might want to compute the average value within a 1-hour window:

```python
tumbling_window_df = (
    streaming_df
    .groupBy(window("timestamp", "1 hour"))
    .agg(avg("value"))
)
```

In the example above, similarly to the sliding window example, the `window("timestamp", "1 hour")` part defines the window specification. Here, we just indicate the column `timestamp`, and the fixed duration `1 hour`.

> When using window operations, you can apply various types of aggregations to derive meaningful insights from your data. These aggregations include common aggregations also used in batch processing, such as `sum`, `avg`, `count`, etc.

Now, let's perform a window operation on our streaming DataFrame `streamingInputDF` from the previous section:

```python
from pyspark.sql.functions import window

streamingCountsDF = (
  streamingInputDF
    .groupBy(
      streamingInputDF.action,
      window(streamingInputDF.time, "1 hour"))
    .count()
)
```

In this example:

- `.groupBy(...)`: The `groupBy` operation is applied to group the streaming data based on two columns:

  - `action`: This is the categorical field representing different actions
  - `window(streamingInputDF.time, "1 hour")`: This specifies the window specification. It indicates that we want to create windows based on the `time` column with a fixed duration of 1 hour. Each window groups data within a distinct 1-hour interval. <br><br>

- `.count()`: The `count` aggregation function is used to calculate the number of records within each window. This provides insights into the frequency or occurrence of different actions within each 1-hour window.

So, the resulting DataFrame, `streamingCountsDF`, contains the count of occurrences of each unique action within non-overlapping 1-hour windows. It gives a temporal perspective on the distribution of actions over time, offering insights into how the frequency of actions changes within each hour. In the next section, we will learn how to visualize the output of such operations.

## Writing Streams

Writing streaming data to *external sinks* is a crucial aspect of real-time data processing.
 
**Sinks** are the destinations where the processed and transformed data is stored for further analysis or retrieval. Spark Streaming supports a variety of sinks, including *Delta Lake*, external databases, and more. The choice of the sink depends on the specific requirements of the application. Key considerations when selecting a sink include reliability, scalability, and the ability to handle real-time data updates.

To initiate writing streaming data to a sink, you can use the following general syntax:

```python
query = (
    streamingDataFrame  # Replace with your actual streaming DataFrame
    .writeStream
    .format("sink_format")      # Specify the sink format (e.g., "memory", "delta", "jdbc", etc.)
    .option("option_key", "value")   # Set specific options for the chosen sink
    .queryName("query_name")    # Name your streaming query
    .outputMode("output_mode")  # Specify the output mode ("complete", "append", or "update")
    .table("table_name") # Specify table_name if writing to a table
)
```

### Output Modes in Streaming

The output of a streaming computation is defined by what gets written to the external storage. Spark Streaming provides three output modes:

- **Complete Mode**:

  - In this mode, the entire updated result table is written to external storage
  - The storage connector determines how to handle writing the complete table
  - Useful when the entire result set is relevant for downstream processing <br><br>

- **Append Mode**:

  - Only new rows appended to the result table since the last trigger are written to external storage
  - Applicable when existing rows in the result table are not expected to change
  - Efficient for scenarios where only new data is relevant <br><br>

- **Update Mode**:

  - Only the rows that were updated in the result table since the last trigger are written to external storage
  - Different from Complete Mode as it outputs only the changed rows since the last trigger
  - Equivalent to Append Mode if the query doesn't contain aggregations

### Checkpointing

*Checkpoints* are essential for fault-tolerance and ensuring the resiliency of streaming queries. In the context of Spark, a **checkpoint** is a snapshot of the distributed processing state of a streaming application. It includes metadata, configuration settings, and necessary information to recover the application's state in case of failures or restarts. To add a checkpoint location, use the `.option("checkpointLocation", "path/to/checkpoint")` method in the general syntax. 

### Example: Writing to Databricks Delta Lake

To illustrate the process of writing streaming data to a Delta Lake, let's continue our previous example using `streamingCountsDF`:

```python
query = (
  streamingCountsDF
    .writeStream
    .format("delta")          # Delta Lake sink for durable storage
    .queryName("streaming_query") # Can give the query a name
    .outputMode("complete")   # Complete mode: All counts should be stored in Delta Lake
    .option("checkpointLocation", "tmp/checkpoints")  # Add checkpoint location
    .table("delta_lake_table")  # Specify the Delta Lake table name
)
```

In this example:

- `format("delta")`: Specifies the sink type as Delta Lake for durable and reliable storage

- `outputMode("complete")`: Specifies the output mode as complete, ensuring that all counts are stored in Delta Lake

- `.option("checkpointLocation", "tmp/checkpoints")`: Adds a checkpoint location to ensure fault-tolerance and resiliency

The `query` handle represents the streaming query running in the background. This query continuously processes incoming data, updates the windowed counts, and writes the complete result set to Delta table.

> IMPORTANT: You will need to run all the commands we have used as an example so far in the same notebook cell, so: reading the stream, performing the window operation and writing the stream should all be contained within the same cell.

The cell output will report the status of the stream:

<p align="center">
    <img src="images/StreamOutput.png" width="800" height="200"/>
</p>

You can expand this output using the arrow next to the query name, `streaming_query`. You will get a dashboard of the number of records processed, batch statistics and the state of the aggregation:

<p align="center">
    <img src="images/QueryOutput.png" width="700" height="500"/>
</p>

You can either interrupt the cell for the operation to stop, or you can programmatically determine the time period at which the query should stop using the following syntax:

```python
import time
# .... the code from before

query = (
  streamingCountsDF
    .writeStream
    .format("delta")          # Delta Lake sink for durable storage
    .queryName("streaming_query") # Can give the query a name
    .outputMode("complete")   # Complete mode: All counts should be stored in Delta Lake
    .option("checkpointLocation", "tmp/checkpoints")  # Add checkpoint location
    .table("delta_lake_table")  # Specify the Delta Lake table name
)

time.sleep(60) # desired time after which the query should stop
query.stop()
```

Once the query has stopped running, you should be able to access this streaming data using the **Data** explorer tab:

<p align="center">
    <img src="images/DeltaTable.png" width="700" height="350"/>
</p>

You can double-click on the table name to access its schema and its sample data:

<p align="center">
    <img src="images/DeltaData.png" width="800" height="450"/>
</p>


## Key Takeaways

- Spark Streaming is designed for real-time data processing, breaking data into micro-batches for scalable and fault-tolerant stream processing applications
- Micro-batch processing in Spark Streaming enables continuous handling of data in manageable intervals, providing low-latency analytics compared to traditional batch processing
- DStreams, based on Resilient Distributed Datasets (RDDs), serve as the fundamental abstraction in Spark Streaming, allowing high-level transformations and actions on streaming data
- The `readStream` method initializes a streaming DataFrame, specifying the source format, options, schema, and location of the streaming data
- Schemas are crucial for processing streaming data accurately. Use `StructType`, `StructField`, and related classes to define a schema for structured streaming.
- Window operations enable time-based analysis, allowing insights into trends, patterns, and aggregates within specified intervals
- Use the `writeStream` method to write streaming data to external sinks. Output modes (Complete, Append, Update) determine what data is written.
- Checkpoints are crucial for fault-tolerance, providing snapshots of the distributed processing state to recover the application's state in case of failures or restart