# WRITE SPARK STREAM

In PySpark Structured Streaming, .writeStream is the method used to define how and where you want to output your streaming data.

| Method             | Description                                                        |
|--------------------|---------------------------------------------------------------------|
| format()           | Sets the output format (e.g., "parquet", "json", "console", "kafka"). |
| option()           | Sets options specific to the output format (e.g., path, checkpoint location). |
| outputMode()       | Defines how data is written: "append", "complete", or "update".     |
| partitionBy()      | (Optional) Partitions output files by specified columns (works with file sinks). |
| trigger()          | Sets the trigger interval for micro-batches (e.g., every "10 seconds"). |
| queryName()        | Assigns a name to the query for Spark UI monitoring.                |
| foreachBatch()   | (Advanced) Processes each micro-batch as a DataFrame (best for databases, APIs). |
| foreach() | (Advanced) Processes each row individually (best for row-based sinks like NoSQL). |
| start()            | Starts the streaming query.                                         |
| awaitTermination() | Waits until the query is stopped.                                   |
| stop()             | Stops the streaming query.                                          |
| status             | Shows the current status (e.g., active).                            |
| recentProgress     | Shows recent progress reports.                                      |




## SOME DETAIL METHODS

### FORMAT

Common options:

| Format     | Description                                         |
|------------|-----------------------------------------------------|
| **parquet**  | Reads/writes streaming data as Parquet files.      |
| **json**     | Reads/writes streaming data as JSON files.         |
| **csv**      | Reads/writes streaming data as CSV files.          |
| **delta**    | Reads/writes streaming data to Delta Lake tables (ACID). |
| **kafka**    | Reads/writes streaming data from/to Kafka topics.  |
| **console**  | Writes streaming data to console (debugging).      |
| **memory**   | Writes streaming data to an in-memory table (debugging). |
| **socket**   | Reads streaming text data from a TCP socket (testing). |


### OPTIONS

| Option Key              | Description                                       | Applies To             |
|-------------------------|---------------------------------------------------|------------------------|
| **path**                | Output or input path for files.                   | File formats, Delta    |
| **checkpointLocation**  | Path for saving checkpoints (mandatory in streaming). | All streaming sinks    |
| **header**              | Whether CSV files have headers (true/false).      | CSV format             |
| **delimiter**           | Delimiter character for CSV.                      | CSV format             |
| **kafka.bootstrap.servers** | Kafka brokers to connect to.                    | Kafka                  |
| **subscribe**           | Kafka topic(s) to subscribe.                      | Kafka (read)           |
| **topic**               | Kafka topic to write to.                          | Kafka (write)          |
| **startingOffsets**     | Where to start reading in Kafka ("earliest", "latest"). | Kafka (read)      |
| **failOnDataLoss**      | Whether to fail if data loss is detected (true/false). | Kafka, Delta       |
| **maxFilesPerTrigger**  | Max number of files to read per trigger interval. | File sources           |
| **partitionOverwriteMode** | Overwrite behavior ("dynamic" is common).       | Delta, Parquet          |


### OUTPUT MODE

By default is append

| Output Mode | Description                                           | Typical Use Case                     |
|-------------|-------------------------------------------------------|--------------------------------------|
| **append**  | Writes only **new rows** since the last trigger.      | File sinks, Kafka, Delta (most common). |
| **complete**| Writes **the entire result table** every time.        | Aggregations (e.g., counts, sums).   |
| **update**  | Writes **only rows that changed** since the last trigger. | Aggregations with watermarking.      |



### TRIGGER

If you dont define trigger, spark will execute a micro-batch of 1 second

| Trigger Option         | Description                                              | Example                              |
|------------------------|----------------------------------------------------------|--------------------------------------|
| **processingTime**     | Runs micro-batches every given interval.                 | `.trigger(processingTime="10 seconds")` |
| **once**               | Runs **one batch only** then stops (like batch job).     | `.trigger(once=True)`                |
| **availableNow**       | Runs batches until all available data is processed, then stops. | `.trigger(availableNow=True)`       |


## EXMAPLES

In [0]:
from pyspark.sql.functions import window, col , count as _count

In [0]:
df = spark.readStream.format("rate").option("rowsPerSecond", 5).load()
df_aggregated = (
  df.withWatermark("timestamp", "1 minutes")
  .groupBy(window(col("timestamp"), "30 seconds"))
  .agg(_count("*").alias("counter"))
)

### CONSOLE

In [0]:
query = df_aggregated.writeStream.format("console").outputMode("complete").trigger(
    processingTime="10 seconds"
).start()


In [0]:
query.awaitTermination()

In [0]:
query.status

### MEMORY

In [0]:
query = (
    df_aggregated.writeStream.format("memory")
    .outputMode("complete")
    .trigger(processingTime="10 seconds")
    .queryName("training")
    .start()
)



In [0]:
query.awaitTermination()

In [0]:
%sql
select * from training

### FOR EACH

#### FUNCTION

In [0]:
def process_foreach(row):
    print(f"row     : {row}")
    print(f"collect : {row.timestamp}")

In [0]:
query = (
    df.writeStream.foreach(process_foreach)
    .trigger(processingTime="10 seconds")
    .queryName("training")
    .start()
)


#### CLASS

In [0]:
class CustomSinkEach:
    def open(self, partition_id, epoch_id):
        """
            Initializes resources per partition (e.g., open DB connection).
        """
        print(f"partition_id : {partition_id}")
        print(f"epoch_id     : {epoch_id}")
        return True # Ready to process rows

    def process(self, row):
        """
            Processes each row (main logic goes here).
        """
        print(f"processing row: {row}")

    def close(self, error):
        """
            Cleans up resources (e.g., close DB connection), called after processing or on error.
        """
        pass

In [0]:
query = (
    df.writeStream.foreach(CustomSinkEach())
    .trigger(processingTime="10 seconds")
    .queryName("training")
    .start()
)

### FOR EACH BATCH

In [0]:
def process_batch(batch_df, batch_id):
    print(f"batch_id : {batch_id}")
    print(f"counter  : {batch_df.count()}")
    batch_df.show()
    batch_df.printSchema()

In [0]:
query = (
    df.writeStream.foreachBatch(process_batch)
    .trigger(processingTime="10 seconds")
    .queryName("training")
    .start()
)

### TO FILE

In [0]:
base_path: str = "/mnt/data/streaming_lab3"
dbutils.fs.rm(base_path, recurse=True)
dbutils.fs.mkdirs(base_path)

In [0]:
query = df.writeStream \
    .format("json") \
    .option("checkpointLocation", "/mnt/checkpoint") \
    .option("path",base_path) \
    .start()

In [0]:

dbutils.fs.ls(base_path)

### TO TABLE

In [0]:
query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoint") \
    .table("default.streaming_lab")

In [0]:
%sql
SELECT * FROM default.streaming_lab

## DEDUPLICATION


It is the process of eliminating duplicate data in a real-time data stream. This is crucial when data arrives from various sources (e.g., sensors, databases, or distributed systems) and may contain duplicate records due to network failures, retransmission attempts, or simply due to the nature of the source system.

`df_aggregated.dropDuplicates(["id_evento", "window"])`