### Task 1

Create a streaming dataframe from the data in the following path:  
/mnt/training/asa/flights/2007-01-stream.parquet/

The schema should contain
* DepartureAt (timestamp)
* UniqueCarrier (string)

Process only 1 file per trigger.  

Aggregate the data by count, using non-overlapping 30 minute windows.  
Ignore any data that is older than 6 hours.

The output should have 3 columns: startTime (window start time), UniqueCarrier, count.  
The output should be sorted ascending by startTime.

Display the output, firing the trigger every 5 seconds.

Once the stream has produced some output, call the stream shutdown function.

In [0]:
# ANSWER
import pyspark.sql.functions as F

path = "/mnt/training/asa/flights/2007-01-stream.parquet/"

schema = "DepartureAt timestamp, UniqueCarrier string"

streamDF = (spark                   
  .readStream                       
  .format("parquet")                
  .schema(schema)            
  .option("maxFilesPerTrigger", 1)  
  .load(path)                   
)

countsDF = (streamDF                                             # Start with the DataFrame
  .withWatermark("DepartureAt", "300 minutes")                   # Specify the watermark
  .groupBy(F.window("DepartureAt", "30 minute"),"UniqueCarrier") # Aggregate the data
  .count()                                                       # Produce a count for each aggreate
  .withColumn("startTime", F.col("window.start"))                # Add the column "startTime", extracting it from "window.start"
  .drop("window")                                                # Drop the window column
  .orderBy("startTime")                                          # Sort the stream by "startTime" 
)
display(countsDF, processingTime = "5 seconds", streamName = "myStreamName")


In [0]:
for stream in spark.streams.active:
  if stream.name == "myStreamName":
    stream.stop()