# Analysing New York City Taxi Data with Spark Streaming

In [65]:
from pyspark.sql.session import SparkSession
from delta import configure_spark_with_delta_pip


spark = configure_spark_with_delta_pip(
    SparkSession.builder.appName("MyApp").config(
        "spark.jars.packages", 
        "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0"
    ).config(
        "spark.sql.extensions", 
        "io.delta.sql.DeltaSparkSessionExtension"
    ).config(
        "spark.sql.catalog.spark_catalog", 
        "org.apache.spark.sql.delta.catalog.DeltaCatalog"
    ).config(
        "spark.sql.repl.eagerEval.enabled", True
    )
).getOrCreate()

Be sure to start the stream on Kafka!

In [66]:
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, BooleanType, TimestampType, DateType
)

schema = StructType(
      [
        StructField("medallion",          StringType(), False),
        StructField("hack_licence",       StringType(), False),
        StructField("vendor_id",          StringType(), False),
        StructField("rate_code",          IntegerType(), False),
        StructField("store_and_fwd_flag", StringType(), False),
        StructField("pickup_datetime",    StringType(), False),
        StructField("dropoff_datetime",   StringType(), False),
        StructField("passenger_count",    IntegerType(), False),
        StructField("trip_time_in_secs",  IntegerType(), False),
        StructField("trip_distance",      DoubleType(), False),
        StructField("pickup_longitude",   DoubleType(), False),
        StructField("pickup_latitude",    DoubleType(), False),
        StructField("dropoff_longitude",  DoubleType(), False),
        StructField("dropoff_latitude",   DoubleType(), False),
        StructField("timestamp",          TimestampType(), False)
      ]
    )

In [67]:
kafka_server = "kafka1:9092" 

kafka_df = (spark.readStream                        # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafka_server) # Configure the Kafka server name and port
  .option("subscribe", "stock")                       # Subscribe to the "stock" Kafka topic 
  .option("startingOffsets", "earliest")           # The start point when a query is started
  .option("maxOffsetsPerTrigger", 100)             # Rate limit on max offsets per trigger interval
  .load() # Load the DataFrame
)

In [68]:
from pyspark.sql.functions import from_json, col

parsed_df = kafka_df.select(
    from_json(
        col("value").cast("string"), schema
    ).alias("parsed_value")
).select("parsed_value.*")

In [73]:
table_name = "taxi_rides"
checkpoint_path = f"{table_name}_checkpoints"

query = (parsed_df
         .writeStream
         .outputMode("append")
         .format("kafka")
         .option("checkpointLocation", checkpoint_path)
         .queryName(table_name)
         .start())

The next code block triggers this error:
```
Py4JJavaError: 
An error occurred while calling o35.sql.: 
org.apache.spark.SparkException: 
Cannot find catalog plugin class for catalog 'spark_catalog': 
org.apache.spark.sql.delta.catalog.DeltaCatalog.
...
```
Can someone figure out why?

In [71]:
result_df = spark.sql(f"SELECT * FROM {table_name}")

## The project starts here

You can create a

## [Query 1] Utilization over a window of 5, 10, and 15 minutes per taxi/driver. This can be computed by computing the idle time per taxi. How does it change? Is there an optimal window?

## [Query 2] The average time it takes for a taxi to find its next fare(trip) per destination borough. This can be computed by finding the time difference, e.g. in seconds, between the trip's drop off and the next trip's pick up within a given unit of time

In [None]:
# remember you can register another stream


## [Query 3] The number of trips that started and ended within the same borough in the last hour

In [None]:
# remember you can register another stream


## [Query 4] The number of trips that started in one borough and ended in another one in the last hour