# Spark Streaming Flight Analysis Pipeline

This notebook implements a real-time flight data processing pipeline using Spark Streaming, Kafka, and Cassandra.

## Architecture
1. **Ingest**: Read JSON flight data from Kafka topic `flights_topic`.
2. **Process**: 
    - Parse JSON payload.
    - Join with static airport and airline reference data.
    - Aggregate metrics for Airline Performance, Delay Reasons, and Routes.
3. **Store**: Write aggregated results to Cassandra tables.


## 1. Setup and Configuration
Initialize SparkSession with Cassandra and Kafka connectors.

In [1]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, current_timestamp, when, sum as _sum, avg, count
from pyspark.sql.types import StructType, StringType, IntegerType
import pyspark.sql.functions as F

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("flights_stream_notebook") \
    .config("spark.cassandra.connection.host", os.environ.get("CASSANDRA_HOST", "cassandra")) \
    .config("spark.cassandra.connection.port", os.environ.get("CASSANDRA_PORT", "9042")) \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.3,com.datastax.spark:spark-cassandra-connector_2.12:3.5.0") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print("Spark Session Created Successfully")

your 131072x1 screen size is bogus. expect trouble
25/12/26 16:07:29 WARN Utils: Your hostname, DESKTOP-VS2UPJ4 resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/12/26 16:07:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/anhtu77/miniconda3/envs/datalab/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/anhtu77/.ivy2/cache
The jars for the packages stored in: /home/anhtu77/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4fc38d75-1704-49d2-8340-af73f3f492a7;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.5.3 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.5.3 in central
	found org.apache.kafka#kafka-clients;3.4.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.apache.commons#commons-pool2;2.11.1

Spark Session Created Successfully


## 2. Data Source Definitions
Read streaming data from Kafka.

In [2]:
kafka_bootstrap = os.environ.get("KAFKA_BOOTSTRAP_SERVERS", "localhost:9092")
kafka_topic = os.environ.get("KAFKA_TOPIC", "flights_topic")

print(f"Connecting to Kafka at {kafka_bootstrap}, topic: {kafka_topic}")

kafka_flights_df = (spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", kafka_bootstrap)
    .option("subscribe", kafka_topic)
    .option("startingOffsets", "earliest")
    .option("failOnDataLoss", "false")
    .load())

print("Kafka Source Initialized")

Connecting to Kafka at localhost:9092, topic: flights_topic
Kafka Source Initialized


In [3]:
import time

# Use localhost when running outside Docker, or kafka:9092 inside Docker
kafka_bootstrap = os.environ.get("KAFKA_BOOTSTRAP_SERVERS", "localhost:9092")

# Streaming DataFrames cannot use .show(); use a streaming sink such as the console for debugging.
query = kafka_flights_df.selectExpr(
    "CAST(key AS STRING) as key",
    "CAST(value AS STRING) as value",
    "topic", "partition", "offset", "timestamp"
).writeStream \
    .format("console") \
    .option("truncate", False) \
    .outputMode("append") \
    .start()

# Let the console sink run briefly to print a few batches, then stop.
try:
    time.sleep(5)
finally:
    query.stop()

25/12/26 16:07:46 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-125b399e-2b61-412f-b0f7-3567c01b6c12. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/12/26 16:07:46 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/12/26 16:07:46 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
25/12/26 16:07:56 WARN NetworkClient: [AdminClient clientId=adminclient-1] Error connecting to node kafka:9092 (id: 1 rack: null)
java.net.UnknownHostException: kafka: Temporary failure in name resolution
	at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Meth

## 3. Schema Definition and Parsing
Define the schema for flight data and parse the JSON value column.

In [4]:
flights_schema = StructType() \
    .add("YEAR", IntegerType()) \
    .add("MONTH", IntegerType()) \
    .add("DAY", IntegerType()) \
    .add("DAY_OF_WEEK", IntegerType()) \
    .add("AIRLINE", StringType()) \
    .add("FLIGHT_NUMBER", IntegerType()) \
    .add("TAIL_NUMBER", StringType()) \
    .add("ORIGIN_AIRPORT", StringType()) \
    .add("DESTINATION_AIRPORT", StringType()) \
    .add("SCHEDULED_DEPARTURE", IntegerType()) \
    .add("DEPARTURE_TIME", IntegerType()) \
    .add("DEPARTURE_DELAY", IntegerType()) \
    .add("TAXI_OUT", IntegerType()) \
    .add("WHEELS_OFF", IntegerType()) \
    .add("SCHEDULED_TIME", IntegerType()) \
    .add("ELAPSED_TIME", IntegerType()) \
    .add("AIR_TIME", IntegerType()) \
    .add("DISTANCE", IntegerType()) \
    .add("WHEELS_ON", IntegerType()) \
    .add("TAXI_IN", IntegerType()) \
    .add("SCHEDULED_ARRIVAL", IntegerType()) \
    .add("ARRIVAL_TIME", IntegerType()) \
    .add("ARRIVAL_DELAY", IntegerType()) \
    .add("DIVERTED", IntegerType()) \
    .add("CANCELLED", IntegerType()) \
    .add("CANCELLATION_REASON", StringType()) \
    .add("AIR_SYSTEM_DELAY", IntegerType()) \
    .add("SECURITY_DELAY", IntegerType()) \
    .add("AIRLINE_DELAY", IntegerType()) \
    .add("LATE_AIRCRAFT_DELAY", IntegerType()) \
    .add("WEATHER_DELAY", IntegerType())

# Parse JSON
flights_df = kafka_flights_df.select(
    from_json(col("value").cast("string"), flights_schema).alias("data")
).select("data.*")

# Validation: Print Schema
print("Parsed Data Schema:")
flights_df.printSchema()

Parsed Data Schema:
root
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- ARRIVAL_

25/12/26 16:08:56 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [7]:
for q in spark.streams.active:
    print(f"Stopping query: {q.name}")
    q.stop()
    
query_mem = flights_df.writeStream \
    .format("memory") \
    .queryName("flights_table") \
    .outputMode("append") \
    .start()

print("Streaming Query Started, writing to in-memory table 'flights_table'")
# query_mem.awaitTermination()

Stopping query: flights_table


25/12/26 16:11:01 WARN NetworkClient: [AdminClient clientId=adminclient-3] Error connecting to node kafka:9092 (id: 1 rack: null)
java.net.UnknownHostException: kafka: Temporary failure in name resolution
	at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:934)
	at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1543)
	at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:852)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
	at org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
	at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:110)
	at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(

KeyboardInterrupt: 

25/12/26 16:11:25 WARN NetworkClient: [AdminClient clientId=adminclient-3] Error connecting to node kafka:9092 (id: 1 rack: null)
java.net.UnknownHostException: kafka
	at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
	at org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
	at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:110)
	at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:510)
	at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:467)
	at org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:173)
	at org.apache.kafka.clients.NetworkClient.

## 4. Enrich with Static Data
Load Airports and Airlines CSV data for joining.

In [4]:
# Load static reference data
try:
    airport_df = spark.read.csv('/home/anhtu77/Coding/BigDataProjectTeam1/data/airports.csv', header=True, inferSchema=True)
    airline_df = spark.read.csv('/home/anhtu77/Coding/BigDataProjectTeam1/data/airlines.csv', header=True, inferSchema=True)
    print("Static data loaded successfully")
    print(f"Airports count: {airport_df.count()}")
    print(f"Airlines count: {airline_df.count()}")
except Exception as e:
    print(f"Error loading static data: {e}")
    # Fallback or exit if critical

# Prepare Airline DF
airline_df = airline_df.withColumnRenamed("AIRLINE", "AIRLINES")

# Prepare Airport DFs for Origin and Destination
airport_origin_df = airport_df.withColumnRenamed("IATA_CODE", "ORIGIN_AIRPORT_CODE")\
    .withColumnRenamed("AIRPORT", "ORIGIN_AIRPORT_NAME")\
    .withColumnRenamed("CITY", "ORIGIN_CITY")\
    .withColumnRenamed("STATE", "ORIGIN_STATE")\
    .withColumnRenamed("COUNTRY", "ORIGIN_COUNTRY")\
    .withColumnRenamed("LATITUDE", "ORIGIN_LATITUDE")\
    .withColumnRenamed("LONGITUDE", "ORIGIN_LONGITUDE")

airport_destination_df = airport_df.withColumnRenamed("IATA_CODE", "DESTINATION_AIRPORT_CODE")\
    .withColumnRenamed("AIRPORT", "DESTINATION_AIRPORT_NAME")\
    .withColumnRenamed("CITY", "DESTINATION_CITY")\
    .withColumnRenamed("STATE", "DESTINATION_STATE")\
    .withColumnRenamed("COUNTRY", "DESTINATION_COUNTRY")\
    .withColumnRenamed("LATITUDE", "DESTINATION_LATITUDE")\
    .withColumnRenamed("LONGITUDE", "DESTINATION_LONGITUDE")\
    .withColumnRenamed("AIRLINE", "AIRLINES")

Static data loaded successfully
Airports count: 322
Airlines count: 14


## 5. Transformations and Joins
Join streaming flight data with static reference data.

In [6]:
# Join with Airline Info
flights_airlines_df = flights_df.join(airline_df, flights_df.AIRLINE == airline_df.IATA_CODE, "left") \
    .drop(airline_df.IATA_CODE)

# Join with Origin Airport Info
flights_airlines_airports_df = flights_airlines_df.join(
    airport_origin_df,
    flights_airlines_df.ORIGIN_AIRPORT == airport_origin_df.ORIGIN_AIRPORT_CODE,
    "left"
)

# Join with Destination Airport Info
flights_airlines_airports_df = flights_airlines_airports_df.join(
    airport_destination_df,
    flights_airlines_airports_df.DESTINATION_AIRPORT == airport_destination_df.DESTINATION_AIRPORT_CODE,
    "left"
)

print("Joins Defined")

Joins Defined


## 6. Aggregation Logic
Define aggregations for Airline Stats, Delay Reasons, and Routes.

In [7]:
# 1. Airline Performance Stats
airline_stats = flights_airlines_airports_df \
    .filter(col("AIRLINE").isNotNull()) \
    .groupBy("AIRLINE") \
    .agg(
        _sum(when(col("CANCELLED") == 1, 1).otherwise(0)).alias("cancelled_flights"),
        _sum(when((col("CANCELLED") == 0) & (col("DEPARTURE_DELAY") <= 15), 1).otherwise(0)).alias("on_time_flights"),
        _sum(when((col("CANCELLED") == 0) & (col("DEPARTURE_DELAY") > 15), 1).otherwise(0)).alias("delayed_flights"),
        avg("DEPARTURE_DELAY").alias("avg_departure_delay"),
        avg("ARRIVAL_DELAY").alias("avg_arrival_delay")
    )

airline_stats_out = airline_stats.select(
    col("AIRLINE").alias("airline"),
    col("cancelled_flights"),
    col("on_time_flights"),
    col("delayed_flights"),
    col("avg_departure_delay"),
    col("avg_arrival_delay")
)

# 2. Delay by Reason Analysis
delay_cols = [
    "AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", 
    "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY"
]

delay_df = flights_airlines_airports_df.filter(
    (col("AIR_SYSTEM_DELAY") > 0) | 
    (col("SECURITY_DELAY") > 0) | 
    (col("AIRLINE_DELAY") > 0) | 
    (col("LATE_AIRCRAFT_DELAY") > 0) | 
    (col("WEATHER_DELAY") > 0)
)

stack_expr = "stack(5, " + ", ".join([f"'{c}', {c}" for c in delay_cols]) + ") as (delay_reason, duration)"

delay_unpivoted = delay_df.selectExpr(stack_expr).filter(col("duration") > 0)

delay_stats = delay_unpivoted \
    .groupBy("delay_reason") \
    .agg(
        count("*").alias("count"),
        avg("duration").alias("avg_duration")
    )

delay_stats_out = delay_stats.select(
    col("delay_reason"),
    col("count"),
    col("avg_duration")
)

# 3. Route Analysis
route_stats = flights_airlines_airports_df \
    .filter((col("ORIGIN_AIRPORT").isNotNull()) & (col("DESTINATION_AIRPORT").isNotNull())) \
    .groupBy(
        "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
        "ORIGIN_CITY", "ORIGIN_STATE", "ORIGIN_LATITUDE", "ORIGIN_LONGITUDE",
        "DESTINATION_CITY", "DESTINATION_STATE"
    ) \
    .agg(
        avg("ARRIVAL_DELAY").alias("avg_delay")
    )

route_stats_out = route_stats.select(
    col("ORIGIN_AIRPORT").alias("original_airport"),
    col("DESTINATION_AIRPORT").alias("destination_airport"),
    col("ORIGIN_CITY").alias("original_city"),
    col("ORIGIN_STATE").alias("original_state"),
    col("DESTINATION_CITY").alias("destination_city"),
    col("DESTINATION_STATE").alias("destination_state"),
    col("ORIGIN_LATITUDE").alias("original_latitude"),
    col("ORIGIN_LONGITUDE").alias("original_longitude"),
    col("avg_delay")
)

print("Aggregations Defined")

Aggregations Defined


## 7. Data Validation (Schema Check)
Displaying schemas for validation.

In [8]:
print("Airline Stats Schema:")
airline_stats_out.printSchema()

print("Delay Stats Schema:")
delay_stats_out.printSchema()

print("Route Stats Schema:")
route_stats_out.printSchema()

Airline Stats Schema:
root
 |-- airline: string (nullable = true)
 |-- cancelled_flights: long (nullable = true)
 |-- on_time_flights: long (nullable = true)
 |-- delayed_flights: long (nullable = true)
 |-- avg_departure_delay: double (nullable = true)
 |-- avg_arrival_delay: double (nullable = true)

Delay Stats Schema:
root
 |-- delay_reason: string (nullable = true)
 |-- count: long (nullable = false)
 |-- avg_duration: double (nullable = true)

Route Stats Schema:
root
 |-- original_airport: string (nullable = true)
 |-- destination_airport: string (nullable = true)
 |-- original_city: string (nullable = true)
 |-- original_state: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- destination_state: string (nullable = true)
 |-- original_latitude: double (nullable = true)
 |-- original_longitude: double (nullable = true)
 |-- avg_delay: double (nullable = true)



In [9]:
query_mem = airline_stats_out.writeStream \
    .outputMode("complete") \
    .format("memory") \
    .queryName("airline_stats_debug") \
    .trigger(processingTime="10 seconds") \
    .start()

# Wait for some data to be processed
import time
time.sleep(15)

# Query the in-memory table
spark.sql("SELECT * FROM airline_stats_debug").show(truncate=False)

# Stop when done
query_mem.stop()

25/12/26 15:55:09 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-0970e5f5-8102-43c8-ad2b-d77a2d654cb0. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/12/26 15:55:09 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/12/26 15:55:20 WARN ClientUtils: Couldn't resolve server kafka:9092 from bootstrap.servers as DNS resolution failed for kafka
25/12/26 15:55:20 WARN KafkaOffsetReaderAdmin: Error in attempt 1 getting Kafka offsets: 
org.apache.kafka.common.KafkaException: Failed to create new KafkaAdminClient
	at org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:551)
	at org.apache.kafka.clients.admin.Admin.create(Admin.java:144)
	at org.apache.s

+-------+-----------------+---------------+---------------+-------------------+-----------------+
|airline|cancelled_flights|on_time_flights|delayed_flights|avg_departure_delay|avg_arrival_delay|
+-------+-----------------+---------------+---------------+-------------------+-----------------+
+-------+-----------------+---------------+---------------+-------------------+-----------------+



## 8. Cassandra Write Logic
Define `foreachBatch` functions to write to Cassandra.

In [9]:
def write_airline_stats(batch_df, batch_id):
    if batch_df.isEmpty():
        print(f"Batch {batch_id}: No data to write for airline_stats")
        return
    print(f"Writing airline_stats batch {batch_id} - {batch_df.count()} rows")
    try:
        batch_df = batch_df.withColumn("updated_at", current_timestamp())
        batch_df.write \
            .format("org.apache.spark.sql.cassandra") \
            .mode("append") \
            .options(table="airline_stats", keyspace="flights_db") \
            .option("spark.cassandra.output.consistency.level", "LOCAL_QUORUM") \
            .save()
    except Exception as e:
        print(f"Error writing airline_stats batch {batch_id}: {e}")

def write_delay_stats(batch_df, batch_id):
    if batch_df.isEmpty():
        return
    print(f"Writing delay_stats batch {batch_id} - {batch_df.count()} rows")
    try:
        batch_df = batch_df.withColumn("updated_at", current_timestamp())
        batch_df.write \
            .format("org.apache.spark.sql.cassandra") \
            .mode("append") \
            .options(table="delay_by_reason", keyspace="flights_db") \
            .option("spark.cassandra.output.consistency.level", "LOCAL_QUORUM") \
            .save()
    except Exception as e:
        print(f"Error writing delay_stats batch {batch_id}: {e}")

def write_route_stats(batch_df, batch_id):
    if batch_df.isEmpty():
        return
    print(f"Writing route_stats batch {batch_id} - {batch_df.count()} rows")
    try:
        batch_df = batch_df.withColumn("updated_at", current_timestamp())
        batch_df.write \
            .format("org.apache.spark.sql.cassandra") \
            .mode("append") \
            .options(table="route_stats", keyspace="flights_db") \
            .option("spark.cassandra.output.consistency.level", "LOCAL_QUORUM") \
            .save()
    except Exception as e:
        print(f"Error writing route_stats batch {batch_id}: {e}")

## 9. Start Streaming
Start the streaming queries. 
**Note**: In a real interactive notebook, you might use `awaitTermination()` which blocks the cell. 
For verification purposes, we can use a timeout.

In [10]:
# Start all streaming queries
print("Starting Streams...")

query1 = airline_stats_out.writeStream \
    .outputMode("complete") \
    .foreachBatch(write_airline_stats) \
    .trigger(processingTime="10 seconds") \
    .start()

query2 = delay_stats_out.writeStream \
    .outputMode("complete") \
    .foreachBatch(write_delay_stats) \
    .trigger(processingTime="10 seconds") \
    .start()

query3 = route_stats_out.writeStream \
    .outputMode("complete") \
    .foreachBatch(write_route_stats) \
    .trigger(processingTime="10 seconds") \
    .start()

print("Streams Started. Waiting for termination...")
# For testing/verification, we can wait for a short duration or until stopped manually.
# In production/long-running, use query.awaitTermination()

# Using a shorter timeout for automated verification if run via script
import time
try:
    # Wait for 30 seconds to process some data
    query1.awaitTermination(30)
    # query2.awaitTermination()
    # query3.awaitTermination()
except KeyboardInterrupt:
    print("Stopping queries...")
finally:
    query1.stop()
    query2.stop()
    query3.stop()
    print("Queries Stopped")

Starting Streams...


25/12/26 14:30:24 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-a3aa3248-28ae-4c4d-bf42-04ba099c2135. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/12/26 14:30:24 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/12/26 14:30:24 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-bc0a3233-e79c-48a4-86a2-933832b3fa3e. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/12/26 14:30:24 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not support

Streams Started. Waiting for termination...


25/12/26 14:30:35 WARN ClientUtils: Couldn't resolve server kafka:9092 from bootstrap.servers as DNS resolution failed for kafka
25/12/26 14:30:35 WARN ClientUtils: Couldn't resolve server kafka:9092 from bootstrap.servers as DNS resolution failed for kafka
25/12/26 14:30:35 WARN ClientUtils: Couldn't resolve server kafka:9092 from bootstrap.servers as DNS resolution failed for kafka
25/12/26 14:30:35 WARN KafkaOffsetReaderAdmin: Error in attempt 1 getting Kafka offsets: 
org.apache.kafka.common.KafkaException: Failed to create new KafkaAdminClient
	at org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:551)
	at org.apache.kafka.clients.admin.Admin.create(Admin.java:144)
	at org.apache.spark.sql.kafka010.ConsumerStrategy.createAdmin(ConsumerStrategy.scala:50)
	at org.apache.spark.sql.kafka010.ConsumerStrategy.createAdmin$(ConsumerStrategy.scala:47)
	at org.apache.spark.sql.kafka010.SubscribeStrategy.createAdmin(ConsumerStrategy.scala:102)
	at org.apache

Queries Stopped


StreamingQueryException: [STREAM_FAILED] Query [id = 3833d8f0-b528-4f33-8c24-732f55e5beb0, runId = 1ebec05b-2949-48ec-bab0-473e9b8fe7ab] terminated with exception: Failed to create new KafkaAdminClient