#### Fact: Airline Tickets â€“ Daily Aggregate
Build a performance-optimized aggregate fact table derived from the base Gold fact.


###### Load Base Fact
Read the Gold transactional fact as the authoritative source for aggregation.

In [0]:
GOLD_FACT_AIRLINE_TICKETS_PATH = (
    "wasbs://gold@flightdatastorage.blob.core.windows.net/fact_airline_tickets/"
)

In [0]:
from pyspark.sql.functions import col, sum, avg, count

df_fact = (
    spark.read
         .format("delta")
         .load(GOLD_FACT_AIRLINE_TICKETS_PATH)
)

In [0]:
df_fact.count()

com.databricks.backend.common.rpc.CommandCancelledException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$5(SequenceExecutionState.scala:139)
	at scala.Option.getOrElse(Option.scala:201)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:139)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:136)
	at scala.collection.immutable.Range.foreach(Range.scala:192)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:136)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:721)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:441)
	at scala.Option.getOrElse(Option.scala:201)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:441)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.can

###### Define Aggregate Grain
Fix the daily aggregate grain to prevent ambiguity and double counting.


Grain
- date_key
- origin_airport_key
- dest_airport_key
- reporting_carrier_key
- Partitioning retained:
- source_year
- source_quarter

###### Aggregate Metrics
Compute commonly queried daily metrics for BI and downstream analytics.


In [0]:
from pyspark.sql import functions as F
df_fact_quarterly_agg = (
    df_fact
    .groupBy(
        "origin_airport_key",
        "dest_airport_key",
        "reporting_carrier_key",
        "distance_group_key",  # Added for distance analysis
        "source_year",
        "source_quarter"
    )
    .agg(
        F.sum("passenger_cnt").alias("total_passengers"),
        F.sum("market_fare_usd").alias("total_revenue"),  # ADDED
        F.avg("market_fare_usd").alias("avg_market_fare"),
        F.sum("market_miles_flown").alias("total_market_miles"),
        F.count("*").alias("flight_count"),
        F.avg("market_distance_miles").alias("avg_route_distance")  # ADDED
    )
)

In [0]:
df_fact_quarterly_agg.count()

1541492

###### Validate Aggregate Integrity
Ensure critical dimension keys are non-null after aggregation.


In [0]:
df_fact.selectExpr("sum(passenger_cnt)").show()
df_fact_quarterly_agg.selectExpr("sum(total_passengers)").show()


+------------------+
|sum(passenger_cnt)|
+------------------+
|          75893539|
+------------------+

+---------------------+
|sum(total_passengers)|
+---------------------+
|             75893539|
+---------------------+



In [0]:
df_fact.selectExpr("sum(market_miles_flown)").show()
df_fact_quarterly_agg.selectExpr("sum(total_market_miles)").show()


+-----------------------+
|sum(market_miles_flown)|
+-----------------------+
|            51974345493|
+-----------------------+

+-----------------------+
|sum(total_market_miles)|
+-----------------------+
|            51974345493|
+-----------------------+



In [0]:
df_fact.selectExpr("count(*)").show()
df_fact_quarterly_agg.selectExpr("sum(flight_count)").show()


+--------+
|count(1)|
+--------+
|39876323|
+--------+

+-----------------+
|sum(flight_count)|
+-----------------+
|         39876323|
+-----------------+



###### Persist Daily Aggregate Fact
Write the aggregate fact using partition-aware, incremental Delta writes.


In [0]:
GOLD_FACT_ROUTE_AGG_PATH = (
    "wasbs://gold@flightdatastorage.blob.core.windows.net/"
    "fact_route_quarterly_agg"
)


In [0]:
df_fact_quarterly_agg.write.format("delta").mode("overwrite").save(GOLD_FACT_ROUTE_AGG_PATH)

###### Post-write Validation
Confirm row-count reduction relative to the base fact as a sanity check.
