# Notebook 06: Dashboard Data Export

**TerraFlow Analytics - Big Data Assessment**

This notebook pre-aggregates data for the interactive dashboard to ensure fast loading and responsive user experience.

**Purpose:**
- Avoid heavy Spark computations in the dashboard
- Export small, aggregated datasets
- Enable real-time filtering and visualization

**Exports:**
1. **Congestion by Hour**: Temporal patterns
2. **Congestion by Route**: Route-level analysis
3. **Speed Trends**: Average speed over time
4. **Summary Statistics**: KPIs for dashboard cards

## 1. Environment Setup

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import os

# Initialize Spark Session
spark = (
    SparkSession.builder
    .appName("TerraFlow_Dashboard_Export")
    .master("local[*]")
    .config("spark.hadoop.fs.defaultFS", "hdfs://namenode:9000")
    .config("spark.sql.shuffle.partitions", "8")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("WARN")
print("✅ Spark Session Initialized")

## 2. Load Silver Dataset

In [None]:
# Load processed data
SILVER_PATH = "hdfs://namenode:9000/terraflow/data/processed/gtfs_silver.parquet"
df = spark.read.parquet(SILVER_PATH)

print(f"Dataset loaded: {df.count():,} rows")
print("Columns:", df.columns)

# Identify route column
route_candidates = ["route_id", "route_short_name", "trip_id"]
route_col = next((c for c in route_candidates if c in df.columns), None)

if route_col is None:
    raise ValueError("No route identifier column found in dataset")

print(f"\n✅ Using route column: {route_col}")

## 3. Export 1: Congestion by Hour

Aggregates congestion levels across all hours for temporal pattern visualization

In [None]:
# Aggregate congestion by hour
cong_by_hour = (
    df.groupBy("hour", "Degree_of_congestion")
    .agg(
        F.count("*").alias("count"),
        F.avg("speed").alias("avg_speed")
    )
    .orderBy("hour", "Degree_of_congestion")
)

print("Congestion by Hour (sample):")
cong_by_hour.show(10)

# Save to HDFS
HDFS_OUT_HOUR = "hdfs://namenode:9000/terraflow/data/processed/dashboard/congestion_by_hour"
cong_by_hour.write.mode("overwrite").parquet(HDFS_OUT_HOUR)
print(f"✅ Saved to HDFS: {HDFS_OUT_HOUR}")

# Also save locally for dashboard
LOCAL_OUT_HOUR = "../data/processed/congestion_by_hour.parquet"
cong_by_hour.toPandas().to_parquet(LOCAL_OUT_HOUR)
print(f"✅ Saved locally: {LOCAL_OUT_HOUR}")

## 4. Export 2: Congestion by Route

Route-level congestion breakdown for comparative analysis

In [None]:
# Aggregate congestion by route
cong_by_route = (
    df.groupBy(route_col, "Degree_of_congestion")
    .agg(
        F.count("*").alias("count"),
        F.avg("speed").alias("avg_speed"),
        F.avg("SRI").alias("avg_sri")
    )
    .orderBy(F.col("count").desc())
)

print("Congestion by Route (top 10):")
cong_by_route.show(10)

# Save to HDFS
HDFS_OUT_ROUTE = "hdfs://namenode:9000/terraflow/data/processed/dashboard/congestion_by_route"
cong_by_route.write.mode("overwrite").parquet(HDFS_OUT_ROUTE)
print(f"✅ Saved to HDFS: {HDFS_OUT_ROUTE}")

# Save locally
LOCAL_OUT_ROUTE = "../data/processed/congestion_by_route.parquet"
cong_by_route.toPandas().to_parquet(LOCAL_OUT_ROUTE)
print(f"✅ Saved locally: {LOCAL_OUT_ROUTE}")

## 5. Export 3: Speed Trends by Hour

Average speed trends for performance monitoring

In [None]:
# Aggregate speed trends
speed_trend = (
    df.groupBy("hour")
    .agg(
        F.avg("speed").alias("avg_speed"),
        F.min("speed").alias("min_speed"),
        F.max("speed").alias("max_speed"),
        F.stddev("speed").alias("std_speed"),
        F.count("*").alias("count")
    )
    .orderBy("hour")
)

print("Speed Trends by Hour:")
speed_trend.show(24)

# Save to HDFS
HDFS_OUT_SPEED = "hdfs://namenode:9000/terraflow/data/processed/dashboard/speed_trend"
speed_trend.write.mode("overwrite").parquet(HDFS_OUT_SPEED)
print(f"✅ Saved to HDFS: {HDFS_OUT_SPEED}")

# Save locally
LOCAL_OUT_SPEED = "../data/processed/speed_trend.parquet"
speed_trend.toPandas().to_parquet(LOCAL_OUT_SPEED)
print(f"✅ Saved locally: {LOCAL_OUT_SPEED}")

## 6. Export 4: Summary KPIs

High-level metrics for dashboard cards

In [None]:
# Calculate global KPIs
total_records = df.count()
avg_speed = df.select(F.avg("speed")).collect()[0][0]
avg_sri = df.select(F.avg("SRI")).collect()[0][0]

# Peak vs off-peak comparison
peak_stats = df.where(F.col("is_peak") == "Peak").agg(
    F.avg("speed").alias("peak_avg_speed"),
    F.count("*").alias("peak_count")
).collect()[0]

offpeak_stats = df.where(F.col("is_peak") == "Off-Peak").agg(
    F.avg("speed").alias("offpeak_avg_speed"),
    F.count("*").alias("offpeak_count")
).collect()[0]

# Most congested hour
most_congested = (
    df.where(F.col("Degree_of_congestion").isin(["Heavy congestion", "High", "Severe"]))
    .groupBy("hour")
    .count()
    .orderBy(F.col("count").desc())
    .first()
)

# Create KPI dictionary
kpis = {
    "total_records": int(total_records),
    "avg_speed": float(avg_speed),
    "avg_sri": float(avg_sri),
    "peak_avg_speed": float(peak_stats["peak_avg_speed"]),
    "offpeak_avg_speed": float(offpeak_stats["offpeak_avg_speed"]),
    "peak_count": int(peak_stats["peak_count"]),
    "offpeak_count": int(offpeak_stats["offpeak_count"]),
    "most_congested_hour": int(most_congested["hour"]),
    "most_congested_count": int(most_congested["count"])
}

print("="*70)
print("SUMMARY KPIs")
print("="*70)
for key, value in kpis.items():
    print(f"{key:25s}: {value}")
print("="*70)

# Save as JSON
import json
LOCAL_OUT_KPIS = "../data/processed/kpis.json"
with open(LOCAL_OUT_KPIS, 'w') as f:
    json.dump(kpis, f, indent=2)
print(f"\n✅ KPIs saved to: {LOCAL_OUT_KPIS}")

## 7. Export 5: Route Performance Summary

Top/bottom routes for quick insights

In [None]:
# Route performance metrics
route_perf = (
    df.groupBy(route_col)
    .agg(
        F.avg("speed").alias("avg_speed"),
        F.avg("SRI").alias("avg_sri"),
        F.count("*").alias("trip_count"),
        F.sum(F.when(F.col("Degree_of_congestion").isin(["Heavy congestion", "High"]), 1).otherwise(0)).alias("high_congestion_count")
    )
    .withColumn("congestion_rate", F.col("high_congestion_count") / F.col("trip_count"))
    .orderBy(F.col("trip_count").desc())
)

print("Route Performance (top 15):")
route_perf.show(15)

# Save locally
LOCAL_OUT_ROUTE_PERF = "../data/processed/route_performance.parquet"
route_perf.toPandas().to_parquet(LOCAL_OUT_ROUTE_PERF)
print(f"✅ Saved locally: {LOCAL_OUT_ROUTE_PERF}")

## 8. Export Summary

All aggregated datasets have been exported for dashboard consumption:

| Export | HDFS Path | Local Path | Purpose |
|--------|-----------|------------|----------|
| Congestion by Hour | `/terraflow/data/processed/dashboard/congestion_by_hour` | `../data/processed/congestion_by_hour.parquet` | Temporal patterns |
| Congestion by Route | `/terraflow/data/processed/dashboard/congestion_by_route` | `../data/processed/congestion_by_route.parquet` | Route comparison |
| Speed Trends | `/terraflow/data/processed/dashboard/speed_trend` | `../data/processed/speed_trend.parquet` | Performance monitoring |
| KPIs | N/A | `../data/processed/kpis.json` | Dashboard cards |
| Route Performance | N/A | `../data/processed/route_performance.parquet` | Route insights |

**Benefits:**
- **Fast Loading**: Pre-aggregated data loads instantly
- **Responsive UI**: No heavy Spark computations in dashboard
- **Scalable**: Small file sizes enable smooth filtering
- **Dual Storage**: HDFS for production, local for development

In [None]:
# Clean up
spark.stop()
print("✅ All exports complete. Spark session stopped.")
print("\nDashboard is ready to run!")