# Notebook 02: Cleaning & Feature Engineering

**TerraFlow Analytics - Big Data Assessment**

This notebook focuses on processing the raw "bronze" data into a clean "silver" dataset. It addresses the requirements for data cleaning, structureing, and feature engineering to support downstream analysis and machine learning.

**Objectives:**
1. **Data Cleaning**: Handle missing values, fix data types, and remove invalid records.
2. **Feature Engineering**: Create new variables for analysis (Peak/Off-Peak, Congestion Levels, Temporal Features).
3. **Reliability Analysis**: Engineer trip reliability indicators based on SRI.
4. **Save Silver Layer**: Store the processed dataset back to HDFS for efficient querying.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, hour, avg, count, lit, to_timestamp, stddev
from pyspark.sql.types import DoubleType, IntegerType

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("TerraFlow_DataCleaning") \
    .master("local[*]") \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.hadoop.fs.defaultFS", "hdfs://namenode:9000") \
    .getOrCreate()

print("Spark Session Initialized")

Spark Session Initialized


In [2]:
# Configuration Paths
HDFS_NAMENODE = "hdfs://namenode:9000"
BRONZE_INPUT_PATH = f"{HDFS_NAMENODE}/terraflow/data/processed/gtfs_bronze.parquet"
SILVER_OUTPUT_PATH = f"{HDFS_NAMENODE}/terraflow/data/processed/gtfs_silver.parquet"
ROUTE_STATS_OUTPUT_PATH = f"{HDFS_NAMENODE}/terraflow/data/processed/route_stats.parquet"

print(f"Reading data from: {BRONZE_INPUT_PATH}")

Reading data from: hdfs://namenode:9000/terraflow/data/processed/gtfs_bronze.parquet


In [3]:
# 1. Load Bronze Data
df = spark.read.parquet(BRONZE_INPUT_PATH)
initial_count = df.count()

print(f"Initial Row Count: {initial_count:,}")

Initial Row Count: 66,913


In [4]:
# 2. Data Cleaning & Type Casting

# Convert columns to appropriate types
df_clean = df.withColumn("speed", col("speed").cast(DoubleType())) \
             .withColumn("SRI", col("SRI").cast(DoubleType())) \
             .withColumn("time", col("time").cast(DoubleType())) \
             .withColumn("Number_of_trips", col("Number_of_trips").cast(IntegerType())) \
             .withColumn("arrival_time", to_timestamp(col("arrival_time")))

# Handle Missing Values (Drop rows where critical metrics are null)
df_clean = df_clean.dropna(subset=["speed", "arrival_time", "SRI"])

# Remove Invalid Rows (Negative speed or time)
df_clean = df_clean.filter((col("speed") >= 0) & (col("time") >= 0))

cleaned_count = df_clean.count()
dropped_count = initial_count - cleaned_count

print(f"Cleaned Row Count: {cleaned_count:,}")
print(f"Rows Dropped: {dropped_count:,}")

Cleaned Row Count: 66,437
Rows Dropped: 476


In [5]:
# 3. Feature Engineering

# A. Temporal Features (Hour of Day)
df_features = df_clean.withColumn("hour", hour("arrival_time"))

# B. Peak vs Off-Peak Classification
# Assuming Peak Hours: 07:00-11:00 (Morning) and 16:00-20:00 (Evening)
df_features = df_features.withColumn(
    "is_peak", 
    when(((col("hour") >= 7) & (col("hour") <= 11)) | 
         ((col("hour") >= 16) & (col("hour") <= 20)), 
         lit("Peak")
    ).otherwise(lit("Off-Peak"))
)

# C. Congestion Encoding (Ordinal Encoding)
df_features = df_features.withColumn(
    "congestion_lebel_encoded",
    when(col("Degree_of_congestion") == "Very smooth", 0)
    .when(col("Degree_of_congestion") == "Smooth", 1)
    .when(col("Degree_of_congestion") == "Moderate", 2)
    .when(col("Degree_of_congestion") == "Heavy congestion", 3)
    .otherwise(4) # Unknown or Extreme
)

# D. Speed Bands (Categorical Binning)
df_features = df_features.withColumn(
    "speed_band",
    when(col("speed") < 10, "Low (<10 km/h)")
    .when((col("speed") >= 10) & (col("speed") < 30), "Medium (10-30 km/h)")
    .otherwise("High (>30 km/h)")
)

# E. Trip Reliability Indicators (Requirement: trip reliability indicators)
# We classify reliability based on SRI (Service Reliability Index)
# Assuming Higher SRI = Better reliability/Less congestion (based on data context)
# or if SRI is delay-based (0 is on time), we adapt. 
# Looking at data: High congestion has High SRI (e.g., 5.14). Smooth has lower (e.g. 1.2 or -0.4).
# Thus: High SRI = Unreliable (Delayed/Congested).

df_features = df_features.withColumn(
    "reliability_status",
    when(col("SRI") > 2, "Unreliable (Congested)")
    .otherwise("Reliable")
)

print("Features Engineered successfully (including Reliability Indicators).")
df_features.select("arrival_time", "is_peak", "SRI", "reliability_status", "speed_band").show(5, truncate=False)

Features Engineered successfully (including Reliability Indicators).
+-------------------+--------+------------+----------------------+-------------------+
|arrival_time       |is_peak |SRI         |reliability_status    |speed_band         |
+-------------------+--------+------------+----------------------+-------------------+
|2026-01-12 14:02:28|Off-Peak|-21.11111199|Reliable              |High (>30 km/h)    |
|2026-01-12 14:55:35|Off-Peak|8.490566166 |Unreliable (Congested)|Medium (10-30 km/h)|
|2026-01-12 14:35:35|Off-Peak|8.490566166 |Unreliable (Congested)|Medium (10-30 km/h)|
|2026-01-12 14:50:35|Off-Peak|8.490566166 |Unreliable (Congested)|Medium (10-30 km/h)|
|2026-01-12 14:22:48|Off-Peak|8.571428755 |Unreliable (Congested)|Medium (10-30 km/h)|
+-------------------+--------+------------+----------------------+-------------------+
only showing top 5 rows



In [6]:
# 4. Route Level Aggregation (Requirement: route-level aggregates)
# Calculating stats per route to save as a separate dataset for dashboards
route_stats = df_features.groupBy("trip_id").agg(
    avg("speed").alias("avg_speed"),
    avg("SRI").alias("avg_sri"),
    stddev("SRI").alias("sri_volatility"),
    count("*").alias("total_records")
)

print("Sample Route Stats:")
route_stats.show(5)

Sample Route Stats:
+--------------------+------------------+-------------------+------------------+-------------+
|             trip_id|         avg_speed|            avg_sri|    sri_volatility|total_records|
+--------------------+------------------+-------------------+------------------+-------------+
|NORMAL_315_Bhosar...|23.314078886235293| 1.4234352720588235|7.1211235313060115|           17|
|NORMAL_64_Hadapsa...|22.961665740333334|  3.622047246722221|2.1446214256149636|           18|
|NORMAL_43_Katraj ...|    48.62421636025|-5.7648642178000005|22.019612649346772|           40|
|NORMAL_337_Bhakti...|28.021119715749997|      3.02426024225|3.7239498950763354|            8|
|NORMAL_319_Alandi...|39.565834718076935|       -4.555626899| 5.109305275344527|           26|
+--------------------+------------------+-------------------+------------------+-------------+
only showing top 5 rows



In [7]:
# 5. Save Datasets to HDFS

# Save Silver Layer (Main Dataset)
print(f"Saving Silver Dataset to HDFS: {SILVER_OUTPUT_PATH}")
df_features.write.mode("overwrite").partitionBy("is_peak").parquet(SILVER_OUTPUT_PATH)
print("✓ Silver layer saved successfully.")

# Save Route Stats (Aggregated Dataset for Dashboard)
print(f"Saving Route Stats to HDFS: {ROUTE_STATS_OUTPUT_PATH}")
route_stats.write.mode("overwrite").parquet(ROUTE_STATS_OUTPUT_PATH)
print("✓ Route Stats saved successfully.")

Saving Silver Dataset to HDFS: hdfs://namenode:9000/terraflow/data/processed/gtfs_silver.parquet
✓ Silver layer saved successfully.
Saving Route Stats to HDFS: hdfs://namenode:9000/terraflow/data/processed/route_stats.parquet
✓ Route Stats saved successfully.


In [8]:
# 6. Verification
print("Verifying Silver Layer integrity...")
df_silver = spark.read.parquet(SILVER_OUTPUT_PATH)
print(f"Total Rows: {df_silver.count():,}")

print("\nVerifying Route Stats integrity...")
df_stats = spark.read.parquet(ROUTE_STATS_OUTPUT_PATH)
print(f"Total Routes: {df_stats.count():,}")

# Spark Stop
spark.stop()

Verifying Silver Layer integrity...
Total Rows: 66,437

Verifying Route Stats integrity...
Total Routes: 5,332
