# Notebook 01: Ingestion from HDFS using Spark

**TerraFlow Analytics - Big Data Assessment**

This notebook demonstrates distributed data processing using PySpark to load raw GTFS data from HDFS, inspect it using Spark DataFrames, and save a bronze layer for further processing.

**Requirements Addressed:**
1. **Distributed data processing with PySpark**: Loading and parsing large GTFS files.
2. **Scalable storage with HDFS**: Reading from and writing to HDFS.
3. **Use Spark DataFrame and RDDs**: Handling temporal and spatial attributes.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Initialize Spark Session for distributed data processing
# 'local[*]' is used here for demonstration, but in a full cluster this would connect to the Spark Master
spark = SparkSession.builder \
    .appName("TerraFlow_Ingestion") \
    .master("local[*]") \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.hadoop.fs.defaultFS", "hdfs://namenode:9000") \
    .getOrCreate()

print("="*60)
print("SPARK SESSION INITIALIZED")
print("="*60)
print(f"Application Name: {spark.sparkContext.appName}")
print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")
print("="*60)

SPARK SESSION INITIALIZED
Application Name: TerraFlow_Ingestion
Spark Version: 3.5.0
Master: local[*]


In [2]:
# HDFS Configuration
# We strictly use proper HDFS paths as per assignment requirements
HDFS_NAMENODE = "hdfs://namenode:9000"

# Path to the raw CSV file uploaded to HDFS
# Note: The file 'CPS6005-Assessment 2_GTFS_Data.csv' is stored as 'gtfs_data.csv' in HDFS
RAW_DATA_PATH = f"{HDFS_NAMENODE}/terraflow/data/raw/gtfs_data.csv"

print("HDFS Configuration:")
print(f"NameNode: {HDFS_NAMENODE}")
print(f"Raw Data Path: {RAW_DATA_PATH}")

HDFS Configuration:
NameNode: hdfs://namenode:9000
Raw Data Path: hdfs://namenode:9000/terraflow/data/raw/gtfs_data.csv


In [3]:
# 1. Load data from HDFS using Spark DataFrame
# inferSchema=True allows Spark to automatically detect data types
print("Loading data from HDFS...")
df = spark.read.csv(RAW_DATA_PATH, header=True, inferSchema=True)

print("Data loaded successfully!")
print(f"Number of partitions: {df.rdd.getNumPartitions()}")

Loading data from HDFS...
Data loaded successfully!
Number of partitions: 2


In [4]:
# 2. Inspect Schema and Row Count
print("Dataset Schema:")
df.printSchema()

total_rows = df.count()
print(f"\nTotal rows in dataset: {total_rows:,}")

print("\nSample Data (Top 5 Rows):")
df.show(5, truncate=False)

Dataset Schema:
root
 |-- stop_id_from: integer (nullable = true)
 |-- stop_id_to: integer (nullable = true)
 |-- trip_id: string (nullable = true)
 |-- arrival_time: timestamp (nullable = true)
 |-- time: double (nullable = true)
 |-- speed: string (nullable = true)
 |-- Number_of_trips: integer (nullable = true)
 |-- SRI: string (nullable = true)
 |-- Degree_of_congestion: string (nullable = true)


Total rows in dataset: 66,913

Sample Data (Top 5 Rows):
+------------+----------+------------------------------------------------------------+-------------------+-----------+-----------+---------------+-----------+--------------------+
|stop_id_from|stop_id_to|trip_id                                                     |arrival_time       |time       |speed      |Number_of_trips|SRI        |Degree_of_congestion|
+------------+----------+------------------------------------------------------------+-------------------+-----------+-----------+---------------+-----------+--------------------

In [5]:
# 3. Demonstrate usage of RDDs (Requirement: Use Spark DataFrame and RDDs)
# We convert the DataFrame to an RDD and inspect the first few records
print("Converting DataFrame to RDD to demonstrate RDD handling...")
rdd_sample = df.rdd.take(3)

print("First 3 records from RDD:")
for i, row in enumerate(rdd_sample, 1):
    print(f"Row {i}: {row}")

Converting DataFrame to RDD to demonstrate RDD handling...
First 3 records from RDD:
Row 1: Row(stop_id_from=36156, stop_id_to=38709, trip_id='NORMAL_333_Pune Station To  Hinjawadi Maan Phase 3_Up-0855_0', arrival_time=datetime.datetime(2026, 1, 12, 9, 13, 54), time=0.027222222, speed='14.47956475', Number_of_trips=9, SRI='-0.40816322', Degree_of_congestion='Very smooth')
Row 2: Row(stop_id_from=36156, stop_id_to=38709, trip_id='NORMAL_115P_Pune Station to Hinjawadi Phase 3_Up-0845_0', arrival_time=datetime.datetime(2026, 1, 12, 9, 3, 1), time=0.032222222, speed='12.23273572', Number_of_trips=9, SRI='1.2068965', Degree_of_congestion='Smooth')
Row 3: Row(stop_id_from=36156, stop_id_to=38709, trip_id='NORMAL_100_Ma Na Pa to Hinjawadi Maan Phase 3_Up-0915_0', arrival_time=datetime.datetime(2026, 1, 12, 9, 15), time=0.058333333, speed='6.7571302', Number_of_trips=9, SRI='5.142857', Degree_of_congestion='Heavy congestion')


In [6]:
# 4. Save processed data back to HDFS (Bronze Layer)
# Saving as Parquet - a columnar storage format optimized for big data analytics
BRONZE_OUTPUT_PATH = f"{HDFS_NAMENODE}/terraflow/data/processed/gtfs_bronze.parquet"

print(f"Saving bronze layer to HDFS at: {BRONZE_OUTPUT_PATH}")

# mode("overwrite") allows re-running this notebook safely
df.write.mode("overwrite").parquet(BRONZE_OUTPUT_PATH)

print("✓ Bronze layer saved successfully!")

Saving bronze layer to HDFS at: hdfs://namenode:9000/terraflow/data/processed/gtfs_bronze.parquet
✓ Bronze layer saved successfully!


In [7]:
# 5. Verify the saved file
print("Reading back the saved Bronze layer to verify integrity...")
df_bronze = spark.read.parquet(BRONZE_OUTPUT_PATH)
bronze_count = df_bronze.count()

print(f"Original Count: {total_rows:,}")
print(f"Saved Count:    {bronze_count:,}")

if total_rows == bronze_count:
    print("\n✅ SUCCESS: Data integrity verified.")
else:
    print("\n❌ FAILURE: Row counts do not match!")

Reading back the saved Bronze layer to verify integrity...
Original Count: 66,913
Saved Count:    66,913

✅ SUCCESS: Data integrity verified.


In [8]:
# Stop Spark Session
spark.stop()
print("Spark session stopped.")

Spark session stopped.
