##CASE STUDY 4
###Title: Smart City Traffic Sensor Analytics using PySpark
A city government has installed tra c sensors across major roads.
Every sensor sends data every few seconds about tra c density, speed, and congestion levels.

In [34]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

In [35]:
spark = SparkSession.builder.appName("Capstone Three").getOrCreate()

In [36]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [37]:
csv_path ="/content/drive/MyDrive/Colab Notebooks/traffic_data_large.csv"

##PHASE 1 – Ingestion

1. Read traffic_data.csv as StringType.
2. Print schema and count records.
3. Identify data quality issues by inspection

In [38]:
df_raw = spark.read.option("header", True).option("inferSchema", False) \
    .csv(csv_path)

df_raw.printSchema()
df_raw.count()
df_raw.show()

root
 |-- sensor_id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- road_name: string (nullable = true)
 |-- vehicle_count: string (nullable = true)
 |-- avg_speed: string (nullable = true)
 |-- temperature: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- status: string (nullable = true)

+---------+---------+---------------+-------------+---------+-----------+-------------------+--------+
|sensor_id| location|      road_name|vehicle_count|avg_speed|temperature|          timestamp|  status|
+---------+---------+---------------+-------------+---------+-----------+-------------------+--------+
|     S105|  Chennai|            OMR|      invalid|     NULL|         39|12/01/2026 06:00:00|INACTIVE|
|     S113|  Chennai|     Mount Road|          103|     73.5|         36|2026-01-12 06:00:05|  ACTIVE|
|     S228|    Delhi|        Janpath|           16|     20.0|         35|2026-01-12 06:00:10|  ACTIVE|
|     S160|Bangalore|        MG Road|         

##PHASE 2 – Cleaning

1. Trim all string columns

In [39]:
df_raw = df_raw.select([trim(col(c)).alias(c) if t == 'string' else col(c) for c, t in df_raw.dtypes])
df_raw.printSchema()

root
 |-- sensor_id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- road_name: string (nullable = true)
 |-- vehicle_count: string (nullable = true)
 |-- avg_speed: string (nullable = true)
 |-- temperature: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- status: string (nullable = true)



2. Clean vehicle_count:
- Replace invalid and empty with null
- Cast to IntegerType

In [40]:
df_raw = df_raw.withColumn('vehicle_count',when(col('vehicle_count').isin('invalid', ''), None)
    .otherwise(col('vehicle_count')).cast(IntegerType())
)
df_raw.printSchema()

root
 |-- sensor_id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- road_name: string (nullable = true)
 |-- vehicle_count: integer (nullable = true)
 |-- avg_speed: string (nullable = true)
 |-- temperature: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- status: string (nullable = true)



3. Clean avg_speed:
- Replace empty with null
- Cast to DoubleType

In [41]:
df_raw = df_raw.withColumn('avg_speed',when(col('avg_speed') == '', None)
    .otherwise(col('avg_speed')).cast(DoubleType())
)

4. Parse timestamp into:

event_time

with TimestampType supporting:
- yyyy-MM-dd HH:mm:ss
- dd/MM/yyyy HH:mm:ss
- yyyy/MM/dd HH:mm:ss

5. Keep original timestamp for audit

In [42]:
df_raw = df_raw.withColumn('event_time',
    coalesce(
        try_to_timestamp(col('timestamp'), lit('yyyy-MM-dd HH:mm:ss')),
        try_to_timestamp(col('timestamp'), lit('dd/MM/yyyy HH:mm:ss')),
        try_to_timestamp(col('timestamp'), lit('yyyy/MM/dd HH:mm:ss'))
    )
)

df_raw = df_raw.drop('timestamp')

##PHASE 3 – Validation

1. Count invalid vehicle_count rows

In [43]:
df_raw.filter(col('vehicle_count').isNull()).count()

49873

2. Count invalid timestamp rows

In [45]:
df_raw.filter(col('event_time').isNull()).count()

4853

3. Remove rows where:
status != "ACTIVE"

In [46]:
df_raw = df_raw.filter(col('status') == 'ACTIVE')

4. Validate row counts

In [47]:
print(f"Number of records after filtering by status: {df_raw.count()}")

Number of records after filtering by status: 475000


##PHASE 4 – Tra c Metrics

1. Average speed per location

In [48]:
df_raw.groupBy('location').agg(mean('avg_speed').alias('average_speed_per_location')).show(truncate=False)

+---------+--------------------------+
|location |average_speed_per_location|
+---------+--------------------------+
|Bangalore|47.455237585125005        |
|Chennai  |47.603255828145606        |
|Mumbai   |47.47210939162663         |
|Kolkata  |47.43352094200893         |
|Pune     |47.421643780726455        |
|Delhi    |47.615486368277594        |
|Hyderabad|47.536721714580516        |
+---------+--------------------------+



2. Total vehicle count per road

In [49]:
df_raw.groupBy('road_name').agg(sum('vehicle_count').alias('total_vehicle_count_per_road')).show(truncate=False)

+---------------+----------------------------+
|road_name      |total_vehicle_count_per_road|
+---------------+----------------------------+
|University Rd  |1322004                     |
|Western Express|1334351                     |
|Eastern Express|1325865                     |
|FC Road        |1322292                     |
|Whitefield Rd  |1320360                     |
|Link Road      |1316848                     |
|Outer Ring Rd  |1339365                     |
|Gachibowli Rd  |1328605                     |
|Janpath        |1303498                     |
|Hitech City Rd |1338486                     |
|GST Road       |1333073                     |
|OMR            |1317171                     |
|NH48           |1335420                     |
|Ring Road      |1327408                     |
|Mount Road     |1329511                     |
|Howrah Rd      |1334512                     |
|Park Street    |1310784                     |
|EM Bypass      |1331117                     |
|Madhapur Rd 

3. Peak traffic time per location.

In [50]:
df_peak_traffic = df_raw.withColumn('event_hour', hour(col('event_time')))

window_spec_peak_time = Window.partitionBy('location').orderBy(desc('total_vehicle_count'))

df_peak_traffic = df_peak_traffic.groupBy('location', 'event_hour') \
    .agg(sum('vehicle_count').alias('total_vehicle_count')) \
    .withColumn('rank', rank().over(window_spec_peak_time)) \
    .filter(col('rank') == 1) \
    .select('location', 'event_hour', 'total_vehicle_count')

df_peak_traffic.show(truncate=False)

+---------+----------+-------------------+
|location |event_hour|total_vehicle_count|
+---------+----------+-------------------+
|Bangalore|4         |168298             |
|Chennai  |6         |169857             |
|Delhi    |22        |170234             |
|Hyderabad|3         |169636             |
|Kolkata  |0         |173778             |
|Mumbai   |17        |169548             |
|Pune     |1         |171539             |
+---------+----------+-------------------+



4. Roads with lowest average speed (most congestion)

In [51]:
df_raw.groupBy('road_name').agg(mean('avg_speed').alias('average_speed')).orderBy(asc('average_speed')).show(10, truncate=False)

+-------------+------------------+
|road_name    |average_speed     |
+-------------+------------------+
|EM Bypass    |47.305273508964476|
|Link Road    |47.3713610932772  |
|Whitefield Rd|47.40783038021416 |
|FC Road      |47.41019017359676 |
|Howrah Rd    |47.41786602740956 |
|MG Road      |47.42467508440872 |
|University Rd|47.42538415393851 |
|Nagar Rd     |47.429443229917716|
|GST Road     |47.44407903702021 |
|Gachibowli Rd|47.455523521874355|
+-------------+------------------+
only showing top 10 rows


##PHASE 5 – Window Function

1. Rank roads by congestion (lowest speed)

In [52]:
df_road_congestion = df_raw.groupBy('road_name').agg(mean('avg_speed').alias('average_speed'))

window_spec_congestion = Window.orderBy(asc('average_speed'))

df_road_congestion = df_road_congestion.withColumn('congestion_rank', rank().over(window_spec_congestion))

df_road_congestion.show(10, truncate=False)

+-------------+------------------+---------------+
|road_name    |average_speed     |congestion_rank|
+-------------+------------------+---------------+
|EM Bypass    |47.305273508964476|1              |
|Link Road    |47.3713610932772  |2              |
|Whitefield Rd|47.40783038021416 |3              |
|FC Road      |47.41019017359676 |4              |
|Howrah Rd    |47.41786602740956 |5              |
|MG Road      |47.42467508440872 |6              |
|University Rd|47.42538415393851 |7              |
|Nagar Rd     |47.429443229917716|8              |
|GST Road     |47.44407903702021 |9              |
|Gachibowli Rd|47.455523521874355|10             |
+-------------+------------------+---------------+
only showing top 10 rows


2. For each location, rank roads by vehicle_count

In [53]:
df_road_vehicle_count = df_raw.groupBy('location', 'road_name').agg(sum('vehicle_count').alias('total_vehicle_count'))

window_spec_vehicle_count = Window.partitionBy('location').orderBy(desc('total_vehicle_count'))

df_road_vehicle_count = df_road_vehicle_count.withColumn('vehicle_count_rank', rank().over(window_spec_vehicle_count))

df_road_vehicle_count.show(truncate=False)

+---------+---------------+-------------------+------------------+
|location |road_name      |total_vehicle_count|vehicle_count_rank|
+---------+---------------+-------------------+------------------+
|Bangalore|Outer Ring Rd  |1339365            |1                 |
|Bangalore|Whitefield Rd  |1320360            |2                 |
|Bangalore|MG Road        |1303485            |3                 |
|Chennai  |GST Road       |1333073            |1                 |
|Chennai  |Mount Road     |1329511            |2                 |
|Chennai  |OMR            |1317171            |3                 |
|Delhi    |NH48           |1335420            |1                 |
|Delhi    |Ring Road      |1327408            |2                 |
|Delhi    |Janpath        |1303498            |3                 |
|Hyderabad|Hitech City Rd |1338486            |1                 |
|Hyderabad|Gachibowli Rd  |1328605            |2                 |
|Hyderabad|Madhapur Rd    |1324233            |3              

3. Identify top 3 congested roads per location.

In [54]:
df_road_vehicle_count.filter(col('vehicle_count_rank') <= 3).show(truncate=False)

+---------+---------------+-------------------+------------------+
|location |road_name      |total_vehicle_count|vehicle_count_rank|
+---------+---------------+-------------------+------------------+
|Bangalore|Outer Ring Rd  |1339365            |1                 |
|Bangalore|Whitefield Rd  |1320360            |2                 |
|Bangalore|MG Road        |1303485            |3                 |
|Chennai  |GST Road       |1333073            |1                 |
|Chennai  |Mount Road     |1329511            |2                 |
|Chennai  |OMR            |1317171            |3                 |
|Delhi    |NH48           |1335420            |1                 |
|Delhi    |Ring Road      |1327408            |2                 |
|Delhi    |Janpath        |1303498            |3                 |
|Hyderabad|Hitech City Rd |1338486            |1                 |
|Hyderabad|Gachibowli Rd  |1328605            |2                 |
|Hyderabad|Madhapur Rd    |1324233            |3              

##PHASE 6 – Anomaly Detection

1. Detect sudden drop in avg_speed

In [55]:
sensor_window = Window.partitionBy('sensor_id', 'location', 'road_name').orderBy('event_time')

traffic_with_lag = df_raw.withColumn('avg_speed_lag', lag('avg_speed', 1).over(sensor_window))

traffic_with_lag = traffic_with_lag.withColumn('speed_diff', col('avg_speed') - col('avg_speed_lag'))

traffic_with_lag = traffic_with_lag.withColumn('speed_drop_anomaly', when(col('speed_diff') < -10, True).otherwise(False))

traffic_with_lag.filter(col('speed_drop_anomaly') == True).show(truncate=False)


+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-------------+-------------------+------------------+
|sensor_id|location |road_name    |vehicle_count|avg_speed|temperature|status|event_time         |avg_speed_lag|speed_diff         |speed_drop_anomaly|
+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-------------+-------------------+------------------+
|S100     |Bangalore|Outer Ring Rd|12           |20.1     |27         |ACTIVE|2026-01-14 04:12:50|50.5         |-30.4              |true              |
|S100     |Bangalore|Outer Ring Rd|84           |54.6     |26         |ACTIVE|2026-01-17 13:21:15|75.4         |-20.800000000000004|true              |
|S100     |Bangalore|Outer Ring Rd|110          |16.2     |26         |ACTIVE|2026-01-17 15:48:15|54.6         |-38.400000000000006|true              |
|S100     |Bangalore|Outer Ring Rd|40           |27.1     |32         |ACTIVE|2026-01-19

2. Detect sudden spikes in vehicle_count.

In [56]:
sensor_window = Window.partitionBy('sensor_id', 'location', 'road_name').orderBy('event_time')

traffic_with_lag = df_raw.withColumn('vehicle_count_lag', lag('vehicle_count', 1).over(sensor_window))

traffic_with_lag = traffic_with_lag.withColumn('vehicle_count_diff', col('vehicle_count') - col('vehicle_count_lag'))

traffic_with_lag = traffic_with_lag.withColumn('vehicle_count_spike_anomaly', when(col('vehicle_count_diff') > 50, True).otherwise(False)) # Assuming a spike is > 50 vehicles

traffic_with_lag.filter(col('vehicle_count_spike_anomaly') == True).show(truncate=False)


+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-----------------+------------------+---------------------------+
|sensor_id|location |road_name    |vehicle_count|avg_speed|temperature|status|event_time         |vehicle_count_lag|vehicle_count_diff|vehicle_count_spike_anomaly|
+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-----------------+------------------+---------------------------+
|S100     |Bangalore|Outer Ring Rd|97           |37.8     |40         |ACTIVE|NULL               |30               |67                |true                       |
|S100     |Bangalore|Outer Ring Rd|79           |NULL     |39         |ACTIVE|2026-01-12 17:19:05|20               |59                |true                       |
|S100     |Bangalore|Outer Ring Rd|116          |26.6     |32         |ACTIVE|2026-01-23 01:15:30|28               |88                |true                       |
|S100     |Banga

3. Use:

lag()

window function to compare with previous event.

In [57]:
sensor_window = Window.partitionBy('sensor_id', 'location', 'road_name').orderBy('event_time')

df_speed_anomalies = df_raw.withColumn('avg_speed_lag', lag('avg_speed', 1).over(sensor_window))

df_speed_anomalies = df_speed_anomalies.withColumn('speed_diff', col('avg_speed') - col('avg_speed_lag'))

df_speed_anomalies = df_speed_anomalies.withColumn('speed_drop_anomaly', when(col('speed_diff') < -10, True).otherwise(False))

df_speed_anomalies.filter(col('speed_drop_anomaly') == True).show(truncate=False)


+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-------------+-------------------+------------------+
|sensor_id|location |road_name    |vehicle_count|avg_speed|temperature|status|event_time         |avg_speed_lag|speed_diff         |speed_drop_anomaly|
+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-------------+-------------------+------------------+
|S100     |Bangalore|Outer Ring Rd|12           |20.1     |27         |ACTIVE|2026-01-14 04:12:50|50.5         |-30.4              |true              |
|S100     |Bangalore|Outer Ring Rd|84           |54.6     |26         |ACTIVE|2026-01-17 13:21:15|75.4         |-20.800000000000004|true              |
|S100     |Bangalore|Outer Ring Rd|110          |16.2     |26         |ACTIVE|2026-01-17 15:48:15|54.6         |-38.400000000000006|true              |
|S100     |Bangalore|Outer Ring Rd|40           |27.1     |32         |ACTIVE|2026-01-19

##PHASE 7 – Performance Engineering

1. Check number of partitions.

In [59]:
print(f"Number of partitions: {df_raw.rdd.getNumPartitions()}")

Number of partitions: 2


2. Use explain(True) on congestion queries

In [60]:
df_road_congestion.explain(True)

== Parsed Logical Plan ==
'Project [unresolvedstarwithcolumns(congestion_rank, 'rank() windowspecdefinition('average_speed ASC NULLS FIRST, unspecifiedframe$()), None)]
+- Aggregate [road_name#494], [road_name#494, avg(avg_speed#501) AS average_speed#681]
   +- Filter (status#499 = ACTIVE)
      +- Project [sensor_id#492, location#493, road_name#494, vehicle_count#500, avg_speed#501, temperature#497, status#499, event_time#502]
         +- Project [sensor_id#492, location#493, road_name#494, vehicle_count#500, avg_speed#501, temperature#497, timestamp#498, status#499, coalesce(try_to_timestamp(timestamp#498, Some(yyyy-MM-dd HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#498, Some(dd/MM/yyyy HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#498, Some(yyyy/MM/dd HH:mm:ss), TimestampType, Some(Etc/UTC), false)) AS event_time#502]
            +- Project [sensor_id#492, location#493, road_name#494, vehicle_count#500, cast(CASE WHEN (avg

3. Repartition by location.

In [61]:
df_raw = df_raw.repartition('location')

4. Cache cleaned DataFrame

In [62]:
df_raw.cache()

DataFrame[sensor_id: string, location: string, road_name: string, vehicle_count: int, avg_speed: double, temperature: string, status: string, event_time: timestamp]

5. Compare execution plans.

In [63]:
# Trigger caching action
df_raw.count()

# Compare execution plan for df_road_congestion again
df_road_congestion.explain(True)

== Parsed Logical Plan ==
'Project [unresolvedstarwithcolumns(congestion_rank, 'rank() windowspecdefinition('average_speed ASC NULLS FIRST, unspecifiedframe$()), None)]
+- Aggregate [road_name#494], [road_name#494, avg(avg_speed#501) AS average_speed#681]
   +- Filter (status#499 = ACTIVE)
      +- Project [sensor_id#492, location#493, road_name#494, vehicle_count#500, avg_speed#501, temperature#497, status#499, event_time#502]
         +- Project [sensor_id#492, location#493, road_name#494, vehicle_count#500, avg_speed#501, temperature#497, timestamp#498, status#499, coalesce(try_to_timestamp(timestamp#498, Some(yyyy-MM-dd HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#498, Some(dd/MM/yyyy HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#498, Some(yyyy/MM/dd HH:mm:ss), TimestampType, Some(Etc/UTC), false)) AS event_time#502]
            +- Project [sensor_id#492, location#493, road_name#494, vehicle_count#500, cast(CASE WHEN (avg

##PHASE 8 – RDD

1. Convert cleaned DataFrame to RDD

In [64]:
rdd_traffic = df_raw.rdd

2. Compute:
- Total vehicle count using reduce.
- Count of records per location using map-reduce.

In [65]:
df_speed_anomalies.filter(col('speed_drop_anomaly') == True).show(truncate=False)
traffic_with_lag.filter(col('vehicle_count_spike_anomaly') == True).show(truncate=False)

+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-------------+-------------------+------------------+
|sensor_id|location |road_name    |vehicle_count|avg_speed|temperature|status|event_time         |avg_speed_lag|speed_diff         |speed_drop_anomaly|
+---------+---------+-------------+-------------+---------+-----------+------+-------------------+-------------+-------------------+------------------+
|S100     |Bangalore|Outer Ring Rd|12           |20.1     |27         |ACTIVE|2026-01-14 04:12:50|50.5         |-30.4              |true              |
|S100     |Bangalore|Outer Ring Rd|84           |54.6     |26         |ACTIVE|2026-01-17 13:21:15|75.4         |-20.800000000000004|true              |
|S100     |Bangalore|Outer Ring Rd|110          |16.2     |26         |ACTIVE|2026-01-17 15:48:15|54.6         |-38.400000000000006|true              |
|S100     |Bangalore|Outer Ring Rd|40           |27.1     |32         |ACTIVE|2026-01-19

3. Explain why DataFrames are better for this case

In [67]:
#DataFrames are generally preferred over RDDs for several reasons in this scenario:
#1. Optimization: DataFrames provide Catalyst Optimizer and Tungsten execution engine, which can optimize queries and operations automatically, leading to significant performance improvements. RDDs require manual optimization.
#2. Schema Awareness: DataFrames have a schema, meaning Spark understands the data structure. This allows for better type safety and more efficient storage and processing, especially for structured and semi-structured data.
#3. Expressiveness: DataFrame API is more high-level and expressive, allowing developers to write less code for complex operations compared to RDDs' lower-level functional programming interface.
#4. Interoperability: DataFrames can be easily converted to and from other Spark data structures (like SQL tables) and integrated with various data sources and sinks.
#5. Performance with Structured Data: For operations like filtering, grouping, and aggregations on structured data (which traffic sensor data is), DataFrames often outperform RDDs due to built-in optimizations.
#6. SQL Query Support: DataFrames can be queried using Spark SQL, making them accessible to a wider range of data professionals familiar with SQL.

##PHASE 9 – Sorting & Set Operations

1. Sort roads by highest congestion

In [68]:
df_road_congestion.orderBy(asc('average_speed')).show(truncate=False)

+---------------+------------------+---------------+
|road_name      |average_speed     |congestion_rank|
+---------------+------------------+---------------+
|EM Bypass      |47.305273508964476|1              |
|Link Road      |47.3713610932772  |2              |
|Whitefield Rd  |47.40783038021416 |3              |
|FC Road        |47.41019017359676 |4              |
|Howrah Rd      |47.41786602740956 |5              |
|MG Road        |47.42467508440872 |6              |
|University Rd  |47.42538415393851 |7              |
|Nagar Rd       |47.429443229917716|8              |
|GST Road       |47.44407903702021 |9              |
|Gachibowli Rd  |47.455523521874355|10             |
|Eastern Express|47.48790560471994 |11             |
|Hitech City Rd |47.51352853463453 |12             |
|Outer Ring Rd  |47.53179303464533 |13             |
|OMR            |47.548242457791424|14             |
|Ring Road      |47.55594236047568 |15             |
|Western Express|47.55637391185471 |16        

2. Create two sets:
- Roads with avg_speed < 25
- Roads with vehicle_count > 60

In [69]:
df_roads_low_speed = df_raw.filter(col('avg_speed') < 25).select('road_name').distinct()
df_roads_high_vehicle_count = df_raw.filter(col('vehicle_count') > 60).select('road_name').distinct()

3. Find:
- Roads in both sets
- Roads in only one set

In [70]:
# Roads in both sets (intersection)
df_roads_in_both = df_roads_low_speed.intersect(df_roads_high_vehicle_count)
print("Roads in both sets:")
df_roads_in_both.show(truncate=False)

# Roads in only one set (symmetric difference)
df_roads_only_in_low_speed = df_roads_low_speed.exceptAll(df_roads_high_vehicle_count)
df_roads_only_in_high_vehicle_count = df_roads_high_vehicle_count.exceptAll(df_roads_low_speed)
df_roads_in_only_one_set = df_roads_only_in_low_speed.union(df_roads_only_in_high_vehicle_count)
print("Roads in only one set:")
df_roads_in_only_one_set.show(truncate=False)

Roads in both sets:
+---------------+
|road_name      |
+---------------+
|Whitefield Rd  |
|Outer Ring Rd  |
|MG Road        |
|GST Road       |
|OMR            |
|Mount Road     |
|Western Express|
|Eastern Express|
|Link Road      |
|Howrah Rd      |
|Park Street    |
|EM Bypass      |
|University Rd  |
|FC Road        |
|Nagar Rd       |
|Janpath        |
|NH48           |
|Ring Road      |
|Gachibowli Rd  |
|Hitech City Rd |
+---------------+
only showing top 20 rows
Roads in only one set:
+---------+
|road_name|
+---------+
+---------+



##PHASE 10 – Storage

1. Write cleaned traffic data to:

Parquet (partitioned by location)

In [71]:
df_raw.write.mode('overwrite').partitionBy('location').parquet('/content/traffic_data_cleaned.parquet')

2. Write congestion analytics to:

ORC

In [72]:
df_road_congestion.write.mode('overwrite').orc('/content/congestion_analytics.orc')

3. Read back and validate

In [73]:
print("Reading back Parquet data...")
df_traffic_cleaned_read = spark.read.parquet('/content/traffic_data_cleaned.parquet')
df_traffic_cleaned_read.printSchema()
df_traffic_cleaned_read.show(5, truncate=False)

print("Reading back ORC data...")
df_congestion_analytics_read = spark.read.orc('/content/congestion_analytics.orc')
df_congestion_analytics_read.printSchema()
df_congestion_analytics_read.show(5, truncate=False)

Reading back Parquet data...
root
 |-- sensor_id: string (nullable = true)
 |-- road_name: string (nullable = true)
 |-- vehicle_count: integer (nullable = true)
 |-- avg_speed: double (nullable = true)
 |-- temperature: string (nullable = true)
 |-- status: string (nullable = true)
 |-- event_time: timestamp (nullable = true)
 |-- location: string (nullable = true)

+---------+--------------+-------------+---------+-----------+------+-------------------+---------+
|sensor_id|road_name     |vehicle_count|avg_speed|temperature|status|event_time         |location |
+---------+--------------+-------------+---------+-----------+------+-------------------+---------+
|S197     |Hitech City Rd|97           |44.4     |26         |ACTIVE|2026-01-12 06:01:20|Hyderabad|
|S237     |Gachibowli Rd |90           |57.1     |39         |ACTIVE|2026-01-12 06:01:30|Hyderabad|
|S277     |Hitech City Rd|91           |37.4     |29         |ACTIVE|2026-01-12 06:02:30|Hyderabad|
|S267     |Hitech City Rd|31  