# CASE STUDY 4
Title: Smart City Traffic Sensor Analytics using PySpark

PHASE 1 – Ingestion
1. Read traffic_data.csv as StringType.

2. Print schema and count records.
3. Identify data quality issues by inspection.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType

spark=SparkSession.builder.appName('SmartCityTrafficAnalytics').getOrCreate()

df_raw=spark.read.csv('traffic_data_large.csv',header=True,inferSchema=False)



In [3]:
df_raw.printSchema()
df_raw.count()

root
 |-- sensor_id: string (nullable = true)
 |-- location: string (nullable = true)
 |-- road_name: string (nullable = true)
 |-- vehicle_count: string (nullable = true)
 |-- avg_speed: string (nullable = true)
 |-- temperature: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- status: string (nullable = true)



500000

In [4]:
df_raw.show(5)

+---------+---------+---------------+-------------+---------+-----------+-------------------+--------+
|sensor_id| location|      road_name|vehicle_count|avg_speed|temperature|          timestamp|  status|
+---------+---------+---------------+-------------+---------+-----------+-------------------+--------+
|     S105|  Chennai|            OMR|      invalid|     NULL|         39|12/01/2026 06:00:00|INACTIVE|
|     S113|  Chennai|     Mount Road|          103|     73.5|         36|2026-01-12 06:00:05|  ACTIVE|
|     S228|    Delhi|        Janpath|           16|     20.0|         35|2026-01-12 06:00:10|  ACTIVE|
|     S160|Bangalore|        MG Road|           27|     27.1|         32|2026-01-12 06:00:15|  ACTIVE|
|     S252|   Mumbai|Western Express|          115|     59.3|         39|2026-01-12 06:00:20|  ACTIVE|
+---------+---------+---------------+-------------+---------+-----------+-------------------+--------+
only showing top 5 rows


1. Incorrect Data Types: All columns are currently inferred as StringType, but columns like vehicle_count, avg_speed, temperature, and timestamp should be numerical or datetime types for proper analysis.
2. Invalid Values: In the vehicle_count column, there's a value 'invalid' (e.g., in sensor_id S105), which is not a valid numerical entry.
3. Missing Values Represented as Strings: In the avg_speed column, there's a 'NULL' string value (e.g., in sensor_id S105) instead of a proper null or numerical value.
4. Inconsistent Timestamp Format: The timestamp column shows at least two different formats (MM/dd/yyyy HH:mm:ss and yyyy-MM-dd HH:mm:ss), which will make direct conversion to a datetime type problematic.

# PHASE 2 – Cleaning
1. Trim all string columns.
2. Clean vehicle_count:
Replace invalid and empty with null
Cast to IntegerType
3. Clean avg_speed:
Replace empty with null
Cast to DoubleType
4. Parse timestamp into:

event_time

with TimestampType supporting:
yyyy-MM-dd HH:mm:ss
dd/MM/yyyy HH:mm:ss
yyyy/MM/dd HH:mm:ss

5. Keep original timestamp for audit.

In [5]:
df_trimmed=df_raw.select(
    [
        trim(col(c)).alias(c) for c in df_raw.columns
    ]
)

In [6]:
df_clean=df_trimmed.withColumn(
    "vehicle_count_clean",
    when(col("vehicle_count").rlike("^[0-9]+$"),col("vehicle_count").cast("int")).otherwise(None)
)


In [8]:
df_clean=df_clean.withColumn(
    "avg_speed_clean",
    when(col("avg_speed")!="",col("avg_speed").cast("double")).otherwise(None)
)


In [9]:
df_clean=df_clean.withColumn(
    "event_time",
    coalesce(
        try_to_timestamp(col("timestamp"),lit("yyyy-MM-dd HH:mm:ss")),
        try_to_timestamp(col("timestamp"),lit("dd/MM/yyyy HH:mm:ss")),
        try_to_timestamp(col("timestamp"),lit("yyyy/MM/dd HH:mm:ss"))
    )
)

# PHASE 3 – Validation
1. Count invalid vehicle_count rows.
2. Count invalid timestamp rows.
3. Remove rows where:

status != "ACTIVE"

4. Validate row counts.

In [10]:
df_clean.filter(col("vehicle_count_clean").isNull()).count()


49873

In [11]:
df_active=df_clean.filter(col("status")=="ACTIVE")
df_active.count()

475000

In [12]:
df_raw.count()-df_active.count()

25000

# PHASE 4 – Traffic Metrics
1. Average speed per location.
2. Total vehicle count per road.

3. Peak traffic time per location.
4. Roads with lowest average speed (most congestion).

In [13]:
avg_speed_location=df_active.groupBy("location").agg(avg("avg_speed_clean").alias("avg_speed"))


In [14]:
vehicle_per_road=df_active.groupBy("road_name").agg(sum("vehicle_count_clean").alias("total_vehicles"))

In [16]:
peak_time=df_active.groupBy("location").agg(
    max("event_time").alias("peak_event_time"),
    sum("vehicle_count_clean").alias("vehicles")
)

In [17]:
congested_roads=df_active.groupBy("road_name").agg(avg("avg_speed_clean").alias("avg_speed")).orderBy("avg_speed")

# PHASE 5 – Window Functions
1. Rank roads by congestion (lowest speed).
2. For each location, rank roads by vehicle_count.
3. Identify top 3 congested roads per location.

In [18]:
from pyspark.sql.window import Window

In [19]:
w=Window.orderBy("avg_speed")
congestion_rank=congested_roads.withColumn("congestion_rank",dense_rank().over(w))

In [20]:
w_loc=Window.partitionBy("location").orderBy(desc("vehicle_count_clean"))
ranked_roads=df_active.withColumn("road_rank",dense_rank().over(w_loc))

In [23]:
top_3_congested=ranked_roads.filter(col("road_rank")<=3)

# PHASE 6 – Anomaly Detection
1. Detect sudden drop in avg_speed.
2. Detect sudden spikes in vehicle_count.
3. Use:

lag()

window function to compare with previous event.

In [24]:
w_sensor=Window.partitionBy("sensor_id").orderBy("event_time")

In [25]:
df_anomaly=df_active.withColumn(
    "prev_speed",lag("avg_speed_clean").over(w_sensor)
).withColumn(
    "speed_drop",
    when(col("avg_speed_clean")<col("prev_speed")*0.7,True).otherwise(False)
)

In [26]:
df_anomaly=df_anomaly.withColumn(
    "prev_count",lag("vehicle_count_clean").over(w_sensor)
).withColumn(
    "vehicle_spike",
    when(col("vehicle_count_clean")>col("prev_count")*1.5,True).otherwise(False)
)

# PHASE 7 – Performance Engineering
1. Check number of partitions.
2. Use explain(True) on congestion queries.
3. Repartition by location.
4. Cache cleaned DataFrame.
5. Compare execution plans.

In [27]:
df_active.rdd.getNumPartitions()

2

In [28]:
congested_roads.explain(True)

== Parsed Logical Plan ==
'Sort ['avg_speed ASC NULLS FIRST], true
+- Aggregate [road_name#73], [road_name#73, avg(avg_speed_clean#80) AS avg_speed#199]
   +- Filter (status#78 = ACTIVE)
      +- Project [sensor_id#71, location#72, road_name#73, vehicle_count#74, avg_speed#75, temperature#76, timestamp#77, status#78, vehicle_count_clean#79, avg_speed_clean#80, coalesce(try_to_timestamp(timestamp#77, Some(yyyy-MM-dd HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#77, Some(dd/MM/yyyy HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#77, Some(yyyy/MM/dd HH:mm:ss), TimestampType, Some(Etc/UTC), false)) AS event_time#81]
         +- Project [sensor_id#71, location#72, road_name#73, vehicle_count#74, avg_speed#75, temperature#76, timestamp#77, status#78, vehicle_count_clean#79, CASE WHEN NOT (avg_speed#75 = ) THEN cast(avg_speed#75 as double) ELSE cast(null as double) END AS avg_speed_clean#80]
            +- Project [sensor_id#71, locati

In [29]:
df_perf=df_active.repartition("location")

In [30]:
df_perf.cache()

DataFrame[sensor_id: string, location: string, road_name: string, vehicle_count: string, avg_speed: string, temperature: string, timestamp: string, status: string, vehicle_count_clean: int, avg_speed_clean: double, event_time: timestamp]

In [31]:
df_active.explain(True)
df_perf.explain(True)

== Parsed Logical Plan ==
'Filter '`=`('status, ACTIVE)
+- Project [sensor_id#71, location#72, road_name#73, vehicle_count#74, avg_speed#75, temperature#76, timestamp#77, status#78, vehicle_count_clean#79, avg_speed_clean#80, coalesce(try_to_timestamp(timestamp#77, Some(yyyy-MM-dd HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#77, Some(dd/MM/yyyy HH:mm:ss), TimestampType, Some(Etc/UTC), false), try_to_timestamp(timestamp#77, Some(yyyy/MM/dd HH:mm:ss), TimestampType, Some(Etc/UTC), false)) AS event_time#81]
   +- Project [sensor_id#71, location#72, road_name#73, vehicle_count#74, avg_speed#75, temperature#76, timestamp#77, status#78, vehicle_count_clean#79, CASE WHEN NOT (avg_speed#75 = ) THEN cast(avg_speed#75 as double) ELSE cast(null as double) END AS avg_speed_clean#80]
      +- Project [sensor_id#71, location#72, road_name#73, vehicle_count#74, avg_speed#75, temperature#76, timestamp#77, status#78, CASE WHEN RLIKE(vehicle_count#74, ^[0-9]+$) THEN cast(v

# PHASE 8 – RDD
1. Convert cleaned DataFrame to RDD.
2. Compute:
Total vehicle count using reduce.
Count of records per location using map-reduce.
3. Explain why DataFrames are better for this case.

In [32]:
rdd=df_active.rdd

In [34]:
total_vehicles=rdd.map(lambda x:x.vehicle_count_clean or 0).reduce(lambda a,b:a+b)

In [35]:
location_count=rdd.map(lambda x:(x.location,1)).reduceByKey(lambda a,b:a+b)

DataFrames are better due to:

Optimized Execution: Spark's Catalyst Optimizer makes DataFrames much faster.

Schema Awareness: Built-in schema allows for better validation, type safety, and optimization.

Easier API: More intuitive and SQL-like operations simplify complex tasks.

Performance with Structured Data: Highly efficient for structured data like our traffic sensor readings.

Better Integration: Seamlessly works with other Spark libraries like MLlib and Spark SQL.

# PHASE 9 – Sorting & Set Operations
1. Sort roads by highest congestion.
2. Create two sets:

Roads with avg_speed < 25
Roads with vehicle_count > 60
3. Find:
Roads in both sets
Roads in only one set

In [36]:
sorted_roads=congested_roads.orderBy(asc("avg_speed"))

In [37]:
slow_roads=df_active.filter(col("avg_speed_clean")<25)\
.select("road_name").distinct()

busy_roads=df_active.filter(col("vehicle_count_clean")>60)\
.select("road_name").distinct()

In [38]:
both=slow_roads.intersect(busy_roads)
only_one=slow_roads.subtract(busy_roads)

# PHASE 10 – Storage
1. Write cleaned traffic data to:

Parquet (partitioned by location)

2. Write congestion analytics to:

ORC

3. Read back and validate.

In [39]:
df_active.write\
.mode("overwrite")\
.partitionBy("location")\
.parquet("traffic_cleaned_parquet")

In [40]:
congested_roads.write\
.mode("overwrite")\
.orc("congestion_analytics_orc")

In [41]:
spark.read.parquet("traffic_cleaned_parquet").count()
spark.read.orc("congestion_analytics_orc").show()

+---------------+------------------+
|      road_name|         avg_speed|
+---------------+------------------+
|      EM Bypass|47.305273508964476|
|      Link Road|  47.3713610932772|
|  Whitefield Rd| 47.40783038021416|
|        FC Road| 47.41019017359676|
|      Howrah Rd| 47.41786602740956|
|        MG Road| 47.42467508440872|
|  University Rd| 47.42538415393851|
|       Nagar Rd|47.429443229917716|
|       GST Road| 47.44407903702021|
|  Gachibowli Rd|47.455523521874355|
|Eastern Express| 47.48790560471994|
| Hitech City Rd| 47.51352853463453|
|  Outer Ring Rd| 47.53179303464533|
|            OMR|47.548242457791424|
|      Ring Road| 47.55594236047568|
|Western Express| 47.55637391185471|
|    Park Street|47.579433971003745|
|           NH48|47.613086790393126|
|    Madhapur Rd| 47.64118452897408|
|        Janpath| 47.67768541905838|
+---------------+------------------+
only showing top 20 rows
