In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSV Reads Optimization").master("local[*]").getOrCreate()

spark

In [2]:


# In Spark, execution is broken into **Jobs, Stages, and Tasks**:

# * **Job**:
#   A job is triggered by an **action** (like `count()`, `show()`, `collect()`, `write()`).
#   Example: calling `df.count()` triggers one job.

# * **Stage**:
#   Each job is divided into stages, which are separated by **shuffle boundaries**.
#   If an operation requires data to be moved across partitions (e.g., `groupBy`, `join`), a new stage is created.
#   Example: `df.groupBy("city").count()` creates multiple stages because Spark must shuffle rows by `city`.

# * **Task**:
#   A task is the **smallest unit of execution**, and there is one task per partition of the data in a stage.
#   Tasks are what actually run on executors in parallel.
#   Example: if a stage operates on a file split into 8 partitions, Spark creates 8 tasks.


# **One job → multiple stages (if shuffles are needed) → multiple tasks (one per partition per stage).**

# “If I read a CSV with 7 partitions and run `df.count()`, Spark will create **1 job**, with **1 stage** (because no shuffle is needed), and **7 tasks** (one per partition).

# If I do a `groupBy` on that DataFrame, Spark will create **1 job**, split into **2 stages** — one stage to read and map data, and another stage after shuffle to aggregate. Each stage again breaks into tasks, based on the number of partitions.”

# ---

# ## ✅ Key Takeaway (good one-liner for interviews)

# * **Job = action**
# * **Stage = set of transformations without shuffle**
# * **Task = work done on a single partition**

# This Action create 7 tasks, 2 jobs and 2 stages

df = spark.read.csv("/opt/data/ncr_ride_bookings.csv", header=True, inferSchema=True)

In [3]:
df.show()

+----------+-------------------+----------------+--------------------+----------------+-------------+-------------------+-----------------+--------+--------+---------------------------+---------------------------------+-------------------------+--------------------------+----------------+-----------------------+-------------+-------------+--------------+---------------+--------------+
|      Date|               Time|      Booking ID|      Booking Status|     Customer ID| Vehicle Type|    Pickup Location|    Drop Location|Avg VTAT|Avg CTAT|Cancelled Rides by Customer|Reason for cancelling by Customer|Cancelled Rides by Driver|Driver Cancellation Reason|Incomplete Rides|Incomplete Rides Reason|Booking Value|Ride Distance|Driver Ratings|Customer Rating|Payment Method|
+----------+-------------------+----------------+--------------------+----------------+-------------+-------------------+-----------------+--------+--------+---------------------------+---------------------------------+-----

In [3]:
df.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Time: timestamp (nullable = true)
 |-- Booking ID: string (nullable = true)
 |-- Booking Status: string (nullable = true)
 |-- Customer ID: string (nullable = true)
 |-- Vehicle Type: string (nullable = true)
 |-- Pickup Location: string (nullable = true)
 |-- Drop Location: string (nullable = true)
 |-- Avg VTAT: string (nullable = true)
 |-- Avg CTAT: string (nullable = true)
 |-- Cancelled Rides by Customer: string (nullable = true)
 |-- Reason for cancelling by Customer: string (nullable = true)
 |-- Cancelled Rides by Driver: string (nullable = true)
 |-- Driver Cancellation Reason: string (nullable = true)
 |-- Incomplete Rides: string (nullable = true)
 |-- Incomplete Rides Reason: string (nullable = true)
 |-- Booking Value: string (nullable = true)
 |-- Ride Distance: string (nullable = true)
 |-- Driver Ratings: string (nullable = true)
 |-- Customer Rating: string (nullable = true)
 |-- Payment Method: string (nullable = true)



In [5]:
# Check DF Type
from pyspark.sql import DataFrame

# Assume df is your object
isinstance(df, DataFrame) 

True

In [6]:
from pyspark.sql.functions import col, count, when

total_count = df.count()
null_stats = df.select([
    (count(when(col(c).isNull(), c)) / total_count).alias(c) for c in df.columns
])

null_stats.show()


+----+----+----------+--------------+-----------+------------+---------------+-------------+--------+--------+---------------------------+---------------------------------+-------------------------+--------------------------+----------------+-----------------------+-------------+-------------+--------------+---------------+--------------+
|Date|Time|Booking ID|Booking Status|Customer ID|Vehicle Type|Pickup Location|Drop Location|Avg VTAT|Avg CTAT|Cancelled Rides by Customer|Reason for cancelling by Customer|Cancelled Rides by Driver|Driver Cancellation Reason|Incomplete Rides|Incomplete Rides Reason|Booking Value|Ride Distance|Driver Ratings|Customer Rating|Payment Method|
+----+----+----------+--------------+-----------+------------+---------------+-------------+--------+--------+---------------------------+---------------------------------+-------------------------+--------------------------+----------------+-----------------------+-------------+-------------+--------------+---------

In [7]:
df.head(5)

[Row(Date=datetime.date(2024, 3, 23), Time=datetime.datetime(2025, 9, 13, 12, 29, 38), Booking ID='"""CNR5884300"""', Booking Status='No Driver Found', Customer ID='"""CID1982111"""', Vehicle Type='eBike', Pickup Location='Palam Vihar', Drop Location='Jhilmil', Avg VTAT='null', Avg CTAT='null', Cancelled Rides by Customer='null', Reason for cancelling by Customer='null', Cancelled Rides by Driver='null', Driver Cancellation Reason='null', Incomplete Rides='null', Incomplete Rides Reason='null', Booking Value='null', Ride Distance='null', Driver Ratings='null', Customer Rating='null', Payment Method='null'),
 Row(Date=datetime.date(2024, 11, 29), Time=datetime.datetime(2025, 9, 13, 18, 1, 39), Booking ID='"""CNR1326809"""', Booking Status='Incomplete', Customer ID='"""CID4604802"""', Vehicle Type='Go Sedan', Pickup Location='Shastri Nagar', Drop Location='Gurgaon Sector 56', Avg VTAT='4.9', Avg CTAT='14.0', Cancelled Rides by Customer='null', Reason for cancelling by Customer='null', Ca

In [4]:
df.rdd.getNumPartitions()

7

In [None]:
df.write.partitionBy("Date", "Vehicle Type").parquet("/opt/output/partitioned_rides")


In [4]:
df.write.bucketBy(16, "Customer ID").sortBy("Customer ID").saveAsTable("bucketed_rides")