In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSV Reads Optimization").master("local[*]").getOrCreate()

spark

In [3]:


# In Spark, execution is broken into **Jobs, Stages, and Tasks**:

# * **Job**:
#   A job is triggered by an **action** (like `count()`, `show()`, `collect()`, `write()`).
#   Example: calling `df.count()` triggers one job.

# * **Stage**:
#   Each job is divided into stages, which are separated by **shuffle boundaries**.
#   If an operation requires data to be moved across partitions (e.g., `groupBy`, `join`), a new stage is created.
#   Example: `df.groupBy("city").count()` creates multiple stages because Spark must shuffle rows by `city`.

# * **Task**:
#   A task is the **smallest unit of execution**, and there is one task per partition of the data in a stage.
#   Tasks are what actually run on executors in parallel.
#   Example: if a stage operates on a file split into 8 partitions, Spark creates 8 tasks.


# **One job → multiple stages (if shuffles are needed) → multiple tasks (one per partition per stage).**

# “If I read a CSV with 7 partitions and run `df.count()`, Spark will create **1 job**, with **1 stage** (because no shuffle is needed), and **7 tasks** (one per partition).

# If I do a `groupBy` on that DataFrame, Spark will create **1 job**, split into **2 stages** — one stage to read and map data, and another stage after shuffle to aggregate. Each stage again breaks into tasks, based on the number of partitions.”

# ---

# ## ✅ Key Takeaway (good one-liner for interviews)

# * **Job = action**
# * **Stage = set of transformations without shuffle**
# * **Task = work done on a single partition**

# This Action create 7 tasks, 2 jobs and 2 stages

df = spark.read.csv("/opt/data/ncr_ride_bookings.csv", header=True, inferSchema=True)

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.9/dist-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkE

Py4JError: An error occurred while calling o25.read

In [4]:
df.show()

ConnectionRefusedError: [Errno 111] Connection refused