In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder\
.appName("Spark core concepts demo")\
.getOrCreate()

In [3]:
data = [
    ("O001","Hyderabad","Electronics",1200,"Delivered"),
    ("O002","Delhi","Clothing",800,"Delivered"),
    ("O003","Mumbai","Electronics",1500,"Cancelled"),
    ("O004","Bangalore","Grocery",400,"Delivered"),
    ("O005","Hyderabad","Grocery",300,"Delivered"),
    ("O006","Delhi","Electronics",2000,"Delivered"),
    ("O007","Mumbai","Clothing",700,"Delivered"),
    ("O008","Bangalore","Electronics",1800,"Delivered"),
    ("O009","Delhi","Grocery",350,"Cancelled"),
    ("O010","Hyderabad","Clothing",900,"Delivered")
]

coulumns = ["order_id","city","category","order_amount","status"]
df=spark.createDataFrame(data,coulumns)
df.show()
df.printSchema()

+--------+---------+-----------+------------+---------+
|order_id|     city|   category|order_amount|   status|
+--------+---------+-----------+------------+---------+
|    O001|Hyderabad|Electronics|        1200|Delivered|
|    O002|    Delhi|   Clothing|         800|Delivered|
|    O003|   Mumbai|Electronics|        1500|Cancelled|
|    O004|Bangalore|    Grocery|         400|Delivered|
|    O005|Hyderabad|    Grocery|         300|Delivered|
|    O006|    Delhi|Electronics|        2000|Delivered|
|    O007|   Mumbai|   Clothing|         700|Delivered|
|    O008|Bangalore|Electronics|        1800|Delivered|
|    O009|    Delhi|    Grocery|         350|Cancelled|
|    O010|Hyderabad|   Clothing|         900|Delivered|
+--------+---------+-----------+------------+---------+

root
 |-- order_id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- category: string (nullable = true)
 |-- order_amount: long (nullable = true)
 |-- status: string (nullable = true)



In [4]:
df.rdd.getNumPartitions()

2

In [5]:
df_repart=df.repartition(4)
df_repart.rdd.getNumPartitions()

4

In [6]:
df_coalesce=df.coalesce(1)
df_coalesce.rdd.getNumPartitions()

1

# Transformation

In [7]:
filtered_df=df.filter(df.city=="Delhi")
selected_df=filtered_df.select("order_id","order_amount")

# Action

In [8]:
selected_df.show()

+--------+------------+
|order_id|order_amount|
+--------+------------+
|    O002|         800|
|    O006|        2000|
|    O009|         350|
+--------+------------+



Data Lineage-Bunch of transformations on data

In [9]:
df_lineage=(
    df.filter(df.status=="Delivered")
    .filter(df.order_amount > 500)
    .select("city","order_amount")
)

In [10]:
df_lineage.count()

6

In [11]:
df.explain(True)

== Parsed Logical Plan ==
LogicalRDD [order_id#0, city#1, category#2, order_amount#3L, status#4], false

== Analyzed Logical Plan ==
order_id: string, city: string, category: string, order_amount: bigint, status: string
LogicalRDD [order_id#0, city#1, category#2, order_amount#3L, status#4], false

== Optimized Logical Plan ==
LogicalRDD [order_id#0, city#1, category#2, order_amount#3L, status#4], false

== Physical Plan ==
*(1) Scan ExistingRDD[order_id#0,city#1,category#2,order_amount#3L,status#4]



The df.explain(True) command provides a detailed explanation of how Spark plans to execute the operations on a DataFrame. It breaks down the execution into several phases:

Parsed Logical Plan: This is the initial, unanalyzed representation of your DataFrame operations. It directly reflects the code you wrote, but hasn't yet been validated against the catalog (e.g., checking if columns exist).

Analyzed Logical Plan: Spark validates the parsed plan by resolving all column references and functions against the Spark catalog. It ensures that all specified columns and tables exist and that data types are correctly inferred. If there are any ambiguities or errors in your code, they would typically show up here.

Optimized Logical Plan: Spark's Catalyst Optimizer applies a series of optimization rules to the analyzed logical plan. These rules include things like predicate pushdown (filtering data as early as possible), projection pruning (dropping unnecessary columns), and combining operations. The goal is to make the query more efficient without changing its semantic meaning.

Physical Plan: This is the final, executable plan. It describes how the optimized logical plan will be executed across the cluster, including details about physical operators (e.g., Scan, HashAggregate, Sort), partitioning strategies, and how data will be shuffled. The *(1) Scan ExistingRDD[order_id#0,city#1,category#2,order_amount#3L,status#4] indicates that Spark is directly scanning an existing Resilient Distributed Dataset (RDD), which is the underlying data structure for DataFrames, to retrieve the data for df.