In [1]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [2]:
import spark_env

spark = spark_env.create_spark_session('practice')

In [3]:
spark

In [4]:
df = [
    ('John',24,'Chicago'),
    ('Jack',35,'New York'),
    ('Doe',28,'San Francisco'),
    ('Kim',32,'Chicago'),
    ('Jerry',42,'Phoenix')
]

schema = StructType([
    StructField('name',StringType(),True),
    StructField('age',IntegerType(),True),
    StructField('city',StringType(),True)
])

df = spark.createDataFrame(df,schema)

# df.show(5)

In [5]:
filtered_df = df.filter(col('age') > 26)

In [6]:
filtered_df.show()

+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
| Jack| 35|     New York|
|  Doe| 28|San Francisco|
|  Kim| 32|      Chicago|
|Jerry| 42|      Phoenix|
+-----+---+-------------+



In [7]:
filtered_df.explain('formatted')

== Physical Plan ==
* Filter (2)
+- * Scan ExistingRDD (1)


(1) Scan ExistingRDD [codegen id : 1]
Output [3]: [name#0, age#1, city#2]
Arguments: [name#0, age#1, city#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)

(2) Filter [codegen id : 1]
Input [3]: [name#0, age#1, city#2]
Condition : (isnotnull(age#1) AND (age#1 > 26))




In [8]:
multiple_transformation_df = df.filter(col('age') > 20)\
                                .select(col('name'),col('city'))\
                                .orderBy(col('city'))

In [9]:
multiple_transformation_df.show()

+-----+-------------+
| name|         city|
+-----+-------------+
|  Kim|      Chicago|
| John|      Chicago|
| Jack|     New York|
|Jerry|      Phoenix|
|  Doe|San Francisco|
+-----+-------------+



In [10]:
multiple_transformation_df.explain('formatted')

== Physical Plan ==
AdaptiveSparkPlan (6)
+- Sort (5)
   +- Exchange (4)
      +- Project (3)
         +- Filter (2)
            +- Scan ExistingRDD (1)


(1) Scan ExistingRDD
Output [3]: [name#0, age#1, city#2]
Arguments: [name#0, age#1, city#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)

(2) Filter
Input [3]: [name#0, age#1, city#2]
Condition : (isnotnull(age#1) AND (age#1 > 20))

(3) Project
Output [2]: [name#0, city#2]
Input [3]: [name#0, age#1, city#2]

(4) Exchange
Input [2]: [name#0, city#2]
Arguments: rangepartitioning(city#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=56]

(5) Sort
Input [2]: [name#0, city#2]
Arguments: [city#2 ASC NULLS FIRST], true, 0

(6) AdaptiveSparkPlan
Output [2]: [name#0, city#2]
Arguments: isFinalPlan=false




### Must know Actions in Spark
---
- **collect()** – Retrieves the entire RDD/DataFrame to the driver. Be cautious: can cause OOM if data is too large.
- **count()** – Returns the number of elements in the dataset.
- **take(n)** – Returns the first n elements as an array.
- **first()** – Returns the first element (like take(1)[0]).
- **show()** (DataFrame) – Displays the top rows in a tabular format.
- **reduce(func)** – Aggregates the dataset using a function (e.g., sum, max).
- **foreach(func)** – Applies a function to each element (executed on the worker nodes).
- **saveAsTextFile(path)** – Saves the RDD as a text file to the specified path.

### Good to Know Actions in Spark
---
- **collectAsMap()** – For (key, value) RDDs; collects to a map at the driver.
- **countByValue()** – Returns the count of each unique value.
- **takeSample(withReplacement, num)** – Samples num elements from the dataset.
- **top(n)** – Returns the top n elements (requires ordering).
- **takeOrdered(n)** – Returns the first n elements in sorted order.
- **saveAsSequenceFile() / saveAsObjectFile()** – Specialized saving methods for RDDs.
- **foreachPartition()** – Operates on each partition, useful for batch inserts or connections.

## 🔁 Spark Action Function Best Practices

---

### ✅ 1. Understanding difference between `foreach()` and `foreachPartition()`

| Function             | Behavior                                                                 |
|----------------------|--------------------------------------------------------------------------|
| `foreach()`          | Applies a function to **each element** — one record at a time.           |
| `foreachPartition()` | Applies a function to **each partition** — one partition at a time.      |

**Why it matters:**  
- `foreachPartition()` is **more efficient** for I/O-bound tasks like writing to a database or calling external APIs.
- Instead of opening a DB connection for every row (which `foreach()` might do), you can open **one connection per partition**.

**✅ Use `foreachPartition()` when:**
- Writing to databases
- Performing bulk writes
- Making batched API calls

---

### ✅ 2. Knowing when to use `collect()` vs. `take()`

| Function      | Behavior                                | Risk              |
|---------------|-----------------------------------------|-------------------|
| `collect()`   | Returns **all rows** to the driver       | OOM if too large  |
| `take(n)`     | Returns **only first n rows**           | Safe and fast     |

**Why it matters:**  
- `collect()` can crash your driver if your data is large.
- Use `take()` to **inspect samples** or debug pipelines without pulling full datasets.

**✅ Best practice:**
- Use `.take(5)` or `.limit(5).collect()` for small previews.
- Use `.collect()` **only** when you're **sure** the result fits in memory.

---

### ✅ 3. Using actions to trigger lazy evaluation properly

- Spark uses **lazy evaluation** — transformations like `map()`, `filter()`, `select()` do **nothing** until an **action** is called.
- Actions include: `show()`, `count()`, `collect()`, `save()`, `foreach()`.

**Why it matters:**  
- Lazy evaluation allows Spark to **optimize execution** by building a logical plan.
- This is what makes Spark powerful and efficient.

**✅ Remember:**
Transformations define the logic → **Actions trigger execution**

---

### ✅ 4. Combining `reduce()` or `aggregate()` for custom distributed aggregations

| Function      | Purpose                                   | Customization       |
|---------------|-------------------------------------------|---------------------|
| `reduce()`    | Aggregates using an associative function (e.g., sum, max) | ❌ No zero value or type change |
| `aggregate()` | Aggregates using two functions: one for within and one across partitions | ✅ Flexible and safe |

**Why it matters:**  
- `reduce()` is simple but can’t handle empty RDDs or different result types.
- `aggregate()` lets you:
  - Define **initial zero value**.
  - Use **different logic** for combining values in partition vs across partitions.

**✅ Use `aggregate()` when:**
- You need to return a different type (e.g., average = sum/count).
- You need robust, custom aggregation logic across partitions.

## Transformations in Spark

**Narrow Transformation:** A transformation where each partition of the parent RDD is used by only one partition of the child RDD. No data shuffle.

**Wide Transformation:** A transformation where data from multiple partitions may be required to compute a single partition of the child RDD. Involves shuffle.

#### Examples of Transformations
- **map()**
- **filter()**
- **select()**
- **withColumn()**
- **groupBy()**
- **join()**
---
- Spark builds a lineage (DAG) of transformations and only evaluates it when an action is called.
- Skilled Spark engineers optimize performance by minimizing wide transformations and strategically placing actions.

In [11]:
# Example of Wide Transformation

wide_transformation_df = df.groupby('city').agg(max('age'))

In [12]:
wide_transformation_df.show()

+-------------+--------+
|         city|max(age)|
+-------------+--------+
|      Chicago|      32|
|     New York|      35|
|San Francisco|      28|
|      Phoenix|      42|
+-------------+--------+



In [13]:
wide_transformation_df.explain(True)

== Parsed Logical Plan ==
'Aggregate ['city], ['city, max('age) AS max(age)#34]
+- LogicalRDD [name#0, age#1, city#2], false

== Analyzed Logical Plan ==
city: string, max(age): int
Aggregate [city#2], [city#2, max(age#1) AS max(age)#34]
+- LogicalRDD [name#0, age#1, city#2], false

== Optimized Logical Plan ==
Aggregate [city#2], [city#2, max(age#1) AS max(age)#34]
+- Project [age#1, city#2]
   +- LogicalRDD [name#0, age#1, city#2], false

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[city#2], functions=[max(age#1)], output=[city#2, max(age)#34])
   +- Exchange hashpartitioning(city#2, 200), ENSURE_REQUIREMENTS, [plan_id=115]
      +- HashAggregate(keys=[city#2], functions=[partial_max(age#1)], output=[city#2, max#44])
         +- Project [age#1, city#2]
            +- Scan ExistingRDD[name#0,age#1,city#2]



### Repartition vs Coalesce

**Coalesce:**

coalesce() is a narrow transformation used to reduce the number of partitions by merging adjacent partitions, without triggering a full shuffle (or with minimal shuffling if necessary).
coalesce() is much faster than repartition() when you're only reducing partitions, especially for writing to disk — it helps avoid the problem of creating too many small files, making it a go-to for final write stages.

**Repartition:**

repartition() is a Spark transformation used to increase or reshuffle the number of partitions by performing a full shuffle of the data across the cluster.
Unlike coalesce(), repartition() balances data evenly across all partitions, making it ideal before joins, caching, or expensive operations where load balancing improves performance — though it’s more costly due to the shuffle.

| Scenario                         | Partition Source                  | Number of Partitions |
|----------------------------------|-----------------------------------|-----------------------|
| Creating DataFrame from a Python list (`createDataFrame`) | `sparkContext.defaultParallelism` | Usually 12            |
| After wide transformation (`groupBy`, `join`, etc.)        | `spark.sql.shuffle.partitions`    | Defaults to 200       |

In [14]:
spark.conf.get('spark.sql.shuffle.partitions')

'200'

In [15]:
df.rdd.getNumPartitions()

12

In [16]:
df = df.repartition(3)
df.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange RoundRobinPartitioning(3), REPARTITION_BY_NUM, [plan_id=127]
   +- Scan ExistingRDD[name#0,age#1,city#2]




In [17]:
df.rdd.getNumPartitions()

3

In [18]:
df = df.coalesce(1)
df.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Coalesce 1
   +- Exchange RoundRobinPartitioning(3), REPARTITION_BY_NUM, [plan_id=146]
      +- Scan ExistingRDD[name#0,age#1,city#2]




In [19]:
df.rdd.getNumPartitions()

1

In [20]:
df.show()

+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|  Kim| 32|      Chicago|
|Jerry| 42|      Phoenix|
| John| 24|      Chicago|
| Jack| 35|     New York|
|  Doe| 28|San Francisco|
+-----+---+-------------+

