# Lesson 6: Spark Fundamentals

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 2-3 hours  
**Difficulty**: Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand Distributed Computing concepts  
âœ… Master the difference between **Transformations** and **Actions**  
âœ… Explain **Lazy Evaluation** in Spark  
âœ… Know when to switch from Pandas to PySpark  

---

## ðŸ“š Table of Contents

1. [The limit of Pandas](#1-pandas-limit)
2. [Spark Architecture](#2-spark-arch)
3. [Core Concepts: Lazy Evaluation](#3-lazy-eval)
4. [Hands-On: PySpark API Simulation](#4-hands-on)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The Limit of Pandas

**Pandas**:
- Runs on 1 machine (Driver).
- Loads ALL data into RAM.
- If dataset > RAM, it crashes (`MemoryError`).

**Spark**:
- Runs on N machines (Cluster).
- Processes data in chunks (Partitions) on disk/RAM.
- Can parse Petabytes of data.

## 2. Spark Architecture

1. **Driver**: The brain. Runs your `main()` function.
2. **Cluster Manager**: Allocates resources (YARN, K8s).
3. **Executors**: The workers. They hold data partitions and run tasks.

**Key Idea**: You simply write code on the Driver, and Spark automatically sends code to where the data lives (Executors).

## 3. Core Concepts: Lazy Evaluation

In Pandas, every line executes immediately.
In Spark, nothing happens until you ask for a result.

### Transformations (Lazy)
- `df.filter()`, `df.select()`, `df.groupBy()`
- Spark just records the "Plan" (DAG).

### Actions (Eager)
- `df.count()`, `df.show()`, `df.write()`
- Spark optimizes the Plan (Catalyst Optimizer) and executes it.

**Why?** Optimization. If you filter 1TB data then take top 5 rows, Spark finds the 5 rows without processing the full 1TB.

In [None]:
# NOTE: We use pyspark.sql.SparkSession in real life.
# Here we simulate the syntax to learn the API structure.

print("---- PySpark Simulation ----")

class MockDataFrame:
    def __init__(self, plan=[]):
        self.plan = plan
    
    def filter(self, condition):
        print(f"[Transform] Added Filter: {condition}")
        return MockDataFrame(self.plan + [f"Filter({condition})"])

    def select(self, *cols):
        print(f"[Transform] Added Select: {cols}")
        return MockDataFrame(self.plan + [f"Select({cols})"])

    def count(self):
        print("\n[Action] Triggered Count!")
        print("Optimizing Plan...")
        print(f"Executing: {' -> '.join(self.plan)}")
        return 100

# 1. Read Data (Lazy)
df = MockDataFrame(["Read(data.parquet)"])

# 2. Transformations (Lazy - nothing calculates yet)
df_filtered = df.filter("age > 21")
df_final = df_filtered.select("name", "age")

print("\nHas any data been touched yet? NO.")

# 3. Action (Eager)
result = df_final.count()

## 5. Interview Preparation

### Common Questions

#### Q1: "What is the difference between `map` and `reduce`?"
**Answer**: "`map` transforms elements individually (1-to-1). `reduce` aggregates elements (Many-to-1). In Spark, `reduceByKey` is a powerful way to aggregate distributed data."

#### Q2: "Explain Wide vs Narrow Dependencies."
**Answer**: 
- **Narrow**: Data stays in the same partition (e.g., `filter`, `map`). Fast.
- **Wide**: Data must be shuffled across network between executors (e.g., `groupBy`, `join`). Slow. Shuffles are the bottleneck in Spark jobs.

#### Q3: "What is a Broadcast Variable?"
**Answer**: "If I have a huge table and a tiny dictionary, instead of doing a full shuffle join, I broadcast (send copy of) the tiny dictionary to every executor's RAM. Then map-side joins can happen locally without network traffic."