# Module 06: Introduction to Apache Spark

**Estimated Time:** 60-75 minutes

## Learning Objectives

By the end of this module, you will:
- Understand what Apache Spark is and when to use it
- Learn Spark architecture (driver, executors, cluster)
- Work with PySpark DataFrames
- Understand transformations vs actions
- Perform common data operations with Spark
- Know when to use Spark vs pandas

---

## 1. What is Apache Spark?

### Definition

**Apache Spark** is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python (PySpark), and R.

### Why Use Spark?

1. **Scale**: Process terabytes to petabytes of data
2. **Speed**: In-memory computing (up to 100x faster than Hadoop MapReduce)
3. **Distributed**: Automatically distributes data and computation
4. **Versatile**: Batch, streaming, ML, graph processing
5. **Fault-tolerant**: Automatic recovery from failures

### When to Use Spark vs pandas?

| Criteria | Use pandas | Use Spark |
|----------|------------|----------|
| **Data Size** | < 10GB | > 10GB |
| **Memory** | Fits in RAM | Doesn't fit in RAM |
| **Processing** | Single machine | Distributed cluster |
| **Speed Need** | Fast enough | Need distributed speed |
| **Complexity** | Simple operations | Complex transformations |
| **Cost** | Low | Higher (cluster costs) |

**Rule of thumb**: Start with pandas, move to Spark when you outgrow it.

In [None]:
# Check if PySpark is installed
try:
    import pyspark

    print(f"[OK] PySpark version: {pyspark.__version__}")
except ImportError:
    print("[FAIL] PySpark not installed")
    print("Install with: pip install pyspark")

In [None]:
# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
import pandas as pd
import numpy as np

# Create Spark session
spark = (
    SparkSession.builder.appName("DataEngineeringFundamentals")
    .master("local[*]")
    .config("spark.driver.memory", "4g")
    .getOrCreate()
)

print("[OK] Spark session created")
print(f"   Spark version: {spark.version}")
print(f"   Master: {spark.sparkContext.master}")

---

## 2. Spark Architecture

### Components

```
┌─────────────────────────────────────────┐
│         Driver Program                  │
│  (SparkContext, SparkSession)          │
└─────────────────┬───────────────────────┘
                  │
        ┌─────────┴─────────┐
        │  Cluster Manager  │
        │  (YARN, Mesos,    │
        │   Standalone)     │
        └─────────┬─────────┘
                  │
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼───┐    ┌───▼───┐    ┌───▼───┐
│Executor│    │Executor│    │Executor│
│ Tasks │    │ Tasks │    │ Tasks │
│ Cache │    │ Cache │    │ Cache │
└───────┘    └───────┘    └───────┘
```

- **Driver**: Coordinates the application
- **Executors**: Perform computations and store data
- **Cluster Manager**: Allocates resources
- **Tasks**: Units of work sent to executors

---

## 3. Creating Spark DataFrames

In [None]:
# Method 1: From Python list
data = [
    (1, "Alice", 25, "USA", 50000),
    (2, "Bob", 30, "UK", 60000),
    (3, "Carol", 28, "Canada", 55000),
    (4, "David", 35, "Australia", 65000),
    (5, "Eve", 32, "USA", 70000),
]

columns = ["id", "name", "age", "country", "salary"]

df_spark = spark.createDataFrame(data, columns)
print("Created Spark DataFrame from list")
df_spark.show()

In [None]:
# Method 2: From pandas DataFrame
df_pandas = pd.DataFrame(
    {
        "product_id": ["P001", "P002", "P003", "P004"],
        "product_name": ["Laptop", "Mouse", "Keyboard", "Monitor"],
        "price": [999.99, 29.99, 79.99, 299.99],
        "stock": [50, 200, 150, 75],
    }
)

df_spark_from_pandas = spark.createDataFrame(df_pandas)
print("Created Spark DataFrame from pandas")
df_spark_from_pandas.show()

In [None]:
# Method 3: Read from file
# Create sample CSV first
sample_csv_data = pd.DataFrame(
    {
        "order_id": range(1, 11),
        "customer_id": np.random.randint(1, 6, 10),
        "amount": np.random.uniform(100, 1000, 10).round(2),
        "date": pd.date_range("2024-01-01", periods=10),
    }
)

sample_csv_data.to_csv("../data/raw/spark_orders.csv", index=False)

# Read with Spark
df_spark_csv = spark.read.csv(
    "../data/raw/spark_orders.csv", header=True, inferSchema=True  # Automatically detect data types
)

print("Read CSV with Spark:")
df_spark_csv.show(5)

---

## 4. Transformations vs Actions

### Key Concept: Lazy Evaluation

Spark uses **lazy evaluation**: Transformations are not executed immediately.

**Transformations** (lazy):
- Create new DataFrames
- Not executed until an action is called
- Examples: `select()`, `filter()`, `groupBy()`, `join()`

**Actions** (eager):
- Trigger computation
- Return results to driver
- Examples: `show()`, `count()`, `collect()`, `write()`

```
df.filter()      # Transformation (lazy)
  .select()      # Transformation (lazy)
  .groupBy()     # Transformation (lazy)
  .count()       # ACTION (triggers execution)
```

In [None]:
# Transformations (lazy - not executed yet)
df_filtered = df_spark.filter(df_spark.age > 28)  # Lazy
df_selected = df_filtered.select("name", "age", "salary")  # Lazy

print("Transformations defined but not executed yet")

# Action (triggers execution)
print("\nNow executing with show() action:")
df_selected.show()  # Action - NOW it executes!

---

## 5. Common Spark Operations

In [None]:
# Select columns
df_spark.select("name", "salary").show(3)

# Filter rows
df_spark.filter(df_spark.salary > 60000).show()

# Add new column
df_with_bonus = df_spark.withColumn("bonus", df_spark.salary * 0.1)
df_with_bonus.show(3)

In [None]:
# Group by and aggregate
df_grouped = df_spark.groupBy("country").agg(
    F.count("*").alias("employee_count"),
    F.avg("salary").alias("avg_salary"),
    F.max("salary").alias("max_salary"),
)

df_grouped.show()

In [None]:
# Sort
df_spark.orderBy(F.desc("salary")).show()

# Drop duplicates
df_spark.dropDuplicates(["country"]).show()

# Rename columns
df_renamed = df_spark.withColumnRenamed("salary", "annual_salary")
df_renamed.show(3)

---

## 6. SQL Queries with Spark

In [None]:
# Register DataFrame as temporary view
df_spark.createOrReplaceTempView("employees")

# Query with SQL
result = spark.sql(
    """
    SELECT country, 
           COUNT(*) as count,
           AVG(salary) as avg_salary
    FROM employees
    WHERE age > 28
    GROUP BY country
    ORDER BY avg_salary DESC
"""
)

result.show()

---

## 7. Spark vs pandas Performance Comparison

In [None]:
import time

# Create larger dataset
large_data = pd.DataFrame(
    {
        "id": range(100000),
        "value": np.random.randn(100000),
        "category": np.random.choice(["A", "B", "C", "D"], 100000),
    }
)

# pandas timing
start = time.time()
pandas_result = large_data.groupby("category")["value"].mean()
pandas_time = time.time() - start

# Spark timing
spark_df = spark.createDataFrame(large_data)
start = time.time()
spark_result = spark_df.groupBy("category").avg("value").collect()
spark_time = time.time() - start

print(f"pandas time: {pandas_time:.4f}s")
print(f"Spark time: {spark_time:.4f}s")
print(f"\nNote: For small data (<1M rows), pandas is often faster due to less overhead")
print(f"Spark shines with datasets that don't fit in memory!")

---

## 8. Reading and Writing Data with Spark

In [None]:
# Write to Parquet (optimized columnar format)
df_spark.write.parquet(
    "../data/processed/spark_employees.parquet",
    mode="overwrite",  # overwrite, append, error, ignore
)

print("[OK] Written to Parquet")

# Read from Parquet
df_read = spark.read.parquet("../data/processed/spark_employees.parquet")
print("\nRead from Parquet:")
df_read.show(3)

In [None]:
# Write partitioned data
df_spark.write.partitionBy("country").parquet(
    "../data/processed/spark_employees_partitioned", mode="overwrite"
)

print("[OK] Written partitioned Parquet")
print("   Data is partitioned by country")
print("   Each partition is a separate file/folder")

---

## 9. Key Takeaways

[OK] **Spark Purpose**: Distributed data processing at scale

[OK] **When to Use**: Data > 10GB, distributed processing needed

[OK] **Architecture**: Driver + Executors + Cluster Manager

[OK] **Lazy Evaluation**: Transformations are lazy, actions trigger execution

[OK] **DataFrames**: Similar API to pandas, but distributed

[OK] **SQL Support**: Can query DataFrames with SQL

[OK] **File Formats**: Parquet is preferred for big data

### When to Use What?

- **pandas**: < 10GB, single machine, rapid prototyping
- **Spark**: > 10GB, distributed processing, production scale
- **Start with pandas, scale to Spark when needed**

---

## Next Steps

In **Module 07: Workflow Orchestration with Airflow**, we'll learn:
- What is workflow orchestration
- Apache Airflow concepts (DAGs, Operators, Tasks)
- Scheduling and dependencies
- Monitoring and alerting

---

**Ready to orchestrate pipelines?** Open `07_workflow_orchestration_airflow.ipynb`!

In [None]:
# Clean up
spark.stop()
print("[OK] Spark session stopped")