# Module 13: Spark on Clusters

**Difficulty**: ⭐⭐⭐  
**Estimated Time**: 70 minutes  
**Prerequisites**: 
- [Module 01: PySpark Setup and Spark Session](01_pyspark_setup_and_spark_session.ipynb)
- [Module 12: Performance Optimization](12_performance_optimization.ipynb)
- Understanding of distributed computing concepts

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand Spark cluster architecture (Driver, Executors, Cluster Manager)
2. Compare deployment modes (Local, Standalone, YARN, Kubernetes, Mesos)
3. Configure resource allocation (cores, memory, executors) for optimal performance
4. Submit Spark applications using spark-submit with appropriate configurations
5. Monitor and troubleshoot Spark jobs using the web UI and logs

## 1. Spark Cluster Architecture

**Components:**

1. **Driver Program**:
   - Runs the main() function
   - Creates SparkContext
   - Converts user code to tasks
   - Schedules tasks on executors
   - Collects results

2. **Cluster Manager**:
   - Allocates resources across applications
   - Types: Standalone, YARN, Kubernetes, Mesos
   - Starts and stops executors

3. **Executors**:
   - Run on worker nodes
   - Execute tasks assigned by driver
   - Store data in memory/disk
   - Return results to driver

4. **Tasks**:
   - Unit of work sent to executors
   - One task per partition
   - Multiple tasks run in parallel

**Execution Flow:**

```
User Code → Driver → Cluster Manager → Executors → Tasks
                ↓                                      ↓
            Schedule                              Results
                ↓                                      ↓
           Collect Results ←←←←←←←←←←←←←←←←←←←←←←←←←←←←
```

**Memory Structure:**

Each executor has:
- **Execution Memory**: For shuffles, joins, sorts, aggregations
- **Storage Memory**: For cached data
- **User Memory**: For user data structures
- **Reserved Memory**: For Spark internal use

Default split: 60% execution/storage, 40% user+reserved

## 2. Deployment Modes

### Local Mode

**Characteristics:**
- Runs on single machine
- Simulates cluster with threads
- Great for development and testing
- Limited by single machine resources

**Syntax:**
- `local`: 1 thread
- `local[4]`: 4 threads
- `local[*]`: All available cores

**When to use:**
- Development and debugging
- Small datasets that fit on one machine
- Testing code before cluster deployment

### Standalone Mode

**Characteristics:**
- Spark's built-in cluster manager
- Simple to set up
- Good for dedicated Spark clusters
- No resource sharing with other frameworks

**Components:**
- Master process: Manages cluster
- Worker processes: Run executors
- Drivers: Submit applications

**Start commands:**
```bash
# Start master
./sbin/start-master.sh

# Start worker
./sbin/start-worker.sh spark://master-host:7077
```

### YARN Mode

**Characteristics:**
- Hadoop's resource manager
- Share cluster with MapReduce, Hive, etc.
- Enterprise-grade resource management
- Most common in production

**Modes:**
- **Client mode**: Driver runs on client machine
- **Cluster mode**: Driver runs on YARN cluster

**When to use:**
- Hadoop ecosystem integration
- Multi-tenant clusters
- Enterprise deployments

### Kubernetes Mode

**Characteristics:**
- Container orchestration
- Cloud-native deployments
- Dynamic scaling
- Integration with modern DevOps tools

**Benefits:**
- Resource isolation
- Auto-scaling
- Easy updates and rollbacks
- Cloud provider support

**When to use:**
- Cloud deployments (AWS EKS, GCP GKE, Azure AKS)
- Microservices architecture
- Modern CI/CD pipelines

In [None]:
# In this notebook, we'll demonstrate concepts in local mode
# But the principles apply to all cluster modes

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, sum as spark_sum
import time

# Create SparkSession in local mode
# local[*] means use all available cores
spark = SparkSession.builder \
    .appName("Spark on Clusters") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")
print(f"App Name: {spark.sparkContext.appName}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")

## 3. Resource Allocation

**Key Resources:**

1. **Number of Executors**
2. **Cores per Executor**
3. **Memory per Executor**
4. **Driver Memory**

**Configuration Parameters:**

```bash
--num-executors <N>           # Number of executors (YARN only)
--executor-cores <N>          # Cores per executor
--executor-memory <MEM>       # Memory per executor (e.g., 4g)
--driver-cores <N>            # Cores for driver
--driver-memory <MEM>         # Memory for driver
--total-executor-cores <N>    # Total cores across all executors (Standalone)
```

**Resource Calculation Example:**

Cluster: 10 nodes, 16 cores/node, 64GB RAM/node

**Conservative Approach:**
- Leave 1 core and 1GB for OS/services per node
- Available: 15 cores, 63GB per node

**Option 1: More Executors (better parallelism)**
- Executors per node: 3
- Cores per executor: 5
- Memory per executor: 19GB (63/3, leave some overhead)
- Total executors: 30 (3 * 10)
- Total cores: 150 (30 * 5)

**Option 2: Fewer, Larger Executors**
- Executors per node: 1
- Cores per executor: 15
- Memory per executor: 60GB
- Total executors: 10
- Total cores: 150

**Recommendation:**
- **Cores per executor**: 4-6 (sweet spot for HDFS throughput)
- **Memory per executor**: 8-16GB (not too large to avoid GC issues)
- **Executors**: More smaller executors > fewer large ones

In [None]:
# View current resource configuration
print("=== Current Resource Configuration ===")

configs = [
    ("spark.driver.memory", "Driver Memory"),
    ("spark.executor.memory", "Executor Memory"),
    ("spark.executor.cores", "Executor Cores"),
    ("spark.driver.cores", "Driver Cores"),
    ("spark.executor.instances", "Number of Executors"),
    ("spark.dynamicAllocation.enabled", "Dynamic Allocation"),
    ("spark.default.parallelism", "Default Parallelism"),
    ("spark.sql.shuffle.partitions", "Shuffle Partitions")
]

for config_key, config_name in configs:
    try:
        value = spark.conf.get(config_key)
        print(f"{config_name:<25}: {value}")
    except:
        print(f"{config_name:<25}: (not set / default)")

## 4. Submitting Applications with spark-submit

**spark-submit** is the standard way to submit Spark applications to a cluster.

**Basic Syntax:**

```bash
spark-submit \
  --master <master-url> \
  --deploy-mode <client|cluster> \
  --class <main-class> \
  --name <app-name> \
  --conf <key=value> \
  --executor-memory <MEM> \
  --num-executors <N> \
  --executor-cores <N> \
  application.jar [app-args]
```

**Examples:**

### Local Mode:
```bash
spark-submit \
  --master local[4] \
  my_script.py
```

### Standalone Cluster:
```bash
spark-submit \
  --master spark://master-host:7077 \
  --deploy-mode cluster \
  --executor-memory 4G \
  --total-executor-cores 8 \
  my_script.py
```

### YARN Cluster:
```bash
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-cores 5 \
  --executor-memory 8G \
  --driver-memory 4G \
  --conf spark.sql.shuffle.partitions=200 \
  --conf spark.default.parallelism=100 \
  --files config.conf \
  --py-files dependencies.zip \
  my_script.py arg1 arg2
```

### Kubernetes:
```bash
spark-submit \
  --master k8s://https://kubernetes-api:443 \
  --deploy-mode cluster \
  --name spark-pi \
  --conf spark.executor.instances=5 \
  --conf spark.kubernetes.container.image=spark:latest \
  local:///opt/spark/examples/jars/spark-examples.jar
```

**Common Options:**

- `--master`: Cluster manager URL
- `--deploy-mode`: Where to run the driver (client/cluster)
- `--name`: Application name (appears in UI)
- `--conf`: Spark configuration property
- `--files`: Files to distribute to executors
- `--py-files`: Python dependencies (.zip, .egg, .py)
- `--packages`: Maven coordinates for dependencies
- `--jars`: Additional JARs to include
- `--driver-java-options`: JVM options for driver
- `--executor-java-options`: JVM options for executors

In [None]:
# Example: Create a simple PySpark script
# In production, you'd submit this with spark-submit

script_content = '''
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg
import sys

def main():
    # Create Spark session
    spark = SparkSession.builder \
        .appName("Example ETL Job") \
        .getOrCreate()
    
    # Read data (would be from HDFS/S3 in production)
    df = spark.range(0, 1000000).toDF("id") \
        .withColumn("value", col("id") * 2)
    
    # Process
    result = df.filter(col("value") > 1000) \
        .groupBy((col("value") % 100).alias("bucket")) \
        .agg(count("*").alias("count"), avg("value").alias("avg_value"))
    
    # Show results
    result.show(10)
    
    # Write output (would be to HDFS/S3 in production)
    result.write.mode("overwrite").parquet("/tmp/output")
    
    spark.stop()
    print("Job completed successfully!")

if __name__ == "__main__":
    main()
'''

# Save script
with open("/tmp/example_job.py", "w") as f:
    f.write(script_content)

print("Example script saved to /tmp/example_job.py")
print("\nTo submit this job, you would use:")
print("""\n
spark-submit \\
  --master yarn \\
  --deploy-mode cluster \\
  --num-executors 5 \\
  --executor-cores 4 \\
  --executor-memory 8G \\
  --driver-memory 4G \\
  /tmp/example_job.py
""")

## 5. Monitoring and Troubleshooting

**Spark Web UI:**

Default URLs:
- Standalone Master: http://master:8080
- Application UI: http://driver:4040
- History Server: http://server:18080

**Important Tabs:**

1. **Jobs**: Overview of all jobs
   - Total time, stages, tasks
   - Failed/succeeded tasks

2. **Stages**: Detailed stage information
   - Task execution time
   - Shuffle read/write
   - GC time
   - Skewed tasks (identify stragglers)

3. **Storage**: Cached RDDs/DataFrames
   - Memory used
   - Partitions cached

4. **Environment**: Configuration settings
   - Verify resource allocation
   - Check property values

5. **Executors**: Executor metrics
   - Memory usage
   - Task distribution
   - Failed tasks
   - GC time

6. **SQL**: DataFrame/SQL queries
   - Execution plans
   - Query duration
   - Physical plans with metrics

**Key Metrics to Monitor:**

- **Task Time**: Should be relatively even across tasks
- **Shuffle Read/Write**: Large shuffles indicate optimization opportunities
- **GC Time**: >10% of task time indicates memory pressure
- **Spill (Memory/Disk)**: Indicates need for more memory
- **Task Skew**: Some tasks taking much longer than others

**Common Issues and Solutions:**

1. **Out of Memory**:
   - Increase executor memory
   - Reduce partition size
   - Avoid collect() on large DataFrames
   - Unpersist unused cached data

2. **Slow Shuffles**:
   - Use broadcast joins
   - Increase parallelism
   - Filter data early
   - Pre-partition data

3. **Task Skew**:
   - Repartition with salt keys
   - Use adaptive query execution
   - Filter outliers

4. **High GC Time**:
   - Increase executor memory
   - Tune GC settings
   - Reduce object creation
   - Use serialized storage

5. **Stragglers** (few slow tasks):
   - Enable speculative execution
   - Check for data skew
   - Look for failing nodes

In [None]:
# Demonstrate monitoring in local mode
# The Web UI is available at http://localhost:4040 while the session is active

print("=== Spark Web UI ===")
print(f"\nApplication UI: http://localhost:4040")
print("\nNote: Open this URL in your browser while the Spark session is running")
print("\nYou can explore:")
print("  - Jobs: See all executed jobs")
print("  - Stages: Detailed stage metrics")
print("  - Storage: Cached DataFrames")
print("  - Environment: Configuration")
print("  - Executors: Resource usage")
print("  - SQL: DataFrame queries and plans")

# Create some activity to monitor
print("\nCreating sample workload to monitor...")

df = spark.range(0, 5000000).toDF("id") \
    .withColumn("value", col("id") % 1000)

# Cache to see in Storage tab
df.cache()

# Run some operations to see in Jobs/Stages
result1 = df.groupBy("value").count().count()
result2 = df.filter(col("value") > 500).count()

print(f"\nWorkload completed:")
print(f"  - Grouped aggregation: {result1} groups")
print(f"  - Filtered count: {result2:,} rows")
print("\nCheck the Web UI to see execution details!")

# Show what's in cache
print("\n=== Cached DataFrames ===")
print(f"Cached DataFrame has {df.rdd.getNumPartitions()} partitions")

In [None]:
# Get application metrics programmatically
sc = spark.sparkContext

print("=== Application Metrics ===")
print(f"\nApplication ID: {sc.applicationId}")
print(f"Application Name: {sc.appName}")
print(f"Master: {sc.master}")
print(f"Default Parallelism: {sc.defaultParallelism}")
print(f"\nSpark Version: {sc.version}")

# Get status tracker for detailed metrics
status = sc.statusTracker()
print(f"\nActive Jobs: {len(status.getActiveJobIds())}")
print(f"Active Stages: {len(status.getActiveStageIds())}")

# Cleanup
df.unpersist()

## 6. Production Best Practices

### Configuration Best Practices

**1. Resource Allocation:**
```python
# Use configuration files instead of hardcoding
spark-submit \
  --properties-file cluster-config.conf \
  my_app.py
```

**2. Dynamic Allocation:**
```
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=2
spark.dynamicAllocation.maxExecutors=20
spark.dynamicAllocation.initialExecutors=5
```

**3. Compression:**
```
spark.sql.parquet.compression.codec=snappy
spark.broadcast.compress=true
spark.shuffle.compress=true
```

**4. Serialization:**
```
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=512m
```

**5. Adaptive Query Execution:**
```
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
```

### Deployment Checklist

- [ ] Test with representative data volume
- [ ] Configure appropriate resources
- [ ] Enable logging and monitoring
- [ ] Set up alerting for failures
- [ ] Configure checkpointing for streaming
- [ ] Use external metastore (Hive)
- [ ] Implement retry logic
- [ ] Document dependencies
- [ ] Version control configuration
- [ ] Set up CI/CD pipeline

### Security Considerations

1. **Authentication**: Kerberos for YARN, RBAC for K8s
2. **Encryption**: Enable at-rest and in-transit
3. **Access Control**: Limit who can submit jobs
4. **Network Isolation**: Use VPCs/subnets
5. **Secrets Management**: Use vault services

### Logging and Monitoring

**Structured Logging:**
```python
import logging

logging.basicConfig(
    format='%(asctime)s %(levelname)s %(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)

logger.info("Job started")
logger.info(f"Processing {row_count} rows")
logger.error(f"Failed to process: {error}")
```

**Metrics Collection:**
- Use Spark metrics system
- Export to Prometheus/Grafana
- Track custom metrics
- Set up dashboards

**Log Aggregation:**
- Use ELK stack (Elasticsearch, Logstash, Kibana)
- Or Splunk for enterprise
- Centralize logs from all executors
- Enable log rotation

In [None]:
# Example: Production-ready configuration

production_config = """
# cluster-config.conf - Production Spark Configuration

# Application
spark.app.name=ProductionETLJob

# Resources
spark.executor.instances=10
spark.executor.cores=5
spark.executor.memory=8g
spark.driver.memory=4g
spark.driver.cores=2

# Performance
spark.sql.shuffle.partitions=200
spark.default.parallelism=100
spark.sql.autoBroadcastJoinThreshold=50m

# Adaptive Query Execution
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true

# Dynamic Allocation
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=2
spark.dynamicAllocation.maxExecutors=20
spark.dynamicAllocation.initialExecutors=5
spark.dynamicAllocation.executorIdleTimeout=60s

# Serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=512m

# Compression
spark.sql.parquet.compression.codec=snappy
spark.broadcast.compress=true
spark.shuffle.compress=true
spark.io.compression.codec=snappy

# Memory Management
spark.memory.fraction=0.6
spark.memory.storageFraction=0.5

# Speculation (for stragglers)
spark.speculation=true
spark.speculation.interval=100ms
spark.speculation.multiplier=1.5

# Logging
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs
spark.history.fs.logDirectory=hdfs:///spark-logs

# Checkpointing (for streaming)
spark.sql.streaming.checkpointLocation=/checkpoints

# Network
spark.network.timeout=300s
spark.executor.heartbeatInterval=10s
"""

print("Production Configuration File:")
print(production_config)

# Save configuration
with open("/tmp/cluster-config.conf", "w") as f:
    f.write(production_config)

print("\nConfiguration saved to /tmp/cluster-config.conf")
print("\nUse with: spark-submit --properties-file /tmp/cluster-config.conf app.py")

## 7. Exercises

These exercises focus on understanding concepts rather than hands-on cluster setup (which requires actual cluster infrastructure).

### Exercise 1: Resource Calculation

Calculate optimal resource allocation for a Spark cluster.

**Scenario:**
- Cluster: 20 nodes
- Per node: 32 cores, 128GB RAM
- Reserve: 1 core and 2GB per node for OS

**Tasks:**
1. Calculate total available resources
2. Determine optimal executor configuration (3 different options)
3. For each option, calculate: num executors, cores/executor, memory/executor
4. Justify which option you'd choose and why
5. Write the spark-submit command with your chosen configuration

In [None]:
# Your calculations here
# TODO: Calculate available resources
# TODO: Design 3 different resource allocation strategies
# TODO: Compare and choose the best option

### Exercise 2: Configuration Tuning

Create appropriate Spark configurations for different scenarios.

**Tasks:**
1. Configuration for streaming application (low latency)
2. Configuration for large batch ETL (high throughput)
3. Configuration for iterative ML training (lots of caching)
4. Explain your choices for each scenario
5. Identify which settings are most critical for each use case

In [None]:
# Your code here
# TODO: Create config for streaming
# TODO: Create config for batch ETL
# TODO: Create config for ML training
# TODO: Document reasoning

### Exercise 3: Troubleshooting Analysis

Analyze and diagnose common cluster issues.

**Scenarios:**

1. **Scenario A**: Job taking 10x longer than expected, Web UI shows:
   - 95% of tasks finish in 1 minute
   - 5% of tasks taking 15+ minutes
   - Large shuffle write size

2. **Scenario B**: executors keep failing with OOM errors:
   - Executor memory: 4GB
   - GC time >50% of task time
   - Many cached DataFrames

3. **Scenario C**: Slow join operation:
   - Two large tables (1TB and 10GB)
   - Using SortMergeJoin
   - autoBroadcastJoinThreshold=10MB

**Tasks:**
- Diagnose the root cause for each scenario
- Propose 2-3 solutions for each
- Explain the trade-offs of each solution

In [None]:
# Your analysis here
# TODO: Analyze each scenario
# TODO: Propose solutions
# TODO: Discuss trade-offs

## 8. Exercise Solutions

### Solution 1: Resource Calculation

In [None]:
# Cluster specifications
num_nodes = 20
cores_per_node = 32
ram_per_node_gb = 128

# Reserved resources
reserved_cores = 1
reserved_ram_gb = 2

# Available resources per node
available_cores = cores_per_node - reserved_cores
available_ram = ram_per_node_gb - reserved_ram_gb

print("=== Cluster Resources ===")
print(f"Total nodes: {num_nodes}")
print(f"Per node: {cores_per_node} cores, {ram_per_node_gb}GB RAM")
print(f"Available per node: {available_cores} cores, {available_ram}GB RAM")
print(f"\nTotal available: {available_cores * num_nodes} cores, {available_ram * num_nodes}GB RAM")

print("\n=== Option 1: Balanced (Recommended) ===")
opt1_cores_per_exec = 5
opt1_execs_per_node = available_cores // opt1_cores_per_exec
opt1_mem_per_exec = (available_ram // opt1_execs_per_node) - 1  # Leave 1GB overhead
opt1_total_execs = opt1_execs_per_node * num_nodes

print(f"Cores per executor: {opt1_cores_per_exec}")
print(f"Executors per node: {opt1_execs_per_node}")
print(f"Memory per executor: {opt1_mem_per_exec}GB")
print(f"Total executors: {opt1_total_execs}")
print(f"Total cores: {opt1_total_execs * opt1_cores_per_exec}")
print("Pros: Good balance, optimal for HDFS throughput")

print("\n=== Option 2: High Parallelism ===")
opt2_cores_per_exec = 3
opt2_execs_per_node = available_cores // opt2_cores_per_exec
opt2_mem_per_exec = (available_ram // opt2_execs_per_node) - 1
opt2_total_execs = opt2_execs_per_node * num_nodes

print(f"Cores per executor: {opt2_cores_per_exec}")
print(f"Executors per node: {opt2_execs_per_node}")
print(f"Memory per executor: {opt2_mem_per_exec}GB")
print(f"Total executors: {opt2_total_execs}")
print(f"Total cores: {opt2_total_execs * opt2_cores_per_exec}")
print("Pros: More parallelism, better for many small tasks")
print("Cons: More overhead, less memory per executor")

print("\n=== Option 3: Memory Intensive ===")
opt3_cores_per_exec = 10
opt3_execs_per_node = available_cores // opt3_cores_per_exec
opt3_mem_per_exec = (available_ram // opt3_execs_per_node) - 2  # More overhead for large memory
opt3_total_execs = opt3_execs_per_node * num_nodes

print(f"Cores per executor: {opt3_cores_per_exec}")
print(f"Executors per node: {opt3_execs_per_node}")
print(f"Memory per executor: {opt3_mem_per_exec}GB")
print(f"Total executors: {opt3_total_execs}")
print(f"Total cores: {opt3_total_execs * opt3_cores_per_exec}")
print("Pros: Lots of memory for large aggregations")
print("Cons: Fewer executors, potential GC issues")

print("\n=== Recommendation ===")
print("Choose Option 1 (Balanced) for most workloads")
print("Reasoning:")
print("  - 5 cores/executor is sweet spot for HDFS throughput")
print("  - 20GB/executor is reasonable (not too large for GC)")
print("  - 124 executors provides good parallelism")

print("\n=== spark-submit Command ===")
print(f"""
spark-submit \\
  --master yarn \\
  --deploy-mode cluster \\
  --num-executors {opt1_total_execs} \\
  --executor-cores {opt1_cores_per_exec} \\
  --executor-memory {opt1_mem_per_exec}G \\
  --driver-memory 8G \\
  --driver-cores 4 \\
  --conf spark.sql.shuffle.partitions=400 \\
  --conf spark.default.parallelism=200 \\
  my_application.py
""")

### Solution 2: Configuration Tuning

In [None]:
# Configuration 1: Streaming Application (Low Latency)
streaming_config = """
# Streaming Application - Low Latency Configuration

# Small, frequent micro-batches
spark.sql.shuffle.partitions=50  # Fewer partitions for low latency
spark.streaming.backpressure.enabled=true  # Handle rate fluctuations
spark.streaming.kafka.maxRatePerPartition=1000  # Control ingestion rate

# Optimize for speed
spark.speculation=true  # Handle stragglers quickly
spark.speculation.interval=50ms
spark.task.cpus=1  # More tasks in parallel

# Memory (moderate, streaming usually has smaller state)
spark.executor.memory=4g
spark.executor.cores=2

# Checkpointing
spark.sql.streaming.checkpointLocation=/streaming-checkpoints
spark.cleaner.referenceTracking.cleanCheckpoints=true

# Network (important for streaming)
spark.network.timeout=120s

Rationale:
- Fewer shuffle partitions for lower latency
- Backpressure to handle variable input rates
- Speculation to quickly recover from slow tasks
- Moderate resources (streaming usually processes smaller batches)
"""

print(streaming_config)
print("\n" + "="*80 + "\n")

# Configuration 2: Batch ETL (High Throughput)
batch_config = """
# Batch ETL - High Throughput Configuration

# Large resources for processing big data
spark.executor.instances=50
spark.executor.cores=5
spark.executor.memory=16g
spark.driver.memory=8g

# High parallelism
spark.sql.shuffle.partitions=400  # Many partitions for large data
spark.default.parallelism=200

# Compression (save space and I/O)
spark.sql.parquet.compression.codec=snappy
spark.broadcast.compress=true
spark.shuffle.compress=true

# Adaptive execution (handle varying data sizes)
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true

# Dynamic allocation (scale up/down as needed)
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=10
spark.dynamicAllocation.maxExecutors=100

# Large broadcast threshold (common in ETL for dimension tables)
spark.sql.autoBroadcastJoinThreshold=100m

Rationale:
- Large resources to process big datasets quickly
- High parallelism to maximize throughput
- Compression to reduce I/O
- Adaptive execution to handle data skew
- Dynamic allocation for cost efficiency
"""

print(batch_config)
print("\n" + "="*80 + "\n")

# Configuration 3: ML Training (Caching Heavy)
ml_config = """
# ML Training - Caching Heavy Configuration

# Large memory for caching datasets
spark.executor.memory=24g
spark.driver.memory=12g
spark.executor.cores=6

# Memory management (favor storage for caching)
spark.memory.fraction=0.7  # More memory for execution/storage
spark.memory.storageFraction=0.6  # More storage within that

# Serialization (important for large cached objects)
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=1024m

# Moderate parallelism (ML often iterates on same data)
spark.sql.shuffle.partitions=100

# No dynamic allocation (want consistent resources)
spark.dynamicAllocation.enabled=false
spark.executor.instances=20

# Checkpointing for iterative algorithms
spark.cleaner.referenceTracking.cleanCheckpoints=true
spark.cleaner.periodicGC.interval=10min

# Larger broadcast (ML models can be sizeable)
spark.sql.autoBroadcastJoinThreshold=200m

Rationale:
- Very large memory to cache training data in RAM
- High storage fraction to prioritize caching
- Kryo serialization for efficient object storage
- Fixed resources (no dynamic allocation for predictable performance)
- Regular GC to manage memory from iterations
"""

print(ml_config)

### Solution 3: Troubleshooting Analysis

In [None]:
troubleshooting = """
=== SCENARIO A: Task Skew and Slow Stragglers ===

Symptoms:
- 95% tasks finish in 1 minute
- 5% tasks taking 15+ minutes
- Large shuffle write size

Diagnosis:
ROOT CAUSE: Data skew - some partitions have much more data than others

The 5% slow tasks are processing disproportionately large partitions.
Large shuffle indicates data redistribution is happening.

Solutions:

1. Enable Adaptive Query Execution (AQE):
   spark.sql.adaptive.enabled=true
   spark.sql.adaptive.skewJoin.enabled=true
   
   Pros: Automatic skew handling, no code changes
   Cons: Requires Spark 3.0+, adds small overhead

2. Add salt to skewed keys:
   df.withColumn("salted_key", concat(col("key"), lit("_"), rand() * 10))
   
   Pros: Distributes skewed keys across partitions
   Cons: Requires code changes, more complex logic

3. Filter outliers if acceptable:
   df.filter(col("partition_key").isin(common_values))
   
   Pros: Simple, fast
   Cons: Loses data, not always acceptable

4. Enable speculation:
   spark.speculation=true
   
   Pros: Re-runs slow tasks on other executors
   Cons: Uses extra resources, doesn't fix root cause

Recommendation: Use AQE (#1) + salting (#2) for severe skew

===============================================================================

=== SCENARIO B: Out of Memory Errors ===

Symptoms:
- Executors failing with OOM
- Executor memory: 4GB
- GC time >50% of task time
- Many cached DataFrames

Diagnosis:
ROOT CAUSE: Insufficient memory + excessive caching + high GC overhead

The executors don't have enough memory for both caching and execution.
High GC time indicates memory pressure.

Solutions:

1. Increase executor memory:
   spark.executor.memory=12g
   
   Pros: More headroom for caching and execution
   Cons: Requires cluster resources, may hit GC issues if too large

2. Unpersist unused cached data:
   df.unpersist()
   
   Pros: Frees memory immediately
   Cons: Must track what's needed

3. Use serialized caching:
   df.persist(StorageLevel.MEMORY_ONLY_SER)
   
   Pros: Uses less memory (compressed)
   Cons: Slower access (deserialization overhead)

4. Adjust memory fractions:
   spark.memory.fraction=0.7
   spark.memory.storageFraction=0.3
   
   Pros: Allocates more to execution vs storage
   Cons: May evict cached data more frequently

5. Increase number of partitions:
   spark.sql.shuffle.partitions=400
   
   Pros: Smaller partitions use less memory per task
   Cons: More tasks, higher overhead

Recommendation: Increase memory (#1) + unpersist unused cache (#2)

===============================================================================

=== SCENARIO C: Slow Join Operation ===

Symptoms:
- Two tables: 1TB and 10GB
- Using SortMergeJoin
- autoBroadcastJoinThreshold=10MB

Diagnosis:
ROOT CAUSE: Using shuffle join instead of broadcast join

The 10GB table is above the 10MB threshold, so Spark uses expensive
SortMergeJoin instead of BroadcastHashJoin.

Solutions:

1. Increase broadcast threshold:
   spark.sql.autoBroadcastJoinThreshold=15g
   
   Pros: Enables broadcast join automatically
   Cons: Requires sufficient driver/executor memory

2. Explicit broadcast hint:
   large_table.join(broadcast(small_table), "key")
   
   Pros: Guaranteed broadcast join
   Cons: Must ensure table fits in memory

3. Filter small table first:
   filtered_small = small_table.filter(col("active") == true)
   large_table.join(broadcast(filtered_small), "key")
   
   Pros: Reduces broadcast size
   Cons: Only works if filtering is possible

4. Partition both tables by join key:
   large_partitioned = large_table.repartition("key")
   small_partitioned = small_table.repartition("key")
   
   Pros: Co-locates matching keys, reduces shuffle
   Cons: Requires upfront repartitioning

Recommendation: Increase threshold (#1) or use explicit broadcast (#2)

Trade-offs:
- Broadcast uses driver memory (limit: ~8GB typically)
- Broadcast sends data to all executors (network cost)
- But eliminates shuffle for 1TB table (huge savings!)
- If 10GB doesn't fit in memory, use partitioning (#4)

===============================================================================
"""

print(troubleshooting)

## 9. Summary

Congratulations! You've learned how to deploy and manage Spark on clusters.

### Key Takeaways:

1. **Cluster Architecture**:
   - Driver orchestrates, executors execute
   - Cluster manager allocates resources
   - Tasks are units of parallel work
   - Memory divided into execution and storage

2. **Deployment Modes**:
   - Local: Development and testing
   - Standalone: Simple dedicated clusters
   - YARN: Hadoop integration, most common in enterprise
   - Kubernetes: Cloud-native, containerized deployments

3. **Resource Allocation**:
   - Balance: cores, memory, number of executors
   - Sweet spot: 4-6 cores per executor
   - Memory: 8-16GB per executor (avoid GC issues)
   - Leave resources for OS and overhead

4. **spark-submit**:
   - Standard way to submit applications
   - Configure resources, dependencies, and properties
   - Use configuration files for production
   - Different syntax for different cluster managers

5. **Monitoring**:
   - Web UI provides detailed metrics
   - Watch for: task skew, GC time, shuffle size
   - Enable event logging and history server
   - Set up alerting for production jobs

6. **Best Practices**:
   - Enable adaptive query execution
   - Use dynamic allocation for variable workloads
   - Configure compression and serialization
   - Implement structured logging
   - Version control configurations

### Production Deployment Checklist:

- [ ] Calculate appropriate resource allocation
- [ ] Create production configuration file
- [ ] Test with representative data volume
- [ ] Set up monitoring and alerting
- [ ] Configure logging and log aggregation
- [ ] Enable checkpointing for streaming
- [ ] Implement retry logic for failures
- [ ] Document dependencies and versions
- [ ] Set up CI/CD pipeline
- [ ] Configure security (auth, encryption)

### Common Production Issues:

1. **Under-resourced**: Jobs slow or failing
   - Solution: Increase memory, cores, or executors

2. **Over-resourced**: Wasting cluster capacity
   - Solution: Enable dynamic allocation

3. **Memory pressure**: High GC time, OOM
   - Solution: More memory, unpersist cache, tune fractions

4. **Data skew**: Few slow tasks
   - Solution: AQE, salting, repartitioning

5. **Slow shuffles**: Large data movement
   - Solution: Broadcast joins, pre-partitioning, filtering

### What's Next?

In [Module 14: Final Project - ETL Pipeline with ML](14_final_project_etl_pipeline_with_ml.ipynb), you'll:
- Build an end-to-end production pipeline
- Integrate ETL, feature engineering, and ML
- Apply all concepts from previous modules
- Create a deployable Spark application
- Implement production best practices

### Additional Resources:

- [Cluster Mode Overview](https://spark.apache.org/docs/latest/cluster-overview.html)
- [Submitting Applications](https://spark.apache.org/docs/latest/submitting-applications.html)
- [Running on YARN](https://spark.apache.org/docs/latest/running-on-yarn.html)
- [Running on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)
- [Monitoring and Instrumentation](https://spark.apache.org/docs/latest/monitoring.html)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Excellent work on understanding Spark clusters!")