# Key Considerations for Your Dataproc Cluster

1\. Cluster Resources:

- Master: n2-standard-4 (4 vCPUs, 16 GB RAM, 32GB disk)

- Workers (2x): n2-standard-4 (4 vCPUs, 16 GB RAM, 64GB disk each)

- Total: 8 worker vCPUs, ~32 GB RAM (excluding master node)

2\. Dataproc Features Disabled:

- No autoscaling, Metastore, advanced execution layer, advanced optimizations

- Storage: pd-balanced (no SSDs, so I/O optimization is crucial)

- Networking: Internal IP enabled

3\. Optimization Strategy:

- Tune shuffle partitions, broadcast join threshold, and storage persistence

- Adjust parallelism based on 2 workers x 4 cores

- Avoid excessive caching due to disk-based storage

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkConf

# Create a SparkConf object and set the configurations
conf = SparkConf().setAppName("OptimizedSparkConfig") \
    .set("spark.executor.memory", "4g") \
    .set("spark.executor.cores", "2") \
    .set("spark.num.executors", "2") \
    .set("spark.driver.memory", "4g") \
    .set("spark.driver.cores", "1") \
    .set("spark.sql.shuffle.partitions", "4") \
    .set("spark.sql.files.maxPartitionBytes", "134217728")  # 128MB
    .set("spark.sql.files.openCostInBytes", "134217728")  # 128MB
    .set("spark.sql.files.cacheSize", "268435456")  # 256MB
    .set("spark.sql.files.maxPartitionBytes", "268435456")  # 256MB
    .set("spark.sql.optimizer.dynamicPartitionPruning", "true")  # Enable dynamic partition pruning
    .set("spark.sql.autoBroadcastJoinThreshold", "104857600")  # 100MB for broadcast joins
    .set("spark.sql.csv.compressionCodec", "org.apache.hadoop.io.compress.SnappyCodec")  # Enable Snappy compression
    .set("spark.sql.adaptive.enabled", "true")  # Enable Adaptive Query Execution
    .set("spark.sql.inMemoryColumnarStorage.batchSize", "10000")  # Max batch size for in-memory storage
    .set("spark.locality.wait", "3s")  # Wait time for data locality
    .set("spark.network.timeout", "800s")  # Network timeout
    .set("spark.executor.heartbeatInterval", "60s")  # Executor heartbeat interval
    .set("spark.eventLog.enabled", "false")  # Disable event logging
    .set("spark.history.fs.logDirectory", "gs://<your-bucket>/spark-events")  # Optional: enable event logging

# Initialize SparkSession with the configuration
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Now, you can use Spark as usual


In [1]:
# https://spark.apache.org/docs/latest/configuration.html