# Optimizing and Tuning Spark Applications


In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my_app") .enableHiveSupport().getOrCreate()

In [6]:
spark.sql("SET -v").select("key", "value").show(n=5, truncate=False)

+---------------------------------------------------------+----------------------------------------------------------------+
|key                                                      |value                                                           |
+---------------------------------------------------------+----------------------------------------------------------------+
|spark.sql.adaptive.advisoryPartitionSizeInBytes          |<value of spark.sql.adaptive.shuffle.targetPostShuffleInputSize>|
|spark.sql.adaptive.autoBroadcastJoinThreshold            |<undefined>                                                     |
|spark.sql.adaptive.coalescePartitions.enabled            |true                                                            |
|spark.sql.adaptive.coalescePartitions.initialPartitionNum|<undefined>                                                     |
|spark.sql.adaptive.coalescePartitions.minPartitionSize   |1MB                                                             |


In [7]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

# dynamic resource allocation
When you specify compute resources as command-line arguments to spark-submit, you cap the limit. This means that if more resources are needed later
as tasks queue up in the driver due to a larger than anticipated workload, Spark can‚Äê
not accommodate or allocate extra resources.
To enable and configure dynamic allocation, you can use settings like the following.
the numbers here are arbitrary; the appropriate settings will depend on the
nature of your workload and they should be adjusted accordingly. Some of these
configs cannot be set inside a Spark CLI, so you will have to set them
programmatically:
<BR>
    
    
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.schedulerBacklogTimeout 1m
spark.dynamicAllocation.maxExecutors 20
spark.dynamicAllocation.executorIdleTimeout 2min
<br>
    
    
By default spark.dynamicAllocation.enabled is set to false. When enabled with
the settings shown here, the Spark driver will request that the cluster manager create
two executors to start with, as a minimum (spark.dynamicAllocation.minExecu
tors). As the task queue backlog increases, new executors will be requested each time
the backlog timeout (spark.dynamicAllocation.schedulerBacklogTimeout) is
exceeded. In this case, whenever there are pending tasks that have not been scheduled
for over 1 minute, the driver will request that a new executor be launched to schedule
backlogged tasks, up to a maximum of 20 (spark.dynamicAllocation.maxExecu
tors). By contrast, if an executor finishes a task and is idle for 2 minutes
(spark.dynamicAllocation.executorIdleTimeout), the Spark driver will terminate
it.


# caching data

In [12]:
#Create a large data set with couple of columns
from pyspark.sql.functions import col

df = spark.range(1 * 1000).toDF("id").withColumn("square", col("id") * col("id"))
df.cache().count()
#Check the Spark UI storage tab to see where the data is stored

1000

In [13]:
# If you do not unpersist, df2 below will not be cached because it has the same query plan as df
df.unpersist()

DataFrame[id: bigint, square: bigint]

In [15]:
#Use persist(StorageLevel.Level)
from pyspark import StorageLevel

df2 = spark.range(1 * 1000).toDF("id").withColumn("square", col("id") * col("id"))
df2.persist(StorageLevel.DISK_ONLY).count()

1000

In [16]:
df2.unpersist()

DataFrame[id: bigint, square: bigint]

In [17]:
df.createOrReplaceTempView("dfTable")
spark.sql("CACHE TABLE dfTable")

DataFrame[]

In [18]:
spark.sql("SELECT count(*) FROM dfTable").show()

+--------+
|count(1)|
+--------+
|    1000|
+--------+



In [1]:
spark.stop()

NameError: name 'spark' is not defined