## There are three ways you can get and set Spark properties.

<h3>1. In SPARK_HOME changing following files</h3>
<ul>
    <li>log4j.properties.template</li>
    <li>spark-defaults.conf.template</li>
    <li>spark-env.sh.template</li>
    </ul>
    
If you rename files by deleting template and change values in it spark uses these values.

<h3>2. During the spark-submit</h3>
<p>You can change configuration by passing the new values with --conf flag</p>
<code>spark-submit --conf spark.sql.shuffle.partitions=5 --conf
"spark.executor.memory=2g" --class main.scala.chapter7.SparkConfig_7_1 jars/mainscala-
chapter7_2.12-1.0.jar</code>

<h3>3. During the creation and after creation of SparkSession</h3>
<p>Some propertes are not configurable after spark session get created. So you can just modify isModifiable properties after sparkSession.</p>

<code>spark = SparkSession.builder
.config("spark.sql.shuffle.partitions", 5)
.config("spark.executor.memory", "2g")
.master("local[*]")
.appName("SparkConfig")
.getOrCreate()</code>

<h3>How to see all properties of Spark Application</h3>
<ul>
    <li>spark.conf.getAll()</li>
    <li>Spark UI Environment Tab</li>
    </ul>
    
<h3>Commandline has precedence over configuration files and code</h3>

<h3>Scaling Spark for Large Workloads</h3>
<p>To avoid job failures due to resource starvation
or gradual performance degradation, there are a handful of Spark configurations that
you can enable or alter. These configurations affect three Spark components: </p>
<ul>
    <li>Spark driver</li>
    <li>Executor</li>
    <li>Shuffle service running on the executor</li>
    </ul>
    
    
<h3>Dynamic Resource Allocation</h3>
<code>spark.dynamicAllocation.enabled true
 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.schedulerBacklogTimeout 1m
 spark.dynamicAllocation.maxExecutors 20
 spark.dynamicAllocation.executorIdleTimeout 2min
</code>

# Configuring Spark executors’ memory and the shuffle service

<p>Although we use dynamic allocation we have to define a single executor memory and cpu cores. Dynamic resource allocation just gets and leaves executors. We cannot change executors memory or cpu core, what changes is only number of executors.</p>

<p>Configuring Spark executors’ memory and the shuffle service.</p>

<img src="../images/spark_memory_management-spark_yarn_memory_eng.png"/>




<p>Execution memory is used for Spark shuffles, joins, sorts, and aggregations. Since different
queries may require different amounts of memory, the fraction <strong>(spark.mem
ory.fraction is 0.6 by default)</strong> of the available memory to dedicate to this can be
tricky to tune but it’s easy to adjust. By contrast, storage memory is primarily used for
caching user data structures and partitions derived from DataFrames.</p>

## Recomended properties for shuffle ops based on experiences

<img src="../images/spark_recomended_props_for_shuffle_ops.jpg"/>

<p>Source: Learning Spark, O'Reilly, 2020</p>

In [2]:
! cat /opt/manual/spark/conf/spark-defaults.conf | grep spark

# Default system properties included when running spark-submit.
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.driver.memory                         512m
spark.shuffle.file.buffer                   1m
spark.file.transferTo                       false
spark.shuffle.unsafe.file.output.buffer     1m
spark.io.compression.lz4.blockSize          512k
spark.shuffle.service.index.cache.size      200m
spark.shuffle.registration.timeout          12000ms
spark.shuffle.registration.maxAttempts      5
spark.sql.warehouse.dir                     /user/hive/warehouse


## Spark parallelism

<img src="../images/spark_tasks_core_partitions.png"/>

Source: Learning Spark, O'Reilly, 2020

## How partitions are created?

<ul>
    <li>A contiguous collection of these blocks constitutes a partition. For example HDSF block size: 128 MB</li>
    <li>Size of partitions: <code>spark.sql.files.maxPartitionBytes</code></li>
    <li>You can change partition number with repartition(n) during the read or after <code>df.repartition(20)</code></li>
    <li>Finally, shuffle partitions are created during the shuffle stage. By default, the number
of shuffle partitions is set to <code>200</code> in spark.sql.shuffle.partitions. You can <code>adjust this number depending on the size of the data set</code> you have, to reduce the amount of
small partitions being sent across the network to executors’ tasks. <code>lower value such as the number of cores on the executors or less</code></li>  
</ul>
