In [1]:
# Importing SparkContext, SparkConf
from pyspark import SparkContext, SparkConf

In [2]:
# To stop any existing spark Context. Only single Spark context can run per JVM.
sc.stop()

# Understanding Configuration of Spark for local mode

####  In local mode,  we have one JVM machine, which has as Executor/ Driver( so we have one executor)
#### Task slots = Executor cores = Available threads != CPU cores
#### On each slot/core, you can allocate multiple tasks

# Understanding your own machine

####  In Windows Power Shell
            wmic
            wmic:root\cli> CPU Get NumberOfCores,NumberOfLogicalProcessors /Format:List
#### It will show
            NumberOfCores=4
            NumberOfLogicalProcessors=8
            
#### Hyper-Threading is enabled. With Hyper-Threading, a microprocessor's "core" processor can execute two (rather than one) concurrent streams (or threads) of instructions sent by the operating system

## Setting up Spark Local Cluster configuration:


#### One way to set my local mode : keep 2 processors for OS, 6 processors for Spark JVM, set 6 task slots.
#### if we need to utilize all the processors efficently(CPU utilization)
#### Set 12 or 18 task slots with memory allocation per CPU processor

In [3]:
conf_spark = SparkConf().set("spark.driver.host", "127.0.0.1").setMaster("local[12]").setAppName("myapp")

In [4]:
conf_spark.getAll()

[('spark.app.name', 'myapp'),
 ('spark.master', 'local[12]'),
 ('spark.driver.host', '127.0.0.1'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.memory', '5g'),
 ('spark.ui.showConsoleProgress', 'true')]

In [5]:
sc = SparkContext(conf=conf_spark)

#### If we look in Spark Web UI, we can spot 12 slots, with some memory allocation
<img src="Data/Executors_SparkWebUI.PNG">

#### We have 2 other options for local let us understand other 2 options

#### local[*] or local


In [None]:
sc.stop()
conf_spark = SparkConf().set("spark.driver.host", "127.0.0.1").setMaster("local[*]").setAppName("myapp")

In [None]:
sc = SparkContext(conf=conf_spark)

<img src="Data/Executors_SparkWebUI2.PNG">

In [None]:
## Shows  ALL 8 AVALIABLE coreS, with 366.3MB Memory

In [None]:
sc.stop()
conf_spark = SparkConf().set("spark.driver.host", "127.0.0.1").setMaster("local").setAppName("myapp")
sc = SparkContext(conf=conf_spark)

In [None]:
## Shows 1 core, with 366.3MB Memory

#### Standard: Number of cores = Concurrent tasks a executor can run (Source:Stackoverflow)

### Let us understand the memory of spark slots

####  There will be no storage memory. Spark does n't have storage system as HDFS, but there can be disk spillage.So, we are talking about only Cache Memory and overall memory of executor used during execution of tasks.We say overall memory is allocated to Executor/Driver.We have one JVM, which has one executor/driver.We need to find driver.memory/executor.memory.Driver collects the results and Executor stores the partitions of data in memory for computations.

####  On-Heap memory management: Objects are allocated on the JVM heap and bound by GC.
#### Off-Heap memory management: Objects are allocated in memory outside the JVM by serialization,managed by the application, and are not bound by GC. This memory management method can avoid frequent GC, but the disadvantage is that you have to write the logic of memory allocation and memory release.

#### speed On-Heap>Off-Heap>Disk

#### Unified Memory Manager mechanism:
#### The Storage memory and Execution memory share a memory area, and both can occupy each other's free area.

In [None]:
# Setting memory for application
Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through 
the --driver-memory command line option in the client mode.
Note: In client mode, this config must not be set through the SparkConf directly in your application, 
because the driver JVM has already started at that point. 
Instead, please set this through the --driver-java-options command line option or in your default properties file.
## Testing above point?
# Yes we need to change the memory in conf file.

In [7]:
sc.stop()
conf_spark = SparkConf().set("spark.driver.host", "127.0.0.1").setMaster("local[5]").setAppName("myapp")
sc = SparkContext(conf=conf_spark)
conf_spark.getAll()

[('spark.app.name', 'myapp'),
 ('spark.driver.host', '127.0.0.1'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.master', 'local[5]'),
 ('spark.driver.memory', '5g'),
 ('spark.ui.showConsoleProgress', 'true')]


<img src="Data/Executors_SparkWebUI3.PNG">

In [9]:
## How to choose 5G memory. It depends on tasks and size of dataset. 
## Will vary for different data sizes and number of operations such as shuffle, cache and computations

In [None]:
## More can be found on https://spark.apache.org/docs/latest/tuning.html#memory-management-overview
## http://spark.apache.org/docs/latest/configuration.html#memory-management

##### References: Databricks, StackOverflow, Apache Spark Documentation, www.tutorialdocs.com
    