# Example notebook to showcase sparkMeasure APIs for Python
  
References:  
- [https://github.com/LucaCanali/sparkMeasure](https://github.com/LucaCanali/sparkMeasure)  
- sparkmeasure Python docs: [docs/Python_shell_and_Jupyter](https://github.com/LucaCanali/sparkMeasure/blob/master/docs/Python_shell_and_Jupyter.md)  

Author: Luca.Canali@cern.ch  
July 2018  
Last updated August 2022

Dependencies:  
    - install sparkmeasure.py Python wrapper package if not already done:  
    `pip install sparkmeasure`  
    - install spark:  
    ` pip install pyspark`


In [None]:
# Start the Spark Session
# This uses local mode for simplicity
# the use of findspark is optional

# import findspark
# findspark.init("/home/luca/Spark/spark-3.3.0-bin-hadoop3")

from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .appName("Test sparkmeasure instrumentation of Python/PySpark code")
         .master("local[*]")
         .config("spark.jars.packages","ch.cern.sparkmeasure:spark-measure_2.12:0.18")
         .getOrCreate()
        )  


In [2]:
# Load the Python API for sparkmeasure package
# and attach the sparkMeasure Listener for stagemetrics to the active Spark session

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)

In [3]:
# Define cell and line magic to wrap the instrumentation
from IPython.core.magic import (register_line_magic, register_cell_magic, register_line_cell_magic)

@register_line_cell_magic
def sparkmeasure(line, cell=None):
    "run and measure spark workload. Use: %sparkmeasure or %%sparkmeasure"
    val = cell if cell is not None else line
    stagemetrics.begin()
    eval(val)
    stagemetrics.end()
    stagemetrics.print_report()

In [4]:
%%sparkmeasure
spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()

+----------+
|  count(1)|
+----------+
|1000000000|
+----------+


Scheduling mode = FIFO
Spark Context default degree of parallelism = 8

Aggregated Spark stage metrics:
numStages => 3
numTasks => 17
elapsedTime => 1261 (1 s)
stageDuration => 1031 (1 s)
executorRunTime => 2816 (3 s)
executorCpuTime => 2248 (2 s)
executorDeserializeTime => 2630 (3 s)
executorDeserializeCpuTime => 979 (1.0 s)
resultSerializationTime => 10 (10 ms)
jvmGCTime => 72 (72 ms)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 13 (13 ms)
resultSize => 16134 (15.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 2000
bytesRead => 0 (0 Bytes)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 8
shuffleTotalBlocksFetched => 8
shuffleLocalBlocksFetched => 8
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 472 (472 Bytes)
shuffleLocalBytesRead => 472 (472 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDis

In [5]:
# You can also explicitly Wrap your Spark workload into stagemetrics instrumentation 
# as in this example
stagemetrics.begin()

spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()

stagemetrics.end()
# Print a summary report
stagemetrics.print_report()

+----------+
|  count(1)|
+----------+
|1000000000|
+----------+


Scheduling mode = FIFO
Spark Context default degree of parallelism = 8

Aggregated Spark stage metrics:
numStages => 3
numTasks => 17
elapsedTime => 461 (0.5 s)
stageDuration => 401 (0.4 s)
executorRunTime => 2273 (2 s)
executorCpuTime => 1909 (2 s)
executorDeserializeTime => 63 (63 ms)
executorDeserializeCpuTime => 38 (38 ms)
resultSerializationTime => 0 (0 ms)
jvmGCTime => 0 (0 ms)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 4 (4 ms)
resultSize => 16048 (15.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 2000
bytesRead => 0 (0 Bytes)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 8
shuffleTotalBlocksFetched => 8
shuffleLocalBlocksFetched => 8
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 472 (472 Bytes)
shuffleLocalBytesRead => 472 (472 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 

In [6]:
# Another way to encapsulate code and instrumentation in a compact form

stagemetrics.runandmeasure(locals(), """
spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()
""")

+----------+
|  count(1)|
+----------+
|1000000000|
+----------+


Scheduling mode = FIFO
Spark Context default degree of parallelism = 8

Aggregated Spark stage metrics:
numStages => 3
numTasks => 17
elapsedTime => 510 (0.5 s)
stageDuration => 399 (0.4 s)
executorRunTime => 2255 (2 s)
executorCpuTime => 1930 (2 s)
executorDeserializeTime => 75 (75 ms)
executorDeserializeCpuTime => 43 (43 ms)
resultSerializationTime => 0 (0 ms)
jvmGCTime => 0 (0 ms)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 37 (37 ms)
resultSize => 16048 (15.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 2000
bytesRead => 0 (0 Bytes)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 8
shuffleTotalBlocksFetched => 8
shuffleLocalBlocksFetched => 8
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 472 (472 Bytes)
shuffleLocalBytesRead => 472 (472 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk =

## Example of collecting using Task Metrics
Collecting Spark task metrics at the granularity of each task completion has additional overhead
compare to collecting at the stage completion level, therefore this option should only be used if you need data with this finer granularity, for example because you want
to study skew effects, otherwise consider using stagemetrics aggregation as preferred choice.


In [7]:
from sparkmeasure import TaskMetrics
taskmetrics = TaskMetrics(spark)

taskmetrics.begin()
spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()
taskmetrics.end()
taskmetrics.print_report()

+----------+
|  count(1)|
+----------+
|1000000000|
+----------+


Scheduling mode = FIFO
Spark Context default degree of parallelism = 8

Aggregated Spark task metrics:
numTasks => 17
successful tasks => 17
speculative tasks => 0
taskDuration => 2310 (2 s)
schedulerDelayTime => 85 (85 ms)
executorRunTime => 2168 (2 s)
executorCpuTime => 1952 (2 s)
executorDeserializeTime => 57 (57 ms)
executorDeserializeCpuTime => 28 (28 ms)
resultSerializationTime => 0 (0 ms)
jvmGCTime => 0 (0 ms)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 0 (0 ms)
gettingResultTime => 0 (0 ms)
resultSize => 2667 (2.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 2000
bytesRead => 0 (0 Bytes)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 8
shuffleTotalBlocksFetched => 8
shuffleLocalBlocksFetched => 8
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 472 (472 Bytes)
shuffleLocalBytesRead => 472 (472 Bytes)
shuffleR