Skip to content

Commit

Permalink
minor updates
Browse files Browse the repository at this point in the history
  • Loading branch information
LucaCanali committed Aug 14, 2018
1 parent 42f923d commit 2bc8419
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 2 deletions.
4 changes: 4 additions & 0 deletions docs/Flight_recorder_mode.md
Expand Up @@ -51,3 +51,7 @@ the Spark History Server.
See also this note with a few tips on how to read event log files(https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_EventLog.md)

- For metrics analysis see also notes at [Notes_on_metrics_analysis.md](Notes_on_metrics_analysis.md) for a few examples.

- If you are deploying applications using cluster mode, note that the serialized metrics
are written by the driver and therefore the path is local to the driver process.
You could use a network filesystem mounted on the driver/cluster for convenience.
4 changes: 3 additions & 1 deletion docs/TODO_and_issues.md
Expand Up @@ -26,9 +26,11 @@ If you plan to contribute to sparkMeasure development, please start by reviewing
TaskMetrics._updatedBlockStatuses is off by default.
* TODO (maybe) implement in sparmMeasure APIS removeSparkListener method, to allow stopping data collection
from sparkMeasure. (note this is only possible from Spark version 2.2 and above)
* TODO (maybe) add additional sink for the flight recorder mode, in particular add HDFS sink
(currently metrics are written to the local filesystem of the driver)
* gatherAccumulables=true for taskMetrics(sparkSession: SparkSession, gatherAccumulables: Boolean)
currently only works on Spark 2.1.x and breaks from Spark 2.2.1. This is a consequence of
[SPARK PR 17596](https://github.com/apache/spark/pull/17596).
TODO (maybe): restore the functionality of measuring task accumulables for Spark 2.2.x
* TODO (maybe): ost-processing of metrics data in scala, rather than Spark SQL?
* TODO (maybe): post-processing of metrics data in scala, rather than Spark SQL?
The advantage would be not to "pollute" the execution environment with additional SQL jobs as is the case now.
2 changes: 1 addition & 1 deletion examples/test_sparkmeasure_python.py
Expand Up @@ -30,7 +30,7 @@ def run_my_workload(spark):


if __name__ == "__main__":
# crate spark session
# create spark session
spark = SparkSession \
.builder \
.appName("Test sparkmeasure instrumentation of Python/PySpark code") \
Expand Down

0 comments on commit 2bc8419

Please sign in to comment.