minor updates

LucaCanali · Aug 14, 2018 · 2bc8419 · 2bc8419
1 parent 42f923d
commit 2bc8419
Show file tree

Hide file tree

Showing 3 changed files with 8 additions and 2 deletions.
diff --git a/docs/Flight_recorder_mode.md b/docs/Flight_recorder_mode.md
@@ -51,3 +51,7 @@ the Spark History Server.
 See also this note with a few tips on how to read event log files(https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_EventLog.md)  
 
 - For metrics analysis see also notes at  [Notes_on_metrics_analysis.md](Notes_on_metrics_analysis.md) for a few examples.
+
+- If you are deploying applications using cluster mode, note that the serialized metrics
+ are written by the driver and therefore the path is local to the driver process.
+ You could use a network filesystem mounted on the driver/cluster for convenience. 
diff --git a/docs/TODO_and_issues.md b/docs/TODO_and_issues.md
@@ -26,9 +26,11 @@ If you plan to contribute to sparkMeasure development, please start by reviewing
      TaskMetrics._updatedBlockStatuses is off by default.
    * TODO (maybe) implement in sparmMeasure APIS removeSparkListener method, to allow stopping data collection 
      from sparkMeasure. (note this is only possible from Spark version 2.2 and above)
+   * TODO (maybe) add additional sink for the flight recorder mode, in particular add HDFS sink 
+   (currently metrics are written to the local filesystem of the driver)
    * gatherAccumulables=true for taskMetrics(sparkSession: SparkSession, gatherAccumulables: Boolean) 
      currently only works on Spark 2.1.x and breaks from Spark 2.2.1. This is a consequence of
       [SPARK PR 17596](https://github.com/apache/spark/pull/17596).  
       TODO (maybe): restore the functionality of measuring task accumulables for Spark 2.2.x
-   * TODO (maybe): ost-processing of metrics data in scala, rather than Spark SQL? 
+   * TODO (maybe): post-processing of metrics data in scala, rather than Spark SQL? 
      The advantage would be not to "pollute" the execution environment with additional SQL jobs as is the case now.
diff --git a/examples/test_sparkmeasure_python.py b/examples/test_sparkmeasure_python.py
@@ -30,7 +30,7 @@ def run_my_workload(spark):
 
 
 if __name__ == "__main__":
-    # crate spark session
+    # create spark session
     spark = SparkSession \
         .builder \
         .appName("Test sparkmeasure instrumentation of Python/PySpark code") \