Quickstart for debugging PySpark applications
A sample application for debugging Python Spark apps using Rookout.
Before following this guide we recommend reading the basic Python + Rookout guide.
Running PySpark jobs with Rookout
- Clone and install dependencies:
git clone https://github.com/Rookout/deployment-examples.git cd deployment-examples/python-spark pip install -r requirements.txt # also on executor nodes, if running in a cluster
- Export organization token:
export ROOKOUT_TOKEN=<Your Rookout Token>
- Try placing breakpoints at these locations in
mainon line 65
map_partitions_handleron line 48
multiply_latlong(a UDF) on line 24
- Run your program
spark-submitto submit the job while loading Rookout into Spark executors (Spark standalone):
spark-submit --conf spark.python.daemon.module=rook.pyspark_daemon example.py
Have fun debugging: Go to https://app.rookout.com and start debugging :)
You cannot place breakpoints in executor nodes in the
__main__module, as a side effect of how PySpark serializes functions.
Try placing breakpoints at:
example.mainon line 36
example_executor_module.handle_recordon line 30
example_executor_module.multiply_latlong(a UDF) on line 7
- Run your program
Running under YARN (AWS EMR)
- Upload the
sample-data.csvfile to a S3 bucket, and modify the code so it calls
s3a://bucket-name/sample-data.csvfor the path, where
bucket-nameis replaced by the bucket that contains your uploaded file.
- Specify the
spark-submit --conf spark.python.daemon.module=rook.pyspark_daemon --conf spark.yarn.appMasterEnv.ROOKOUT_TOKEN=[Your Rookout Token] --conf spark.executorEnv.ROOKOUT_TOKEN=[Your Rookout Token] example.py
Rookout Integration explained
Rookout is loaded into the Spark driver on line 74 in main.py, by calling
rook.start()directly. This is the standard way of loading Rookout, but we still haven't loaded Rookout into the executor nodes.
Specifying the configuration option
--conf spark.python.daemon.module=rook.pyspark_daemonloads Rookout into the executor nodes. When specifying this option, Rookout will automatically load to worker processes. This is necessary for placing breakpoints in any code that runs on executor nodes.
Rookout cannot currently place executor breakpoints in nested functions, lambdas, staticmethods when those are used as a UDF or a mapPartitions callable, but you can place breakpoints if them if they are called by the UDF or the mapPartitions callable (as in the example).
In addition, you cannot place breakpoints in functions defined in the
This is a side effect of how PySpark serializes functions before sending them to executors.