-
Notifications
You must be signed in to change notification settings - Fork 34
How GridMap works under the hood
To run a function on a bunch of arguments using GridMap, you either use the grid_map
or process_jobs
functions. grid_map
automatically creates a bunch of Job
objects and sends them to process_jobs
, so in the end, you're always using process_jobs
(albeit indirectly with grid_map
). When you create a Job
object it saves your function, its arguments, and the path to the module containing the function (since that needs to be in the sys.path
for unpickling later).
process_jobs
does the following with the Job
objects:
- It creates a
JobMonitor
instance that uses ØMQ to communicate with theheartbeat
andrunner
processes that will be running on the different machines. TheJobMonitor
keeps track of what the inputs and outputs are for all of the jobs, and kills/resubmits jobs that have stalled. It also sends error email reports when things go awry. - It submits a bunch of Grid Engine command-line jobs that call
python -m gridmap.runner HOME_ADDRESS JOB_PATH
, whereHOME_ADDRESS
is the URL of the ØMQJobMonitor
andJOB_PATH
is the path to the module that the job's function belongs to. When these jobs gets executed on the cluster, they:- Immediately add
JOB_PATH
tosys.path
- Spawn a separate heartbeat process that repeatedly monitors CPU/memory usage and reports those back to the
JobMonitor
. - Request the job's function and input from the
JobMonitor
, which are sent as bz2-compressed pickles. - Execute the function inside a
try/except
that catches all exceptions. - Sends the return value of the function back to the
JobMonitor
as a bz2-compressed pickle. If the job encountered an exception, that is considered the return value. The text of the stack trace is also sent back to aid in debugging in these cases. - Kill the heartbeat process after completion.
- Immediately add
- It waits until the
JobMonitor
has either received valid output from all of the jobs, or any one of the jobs has encountered an exception. If there was an exception, it is re-raised and all jobs are killed. - Tears down the
JobMonitor
(and its local heartbeat process) and returns the outputs from all the jobs in a list.
As a side note, I should also mention that the JobMonitor
class is a context manager, so if any exceptions are encountered when we're inside a with
statement that instantiates one, we can tell that that happened and automatically try to kill all of the jobs. This includes KeyboardInterrupt
exceptions when people hit CTRL-C.