Skip to content

How GridMap works under the hood

Dan Blanchard edited this page Mar 27, 2015 · 10 revisions

To run a function on a bunch of arguments using GridMap, you either use the grid_map or process_jobs functions. grid_map automatically creates a bunch of Job objects and sends them to process_jobs, so in the end, you're always using process_jobs (albeit indirectly with grid_map). When you create a Job object it saves your function, its arguments, and the path to the module containing the function (since that needs to be in the sys.path for unpickling later).

process_jobs does the following with the Job objects:

  1. It creates a JobMonitor instance that uses ØMQ to communicate with the heartbeat and runner processes that will be running on the different machines. The JobMonitor keeps track of what the inputs and outputs are for all of the jobs, and kills/resubmits jobs that have stalled. It also sends error email reports when things go awry.
  2. It submits a bunch of Grid Engine command-line jobs that call python -m gridmap.runner HOME_ADDRESS JOB_PATH, where HOME_ADDRESS is the URL of the ØMQ JobMonitor and JOB_PATH is the path to the module that the job's function belongs to. When these jobs gets executed on the cluster, they:
    1. Immediately add JOB_PATH to sys.path
    2. Spawn a separate heartbeat process that repeatedly monitors CPU/memory usage and reports those back to the JobMonitor.
    3. Request the job's function and input from the JobMonitor, which are sent as bz2-compressed pickles.
    4. Execute the function inside a try/except that catches all exceptions.
    5. Sends the return value of the function back to the JobMonitor as a bz2-compressed pickle. If the job encountered an exception, that is considered the return value. The text of the stack trace is also sent back to aid in debugging in these cases.
    6. Kill the heartbeat process after completion.
  3. It waits until the JobMonitor has either received valid output from all of the jobs, or any one of the jobs has encountered an exception. If there was an exception, it is re-raised and all jobs are killed.
  4. Tears down the JobMonitor (and its local heartbeat process) and returns the outputs from all the jobs in a list.

As a side note, I should also mention that the JobMonitor class is a context manager, so if any exceptions are encountered when we're inside a with statement that instantiates one, we can tell that that happened and automatically try to kill all of the jobs. This includes KeyboardInterrupt exceptions when people hit CTRL-C.

Clone this wiki locally