Release 0.12.2 #25

dan-blanchard · 2014-01-07T20:11:53Z

Just fixed a couple minor issues.

exception is now the properly set as the cause of death when a job encounters an exception.
Fixed a potential memory leak in the qmaster process caused by not cleaning up job info as recommended in the DRMAA Python documentation.
Changed default session_id in JobMonitor to None to be more Pythonic, instead of -1 like it was before.

…p into develop

…_id is set

…__exit__

…ets cleaned up

dan-blanchard · 2014-01-07T20:13:38Z

Anyone want to volunteer to review this? 😄 It's only like 20 lines. @mheilman @desilinguist @dmnapolitano @aoifecahill

coveralls · 2014-01-07T20:50:26Z

Coverage decreased (-0.15%) when pulling c310aee on release/0.12.2 into 3c554a5 on master.

mheilman · 2014-01-07T21:41:01Z

I didn't see anything obviously bad, but I'm not familiar enough with this code to really give a useful review. If nobody else is either, maybe you should sit down with one of us and explain the changes or something... ?

dan-blanchard · 2014-01-08T15:14:25Z

Explaining it to at least one of you is a good idea. I don't know that I have time to meet about it this week, but here's a quick overview of how the whole GridMap system works.

To run a function on a bunch of arguments using GridMap, you either use the grid_map or process_jobs functions. grid_map automatically creates a bunch of Job objects and sends them to process_jobs, so in the end, you're always using process_jobs (albeit indirectly with grid_map). When you create a Job object it saves your function, its arguments, and the path to the module containing the function (since that needs to be in the sys.path for unpickling later).

process_jobs does the following with the Job objects:

It launches creates a JobMonitor instance that uses 0MQ to communicate with the heartbeat and runner processes that will be running on the different machines. The JobMonitor keeps track of what the inputs and outputs are for all of the jobs, and kills/resubmits jobs that have stalled. It also sends error email reports when things go awry.
It submits a bunch of Grid Engine command-line jobs that call python -m gridmap.runner HOME_ADDRESS JOB_PATH, where HOME_ADDRESS is the URL of the 0MQ JobMonitor and JOB_PATH is the path to the module that the job's function belongs to. When these jobs gets executed on the cluster, they:
Immediately add JOB_PATH to sys.path
Request the job's function, and its input from the JobMonitor, which are sent as bz2-compressed pickles.
Spawn a separate heartbeat process that repeatedly monitors CPU/memory usage and reports those back to the JobMonitor.
Execute the function inside a try/except that catches all exceptions.
Sends the return value of the function back to the JobMonitor as a bz2-compressed pickle. If the job encountered an exception, that is considered the return value. The text of the stack track is also sent back to aid in debugging in these cases.
Kill the heartbeat process after completion.
It waits until the JobMonitor has either received valid output from all of the jobs, or any one of the jobs has encountered an exception. If there was an exception, it is re-raised and all jobs are killed.
Tears down the JobMonitor (and its local heartbeat process) and returns the outputs from all the jobs in a list.

As a side note, I should also mention that the JobMonitor class is a context manager, so if any exceptions are encountered when we're inside a with statement that instantiates one, we can tell that that happened and automatically try to kill all of the jobs. This includes KeyboardInterrupt exceptions when people hit CTRL-C. This release mostly addresses one of the clean up tasks that needs to take place when we exit the with statement context: we need to tell the Grid Engine to remove the job metadata from the qmaster process, or it will use much more memory than it needs to.

dan-blanchard · 2014-01-08T15:17:04Z

Since that explanation shouldn't get lost to time, I also put it on the wiki: https://github.com/EducationalTestingService/gridmap/wiki/How-GridMap-works-under-the-hood

desilinguist · 2014-01-08T15:18:19Z

That's quite useful. Thanks!

On Wed, Jan 8, 2014 at 10:17 AM, Dan Blanchard notifications@github.com
wrote:

Since that explanation shouldn't get lost to time, I also put it on the wiki: https://github.com/EducationalTestingService/gridmap/wiki/How-GridMap-works-under-the-hood

Reply to this email directly or view it on GitHub:
#25 (comment)

coveralls · 2014-01-10T17:04:48Z

Coverage increased (+0.33%) when pulling 9e0e651 on release/0.12.2 into 3c554a5 on master.

dan-blanchard added 12 commits December 20, 2013 15:21

Store 'exception' as cause_of_death when we encounter and exception.

a3aeb34

Fix memory leak caused by job info not being wiped out.

d2e23a8

Merge branch 'develop' of github.com:EducationalTestingService/gridma…

0bdb61e

…p into develop

Add some more debug logging

bf52463

Fix issue where job list was not being passed to synchronize

9599ffd

Prevent additional exception when JobMonitor dies before self.session…

4d35f09

…_id is set

Wrap JOB_IDS_SESSION_ALL in a list, because it needs to be that way

64208c5

Move job info disposal into JobMonitor.__exit__

3c7e951

Don't raise ExitTimeoutException when trying to kill jobs JobMonitor.…

e297984

…__exit__

Refactor JobMonitor.__exit__ a little bit to ensure job info always g…

096cec0

…ets cleaned up

Remove extraneous comma

dd31a08

Bump version number

c310aee

Change default session_id to None, because that makes more sense

9e0e651

dan-blanchard merged commit 9e0e651 into master Jan 10, 2014

dan-blanchard deleted the release/0.12.2 branch January 10, 2014 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.12.2 #25

Release 0.12.2 #25

dan-blanchard commented Jan 7, 2014

dan-blanchard commented Jan 7, 2014

coveralls commented Jan 7, 2014

mheilman commented Jan 7, 2014

dan-blanchard commented Jan 8, 2014

dan-blanchard commented Jan 8, 2014

desilinguist commented Jan 8, 2014

Since that explanation shouldn't get lost to time, I also put it on the wiki: https://github.com/EducationalTestingService/gridmap/wiki/How-GridMap-works-under-the-hood

coveralls commented Jan 10, 2014

Release 0.12.2 #25

Release 0.12.2 #25

Conversation

dan-blanchard commented Jan 7, 2014

dan-blanchard commented Jan 7, 2014

coveralls commented Jan 7, 2014

mheilman commented Jan 7, 2014

dan-blanchard commented Jan 8, 2014

dan-blanchard commented Jan 8, 2014

desilinguist commented Jan 8, 2014

Since that explanation shouldn't get lost to time, I also put it on the wiki: https://github.com/EducationalTestingService/gridmap/wiki/How-GridMap-works-under-the-hood

coveralls commented Jan 10, 2014