Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.12.2 #25

Merged
merged 13 commits into from Jan 10, 2014
Merged

Release 0.12.2 #25

merged 13 commits into from Jan 10, 2014

Conversation

dan-blanchard
Copy link
Contributor

Just fixed a couple minor issues.

  • exception is now the properly set as the cause of death when a job encounters an exception.
  • Fixed a potential memory leak in the qmaster process caused by not cleaning up job info as recommended in the DRMAA Python documentation.
  • Changed default session_id in JobMonitor to None to be more Pythonic, instead of -1 like it was before.

@dan-blanchard
Copy link
Contributor Author

Anyone want to volunteer to review this? 😄 It's only like 20 lines. @mheilman @desilinguist @dmnapolitano @aoifecahill

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.15%) when pulling c310aee on release/0.12.2 into 3c554a5 on master.

@mheilman
Copy link

mheilman commented Jan 7, 2014

I didn't see anything obviously bad, but I'm not familiar enough with this code to really give a useful review. If nobody else is either, maybe you should sit down with one of us and explain the changes or something... ?

@dan-blanchard
Copy link
Contributor Author

Explaining it to at least one of you is a good idea. I don't know that I have time to meet about it this week, but here's a quick overview of how the whole GridMap system works.

To run a function on a bunch of arguments using GridMap, you either use the grid_map or process_jobs functions. grid_map automatically creates a bunch of Job objects and sends them to process_jobs, so in the end, you're always using process_jobs (albeit indirectly with grid_map). When you create a Job object it saves your function, its arguments, and the path to the module containing the function (since that needs to be in the sys.path for unpickling later).

process_jobs does the following with the Job objects:

  1. It launches creates a JobMonitor instance that uses 0MQ to communicate with the heartbeat and runner processes that will be running on the different machines. The JobMonitor keeps track of what the inputs and outputs are for all of the jobs, and kills/resubmits jobs that have stalled. It also sends error email reports when things go awry.
  2. It submits a bunch of Grid Engine command-line jobs that call python -m gridmap.runner HOME_ADDRESS JOB_PATH, where HOME_ADDRESS is the URL of the 0MQ JobMonitor and JOB_PATH is the path to the module that the job's function belongs to. When these jobs gets executed on the cluster, they:
  3. Immediately add JOB_PATH to sys.path
  4. Request the job's function, and its input from the JobMonitor, which are sent as bz2-compressed pickles.
  5. Spawn a separate heartbeat process that repeatedly monitors CPU/memory usage and reports those back to the JobMonitor.
  6. Execute the function inside a try/except that catches all exceptions.
  7. Sends the return value of the function back to the JobMonitor as a bz2-compressed pickle. If the job encountered an exception, that is considered the return value. The text of the stack track is also sent back to aid in debugging in these cases.
  8. Kill the heartbeat process after completion.
  9. It waits until the JobMonitor has either received valid output from all of the jobs, or any one of the jobs has encountered an exception. If there was an exception, it is re-raised and all jobs are killed.
  10. Tears down the JobMonitor (and its local heartbeat process) and returns the outputs from all the jobs in a list.

As a side note, I should also mention that the JobMonitor class is a context manager, so if any exceptions are encountered when we're inside a with statement that instantiates one, we can tell that that happened and automatically try to kill all of the jobs. This includes KeyboardInterrupt exceptions when people hit CTRL-C. This release mostly addresses one of the clean up tasks that needs to take place when we exit the with statement context: we need to tell the Grid Engine to remove the job metadata from the qmaster process, or it will use much more memory than it needs to.

@dan-blanchard
Copy link
Contributor Author

Since that explanation shouldn't get lost to time, I also put it on the wiki: https://github.com/EducationalTestingService/gridmap/wiki/How-GridMap-works-under-the-hood

@desilinguist
Copy link
Contributor

That's quite useful. Thanks!

On Wed, Jan 8, 2014 at 10:17 AM, Dan Blanchard notifications@github.com
wrote:

Since that explanation shouldn't get lost to time, I also put it on the wiki: https://github.com/EducationalTestingService/gridmap/wiki/How-GridMap-works-under-the-hood

Reply to this email directly or view it on GitHub:
#25 (comment)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.33%) when pulling 9e0e651 on release/0.12.2 into 3c554a5 on master.

@dan-blanchard dan-blanchard merged commit 9e0e651 into master Jan 10, 2014
@dan-blanchard dan-blanchard deleted the release/0.12.2 branch January 10, 2014 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants