Skip to content

Commit

Permalink
More final edit
Browse files Browse the repository at this point in the history
  • Loading branch information
richcar58 committed Nov 26, 2018
1 parent 1872dd0 commit e718d84
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 3 deletions.
6 changes: 4 additions & 2 deletions docs/agave/guides/jobs/aloe-job-architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ This message routing algorithm allows requests to be segregated by workload char
Asynchronous Communication
""""""""""""""""""""""""""

The first thing a job worker thread does when it reads in a job submission message is to spawn a *job-specific command thread* to handle asynchronous communication to that job. The command thread creates a temporary *job-specific command topic* and waits for asynchronous messages to be sent to the job. The most common message sent to a job is a cancellation message, usually originating from a REST call sent by the user that originally submitted the job. The command thread communicates through shared memory to deliver messages to its parent worker thread.
The first thing a job worker thread does when it reads in a job submission message is to spawn a *job-specific command thread* to handle asynchronous communication to that job. The command thread creates a temporary *job-specific command topic* and waits for asynchronous messages to be sent to the job. The most common message sent to a job is a cancellation message, usually originating from a REST call sent by the user that originally submitted the job. The command thread communicates through shared memory to deliver messages to its parent worker thread. To cover cases when jobs are in recovery, asynchronous messages destine for jobs are also sent to the *recovery queue* so they can be handled by `Tenant Recovery Readers`_.

In addition to the comand topic, the Jobs service designates an *events topic* for each tenant. The idea is that different system components can write well-defined events to the topic and interested parties can subscribe to the topic to receive some subset of those events. Eventually, a REST API will be developed to allow external subscriptions to the events topic. *The events topic is not used in the initial Jobs service release.*

Expand Down Expand Up @@ -138,10 +138,12 @@ When any of the above conditions are detected during job execution, the worker p

The recovery message contains information collected at the failure site and higher up in the executing job's call stack. This information characterizes the error condition and specifies how the job can be restarted. Specifically, the recovery message specifies the *policies* and *testers* used to recover the job. Policies determine when the next error condition check should be made; testers implement the code that actually makes the checks. New policies and testers can be easily plugged into the system, though at present they have to ship with the system.

The recovery reader is a multithreaded Java program that processes the tenant's recovery queue. Internally, recovery messages are organized into lists based on their error condition---jobs blocked by the same condition are put in the same list. Recovery jobs are ordered by next check time and the recovery reader waits until that time to test a blocking condition. Recovery information is kept in the MySQL database for resilience against reader failures.
The recovery reader is a multithreaded Java program that processes the tenant's recovery queue. Internally, recovery messages are organized into lists based on their error condition---jobs blocked by the same condition are put in the same list. Recovery jobs are ordered by next check time and the recovery reader waits until that time to test a blocking condition. Recovery information is kept in the MySQL database for resilience against reader failures.

When a test indicates that a blocking condition has cleared, all jobs blocked by that condition are resubmitted for execution. Resubmission entails (1) setting the job status to the value specified in the original recovery message, (2) creating a job submission message and placing it on the job's original submission queue, and (3) removing the job from the recovery subsystem and its persistent store. The job is immediately failed if it cannot be resubmitted. Resubmission transfers responsibility for the job back to the tenant workers. Care is again taken to ensure that a job cannot be both in recovery and executing.

Recovery readers also handle asynchronous requests initiated by users, such as requests to cancel a job. These requests appear as messages on the recovery queue.

Site Alternate Readers
^^^^^^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 1 addition & 1 deletion docs/agave/guides/jobs/aloe-job-changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@ By default, each tenant is assigned a job submission queue that conforms to the

::

aloe.jobq.<tenantId>.DefaultQueue
aloe.jobq.<tenantId>.submit.DefaultQueue
::

The Jobs service allows tenants to balance and segregate workloads by sending job requests to different queues, each with its own set of worker processes (see `Tenant Workers <aloe-job-architecture.html#tenant-workers>`_ for discussion). Administrators define new queues or update existing ones using the provided *ImportQueueDefinitions* utility program. This program reads tenant queue configuration files and creates or updates queue definition records in the *aloe_queues* database table. The configuration file content conforms to the JSON schema defined in the *JobQueueDefinitions.json* file that also ships with the Jobs service.
Expand Down

0 comments on commit e718d84

Please sign in to comment.