Skip to content

Commit

Permalink
Merge branch 'version/2.5'
Browse files Browse the repository at this point in the history
* version/2.5:
  Added a note about UNKNOWN workers
  Added a link to the newly-expanded Garbage Collection section
  "Out of the 0 Workers" sounds weird
  Expanded the section about Garbage Collection
  • Loading branch information
muffato committed Jan 28, 2019
2 parents 3a1255a + a9f6e5d commit f365204
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 3 deletions.
40 changes: 39 additions & 1 deletion docs/running_pipelines/management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,52 @@ update semaphores by running ``beekeeper.pl`` with the -balance_semaphores optio

beekeeper.pl -url sqlite:///my_pipeline_database -balance_semaphores

.. _garbage-collection:

Garbage collection of dead Workers
----------------------------------

On occasion, Worker processes will end without having an opportunity to update
their status in the eHive database. The Beekeeper will attempt to find these
Workers and update their status itself. It does this by reconciling the list of
Worker statuses in the eHive database with information on Workers gleaned from
the meadow's process tables (e.g. ``ps``, ``bacct``, ``bjobs``). A manual
the meadow's process tables (e.g. ``ps``, ``bacct``, ``bjobs``). This
process is called "garbage collection". A typical output is like this::

Beekeeper : loop #12 ======================================================
GarbageCollector: Checking for lost Workers...
GarbageCollector: [Queen:] out of 66 Workers that haven't checked in during the last 5 seconds...
GarbageCollector: [LSF/EBI Meadow:] RUN:66

In this case, 66 LSF workers had not updated their status within the last 5
seconds but they were in fact all running and listed by ``bjobs``. The
Garbage collection process ends there, then.

In another case, Beekeeper could find so called ``LOST`` Workers::

Beekeeper : loop #15 ======================================================
GarbageCollector: Checking for lost Workers...
GarbageCollector: [Queen:] out of 45 Workers that haven't checked in during the last 5 seconds...
GarbageCollector: [LSF/EBI Meadow:] LOST:4, RUN:41

GarbageCollector: Discovered 4 lost LSF Workers
LSF::parse_report_source_line( "bacct -l '4126850[15]' '4126850[6]' '4126835[24]' '4126850[33]'" )
GarbageCollector: Found why 4 of LSF Workers died

In this case, ``bjobs`` only listed 41 of the 45 LSF workers that had not
updated their status within the last 5 seconds. Beekeeper then had to
resort to ``bacct`` to find out what happened to 4 ``LOST`` Workers.
``LOST`` Workers are most of the time Workers that have been killed by LSF
due to exceeding their allocated resources (MEMLIMIT or RUNLIMIT).

When no reason could be found, the cause of death recorded in the
``log_message`` table will be UNKNOWN. This is known to happen when
``bacct`` was executed too long after the Worker exited: LSF's journal only
knows about the most recent jobs. It seems to be happening in other
circumstances that are not clearly understood. If you face this UNKNOWN
error, re-run the job locally in debug mode.

The Garbage collection happens at every Beekeeper loop, but a manual
reconciliation and update of Worker statuses can be invoked by running
``beekeeper.pl`` with the -dead option:

Expand Down
2 changes: 1 addition & 1 deletion docs/running_pipelines/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ The message log stores messages sent from the Beekeeper and from Workers. Messag

- WORKER_ERROR

- WORKER_ERROR messages are sent when a Worker encounters an abnormal condition related to its particular lifecycle of Job execution, and that condition causes it to end prematurely. Examples of conditions that can generate WORKER_ERROR messages include failure to compile a Runnable, or a Runnable generating a failure message.
- WORKER_ERROR messages are sent when a Worker encounters an abnormal condition related to its particular lifecycle of Job execution, and that condition causes it to end prematurely. Examples of conditions that can generate WORKER_ERROR messages include failure to compile a Runnable, reasons why Workers disappeared (as detected in the :ref:`Garbage collection<garbage-collection>` phase, or a Runnable generating a failure message.

The log can be viewed in guiHive's log tab, or by directly querying the eHive database. In the database, the log is stored in the log_message table. To aid with discovery of relevant messages, eHive also provides via a view called msg, which includes Analysis logic_names. For example, to find all non-INFO messages for an Analysis with a logic_name of "align_sequences" one could run:

Expand Down
7 changes: 6 additions & 1 deletion modules/Bio/EnsEMBL/Hive/Queen.pm
Original file line number Diff line number Diff line change
Expand Up @@ -430,7 +430,12 @@ sub check_for_dead_workers { # scans the whole Valley for lost Workers (but i
my $signature_and_pid_to_worker_status = $valley->status_of_all_our_workers_by_meadow_signature( $reconciled_worker_statuses );
# this may pick up workers that have been created since the last fetch
my $queen_overdue_workers = $self->fetch_overdue_workers( $last_few_seconds ); # check the workers we have not seen active during the $last_few_seconds
print "GarbageCollector:\t[Queen:] out of ".scalar(@$queen_overdue_workers)." Workers that haven't checked in during the last $last_few_seconds seconds...\n";

if (@$queen_overdue_workers) {
print "GarbageCollector:\tOut of the ".scalar(@$queen_overdue_workers)." Workers that haven't checked in during the last $last_few_seconds seconds...\n";
} else {
print "GarbageCollector:\tfound none (all have checked in during the last $last_few_seconds seconds)\n";
}

my $this_meadow_user = whoami();

Expand Down

0 comments on commit f365204

Please sign in to comment.