Merge branch 'version/2.5'

* version/2.5: Added a note about UNKNOWN workers Added a link to the newly-expanded Garbage Collection section "Out of the 0 Workers" sounds weird Expanded the section about Garbage Collection
Ensembl · Jan 28, 2019 · f365204 · f365204
2 parents 3a1255a + a9f6e5d
commit f365204
Show file tree

Hide file tree

Showing 3 changed files with 46 additions and 3 deletions.
diff --git a/docs/running_pipelines/management.rst b/docs/running_pipelines/management.rst
@@ -49,14 +49,52 @@ update semaphores by running ``beekeeper.pl`` with the -balance_semaphores optio
 
            beekeeper.pl -url sqlite:///my_pipeline_database -balance_semaphores
 
+.. _garbage-collection:
+
 Garbage collection of dead Workers
 ----------------------------------
 
 On occasion, Worker processes will end without having an opportunity to update
 their status in the eHive database. The Beekeeper will attempt to find these
 Workers and update their status itself. It does this by reconciling the list of
 Worker statuses in the eHive database with information on Workers gleaned from
-the meadow's process tables (e.g. ``ps``, ``bacct``, ``bjobs``). A manual
+the meadow's process tables (e.g. ``ps``, ``bacct``, ``bjobs``). This
+process is called "garbage collection". A typical output is like this::
+
+    Beekeeper : loop #12 ======================================================
+    GarbageCollector:       Checking for lost Workers...
+    GarbageCollector:       [Queen:] out of 66 Workers that haven't checked in during the last 5 seconds...
+    GarbageCollector:       [LSF/EBI Meadow:]       RUN:66
+
+In this case, 66 LSF workers had not updated their status within the last 5
+seconds but they were in fact all running and listed by ``bjobs``. The
+Garbage collection process ends there, then.
+
+In another case, Beekeeper could find so called ``LOST`` Workers::
+
+    Beekeeper : loop #15 ======================================================
+    GarbageCollector:       Checking for lost Workers...
+    GarbageCollector:       [Queen:] out of 45 Workers that haven't checked in during the last 5 seconds...
+    GarbageCollector:       [LSF/EBI Meadow:]       LOST:4, RUN:41
+
+    GarbageCollector:       Discovered 4 lost LSF Workers
+    LSF::parse_report_source_line( "bacct -l '4126850[15]' '4126850[6]' '4126835[24]' '4126850[33]'" )
+    GarbageCollector:       Found why 4 of LSF Workers died
+
+In this case, ``bjobs`` only listed 41 of the 45 LSF workers that had not
+updated their status within the last 5 seconds. Beekeeper then had to
+resort to ``bacct`` to find out what happened to 4 ``LOST`` Workers.
+``LOST`` Workers are most of the time Workers that have been killed by LSF
+due to exceeding their allocated resources (MEMLIMIT or RUNLIMIT).
+
+When no reason could be found, the cause of death recorded in the
+``log_message`` table will be UNKNOWN. This is known to happen when
+``bacct`` was executed too long after the Worker exited: LSF's journal only
+knows about the most recent jobs. It seems to be happening in other
+circumstances that are not clearly understood. If you face this UNKNOWN
+error, re-run the job locally in debug mode.
+
+The Garbage collection happens at every Beekeeper loop, but a manual
 reconciliation and update of Worker statuses can be invoked by running
 ``beekeeper.pl`` with the -dead option:
 

diff --git a/docs/running_pipelines/troubleshooting.rst b/docs/running_pipelines/troubleshooting.rst
@@ -57,7 +57,7 @@ The message log stores messages sent from the Beekeeper and from Workers. Messag
 
    - WORKER_ERROR
 
-      - WORKER_ERROR messages are sent when a Worker encounters an abnormal condition related to its particular lifecycle of Job execution, and that condition causes it to end prematurely. Examples of conditions that can generate WORKER_ERROR messages include failure to compile a Runnable, or a Runnable generating a failure message.
+      - WORKER_ERROR messages are sent when a Worker encounters an abnormal condition related to its particular lifecycle of Job execution, and that condition causes it to end prematurely. Examples of conditions that can generate WORKER_ERROR messages include failure to compile a Runnable, reasons why Workers disappeared (as detected in the :ref:`Garbage collection<garbage-collection>` phase, or a Runnable generating a failure message.
 
 The log can be viewed in guiHive's log tab, or by directly querying the eHive database. In the database, the log is stored in the log_message table. To aid with discovery of relevant messages, eHive also provides via a view called msg, which includes Analysis logic_names. For example, to find all non-INFO messages for an Analysis with a logic_name of "align_sequences" one could run:
 

diff --git a/modules/Bio/EnsEMBL/Hive/Queen.pm b/modules/Bio/EnsEMBL/Hive/Queen.pm
@@ -430,7 +430,12 @@ sub check_for_dead_workers {    # scans the whole Valley for lost Workers (but i
     my $signature_and_pid_to_worker_status  = $valley->status_of_all_our_workers_by_meadow_signature( $reconciled_worker_statuses );
     # this may pick up workers that have been created since the last fetch
     my $queen_overdue_workers               = $self->fetch_overdue_workers( $last_few_seconds );    # check the workers we have not seen active during the $last_few_seconds
-    print "GarbageCollector:\t[Queen:] out of ".scalar(@$queen_overdue_workers)." Workers that haven't checked in during the last $last_few_seconds seconds...\n";
+
+    if (@$queen_overdue_workers) {
+        print "GarbageCollector:\tOut of the ".scalar(@$queen_overdue_workers)." Workers that haven't checked in during the last $last_few_seconds seconds...\n";
+    } else {
+        print "GarbageCollector:\tfound none (all have checked in during the last $last_few_seconds seconds)\n";
+    }
 
     my $this_meadow_user            = whoami();