Skip to content

Commit

Permalink
Added sections about resource usage and optimization, and extended th…
Browse files Browse the repository at this point in the history
…e monitoring section
  • Loading branch information
muffato authored and ens-bwalts committed May 24, 2018
1 parent 7928df1 commit f0f8498
Show file tree
Hide file tree
Showing 4 changed files with 259 additions and 10 deletions.
212 changes: 209 additions & 3 deletions docs/running_pipelines/management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,160 @@ reconciliation and update of Worker statuses can be invoked by running
Tips for performance tuning and load management
-----------------------------------------------

Resource optimisation
+++++++++++++++++++++

eHive automaticaly gathers resource usage information about every
workers and stores that in the ``worker_resource_usage`` table. Although it
can be accessed directly, the ``resource_usage_stats`` view gives a nicer
summary for each analysis::

> SELECT * FROM resource_usage_stats;
+---------------------------+-------------+--------------------------+-------------+---------+--------------+--------------+--------------+---------------+---------------+---------------+
| analysis | meadow_type | resource_class | exit_status | workers | min_mem_megs | avg_mem_megs | max_mem_megs | min_swap_megs | avg_swap_megs | max_swap_megs |
+---------------------------+-------------+--------------------------+-------------+---------+--------------+--------------+--------------+---------------+---------------+---------------+
| create_tracking_tables(1) | LSF | default(1) | done | 1 | 56.7109 | 56.71 | 56.7109 | NULL | NULL | NULL |
| MLSSJobFactory(2) | LSF | default_with_registry(5) | done | 1 | 67.5625 | 67.56 | 67.5625 | NULL | NULL | NULL |
| count_blocks(3) | LSF | default_with_registry(5) | done | 1 | 56.9492 | 56.95 | 56.9492 | NULL | NULL | NULL |
| initJobs(4) | LSF | default_with_registry(5) | done | 1 | 66.6445 | 66.64 | 66.6445 | NULL | NULL | NULL |
| createChrJobs(5) | LSF | default_with_registry(5) | done | 1 | 61.2188 | 61.22 | 61.2188 | NULL | NULL | NULL |
| createSuperJobs(6) | LSF | default_with_registry(5) | done | 2 | 61.1992 | 61.20 | 61.207 | NULL | NULL | NULL |
| createOtherJobs(7) | LSF | crowd_with_registry(4) | done | 1 | 100.398 | 100.40 | 100.398 | NULL | NULL | NULL |
| dumpMultiAlign(8) | LSF | crowd(2) | done | 52 | 108.848 | 695.60 | 1330.91 | NULL | NULL | NULL |
| emf2maf(9) | LSF | crowd(2) | done | 46 | 58.3008 | 132.88 | 150.695 | NULL | NULL | NULL |
| compress(10) | LSF | default(1) | done | 17 | 58.1758 | 58.24 | 58.2695 | NULL | NULL | NULL |
| md5sum(11) | LSF | default(1) | done | 2 | 58.2227 | 58.23 | 58.2344 | NULL | NULL | NULL |
| move_maf_files(12) | LSF | default(1) | done | 1 | 58.1914 | 58.19 | 58.1914 | NULL | NULL | NULL |
| readme(13) | LSF | default_with_registry(5) | done | 2 | 77.5859 | 83.13 | 88.668 | NULL | NULL | NULL |
| targz(14) | NULL | default(1) | NULL | 1 | NULL | NULL | NULL | NULL | NULL | NULL |
+---------------------------+-------------+--------------------------+-------------+---------+--------------+--------------+--------------+---------------+---------------+---------------+

In this example you can see how much memory each analysis used and decide
how much to allocate them, e.g. 100 Mb for most analyses, 200 Mb for
"createOtherJobs" and "emf2maf", and 1,500 Mb for "dumpMultiAlign".
However, it seems that the average memory usage of "dumpMultiAlign" is below
700 Mb, meaning that more than half of the requested memory could be wasted
each time. You can get the actual breakdown with this query::

> SELECT 100*CEIL(mem_megs/100) AS mem_megs, COUNT(*) FROM worker_resource_usage JOIN role USING (worker_id) WHERE analysis_id = 8 GROUP BY CEIL(mem_megs/100);
+----------+----------+
| mem_megs | COUNT(*) |
+----------+----------+
| 200 | 1 |
| 300 | 4 |
| 400 | 6 |
| 500 | 10 |
| 600 | 9 |
| 700 | 8 |
| 800 | 5 |
| 900 | 4 |
| 1000 | 2 |
| 1200 | 2 |
| 1400 | 1 |
+----------+----------+

You can see that about three quarters of the jobs used less than 700Mb, so
another strategy is to give 700Mb to the analysis, *expect* some jobs to
fail (i.e. to be killed by the compute farm) and wire a copy of the
analysis with more memory on the -1 branch (MEMLIMIT), cf
:ref:`resource-limit-dataflow`. You can chain with MEMLIMIT as many
analyses as required to provide the appropriate memory usage steps, e.g.

.. hive_diagram::

{ -logic_name => 'Alpha',
-flow_into => {
-1 => [ 'Alpha_moremem' ],
},
},
{ -logic_name => 'Alpha_moremem',
-flow_into => {
-1 => [ 'Alpha_himem' ],
},
},
{ -logic_name => 'Alpha_himem',
-flow_into => {
-1 => [ 'Alpha_hugemem' ],
},
},
{ -logic_name => 'Alpha_hugemem',
},

Relying on MEMLIMIT can be inconvenient at times:

* The mechanism may not be available on all job schedulers (of the ones
eHive support, only LSF has that functionality).
* When LSF kills the jobs, the open file handles and database connections
are interrupted, potentially leading in corrupted data, and temporary
files hanging around.
* Since the processes are killed in a *random* order and not atomically,
sometimes, the child process (e.g. an external program your Runnable is
running) will be killed first, and the Runnable will have enough time to
record this job attempt as failed (but not as MEMLIMIT), take another job
and *then* be killed, making eHive think it is the *second* job that has
exceeded the memory requirement. On LSF we advice waiting 30 seconds when
detecting that an external command has been killed to give LSF enough time to kill
the worker too.
* This is time-expensive since a job may be tried with several memory
requirements before finally finding the right one.

Instead of relying on MEMLIMIT, a more efficient approach is to predict the
amount of memory the job is going to need. You would first need to
understand what is causing the high memory usage, and try to correlate that to
some input parameters (for instance, the length of the chromosome, the
number of variants, etc). Then you can define several resource classes and
add ``WHEN`` conditions to the seeding dataflow to wire each job to the right
resource class.
Here is an example from an Ensembl Compara pipeline::

-flow_into => {
"2->A" => WHEN (
"(#total_residues_count# <= 3000000) || (#dnafrag_count# <= 10)" => "pecan",
"(#total_residues_count# > 3000000) && (#total_residues_count# <= 30000000) && (#dnafrag_count# > 10) && (#dnafrag_count# <= 25)" => "pecan_mem1",
"(#total_residues_count# > 30000000) && (#total_residues_count# <= 60000000) && (#dnafrag_count# > 10) && (#dnafrag_count# <= 25)" => "pecan_mem2",
"(#total_residues_count# > 3000000) && (#total_residues_count# <= 60000000) && (#dnafrag_count# > 25)" => "pecan_mem2",
"(#total_residues_count# > 60000000) && (#dnafrag_count# > 10)" => "pecan_mem3",
),
"A->1" => [ "update_max_alignment_length" ],
},


Resource usage overview
+++++++++++++++++++++++

The data can also be retrieved with the :ref:`generate_timeline.pl
<script-generate_timeline>` script in the form of a graphical representation::

generate_timeline.pl --url $EHIVE_URL --mode memory -output timeline_memory_usage.png
generate_timeline.pl --url $EHIVE_URL --mode cores -output timeline_cpu_usage.png

.. figure:: timeline_memory_usage.png

Timeline of the memory usage. The hatched areas represent the amount of
memory that has been requested but not used.

Since eHive forces you to bin jobs into a smaller number of analyses, each
analysis having a single resource class (a memory requirement), each job
may not run with the exact amount of memory it needs. Some level of memory
over-reservation **is** expected (although the plot above shows too much of
that !).

.. figure:: timeline_cpu_usage.png

Timeline of the CPU usage. The hatched areas represent the fraction of
the wall time spent on sleeping or waiting (for I/O, for instance).

It is most of the time expected to not be fully using the CPUs, as most
jobs will have to read some input data and write some results. both of
which subject to I/O waits. You also need to consider that all the SQL
queries you will be sending to a database server (either directly or via an
Ensembl API) will shift the focus to the server and make your own Runnable
wait for the result.
Finally, many job schedulers (such as LSF) can
only allocate whole CPU cores, meaning that even if you estimate you only
need 50% of a core, you might be forced to still allocate 1 core and
"waste" the other 50%.

.. _capacity-and-batch-size:

Capacity and batch size
Expand Down Expand Up @@ -288,12 +442,13 @@ determines more Workers are needed, and the ``-total_running_workers_max``
value hasn't been reached, it will submit more, up to the limit of
``-submit_workers_max``.

Database servers
++++++++++++++++
Database server choice and configuration
++++++++++++++++++++++++++++++++++++++++

SQLite can have issues when multiple processes are trying to access the
database concurrently because each process acquires locks the whole
database.
database. As a result, it behaves poorly when the number of workers
reaches a dozen or so.

MySQL is better at those scenarios and can handle hundreds of concurrent
active connnections. In our experience, the most important parameters of
Expand All @@ -304,3 +459,54 @@ We have only used PostgreSQL in small-scale tests. If you get the chance to
run large pipelines on PostgreSQL, let us know ! We will be interested
in hearing how eHive behaves.

Database connections
++++++++++++++++++++

The Ensembl codebase does, by default, a very poor job at managing database
connections, and how to solve the "MySQL server has gone away" error is a
recurrent discussion thread. Even though eHive's database connections
themselves are in theory immune to this error, Runnables often use the
Ensembl connection mechanism via various Ensembl APIs and might still get
into trouble. The Core API especially has two evil parameters:

* ``disconnect_when_inactive`` (boolean). When set the API will
disconnect after every single query. This can result in exhausting the
pool of ports available on the worker's machine, leading to *all*
processes on this machine failing to open a network connection. You can
spot this when MySQL fails with the error code 99 (*Cannot assign
requested address*). Leave this one to zero unless you use other
mechanisms such as ``prevent_disconnect`` to prevent this from happening
(see below).

* ``reconnect_when_lost`` (boolean). When set the API will ping the server
before *every* query, which is expensive.

There are however some useful methods and tools in the Core API:

* ``disconnect_if_idle``. This method will ask the DBConnection object to
disconnect from the database if possible (i.e. if the connection is not
used for anything). Simple, but it does the job. Use this when you're
done with a database.

* ``prevent_disconnect``. This method will run a piece of code with
``disconnect_when_inactive`` unset. Together they can form a clean way of
handling database connections:

1. Set ``disconnect_when_inactive`` generally to 1 -- This works as
long as the database is used for just one query once in a while.
2. Use ``prevent_disconnect`` blocks when you're going to use the
database for multiple, consecutive, queries.

This way, the connection is only open when it is needed, and closed the
rest of the time.

* ``ProxyDBConnection``. In Ensembl, a database connection is bound to one
database only. However data can be spread across multiple databases on
the same server (e.g. the Ensembl live MySQL server), and the API is
going to create one connection for each database, potentially quickly
exhausting the number of connections available, and the API is going to
create one connection for each database, potentially quickly exhausting
the number of connections available. ``ProxyDBConnection`` is a way of
pooling multiple database connections to the same server within the same
object (connection). See an example in the `Ensembl-compara API
<https://github.com/Ensembl/ensembl-compara/blob/release/93/modules/Bio/EnsEMBL/Compara/Utils/CoreDBAdaptor.pm#L73-L109>`__.
57 changes: 50 additions & 7 deletions docs/running_pipelines/monitoring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -164,13 +164,56 @@ diagram <hive_schema.png>`__ and `eHive schema
description <hive_schema.html>`__) for details on these tables and
their relations.

In addition to the tables, there is a "progress" view which summarizes the
progression of work across the Analyses in a pipeline:

::

SELECT * from progress;

In addition to the tables, there are a number of views which summarize the
activity and progression of work across the Analyses in a pipeline.

First of all, ``beekeeper_activity`` shows all the registered beekeepers,
with some information about the number of loops they executed, when they
were last seen, etc. The example query here lists the beekeepers that are
alive and the ones that have "disappeared" (i.e. likely interrupted with
Ctrl+C).::

> SELECT * FROM beekeeper_activity WHERE cause_of_death IS NULL OR cause_of_death = "DISAPPEARED";
+--------------+-------------+-----------------------+---------------+------------+------------+----------------+----------------+---------------------+---------------------------+------------+
| beekeeper_id | meadow_user | meadow_host | sleep_minutes | loop_limit | is_blocked | cause_of_death | loops_executed | last_heartbeat | time_since_last_heartbeat | is_overdue |
+--------------+-------------+-----------------------+---------------+------------+------------+----------------+----------------+---------------------+---------------------------+------------+
| 1 | muffato | ebi-cli-002.ebi.ac.uk | 1 | NULL | 0 | DISAPPEARED | 7 | 2018-05-12 22:55:05 | NULL | NULL |
| 3 | muffato | ebi-cli-002.ebi.ac.uk | 1 | NULL | 0 | DISAPPEARED | 26 | 2018-05-12 23:22:37 | NULL | NULL |
| 4 | muffato | ebi-cli-002.ebi.ac.uk | 1 | NULL | 0 | DISAPPEARED | 86 | 2018-05-13 00:48:45 | NULL | NULL |
| 11 | muffato | ebi-cli-002.ebi.ac.uk | 1 | NULL | 0 | DISAPPEARED | 2425 | 2018-05-15 14:01:24 | NULL | NULL |
| 19 | muffato | ebi-cli-002.ebi.ac.uk | 1 | NULL | 0 | DISAPPEARED | 3 | 2018-05-19 10:44:10 | NULL | NULL |
| 20 | muffato | ebi-cli-002.ebi.ac.uk | 1 | NULL | 0 | NULL | 3180 | 2018-05-21 16:00:17 | 00:00:57 | 0 |
+--------------+-------------+-----------------------+---------------+------------+------------+----------------+----------------+---------------------+---------------------------+------------+

Then, you can dig a bit further into the list of what is running with the
``live_roles`` table::

> SELECT * FROM live_roles;
+-------------+-------------+-------------------+---------------------+-------------+---------------------------------------+----------+
| meadow_user | meadow_type | resource_class_id | resource_class_name | analysis_id | logic_name | count(*) |
+-------------+-------------+-------------------+---------------------+-------------+---------------------------------------+----------+
| mateus | LSF | 7 | 2Gb_job | 88 | hmm_thresholding_searches | 1855 |
| mateus | LSF | 14 | 8Gb_job | 89 | hmm_thresholding_searches_himem | 10 |
| mateus | LSF | 18 | 64Gb_job | 90 | hmm_thresholding_searches_super_himem | 1 |
| muffato | LSF | 7 | 2Gb_job | 88 | hmm_thresholding_searches | 929 |
| muffato | LSF | 14 | 8Gb_job | 89 | hmm_thresholding_searches_himem | 2 |
| muffato | LSF | 18 | 64Gb_job | 90 | hmm_thresholding_searches_super_himem | 7 |
+-------------+-------------+-------------------+---------------------+-------------+---------------------------------------+----------+

This example shows a "collaborative" run of the pipeline, with two users
running about 2,900 jobs.

Finally, the "progress" view tells you how your jobs are doing::

> SELECT * FROM progress;
+----------------------+----------------+--------+-------------+-----+----------------+
| analysis_name_and_id | resource_class | status | retry_count | cnt | example_job_id |
+----------------------+----------------+--------+-------------+-----+----------------+
| chrom_sizes(1) | default | DONE | 0 | 1 | 1 |
| base_age_factory(2) | 100Mb | DONE | 0 | 1 | 2 |
| base_age(3) | 3.6Gb | DONE | 0 | 25 | 4 |
| big_bed(4) | 1.8Gb | DONE | 0 | 1 | 3 |
+----------------------+----------------+--------+-------------+-----+----------------+

If you see Jobs in :hivestatus:`<FAILED>[ FAILED ]` state or Jobs with
retry\_count > 0 (which means they have failed at least once and had
Expand Down
Binary file added docs/running_pipelines/timeline_cpu_usage.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/running_pipelines/timeline_memory_usage.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit f0f8498

Please sign in to comment.