Skip to content

Commit

Permalink
Merge pull request #105 from EGA-archive/docs
Browse files Browse the repository at this point in the history
Docs
  • Loading branch information
silverdaz committed Jun 14, 2020
2 parents 322467a + c69ff0c commit da08361
Show file tree
Hide file tree
Showing 40 changed files with 1,353 additions and 1,070 deletions.
35 changes: 17 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,25 @@ The [code](ingestion/lega) is written in Python (3.7+).

You can provision and deploy the different components, locally, using [docker-compose](deploy).

Other provisioning methods are provided by our partners:
## Quick install

* on an [OpenStack cluster](https://github.com/NBISweden/LocalEGA-deploy-terraform), using `terraform`;
* on a [Kubernetes/OpenShift cluster](https://github.com/NBISweden/LocalEGA-deploy-k8s), using `kubernetes`;
* on a [Docker Swarm cluster](https://github.com/NBISweden/LocalEGA-deploy-swarm), using `gradle`.
cd deploy
make -C bootstrap
make -j 4 images
make up

# Architecture
After a few seconds, you then have a locally-deployed instance of
LocalEGA (using a fake Central EGA), and you can run the
[testsuite](tests).

LocalEGA is divided into several components, as docker containers.

| Components | Role |
|-------------|------|
| db | A Postgres database with appropriate schemas and isolations |
| mq | A (local) RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings, connected to the CentralEGA counter-part. |
| inbox | SFTP server, acting as a dropbox, where user credentials are fetched from CentralEGA |
| ingesters | Split the Crypt4GH header and move the remainder to the storage backend. No cryptographic task, nor access to the decryption keys. |
| verifiers | Decrypt the stored files and checksum them against their embedded checksum. |
| archive | Storage backend: as a regular file system or as a S3 object store. |
| finalizers | Handle the so-called _Stable ID_ filename mappings from CentralEGA. |
| outgesters | Front-facing checks for download permissions. |
| streamers | Fetch the files from the archive and re-encrypt its header for the given requester. |
## Architecture

Find the [LocalEGA documentation](http://localega.readthedocs.io) hosted on [ReadTheDocs.org](https://readthedocs.org/).

![Architecture](docs/static/overview.png)

Other provisioning methods are provided by [our partners](https://github.com/neicnordic/LocalEGA):

* on an [OpenStack cluster](https://github.com/NBISweden/LocalEGA-deploy-terraform), using `terraform`;
* on a [Kubernetes/OpenShift cluster](https://github.com/NBISweden/LocalEGA-deploy-k8s), using `kubernetes`;
* on a [Docker Swarm cluster](https://github.com/NBISweden/LocalEGA-deploy-swarm), using `gradle`.
30 changes: 15 additions & 15 deletions deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ The credentials for the test users are found in `private/users`.
| make up | `docker-compose up -d` | Use `docker-compose up -d --scale ingest=3 --scale verify=5` instead, if you want to start 3 ingestion and 5 verification workers. |
| make down | `docker-compose down -v` | `-v`: removing networks and volumes |
| make ps | `docker-compose ps` | |
| make logs | `docker-compose logs -f` | very verbose output |

Note that, in this architecture, we use separate volumes, e.g. for the
inbox area, for the archive (be it a POSIX file system or backed by
Expand All @@ -36,35 +37,34 @@ S3). They will be created on-the-fly by docker-compose.

Create the base image by executing:

make image
make -j4 images

It takes some time. The result is an image, named `egarchive/lega-base`, and containing `python 3.6` and the LocalEGA services.
It takes some time. The result is an image, named `egarchive/lega-base`, and containing `python` and the LocalEGA services.

The following images are pulled from Docker Hub, when starting LocalEGA (only the first time, if not present):
The following images are also generated (or pulled from Docker Hub, when starting LocalEGA, only the first time, if not present):

* [`egarchive/lega-mq`](https://github.com/EGA-archive/LocalEGA-mq) (based on `rabbitmq:3.6.14-management`)
* [`egarchive/lega-db`](https://github.com/EGA-archive/LocalEGA-db) (based on `postgres:11.2`)
* `egarchive/lega-mq` (based on `rabbitmq:3.6.14-management`)
* `egarchive/lega-db` (based on `postgres:12.1`)
* [`egarchive/lega-inbox`](https://github.com/EGA-archive/LocalEGA-inbox) (based on OpenSSH version 7.8p1 and CentOS7)
* `python:3.8-alpine3.11`


> Important notice: The user inside the container is called `lega`,
> and its ID is by default 1000. When (re)building the image, the
> above target `make image` will make the ID match the current user
> calling the command. This is important to allow injected files to be
> target `make image` will make the ID match the current user calling
> the command. This is important to allow injected files to be
> readable by the `lega` user inside the containers.
If images are not available, docker-compose will try to pull them from docker hub or build them locally, if possible.

----

# Fake Central EGA

We use 2 stubbing services in order to fake the necessary Central EGA components (mostly for local or Travis tests).

| Container | Role |
|-------------:|------|
| `cega-users` | Sets up a small list of test users |
| `cega-mq` | Sets up a RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings |

If the `cega-users` is not built, it will be build by docker-compose. If you want to build yourself, you can run:
| Container | Role |
|-----------------:|------|
| `cega-users` | Sets up a small list of test users |
| `cega-mq` | Sets up a RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings |
| `cega-accession` | Sets up a non-persistent accession service |

docker-compose build cega-users
1 change: 0 additions & 1 deletion docs/CONTRIBUTING.md

This file was deleted.

195 changes: 195 additions & 0 deletions docs/amqp.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
.. _cega_lega:

Connection to Central EGA
=========================

All Local EGA instances are connected to Central EGA using `AMQP, the
advanced message queueing protocol <http://www.amqp.org/>`_, that
allows application components to send and receive messages. Messages
are queued, not lost, and resend on network failure or connection
problems. Naturally, this is configurable.


In practice, the `reference implementation
<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/mq>`_
uses the RabbitMQ message broker for each LocalEGA, henceforth called
*local broker*, which is the **only** component with the necessary
credentials to connect to the Central EGA message broker, henceforth
called *central broker*. The other LocalEGA components are connected
to their respective local broker.

.. note:: We pinned the RabbitMQ version to ``3.7.8``, so far, until
both the central broker and the local brokers can be
upgraded simultaneously to the latest version.


For each LocalEGA instance, the central broker configures a ``vhost``,
and creates the credentials to connect to that ``vhost`` in the form
of a *username/password* pair. The local brokers then use a connection
string with the following syntax:

.. code-block:: console
amqps://<user>:<password>@<cega-host>:<port>/<vhost>
.. image:: /static/amqp.png
:target: ./_static/amqp.png
:alt: RabbitMQ setup

The connection is a two-way connection using a combination of a
*federated queue* and a *shovel*.

The local broker registers a *federated queue* with the central broker
as *upstream*, named ``v1.files``, and listens to the incoming
messages. In order to minimize the number of connection sockets, all
Local EGAs only use *one* federated queue towards the central broker,
and all messages in the queue are distinguished with a ``type``.

Ingestion workers listen to the downstream queue of the local
broker. If there are no messages to work on, the local broker will ask
its upstream queue if it has messages. If so, messages are moved
downstream. If not, ingestion workers wait for messages to arrive.

.. note:: This allows a Local EGA instance to *also* ingest files from
other sources than Central EGA. For example, a message, external to
Central EGA, could be dropped in the local broker in order to
ingest non-EGA files.


The central broker receives notifications from the local broker using
a *shovel*. The local broker has an exchange named ``cega`` configured
such that all messages published to it get forwarded to CentralEGA
(using the same routing key). This is how we propagate the different
status of the workflow to the central broker, using the following
routing keys:

* ``files.verified`` for properly ingested files, ready to request an Accession ID.
* ``files.completed`` for properly backed-up files, ready to be distributed
* ``files.error`` for user-related errors
* ``files.inbox`` for inbox file operations

The shovel is backed by a ``to_cega`` queue in case the central broker
is temporarily unavailable. This is similar to a (reverse) federated
queue.


Message interface (API) CEGA |connect| LEGA
============================================

It is necessary to agree on the format of the messages exchanged
between Central EGA and any Local EGAs. All messages are
JSON-formatted. The `JSON schemas to described the message formats
<https://github.com/EGA-archive/LocalEGA/tree/docs/ingestion/schemas>`_
can be found in the repository.

When the brokers exchange messages, the message headers have the following properties:

- a content type: ``application/json``
- delivery mode: 2 (for persistence)
- and a **required** correlation id.

The correlation id is a uuid of 37 characters, generated by `uuid_generate <https://linux.die.net/man/3/uuid_generate>`_.


Central EGA |cegatolega| Local EGA
----------------------------------

Central EGA uses a unique upstream queue, to minimize the number of
connection sockets. In order to distinguish message, Central EGA adds
a field named ``type`` to all outgoing messages. There are 5 types of
messages so far:

* ``type=ingest``: an ingestion trigger
* ``type=cancel``: an ingestion cancellation
* ``type=accession``: contains an accession id
* ``type=mapping``: contains a dataset to accession id mapping (they
are known at the metadata release stage or when permissions are granted by a DAC
* ``type=heartbeat``: A mean to check if the Local EGA instance is "alive"

Refer to the complete JSON Schemas for `the ingestion trigger message
format
<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/schemas/ingestion-trigger.json>`_
and `the Accession ID message format
<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/schemas/ingestion-accession.json>`_.

For example, an ingestion trigger would have the following format:

.. code::
{
"type": "ingest",
"user": "john",
"filepath": "/inbox/user/dir1/file.txt.c4gh",
"encrypted_checksums": [ { "type": "sha256",
"value": "82E4e60e7beb3db2e06...f28c4c942703dabb6d6" }]
}
and an accession id message from Central EGA would be:

.. code::
{
"type": "accession",
"user": "john",
"filepath": "/inbox/user/dir1/file.txt.c4gh",
"accession_id": "EGAF00000123456",
"decrypted_checksums": [ { "type": "sha256",
"value": "7853c53a03ccfc38683e...533e68ab37b5b790074" },
{ "type": "md5",
"value": "ee25789673d8711563d5fcb7234f9a68" }]
}
Central EGA |legatocega| Local EGA
----------------------------------

Messages from Local EGA to Central EGA are used in the following cases:

* Requesting an Accession ID
* Notifying of the completion of an ingestion
* Inbox operations
* User-related Errors

The message must contain the ``user`` or ``filepath``, and you can
refer to the `JSON Schemas for ingestion messages
<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/schemas/ingestion-to-cega.json>`_. Valid
checksum algorithms are "md5" and "sha256", where "sha256" is
preferred. For example, a request for an Accession ID could be:

.. code::
{
"user": "john",
"filepath": "/inbox/user/dir1/file.txt.c4gh",
"decrypted_checksums": [ { "type": "sha256",
"value": "7853c53a03ccfc38683e...533e68ab37b5b790074" },
{ "type": "md5",
"value": "ee25789673d8711563d5fcb7234f9a68" }]
}
.. note:: When requesting an Accession ID, the md5 decrypted_checksums field is, for the moment, mandatory.

The messages sent by the inbox hooks capture operation of the files,
be it a (re)upload, a rename or a removal. They must contain the
fields: ``user``, ``filepath``, ``operation``, where the value is
either ``upload``, ``rename`` or ``remove``. In the case of a file
renaming, the ``oldpath`` must be added to the required fields. For
example, a file upload message could be:

.. code::
{
"user": "john",
"filepath": "/inbox/user/dir1/file.txt.c4gh",
"operation": "upload"
}
Optional fields can be added, such as ``filesize``, or
``encrypted_checksums``.


.. |connect| unicode:: U+21cc .. <->
.. |cegatolega| unicode:: U+21C0 .. ->
.. |legatocega| unicode:: U+21BD .. <-
.. _RabbitMQ: http://www.rabbitmq.com
29 changes: 0 additions & 29 deletions docs/bootstrap.rst

This file was deleted.

63 changes: 0 additions & 63 deletions docs/code.rst

This file was deleted.

0 comments on commit da08361

Please sign in to comment.