Merge pull request #105 from EGA-archive/docs

Docs
EGA-archive · Jun 14, 2020 · da08361 · da08361
2 parents 322467a + c69ff0c
commit da08361
Show file tree

Hide file tree

Showing 40 changed files with 1,353 additions and 1,070 deletions.
diff --git a/README.md b/README.md
@@ -9,26 +9,25 @@ The [code](ingestion/lega) is written in Python (3.7+).
 
 You can provision and deploy the different components, locally, using [docker-compose](deploy).
 
-Other provisioning methods are provided by our partners:
+## Quick install
 
-* on an [OpenStack cluster](https://github.com/NBISweden/LocalEGA-deploy-terraform), using `terraform`;
-* on a [Kubernetes/OpenShift cluster](https://github.com/NBISweden/LocalEGA-deploy-k8s), using `kubernetes`;
-* on a [Docker Swarm cluster](https://github.com/NBISweden/LocalEGA-deploy-swarm), using `gradle`.
+	cd deploy
+	make -C bootstrap
+	make -j 4 images
+	make up
 
-# Architecture
+After a few seconds, you then have a locally-deployed instance of
+LocalEGA (using a fake Central EGA), and you can run the
+[testsuite](tests).
 
-LocalEGA is divided into several components, as docker containers.
-
-| Components  | Role |
-|-------------|------|
-| db          | A Postgres database with appropriate schemas and isolations |
-| mq          | A (local) RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings, connected to the CentralEGA counter-part. |
-| inbox       | SFTP server, acting as a dropbox, where user credentials are fetched from CentralEGA |
-| ingesters   | Split the Crypt4GH header and move the remainder to the storage backend. No cryptographic task, nor access to the decryption keys. |
-| verifiers   | Decrypt the stored files and checksum them against their embedded checksum. |
-| archive     | Storage backend: as a regular file system or as a S3 object store. |
-| finalizers  | Handle the so-called _Stable ID_ filename mappings from CentralEGA. |
-| outgesters  | Front-facing checks for download permissions. |
-| streamers   | Fetch the files from the archive and re-encrypt its header for the given requester. |
+## Architecture
 
 Find the [LocalEGA documentation](http://localega.readthedocs.io) hosted on [ReadTheDocs.org](https://readthedocs.org/).
+
+![Architecture](docs/static/overview.png)
+
+Other provisioning methods are provided by [our partners](https://github.com/neicnordic/LocalEGA):
+
+* on an [OpenStack cluster](https://github.com/NBISweden/LocalEGA-deploy-terraform), using `terraform`;
+* on a [Kubernetes/OpenShift cluster](https://github.com/NBISweden/LocalEGA-deploy-k8s), using `kubernetes`;
+* on a [Docker Swarm cluster](https://github.com/NBISweden/LocalEGA-deploy-swarm), using `gradle`.
diff --git a/deploy/README.md b/deploy/README.md
@@ -25,6 +25,7 @@ The credentials for the test users are found in `private/users`.
 | make up          | `docker-compose up -d` | Use `docker-compose up -d --scale ingest=3 --scale verify=5` instead, if you want to start 3 ingestion and 5 verification workers. |
 | make down        | `docker-compose down -v` | `-v`: removing networks and volumes |
 | make ps          | `docker-compose ps` | |
+| make logs        | `docker-compose logs -f` | very verbose output |
 
 Note that, in this architecture, we use separate volumes, e.g. for the
 inbox area, for the archive (be it a POSIX file system or backed by
@@ -36,35 +37,34 @@ S3). They will be created on-the-fly by docker-compose.
 
 Create the base image by executing:
 
-	make image
+	make -j4 images
 
-It takes some time. The result is an image, named `egarchive/lega-base`, and containing `python 3.6` and the LocalEGA services.
+It takes some time. The result is an image, named `egarchive/lega-base`, and containing `python` and the LocalEGA services.
 
-The following images are pulled from Docker Hub, when starting LocalEGA (only the first time, if not present):
+The following images are also generated (or pulled from Docker Hub, when starting LocalEGA, only the first time, if not present):
 
-* [`egarchive/lega-mq`](https://github.com/EGA-archive/LocalEGA-mq) (based on `rabbitmq:3.6.14-management`)
-* [`egarchive/lega-db`](https://github.com/EGA-archive/LocalEGA-db) (based on `postgres:11.2`)
+* `egarchive/lega-mq` (based on `rabbitmq:3.6.14-management`)
+* `egarchive/lega-db` (based on `postgres:12.1`)
 * [`egarchive/lega-inbox`](https://github.com/EGA-archive/LocalEGA-inbox) (based on OpenSSH version 7.8p1 and CentOS7)
-* `python:3.8-alpine3.11` 
 
 
 > Important notice: The user inside the container is called `lega`,
 > and its ID is by default 1000. When (re)building the image, the
-> above target `make image` will make the ID match the current user
-> calling the command. This is important to allow injected files to be
+> target `make image` will make the ID match the current user calling
+> the command. This is important to allow injected files to be
 > readable by the `lega` user inside the containers.
 
+If images are not available, docker-compose will try to pull them from docker hub or build them locally, if possible.
+
 ----
 
 # Fake Central EGA
 
 We use 2 stubbing services in order to fake the necessary Central EGA components (mostly for local or Travis tests).
 
-| Container    | Role |
-|-------------:|------|
-| `cega-users` | Sets up a small list of test users |
-| `cega-mq`    | Sets up a RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings |
-
-If the `cega-users` is not built, it will be build by docker-compose. If you want to build yourself, you can run:
+| Container        | Role |
+|-----------------:|------|
+| `cega-users`     | Sets up a small list of test users |
+| `cega-mq`        | Sets up a RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings |
+| `cega-accession` | Sets up a non-persistent accession service |
 
-	docker-compose build cega-users
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
diff --git a/docs/amqp.rst b/docs/amqp.rst
@@ -0,0 +1,195 @@
+.. _cega_lega:
+
+Connection to Central EGA
+=========================
+
+All Local EGA instances are connected to Central EGA using `AMQP, the
+advanced message queueing protocol <http://www.amqp.org/>`_, that
+allows application components to send and receive messages. Messages
+are queued, not lost, and resend on network failure or connection
+problems. Naturally, this is configurable.
+
+
+In practice, the `reference implementation
+<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/mq>`_
+uses the RabbitMQ message broker for each LocalEGA, henceforth called
+*local broker*, which is the **only** component with the necessary
+credentials to connect to the Central EGA message broker, henceforth
+called *central broker*. The other LocalEGA components are connected
+to their respective local broker.
+
+.. note:: We pinned the RabbitMQ version to ``3.7.8``, so far, until
+          both the central broker and the local brokers can be
+          upgraded simultaneously to the latest version.
+
+
+For each LocalEGA instance, the central broker configures a ``vhost``,
+and creates the credentials to connect to that ``vhost`` in the form
+of a *username/password* pair. The local brokers then use a connection
+string with the following syntax:
+
+.. code-block:: console
+
+   amqps://<user>:<password>@<cega-host>:<port>/<vhost>
+
+
+.. image:: /static/amqp.png
+   :target: ./_static/amqp.png
+   :alt: RabbitMQ setup
+
+The connection is a two-way connection using a combination of a
+*federated queue* and a *shovel*.
+
+The local broker registers a *federated queue* with the central broker
+as *upstream*, named ``v1.files``, and listens to the incoming
+messages. In order to minimize the number of connection sockets, all
+Local EGAs only use *one* federated queue towards the central broker,
+and all messages in the queue are distinguished with a ``type``.
+
+Ingestion workers listen to the downstream queue of the local
+broker. If there are no messages to work on, the local broker will ask
+its upstream queue if it has messages. If so, messages are moved
+downstream. If not, ingestion workers wait for messages to arrive.
+
+.. note:: This allows a Local EGA instance to *also* ingest files from
+   other sources than Central EGA. For example, a message, external to
+   Central EGA, could be dropped in the local broker in order to
+   ingest non-EGA files.
+
+
+The central broker receives notifications from the local broker using
+a *shovel*. The local broker has an exchange named ``cega`` configured
+such that all messages published to it get forwarded to CentralEGA
+(using the same routing key). This is how we propagate the different
+status of the workflow to the central broker, using the following
+routing keys:
+
+* ``files.verified`` for properly ingested files, ready to request an Accession ID.
+* ``files.completed`` for properly backed-up files, ready to be distributed
+* ``files.error`` for user-related errors
+* ``files.inbox`` for inbox file operations
+
+The shovel is backed by a ``to_cega`` queue in case the central broker
+is temporarily unavailable. This is similar to a (reverse) federated
+queue.
+
+
+Message interface (API) CEGA |connect| LEGA
+============================================
+
+It is necessary to agree on the format of the messages exchanged
+between Central EGA and any Local EGAs. All messages are
+JSON-formatted. The `JSON schemas to described the message formats
+<https://github.com/EGA-archive/LocalEGA/tree/docs/ingestion/schemas>`_
+can be found in the repository.
+
+When the brokers exchange messages, the message headers have the following properties:
+
+- a content type: ``application/json``
+- delivery mode: 2 (for persistence)
+- and a **required** correlation id.
+
+The correlation id is a uuid of 37 characters, generated by `uuid_generate <https://linux.die.net/man/3/uuid_generate>`_.
+
+
+Central EGA |cegatolega| Local EGA
+----------------------------------
+
+Central EGA uses a unique upstream queue, to minimize the number of
+connection sockets. In order to distinguish message, Central EGA adds
+a field named ``type`` to all outgoing messages. There are 5 types of
+messages so far:
+
+* ``type=ingest``: an ingestion trigger
+* ``type=cancel``: an ingestion cancellation
+* ``type=accession``: contains an accession id
+* ``type=mapping``: contains a dataset to accession id mapping (they
+  are known at the metadata release stage or when permissions are granted by a DAC
+* ``type=heartbeat``: A mean to check if the Local EGA instance is "alive"
+
+Refer to the complete JSON Schemas for `the ingestion trigger message
+format
+<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/schemas/ingestion-trigger.json>`_
+and `the Accession ID message format
+<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/schemas/ingestion-accession.json>`_.
+
+For example, an ingestion trigger would have the following format:
+
+.. code::
+
+		{
+                                 "type": "ingest",
+                                 "user": "john",
+                             "filepath": "/inbox/user/dir1/file.txt.c4gh",
+                  "encrypted_checksums": [ { "type": "sha256",
+                                             "value": "82E4e60e7beb3db2e06...f28c4c942703dabb6d6" }]
+		}
+
+and an accession id message from Central EGA would be:
+
+.. code::
+
+		{
+                                 "type": "accession",
+                                 "user": "john",
+                             "filepath": "/inbox/user/dir1/file.txt.c4gh",
+                         "accession_id": "EGAF00000123456",
+                  "decrypted_checksums": [ { "type": "sha256",
+		                             "value": "7853c53a03ccfc38683e...533e68ab37b5b790074" },
+                                           { "type": "md5",
+					     "value": "ee25789673d8711563d5fcb7234f9a68" }]
+		}
+
+
+Central EGA |legatocega| Local EGA
+----------------------------------
+
+Messages from Local EGA to Central EGA are used in the following cases:
+
+* Requesting an Accession ID
+* Notifying of the completion of an ingestion
+* Inbox operations
+* User-related Errors
+
+The message must contain the ``user`` or ``filepath``, and you can
+refer to the `JSON Schemas for ingestion messages
+<https://github.com/EGA-archive/LocalEGA/tree/master/ingestion/schemas/ingestion-to-cega.json>`_. Valid
+checksum algorithms are "md5" and "sha256", where "sha256" is
+preferred. For example, a request for an Accession ID could be:
+
+.. code::
+
+		{
+                                 "user": "john",
+                             "filepath": "/inbox/user/dir1/file.txt.c4gh",
+                  "decrypted_checksums": [ { "type": "sha256",
+		                             "value": "7853c53a03ccfc38683e...533e68ab37b5b790074" },
+                                           { "type": "md5",
+					     "value": "ee25789673d8711563d5fcb7234f9a68" }]
+		}
+
+.. note:: When requesting an Accession ID, the md5 decrypted_checksums field is, for the moment, mandatory.
+
+The messages sent by the inbox hooks capture operation of the files,
+be it a (re)upload, a rename or a removal.  They must contain the
+fields: ``user``, ``filepath``, ``operation``, where the value is
+either ``upload``, ``rename`` or ``remove``.  In the case of a file
+renaming, the ``oldpath`` must be added to the required fields. For
+example, a file upload message could be:
+
+.. code::
+
+		{
+                                 "user": "john",
+                             "filepath": "/inbox/user/dir1/file.txt.c4gh",
+                            "operation": "upload"
+		}
+
+Optional fields can be added, such as ``filesize``, or
+``encrypted_checksums``.
+
+
+.. |connect| unicode:: U+21cc .. <->
+.. |cegatolega| unicode:: U+21C0 .. ->
+.. |legatocega| unicode:: U+21BD .. <-
+.. _RabbitMQ: http://www.rabbitmq.com
diff --git a/docs/bootstrap.rst b/docs/bootstrap.rst
diff --git a/docs/code.rst b/docs/code.rst