# Building robust workflows with strong provenance

And we will do that using:

<img src="../../data/aiida-logo.png" width="500" style="height:auto; display:block; margin-left:auto; margin-right:auto;">

An open-source Python infrastructure to help researchers with:
- automating,
- managing,
- persisting,
- sharing, and
- reproducing
the complex workflows associated with modern computational science and all associated data.

***
## Provenance: A robust solution for process management and data traceability

What is a process or a what we call *calculation*, fundamentally?

Well, it's just a data transformation!

<img src="../../data/aiida-calculation-recipe.jpg" width="500" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

When doing this via AiiDA, it stores:
- The data transformations or calculations
- The inputs and their metadata
- The outputs and their metadata
- Most crucially: The inter-connections

While doing so, AiiDA creates a directed acyclic graph (DAG) of the data flow and takes care of some important features:
- Once data is stored, it cannot be modified &rarr; **provenance**
- Data is queryable and can always be traced back &rarr; **reproducibility**
- Checkpointing allows for **continuation** (even if computer is shut down)
- **Caching** prevents running the same calculation twice

***
## Scalability, interoperability, and high-throughput performance

### Learning by example: The LUMI hero run

AiiDA was built for high-throughput workloads, having the upcoming exa-scale area in mind:

<img src="../../data/lumi-hero-run.jpg" width="500" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

The hero run:

- Utilized a full partition of LUMI-C: **1,500** nodes with **128** cores each (**192k** cores in total)
- **~15k** simulations (geometry optimizations of inorganic compounds) orchestrated with AiiDA in **13** hours runtime
- **~8k** issues dealt with on the fly

During all of this, **AiiDA runs on the local machine**. So no need to:

- Mirror your local environment to the HPC
- Ask the HPC admin to install software for you
- Getting banned from the HPC because a background process is continuously running

***

## The cogs and wheels behind AiiDA

### System dependencies

To achieve performance for thousands of workflows and millions of data nodes, AiiDA requires two system services:

- The **RabbitMQ** message broker that enables running multiple background daemon workers that orchestrate and
  monitor processes, as well as write data to the
- **PostgreSQL** database, which allows for concurrent write access by the daemon workers.

**Note** that for this tutorial, we will be using a simplified, service-less AiiDA installation that actually does not
require these two services. More information on the different ways to install AiiDA can be found in the [documentation](https://aiida.readthedocs.io/projects/aiida-core/en/latest/installation/index.html).

### Architecture

The other components of AiiDA are:
- An object-relational-mapper (ORM) which links entries in the database to the Python objects we will be dealing with
- A custom [disk-objectstore](https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/storage/repository.html#the-disk-object-store) file repository, where raw files are stored
  in an efficient, machine-readable manner, and can be *packed* to reduce the number of files for quick backup and export
- A custom [daemon](https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/daemon.html#daemon) that handles
  the execution and retrieval of multiple simulations

### More on provenance

The main ORM entry point in AiiDA is the **Node** class, which provides the functionalities to interact with the
underlying SQL database. From this, we branch of to the **Data** and **ProcessNode** classes, used to distinguish
between, you guessed it, **Data** and **Processes**.

For the latter, another important distinction is then made: One one hand, AiiDA defines:

- **Calculations** as processes that are able to **create** new data, and
- **Workflows** which are processes that **orchestrate** other workflows and calculations, but **cannot create new
  data**, but only **return already existing data**.

This distinction allows for a the conceptual separation of the **data provenance** and **logical provenance**. In the
former case, due to the causality principle, a directed acyclic graph (DAG) must result, while in the second case, as a
workflow can **return** its inputs, cycles can be present in the graph.

The interested reader is referred to the [relevant documentation section on
provenance](https://aiida.readthedocs.io/projects/aiida-core/en/stable/topics/provenance/index.html), which provides an
in-depth discussion of the topic.

Importantly, AiiDA enforces **strict provenance**, and therefore when exporting/deleting entities of its database,
all connected Nodes necessary to keep the provenance consistent will also be exported/deleted.

## Implementation of Calculations and Workflows and where this tutorial will lead you

The two main classes that provide the aforementioned implementation of `Calculation`s and `Workflow`s are the `CalcJob`
and the `WorkChain` classes (to be precise, also `CalcFunction` and `WorkFunction` exist, however, we will not
cover them in this tutorial).

The `CalcJob` class is typically used to provide an interface for external codes, e.g. [Quantum
ESPRESSO (QE)](https://www.quantum-espresso.org) in materials science. It is at this stage, where the first
external tool that we will cover in this workshop enters: [**`aiida-shell`**](https://aiida-shell.readthedocs.io/en/latest/).

Writing the `CalcJob` interface for an external code requires significant Python and AiiDA expertise, and is typically a
task taken care of by an AiiDA plugin developer. The [AiiDA plugin
registry](https://aiidateam.github.io/aiida-registry/) currently contains almost 100 plugins with 163 `CalcJob`s
defined, however, these are mostly related to the field of materials science. If you'd like to start executing a
a new external code from a different research domain in AiiDA, the next notebook will show you how `aiida-shell` can
help kickstart you doing just that!

Further, to define a workflow in AiiDA, the typical approach is to construct a `WorkChain` by inheriting from the
`WorkChain` class, like so: `class EuroSciPyWorkChain(WorkChain):`. However, constructing this class correctly again
requires advanced Python and AiiDA expertise, so to simplify the generation of workflows, the [**`aiida-workgraph`**](https://aiida-workgraph.readthedocs.io/en/stable/search.html)
was created. In the third notebook, we will show you how you can use this tool to quickly construct your own workflows,
using existing AiiDA building blocks, external executables and scripts, or your own Python code.

Finally, we should mention that `aiida-shell` and `aiida-workgraph` are currently not part of the `aiida-core`
repository, and do not replace, but rather build on top of it:

<br>

<img src="../../data/aiida-core-shell-workgraph.jpg" width="800" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

For more in-depth information on how to write AiiDA workflows in the *classical* way, that is, by writing a custom
`WorkChain` class, we point you to the [relevant documentation
section](https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/write_workflows.html), as well as material from [past AiiDA virtual tutorials](https://aiida-tutorials.readthedocs.io/en/latest/sections/writing_workflows/index.html).

Lastly, it is important to note that, while the `aiida-shell` API has been quite stable for a while, the
`aiida-workgraph` is still very much under active development. So any feedback you might have during this tutorial will
be very valuable to us!

So, let's install both tools in the next cell, and get started, shall we?!

In [None]:
!/apps/share64/debian10/anaconda/anaconda-7/envs/AIIDA/bin/python -m pip install aiida-shell==0.7.3
!/apps/share64/debian10/anaconda/anaconda-7/envs/AIIDA/bin/python -m pip install aiida-workgraph