# Building robust workflows with strong provenance

And we will do that using:

<img src="../../data/figs/aiida-logo.png" width="500" style="height:auto; display:block; margin-left:auto; margin-right:auto;">

An open-source Python infrastructure to help researchers with:
- automating,
- managing,
- persisting,
- sharing, and
- reproducing
the complex workflows associated with modern computational science and all associated data.

***
## Provenance: A robust solution for process management and data traceability

What is a process or a what we call *calculation*, fundamentally?

Well, it's just a data transformation!

<img src="../../data/figs/aiida-calculation-recipe.jpg" width="500" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

When doing this via AiiDA, it stores:
- The data transformations or calculations
- The inputs and their metadata
- The outputs and their metadata
- Most crucially: The inter-connections

While doing so, AiiDA creates a directed acyclic graph (DAG) of the data flow and takes care of some important features:
- Once data is stored, it cannot be modified &rarr; **provenance**
- Data is queryable and can always be traced back &rarr; **reproducibility**
- Checkpointing allows for **continuation** (even if computer is shut down)
- **Caching** prevents running the same calculation twice

***
## Scalability, interoperability, and high-throughput performance

### Learning by example: The LUMI hero run

AiiDA was built for high-throughput workloads, having the upcoming exa-scale area in mind:

<img src="../../data/figs/lumi-hero-run.jpg" width="500" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

The hero run:

- Utilized a full partition of LUMI-C: **1,500** nodes with **128** cores each (**192k** cores in total)
- **~15k** simulations (geometry optimizations of inorganic compounds) orchestrated with AiiDA in **13** hours runtime
- **~8k** issues dealt with on the fly

During all of this, **AiiDA runs on the local machine**. So no need to:

- Mirror your local environment to the HPC
- Ask the HPC admin to install software for you
- Getting banned from the HPC because a background process is continuously running

***

## The cogs and wheels behind AiiDA

### System dependencies

To achieve performance for thousands of workflows and millions of data nodes, AiiDA requires two system services:

- The **RabbitMQ** message broker that enables running multiple background daemon workers that orchestrate and
  monitor processes, as well as write data to the
- **PostgreSQL** database, which allows for concurrent write access by the daemon workers.

**Note** that for this tutorial, we will be using a simplified, service-less AiiDA installation that actually does not
require these two services. More information on the different ways to install AiiDA can be found in the [documentation](https://aiida.readthedocs.io/projects/aiida-core/en/latest/installation/index.html).

### Architecture

The other components of AiiDA are:
- An object-relational-mapper (ORM) which links entries in the database to the Python objects we will be dealing with
- A custom [disk-objectstore](https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/storage/repository.html#the-disk-object-store) file repository, where raw files are stored
  in an efficient, machine-readable manner, and can be *packed* to reduce the number of files for quick backup and export
- A custom [daemon](https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/daemon.html#daemon) that handles
  the execution and retrieval of multiple simulations

### Quickly set up a running instance

#### Interacting with AiiDA and creating a profile

While AiiDA is already installed in the conda kernel of this deployment, for each project one must set up a **profile**,
which defines the connection to the data storage (SQLite or PostgreSQL database and file repository), configuration, and
other settings.

Overall, AiiDA can be controlled in two ways:

1. Using the `verdi` command line interface (CLI), or `%verdi` magic in Jupyter notebooks.
2. Using the `aiida` Python API

As of AiiDA **v2.6.1** which was released on 2024-07-01, it is now possible to create a profile without the
PostgreSQL and RabbitMQ services mentioned previously. For the sake of this tutorial, we will use this simplified
version, and we refer you to the [installation instructions on
RTD](https://aiida.readthedocs.io/projects/aiida-core/en/stable/installation/index.html) for more information on how to
set up a fully functional high-performance profile.

To set up our profile, we just need to run the following notebook cell:

In [None]:
!/apps/share64/debian10/anaconda/anaconda-7/envs/AIIDA/bin/verdi presto --profile-name euro-scipy-2024

Now that we have created a profile, for convenience, we will now load the AiiDA jupyter extension. This will allow us
to use the `%verdi` jupyter magic commands, rather than having to run them in a subshell with the full, absolute
path to the `verdi` executable as done in the cell above.

In addition, this makes the `%aiida` jupyter magic command available that, when executed, will automatically load the
previously created `euro-scipy-2024` default profile. Alternatively, a specific profile can also be loaded as follows:
```python
from aiida import load_profile
load_profile('euro-scipy-2024')
```
which is the typical way to load a profile and what you will see in most code snippets.

In [2]:
%load_ext aiida
%aiida

Now, we set some configuration options for our profile:

In [3]:
%verdi config set warnings.development_version false
%verdi config set warnings.showdeprecations false



And verify that the profile was created successfully via:

In [4]:
%verdi status

[32m[22m ✔ [0m[22mversion:     AiiDA v2.6.2[0m
[32m[22m ✔ [0m[22mconfig:      /home/geiger_j/aiida_projects/fair-workflows-workshop/.aiida[0m
[32m[22m ✔ [0m[22mprofile:     euro-scipy-2024[0m
[32m[22m ✔ [0m[22mstorage:     SqliteDosStorage[/home/geiger_j/aiida_projects/fair-workflows-workshop/.aiida/repository/sqlite_dos_0973a1a1368743c7a12a11c420c85d70]: open,[0m
[32m[22m ✔ [0m[22mbroker:      RabbitMQ v3.9.13 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600[0m


  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,


[33m[22m ⏺ [0m[22mdaemon:      The daemon is not running.[0m


should show something like:

```shell
 ✔ version:     AiiDA v2.6.2
 ✔ config:      /home/nanohub/<your-user>/.aiida
 ✔ profile:     euro-scipy-2024
 ✔ storage:     SqliteDosStorage[/home/nanohub/<your-user>/.aiida/repository/sqlite_dos_b25c3582f65647beb068a3e50636a274]: open,
 ⏺ broker:      No broker defined for this profile: certain functionality not available. See https://aiida-core.readthedocs.io/en/stable/installation/guide_quick.html#quick-install-limitations
 ⏺ daemon:      No broker defined for this profile: daemon is not available. See {URL_NO_BROKER}
```

### Data nodes

Before running any calculations, let's create and store a *data node*.
AiiDA implements data node types for the most common types of data (int, float, str, etc.), which you can extend with your own (composite) data node types if needed.
For this tutorial, we'll keep it very simple, and start by initializing an `Int` node and assigning it to the `x` variable:

In [1]:
from aiida import orm

x = orm.Int(2)

ConfigurationError: Could not determine the current profile. Consider loading a profile using `aiida.load_profile()`.

We can check the contents of the `node` variable like this:

In [None]:
x

<Int: uuid: b313bd5c-b0d4-489d-97af-abb862d9102b (unstored) value: 2>

Quite a bit of information on our freshly created node is returned:

- The data node is of the type `Int`
- The node has the *universally unique identifier* (**UUID**)
- The node is currently not stored in the database `(unstored)`
- The integer value of the node is `2`

Let's store the node in the database:

In [None]:
x.store()

<Int: uuid: b313bd5c-b0d4-489d-97af-abb862d9102b (pk: 2) value: 2>

As you can see, the data node has now been assigned a *primary key* (**PK**), a number that identifies the node in your database `(pk: 1)`.
The PK and UUID both reference the node with the only difference that the PK is unique *for your local database only*, whereas the UUID is a globally unique identifier and can therefore be used between *different* databases.
Use the PK only if you are working within a single database, i.e. in an interactive session and the UUID in all other cases.

> **Note**
> 
> The PK numbers shown throughout this tutorial assume that you start from a completely empty database.
> It is possible that the nodes' PKs will be different for your database!
>
> The UUIDs are generated randomly and are, therefore, **guaranteed** to be different.


Next, let's use the `verdi` command line interface (CLI) to check the data node we have just created:
(**tip**: To dynamically access the PK of the node when using the `%verdi` magic command, you can also use: {x.pk})

In [None]:
%verdi node show {x.pk}

Once again, we can see that the node is of type `Int`, has its PK and UUID.
Besides this information, the `verdi node show` command also shows the (empty) `label` and `description`, as well as the time the node was created (`ctime`) and last modified (`mtime`).

> **Note**
> AiiDA already provides many standard data types, but you can also [create your own](https://aiida.readthedocs.io/projects/aiida-core/en/stable/topics/data_types.html#topics-data-types-plugin).

### Calculation functions

Once your data is stored in the database, it is ready to be used for some computational task.
For example, let's say you want to multiply two `Int` data nodes.
The following Python function:

```python
def multiply(x, y):
    return x * y
```

will give the desired result when applied to two `Int` nodes, but the calculation will not be stored in the provenance graph.
However, we can use a [Python decorator](https://docs.python.org/3/glossary.html#term-decorator) provided by AiiDA to automatically make it part of the provenance graph, as shown below:

In [None]:
from aiida import engine

@engine.calcfunction
def multiply(x, y):
    return x * y

This converts the `multiply` function into an AiIDA *calculation function*, the most basic execution unit in AiiDA.
Next, Let's create a new `Int` data node and assign it to the variable `y`, and then run the `multiply` function with the `x` and `y` data nodes as inputs:

In [None]:
y = orm.Int(3)

Now it's time to multiply the two numbers!

In [None]:
multiply(x, y)

Success!
The `calcfunction`-decorated `multiply` function has multiplied the two `Int` data nodes and returned a new `Int` data node whose value is the product of the two input nodes.
Note that by executing the `multiply` function, all input and output nodes are automatically stored in the database:

In [None]:
y

We had not yet stored the data node assigned to the `y` variable, but by providing it as an input argument to the `multiply` function, it was automatically stored with PK = 2.
Similarly, the returned `Int` node with value 6 has been stored with PK = 4.

Let's look for the process we have just run using the `verdi` CLI:

In [None]:
%verdi process list -a

We can see that our `multiply` calcfunction was created 1 minute ago, assigned the PK 3, and has `Finished`.

### The provenance graph
An AiiDA database does not only contain the results of your calculations, but also their inputs and each step that was executed to obtain them. All of this information is stored in the form of a directed acyclic graph (DAG).
Let's have a look at the provenance of this simple calculation.
The provenance graph can be automatically generated using the verdi CLI.
Let's generate the provenance graph for the `multiply` calculation function we have just run with PK = 3:

> **note**
> Remember that the PK of the `CalcJob` can be different for your database.

```console
$ verdi node graph generate 3
```

The command will write the provenance graph to a `.pdf` file.
Use your favorite PDF viewer to have a look.
It should look something like the graph shown below.

In [None]:
from aiida.tools.visualization import Graph
graph = Graph()
graph.add_incoming(calc_node, annotate_links="both")
graph.add_outgoing(calc_node, annotate_links="both")
graph.graphviz

In the provenance graph, you can see different types of *nodes* represented by different shapes.
The green ellipses are `Data` nodes, and the rectangles represent *processes*, i.e. the calculations performed in your *workflow*.

The provenance graph allows us to not only see what data we have, but also how it was produced.

### More on provenance

The main ORM entry point in AiiDA is the **Node** class, which provides the functionalities to interact with the
underlying SQL database. From this, we branch of to the **Data** and **ProcessNode** classes, used to distinguish
between, you guessed it, **Data** and **Processes**.

For the latter, another important distinction is then made: One one hand, AiiDA defines:

- **Calculations** as processes that are able to **create** new data, and
- **Workflows** which are processes that **orchestrate** other workflows and calculations, but **cannot create new
  data**, but only **return already existing data**.

This distinction allows for a the conceptual separation of the **data provenance** and **logical provenance**. In the
former case, due to the causality principle, a directed acyclic graph (DAG) must result, while in the second case, as a
workflow can **return** its inputs, cycles can be present in the graph.

The interested reader is referred to the [relevant documentation section on
provenance](https://aiida.readthedocs.io/projects/aiida-core/en/stable/topics/provenance/index.html), which provides an
in-depth discussion of the topic.

Importantly, AiiDA enforces **strict provenance**, and therefore when exporting/deleting entities of its database,
all connected Nodes necessary to keep the provenance consistent will also be exported/deleted.

## Implementation of Calculations and Workflows and where this tutorial will lead you

The two main classes that provide the aforementioned implementation of `Calculation`s and `Workflow`s are the `CalcJob`
and the `WorkChain` classes (to be precise, also `CalcFunction` and `WorkFunction` exist, however, we will not
cover them in this tutorial).

The `CalcJob` class is typically used to provide an interface for external codes, e.g. [Quantum
ESPRESSO (QE)](https://www.quantum-espresso.org) in materials science. It is at this stage, where the first
external tool that we will cover in this workshop enters: [**`aiida-shell`**](https://aiida-shell.readthedocs.io/en/latest/).

Writing the `CalcJob` interface for an external code requires significant Python and AiiDA expertise, and is typically a
task taken care of by an AiiDA plugin developer. The [AiiDA plugin
registry](https://aiidateam.github.io/aiida-registry/) currently contains almost 100 plugins with 163 `CalcJob`s
defined, however, these are mostly related to the field of materials science. If you'd like to start executing a
a new external code from a different research domain in AiiDA, the next notebook will show you how `aiida-shell` can
help kickstart you doing just that!

Further, to define a workflow in AiiDA, the typical approach is to construct a `WorkChain` by inheriting from the
`WorkChain` class, like so: `class EuroSciPyWorkChain(WorkChain):`. However, constructing this class correctly again
requires advanced Python and AiiDA expertise, so to simplify the generation of workflows, the [**`aiida-workgraph`**](https://aiida-workgraph.readthedocs.io/en/stable/search.html)
was created. In the third notebook, we will show you how you can use this tool to quickly construct your own workflows,
using existing AiiDA building blocks, external executables and scripts, or your own Python code.

Finally, we should mention that `aiida-shell` and `aiida-workgraph` are currently not part of the `aiida-core`
repository, and do not replace, but rather build on top of it:

<br>

<img src="../../data/figs/aiida-core-shell-workgraph.jpg" width="800" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

For more in-depth information on how to write AiiDA workflows in the *classical* way, that is, by writing a custom
`WorkChain` class, we point you to the [relevant documentation
section](https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/write_workflows.html), as well as material from [past AiiDA virtual tutorials](https://aiida-tutorials.readthedocs.io/en/latest/sections/writing_workflows/index.html).

Lastly, it is important to note that, while the `aiida-shell` API has been quite stable for a while, the
`aiida-workgraph` is still very much under active development. So any feedback you might have during this tutorial will
be very valuable to us!

So, let's install both tools in the next cell, and get started, shall we?!

In [None]:
!/apps/share64/debian10/anaconda/anaconda-7/envs/AIIDA/bin/python -m pip install aiida-shell==0.7.3
!/apps/share64/debian10/anaconda/anaconda-7/envs/AIIDA/bin/python -m pip install aiida-workgraph[widget]==0.3.22