# How to quickly create a workflow from a set of executables

**Please note, this notebook depends on successful execution of the first notebook `1-aiida-intro.ipynb`!**

To run the following Python cells, we need to make sure that we select the correct kernel `Python3.10 (AIIDA)`. If it is
not already selected, do so as follows:

<img src="../../data/figs/change_notebook_kernel.png" width="500" style="height:auto; display:block; margin-left:auto; margin-right:auto;">

***

## Concatenating several scripts to one workflow

### The workflow setup

Now that we have a working profile set up, assume we would like to execute a workflow that is composed of the following
steps:

- 1. Query a database that contains some matrices 
- 2. Run a code that achieves matrix diagonalizations and writes the eigenvalues and eigenvectors to files on disk
- 3. Plot the obtained eigenvalues from the previous steps

Each of the steps of our workflow can be of arbitrary nature, e.g. an executable on your system, a shell script, Python code, etc. We provide those for the exemplary workflow outlined above as pre-compiled binaries. Their source code doesn't really
matter. If you are interested, you can find the source code under the `data` directory.

<img src="../../data/figs/dummy-workflow.png" width="400" style="height:auto; display:block; margin-left:auto;
margin-right:auto;">

Now, let's import the necessary modules, importantly the AiiDA ORM:

In [None]:
from pathlib import Path
import numpy as np
from IPython.display import Image, display

from aiida import orm
from aiida_shell.parsers import ShellParser
from aiida.tools.visualization import Graph

In [None]:
%load_ext aiida
%aiida

In [None]:
def provenance_graph(aiida_node):
    graph = Graph()
    graph.recurse_ancestors(aiida_node, annotate_links="both")
    graph.recurse_descendants(aiida_node, annotate_links="both")
    display(graph.graphviz)

Great, now we're ready to run our binaries through `aiida-shell`. They should already be accessible through via the
`$PATH`. If not, uncomment these lines to make them available:

In [None]:
# !echo 'export PATH=$PATH:$HOME/fair-workflows-workshop/data/euro-scipy-2024/diag-wf/' >> ~/.bash_profile
# !echo 'export PATH=$PATH:$HOME/fair-workflows-workshop/data/euro-scipy-2024/diag-wf/bin/default' >> ~/.bash_profile

We then we load the `launch_shell_job` function:

In [None]:
from aiida_shell import launch_shell_job

To which we pass:

- The codes that we want to execute, which can be, either
  - Common binaries available on Linux systems, such as `cat`, `echo`, etc.
  - Custom binaries, if they are discoverable e.g. by adding them to `$PATH`
  - Full paths to custom binaries (albeit this will lead to long code labels)
  - And previously created `Code` instances, already registered in AiiDA
- The two required command line arguments, namely
  - The path to the mocked external database from which we want to obtain data, and
  - The matrix identifier (feel free to change that to a value between 0 and 100 to obtain different results)
- Lastly, we also specify the output filename of the file that our executable will create (note that `stdout` and
  `stderr` are automatically captured by `aiida-shell`)

In [None]:
db_path = str(Path('../../data/euro-scipy-2024/diag-wf/remote/matrices.db').resolve())
matrix_id = 0
matrix_file = f'matrix-{matrix_id}.npy'

# 1. Query a remote database for data

query_results, query_node = launch_shell_job(
    'remote_query.py',
    arguments=f'{db_path} {matrix_id}',
    outputs=[matrix_file]
)

That was simple, wasn't it?

Now, `aiida-shell` allows us to pass the output of one job as the input of another job, so let's do that for the next
step, and then unpack it:

In [None]:
# 2. Diagonalize 

eigvals_file = f'matrix-{matrix_id}-eigvals.txt'
matrix_file_link_label = ShellParser.format_link_label(matrix_file)

diag_results, diag_node = launch_shell_job(
    'diag',
    arguments='{matrix_file}',
    nodes={
        'matrix_file': query_results[matrix_file_link_label]
    },
    outputs = [eigvals_file]
)

In [None]:
# 3. Plotting of the script

plot_type = 'violin'
eigvals_file_link_label = ShellParser.format_link_label(eigvals_file)
figure_file = f'matrix-{matrix_id}-eigvals-{plot_type}.png'
figure_file_link_label = ShellParser.format_link_label(figure_file)

plot_results, plot_node = launch_shell_job(
    'plot_eigvals.py',
    arguments='-i {eigenval_txt} -p {plot_type}',
    nodes={
        'eigenval_txt': diag_results[eigvals_file_link_label],
        'plot_type': orm.Str(plot_type)
    },
    outputs = [figure_file]
)

In [None]:
%verdi process list -ap 1

Once all processes have (hopefully) finished successfully, we can visualize the final plotted result, as well as the
provenance graph that AiiDA has created from the execution of our workflow:

In [None]:
with plot_node.outputs[figure_file_link_label].as_path() as filepath:
    display(Image(filename=filepath))

In [None]:
provenance_graph(plot_node)

Normally while waiting, executing the command above will show processes in various states, for example first you would see this step by order:

```bash 
# Step 1
PK  Created    Process label                        ♻    Process State    Process status
----  ---------  -----------------------------------  ---  ---------------  ----------------
6  1s ago    ShellJob<remote_query@localhost>          ⏵ Waiting        Waiting for transport task: upload 

# Step 2
 PK  Created    Process label                        ♻    Process State    Process status
----  ---------  -----------------------------------  ---  ---------------  ----------------
6  2s ago    ShellJob<remote_query@localhost>          ⏵ Waiting        Waiting for transport task: submit

# Step 3
PK  Created    Process label                        ♻    Process State    Process status
----  ---------  -----------------------------------  ---  ---------------  ----------------   
6  3s ago    ShellJob<remote_query@localhost>          ⏵ Waiting        Monitoring scheduler: job state QUEUED

# Step 4
PK  Created    Process label                        ♻    Process State    Process status
----  ---------  -----------------------------------  ---  ---------------  ----------------
6  4s ago    ShellJob<remote_query@localhost>          ⏵ Waiting        Monitoring scheduler: job state RUNNING

# Step 5
PK  Created    Process label                        ♻    Process State    Process status
----  ---------  -----------------------------------  ---  ---------------  ----------------
6  5s ago    ShellJob<remote_query@localhost>          ⏵ Waiting        Waiting for transport task: retrieve

# Step 6
PK  Created    Process label                        ♻    Process State    Process status
----  ---------  -----------------------------------  ---  ---------------  ----------------
6  6s ago    ShellJob<remote_query@localhost>          ⏹ Finished [0]
```

During this tutorial, processes will always be in the `Finished [0]` state, as without the RabbitMQ dependency, we
cannot `submit` them to the daemon in a non-blocking manner, but instead `run` them blockingly in the notebook cells.

## Custom parsing of your results

When running your matrix diagonalization, you might not want to only obtain the output file with the eigenvalues, but
actually retrieve them as Python objects, so that you can the directly operate on them. This can be achieved in
`aiida-shell` by attaching a custom parser, like so:

In [None]:
# 1. Query a remote database for data

query_results, query_node = launch_shell_job(
    'remote_query.py',
    arguments=f'{db_path} {matrix_id}',
    outputs=[matrix_file]
)

# Custom parser defined that actually reads the created output file and returns the eigenvalues as an AiiDA data type

def parse_array(self, dirpath: Path) -> dict[str, orm.Data]:
    arr = np.loadtxt(dirpath / self.node.inputs.outputs[0])
    data = orm.ArrayData(arr)
    return {"eigvals": data}

# 2. Run matrix diagonalization with the parser attached

eigvals_file = f'matrix-{matrix_id}-eigvals.txt'
matrix_file_link_label = ShellParser.format_link_label(matrix_file)

diag_results, diag_node = launch_shell_job(
    'diag',
    arguments='{matrix_file}',
    nodes={
        'matrix_file': query_results[matrix_file_link_label]
    },
    outputs = [eigvals_file],
    parser=parse_array,  # Parser attached here
)

In [None]:
print(diag_results['eigvals'])
print(diag_results['eigvals'].get_array())

We've now seen how we can parse and access the results of our `ShellJob`. This will become important in the next
notebook where we'll start creating more complex workflows. So let's go :fire: