__Description & purpose__: TODO 

__Author(s)__: Alastair Graham, Dusan Figala

__Date created__: 2024-11-08

__Date last modified__: 2024-11-08

__Licence__: This file is licensed under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/).  Any included code is released using the [BSD-2-Clause](https://www.tldrlegal.com/license/bsd-2-clause-license-freebsd) license.


<span style="font-size:0.75em;">
Copyright (c) , All rights reserved.</span>

<span style="font-size:0.75em;">
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:</span>

<span style="font-size:0.75em;">
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.</span>

# What is a Workflow

One of the provisioned services made available to users of the EODH is the Workflow Runner (WR). More information about the WR is provided later on this page, but the fundamental question that requires answering is...

<p style="text-align:center;"> In the context of the EODH, what is a workflow? </p>

Using the terminology of the EODH, a workflow is a file written in CWL (the Common Workflow Language) that creates a data processing chain. However, this CWL file only provides the orchestration of a wider set of tools that are brought together as an entity known as an Earth Observation Application Package (EOAP).

# What is the Workflow Runner?

The WR is a required component of the Hub designed to interprate CWL files and ensure that the algorithms wrapped within them are sensibly scaled across the available infrastructure. In the case of the EODH, the infrastructure is a Kubernetes cluster running on an AWS backend, but this is not a hard-set requirement. 

The WR itself is a piece of software called ADES. If you want to understand the internal components of the WR then the ADES Design Document is avialable here: https://eoepca.github.io/proc-ades/master/. 

The use is that EOAPs can be transfered from platform to platform e.g. they could be developed on EODH and run on EOEPCA or the other way round. Platforms run by DLR and NASA also utilise EOAPs and procesisng algorithms can be shared between them.  

Good introduction here: https://carpentries-incubator.github.io/cwl-novice-tutorial/aio/index.html



 ADES is responsible for the execution of the processing service through both a OGC WPS 1.0 & 2.0 OWS service and an OGC Processes REST API. The processing request are executed within the target Exploitation Platform (i.e., the one that is close to the data).

The ADES software uses ZOO-Project as the main framework for exposing the OGC compliant web services. The ZOO-kernel powering the web services is included in the software package.

The ADES functions are designed to perform the processing and chaining function on a Kubernetes cluster using the Calrissian Tool. Calrissian uses CWL, that is a robust workflow engine, over Kubernetes that enables the implementation of each step in a workflow as a container. It provides simple, flexible mechanisms for specifying constraints between the steps in a workflow and artifact management for linking the output of any step as an input to subsequent steps.

# What is CWL?

### Execution sequence
The generic execution sequence of a CWL process (including Workflows and CommandLineTools) is as follows.

* Load input object.
* Load, process and validate a CWL document. The $namespaces present in the CWL document are also used when validating and processing the input object.
* If there are multiple process objects (due to $graph) then choose the process with the id of "#main" or "main".
* Validate the input object against the inputs schema for the process.
* Perform any further setup required by the specific process type.
* Execute the process.
* Capture results of process execution into the output object.
* Validate the output object against the outputs schema for the process.
* Report the output object to the process caller.


## Scripting workflows
### Why CWL?
While shell scripts or other code scripts (e.g. Python) can meet the need of data processing workflows, using a formal workflow language (such as CWL) brings additional benefits such as abstraction and improved scalability and portability. 

Computational workflows explicitly create a divide between a user’s dataflow and the computational details which underpin the chain of tools. 

The dataflow is described by the workflow and the tool implementation is specified by descriptors that remove the workflow complexity.

Workflow managers such as `cwltool` help with the automation, monitoring and tracking of a dataflow. By producing computational workflows in a standardised format, and publishing them (alongside any data) with open access, the workflows become more FAIR (Findable, Accessible, Interoperable, and Reusable). The Common Workflow Language (CWL) standard has been developed to standardise workflow needs across different thematic areas.




cwltool is a great way to test workflows locally: https://cwltool.readthedocs.io/en/latest/cli.html

Good tutorial here: https://andrewjesaitis.com/posts/2017-02-06-cwl-tutorial/

Check your workflow: https://view.commonwl.org/

Examples - link to Lot 1 & 2
Other tools - link to Lot 2 tool


# How do I create a workflow?

The guidance around EOAPs is that handcrafting a package is possible but not recommended. These workflows can become very complex very quickly and it is recommended that some form of tooling is used to automate or semi-automate the generation of an EOAP. As part of the Pathfinder phase of the project, the EODH (via a 3rd party contractor, Oxidian) has created eoap-gen to help specialist technicians to create service workflows. 

# A simple example to introduce CWL
## Context
The first thing to do when designing a workflow is understand the context of what is desired, and how that may need to be referred to in the workflow. For this example workflow, we will take a list of Sentinel-2 ARD images, clip them to an area of interest, and stack them. The flow will look like the following:

```mermaid
graph LR;
  get_data -- S2_ARD --> clip -- clipped --> stack -- stacked--> Output ;

```

The next thing to do is access the data to be used in the Workflow. In this case we will download two bands of a Sentinel 2 image held on AWS. We will use the [curl](https://curl.se/) tool to do this, saving the accessed image as `B0$.tif` (where $ is the band number):

``` 
curl -o B04.tif https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/53/H/PA/2021/7/S2B_53HPA_20210723_0_L2A/B04.tif

curl -o B03.tif https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/53/H/PA/2021/7/S2B_53HPA_20210723_0_L2A/B03.tif 
```

The commands that we will use in the workflow are all available through [gdal](https://gdal.org/index.html).  

### Clip the image
We will use `gdal_-_translate` to clip the larger image to a smaller more manageable dataset. The gdal command that we can test and that we will need to replicate in CWL is: 

`gdal_translate -projwin ULX ULY LRX LRY  -projwin_srs EPSG:4326 BO4.tif B04_clipped.tif`

where the coordinates UL refer to upper left and LR to lower right X and Y.

### Stack the clips
Similarly, we will use `gdal_merge.py` to construct the stacked images from the clipped image. This can be tested using the following command:

`gdal_merge.py -separate B04_clipped.tif B03_clipped.tif -o stacked.tif` 

## Building the Workflow
### Required files
There are three main files that are required to construct a CWL Workflow. These are:
* DockerFile or existing online container
* CWL file
* YAML file
It may be that other files e.g. a .sh script or a Python script are also needed, depending on how bespoke and/or complex the desired workflow is. 

### cwltool
To run CWL workflows you will need a CWL runner. The most commonly used (locally) is `cwltool` which is maintained by the CWL community. It will support everything in the current CWL specification. `cwltool` can be installed using `pip` or variants of `conda`. More information can be found [here](https://www.commonwl.org/user_guide/introduction/prerequisites.html) and [here](https://cwl-for-eo.github.io/guide/requirements/). 

### Containers
For the purposes of this example, we will be pulling the GDAL container from the OSgeo repository (see [here](https://github.com/OSGeo/gdal/tree/master/docker)).

**NOTE**: There are a number of different images that can be accessed. To use the .py tools available through GDAL then 'GDAL Python' is required.

 If we wanted to we could also build our own bespoke image using a DockerFile and then run that. This is often used when data processing scripts need to be copied into the container.

We will also be using [Podman](https://podman.io/) as our container software. 'podman' is a drop in replacement for Docker but does require the `--podman` arguement in the `cwltool` command. If using Windows, or if you are more familiar with Docker, then using Docker is the default containerisation method.



### CWL files
For this example we require a CWL CommandLine file for both the clipping and stacking components of the workflow. We will also need a CWL Workflow file to bring these together and run the entire process. The next block of code outlines the overall Workflow file.
**Note**: This example is based on the example found [here](https://cwl-for-eo.github.io/guide/101/cwl-101/tutorial-2/workflow/). Some errors were found in the original CWL files and the version presented here has been tested and is known to work on a local Linux (Debian based) system. 

```
class: Workflow
label: Sentinel-2 clipping and stacking
doc:  This workflow creates a stacked composite. File name: composite.cwl
id: main

requirements:
- class: ScatterFeatureRequirement

inputs:

  geotiff:
    doc: list of geotifs
    type: File[]

  bbox: 
    doc: area of interest as a bounding box
    type: string

  epsg:
    doc: EPSG code 
    type: string
    default: "EPSG:4326"

outputs:
  rgb:
    outputSource:
    - node_concatenate/composite
    type: File

steps:

  node_translate:

    run: gdal-translate.cwl

    in:

      geotiff: geotiff  
      bbox: bbox
      epsg: epsg

    out:
    - clipped_tif

    scatter: geotiff
    scatterMethod: dotproduct

  node_concatenate:

    run: concatenate2.cwl

    in: 
      tifs:
        source: node_translate/clipped_tif

    out:
    - composite


cwlVersion: v1.0

```
From this example, we can see that we require two CommandLine CWL files: `gdal-translate.cwl`  and `concatenate2.cwl`. Let's deal with these in order.

```
class: CommandLineTool

cwlVersion: v1.0
doc:  This runs GDAL Translate to clip an image to bbox corner coordinates.

requirements: 
  InlineJavascriptRequirement: {}
  DockerRequirement: 
    dockerPull: ghcr.io/osgeo/gdal:ubuntu-small-latest

baseCommand: gdal_translate

arguments:
- -projwin 
- valueFrom: ${ return inputs.bbox.split(",")[0]; }
- valueFrom: ${ return inputs.bbox.split(",")[3]; }
- valueFrom: ${ return inputs.bbox.split(",")[2]; }
- valueFrom: ${ return inputs.bbox.split(",")[1]; }
- valueFrom: ${ return inputs.geotiff.basename.replace(".tif", "") + "_clipped.tif"; }
  position: 8

inputs:
  geotiff: 
    type: File
    inputBinding:
      position: 7
  bbox: 
    type: string
  epsg:
    type: string
    default: "EPSG:4326" 
    inputBinding:
      position: 6
      prefix: -projwin_srs
      separate: true

outputs:
  clipped_tif:
    outputBinding:
      glob: '*_clipped.tif'
    type: File


```

```
class: CommandLineTool

cwlVersion: v1.0
doc: This runs GDAL Merge to stack images together.

requirements:
  InlineJavascriptRequirement: {}
  DockerRequirement: 
    dockerPull: ghcr.io/osgeo/gdal:ubuntu-small-latest

baseCommand: gdal_merge.py

arguments: 
- -separate 
- valueFrom: ${ return inputs.tifs; }
- -o
- composite.tif
# gdal_merge.py -separate 1.tif 2.tif 3.tif -o rgb.tif

inputs:

  tifs:
    type: File[]

outputs:

  composite:
    outputBinding:
      glob: '*.tif'
    type: File

```
**NOTE**: YAML generally doesn't play well with tabs as whitespace so it's best practice to use spaces for indentations



## Running the Workflow
Now that we have our commandline CWL component files, and the Workflow CWL file that brings the tools together, we need to specify the input parameters. This is done using a `parameters.yml` file, where the name of the file can be anything that you want. The contents should follow the layout that we will be using:

```
bbox: "136.659,-35.96,136.923,-35.791"
geotiff: 
- { "class": "File", "path": "../B04.tif" }
- { "class": "File", "path": "../B03.tif" }
epsg: "EPSG:4326"
```
You will need to change the `path` parameter to match the location of your input files. 

Now we run it with the command:

`cwltool --podman composite.cwl composite-params.yml`

**Note**: remember that if you are using Docker then you do not need the `--podman` arguement.



## Outputs
This workflow takes a couple of minutes to run, during which time the executed commandsand their runtime messages are displayed on the commandline. Once the workflow completes, the output file will be found in the directory from where the workflow was run. Intermediate files that are not specified in the out block in the workflow are automatically deleted.

The output .tif file can now be opened in QGIS or a similar software application to check that the output is as expected (in this case a 2-layer image of a clipped area of the extent of the input files).

## Tips
You can pass `--leave-tmpdirs` to the `cwltool` command. This is often helpful to figure out if the outputs from a step are what you think they should be.