# GeoDN Course 2: Fundamentals of Geospatial Data and Modeling - Part 2 Geospatial Foundation Models and Workflows #
> Copyright (c) 2024 International Business Machines Corporation

> This software is released under the MIT License.
> https://opensource.org/licenses/MIT

# Section 2 - Workflow development/testing and sharing
This notebook will provide a walkthrough of using the GeoDN SDK to test and deploy a GeoDN modeling workflow.

We will begin from the example workflows (in the `example_workflows` folder).  In GeoDN Modeling, workflows are defined by a python function which takes a user payload (as a dict) and dynamically generated both the workflow graph specification and the required input config files, generated based on the user options. 

You can take a look at the different examples and the documentation will describe the details about the anatomy of the workflow definition.

The development/testing steps we will follow are to:
1) Try passing a payload into a workflow function and check the generated workflow
2) Send that payload to GeoDN Modeling to run
3) Monitor the workflow status and check the logs 
4) Download and check the files generated by the workflow
5) Submit workflow to the catalogue

If the developer is happy with the workflow and would like to submit it to the GeoDN Modeling catalogue to share with others, we should the simple step of pushing to the workflow catalogue.


### Prepare
Load the `geodn.modeling` module. We also need to using the python package `os`, `json` and `Image` from `PIL`, so let's import those too.


In [None]:
from geodn.modeling import workflow

import os
import json
from PIL import Image

### Authenticate to GeoDN

To authenticate to the GeoDN services, add your username and password to the file `~/geodn-creds`, which can be found in the home folder. The format should be `your-username:your-password`. 

Once updated, we can use the `getToken()` function to get an authentication token which will later be used to access the GeoDN services. The `get_token()` function takes three parameters, `username`, `password` and `geodn_modeling_url`, which are your username, your password and the URL of the GeoDN backend service to connect to, respecitvely. 

These tokens will expire after 24 hours. To also return a refresh_token, pass `refresh=True` to `getToken()`.

In [None]:
with open("../.." + '/geodn-creds', 'r') as file:
    data = file.read().rstrip()
    username = data.split(':')[0]
    password = data.split(':')[1]

assert username and password

# Get the tokens
id_token, access_token = workflow.get_token(username, password, geodn_modeling_url=os.environ["GEODN_URL"])
assert id_token and access_token

### Connect to GeoDN modeling

And finally, we can connect to the GeoDN Modeling service. Here you will pass the token and create the connection to the GeoDN Modeling APIs. This will allow you to submit models to the cluster, check status, access logs and download files.

To determine which backend service to connect to, we use the arguments `geodn_modeling_url`, `core_url` and `workflow_url`. These have been set as environment variables but can be configured if in the future you require a connection to a different backend service. 

In [None]:
geodn_modelling = workflow.GeoDN_Modeling(
    bearer_token=id_token,
    api_url=os.environ["GEODN_URL"],
    core_url=os.environ["GEODN_CORE_URL"],
    workflow_url=os.environ["GEODN_WORKFLOW_URL"],
)

Sometimes, we can see errors returned from this function. If this is the case, it maybe that the `id_token` has expired and needs to be regenerated. Re-run the `get_token` function to generate a new token.

------------------------------------------------
# 1. Passing a payload into a workflow function and check the generated workflow


### Test compile a workflow definition from file

In order to generate a workflow to run in GeoDN Modeling, a user specifies their choices of parameters, datasets etc and submits them as a json/dict payload.

The first step is to pass a payload into the workflow function to see if it is interpretted correctly.  We want to see that the payload is successully passed and the function is able to generate a workflow graph and any input files (if required).

To do this we will chose a workflow and its corresponding payload (to start with use the examples, but feel free to start experimenting), and use the `parse_and_print` function to combine the two.  If the function generates a graph, it will then be compiled into the yaml which will eventually get sent to the orchestrator.  In general, you don't need to dig into this unless you are doing something very advanced, or there are bugs with a workflow which are otherwise hard to decipher.

`parse_and_print` takes a workflow (either from file, or function name) and a payload (from file or dict) and returns the compiled workflow yaml (`wfy` below), the workflow graph (`dag`) and input files.


In [None]:
# Choose the workflow (i.e. slope_calc, ifm_standard)
wf = 'air_pollution'


wfy, dag, input_files = geodn_modelling.parse_and_print(
    workflow_file='./example_workflows/workflow_' + wf + '.py', 
    payload_file='./example_workflows/payload_' + wf + '.json'
    )

The cell above should return a successful compilation and we can now visualise the workflow graph.  This will give you a first indication if you have constructed the graph you intended to.  In future, more detailed and interactive visualisations will be available.

In [None]:
# Optional step to view the DAG describe by the workflow template and payload together
G = workflow.draw_workflow(dag)

------------------------------------------------

# 2. Send payload to GeoDN Modeling to run

### Submit workflow to GeoDN Modeling cluster to run

Once you have tested that your workflow function successfully ingests the user payload and generated the correct  graph, it is now time to run the workflow on a GeoDN Modeling cluster and check the outputs.

To do this we will use the `parse_and_submit` function.  This is very similar to `parse_and_print` but will send it to the cluster (specified by the url at the top of the notebook).  The only additional argument you need to provide is a name to give your workflow instance.

If the submission is successful, you should get a response which includes the `model_run_id` which is the main identifier linked to that run of the workflow.  You will use that later to check status, access logs and generated files.

In [None]:
response = geodn_modelling.parse_and_submit(name=wf, 
                      workflow_file='./example_workflows/workflow_' + wf + '.py', 
                      payload_file='./example_workflows/payload_' + wf + '.json',
                      )

model_run_id = response['data']['model_run_id']

In [None]:
print(model_run_id)

------------------------------------------------
#  3. Monitor the workflow and check outputs


Once you have submitted the workflow to the cluster to run, you can monitoring its progress, by:

### Checking the workflow status
The `workflowStatus` function will (for a given `model_run_id`) show the status of the workflow run.  At present, this is the status of any steps which are/have been running.  You will see `Pending`, `Running`, `Succeeded` or `Failed`.  In the case of a failed step it will show an error message (this will often refer you to the logs, see below).  The json returned (`opr` here) contains more detailed information about when steps ran etc).

In [None]:
model_run_id

In [None]:
opr = geodn_modelling.workflowStatus(model_run_id)

### Pulling the log files:

Logs files for all steps in GeoDN Modeling workflows are archived in the same S3-compatible bucket where the data files reside.  This is an asynchronous process, so there is a delay in logs being available through the SDK (this could be up to 10 minutes currently).

The GeoDN SDK `grab_logs` function will find the archived logs related to a given `model_run_id`, load them and clean them.  The `print_logs` will print extracted logs.

In [None]:
logs = geodn_modelling.grab_logs(model_run_id)

In [None]:
pod_logs = workflow.print_logs(logs, level='content')
print(json.dumps(pod_logs, indent=1))

------------------------------------------------
# 4. Download and checking output files


In a GeoDN Modeling workflow, all the data files from a particular run of a workflow are stored in an S3-compitable bucket. This includes the input files, the final outputs and all intermediate files.  To check if the workflow has run correctly, we can check the correct files have been generated, download them and take a look.

You can either access a list of non-log files from the workflow with `geodn_modelling.listWorkflowFiles(model_run_id)` and download the ones you want to look at with `geodn_modelling.downloadFiles()`.

Or you could use the interactive downloader, where you can select multiple files, specify the download path in the box at the bottom and click the download button. Once the workflow has completed and the `workflowStatus` method returns `succeeded`, all the files will be available to download. This includes `'main_warrington_pm10_daily_forecast_plot.png'` which we'll use in the next section.

In [None]:
geodn_modelling.listWorkflowFiles(model_run_id)
# workflow.fileDownloader(geodn_modelling, model_run_id)

In [None]:
for ofile in geodn_modelling.listWorkflowFiles(model_run_id):
    geodn_modelling.downloadFiles(ofile)

### Plot output

In [None]:
img = Image.open('main_warrington_pm10_daily_forecast_plot.png').convert("RGB")
display(img)

------------------------------------------------
# 5. Submit workflow to the catalogue


Once you have checked your workflow runs correctly and produces the outputs you desired you can push it to the workflow catalogue where you can share it with others.  Once a workflow is stored in the catalogue, a user can run the model by providing only the payload.


To upload the workflow, you use the `uploadWorkflow` function which takes a name for the workflow (the key in the catalogue), the workflow and payload (in the same way as the above previous functions), tags (optional) which can help to catagorise or search for workflows.  Finally, there is a flag to actually push it to the catalogue.  If the `upload` flag is not set to True, the function will grab the definitions, encode them and prepare the upload payload, but just not send it.  This is a useful check to do before uploading.

Set `upload` to `True` to upload your workflow.

In [None]:
upload = False

geodn_modelling.upload_workflow(
    'explain_air_pollution_v1', 
    workflow_file='example_workflows/workflow_' + wf + '.py', 
    payload_file='example_workflows/payload_' + wf + '.json', 
    tags=['explain'], 
    upload=upload)

After uploading, you can check the workflow catalogue for the newly uploaded definition.  To do this you can use either the `cw.available_workflows()` and explore the dataframe that's returned, or you can use the searchable UI with `cw.available_workflows_ui()`

In [None]:
geodn_modelling.available_workflows_ui()

## Delete workflows
To remove a workflow from the catalogue once it has been uploaded, the `delete_workflow()` function can be used to send a request to the GeoDN API to delete the specified workflow. Identify the workflow key and version number of the workflow you wish to delete using the `workflow_name` and `version` parameters respectively. You will only be able to remove workflow your user has previously uploaded.

In [None]:
geodn_modelling.delete_workflow(workflow_name="explain_air_pollution_v1", version="20231101-5")