# Data Processing System (DPS) Tutorial A to Z

Authors: Sujen Shah and Rob Tapella

Date: October, 2023

Description: This DPS tutorial is intended to demonstrate the steps needed to create, register, run, monitor and view outputs of algorithm jobs run at scale. It includes a template repository with various files needed to set up and run a job. 

## Importing and Installing Packages

Additional package installation will be included inline, and environment-configuration to support using a custom conda environment in DPS is a part of the tutorial below.

## Before Starting

- This tutorial assumes that you have at least run through the [Getting Started Guide](../../getting_started/getting_started.ipynb) and have set up your MAAP account.
- This tutorial is made for the Application Development Environment (ADE) v. 3.1.0 or later (August 2023 or later).
- This also assumes that you are familiar with using [Github with MAAP](../../system_reference_guide/work_with_git.ipynb).

## Overview of this Tutorial

- Clone the demo algorithm & make a personal working copy
- Edit and test your Algorithm code to make sure that it is working in its original form
- Prepare the Algorithm for DPS by setting up the runtime arguments and pre-run environment set-up
- Register the Algorithm with the Algorithm UI
- Run and Monitor the Algorithm using the Jobs UI
- View the outputs and errors from your run

## Clone the Demo Algorithm

1. For this tutorial, please use a Basic Stable workspace. 
2. Download a copy of the Github repository at https://github.com/MAAP-Project/dps_tutorial
3. Create a new repository in your personal Github and clone it to your workspace, then copy the files from dps-tutorial into it, and push it to your personal Github repo. This way you can modify the tutorial files while you're practicing and register your own version of the demo algorithm.

It does not matter what you call the copy of the tutorial files, but it does need to be a public repository in order to register the algorithm with DPS. For this tutorial we will use the `gdal_wrapper` algorithm folder inside the tutorial repository code.

Anatomy of the `gdal_wrapper` algorithm folder in the `dps_tutorial` repo:

- `README.md` to describe the algorithm in Github
- `build-env.sh`: a shell script that is executed before the algorithm is run; it is used to set up any custom programming libraries used in the algorithm (i.e., a custom conda environment)
- `environment.yaml`: a configuration file used by conda to add any custom libraries; this is used by build.sh
- `gdal_wrapper.py`: a python script that contains the logic of the algorithm
- `run_gdal.sh`: a shell script that DPS will execute when a run is requested. It calls any relevant python files with the required inputs

![DPS Tutorial Git repository overview](_static/dps_tutorial_git_repo.png)


## Edit and Test your Code

Once you have an algorithm such as the `gdal_wrapper`, or your own Jupyter Notebook, test it to make sure that it is running properly by following the instructions in the README.md file. If it runs properly in a Terminal window, you are one step closer to registering and running your algorithm in the DPS.

Typically a Jupyter Notebook is run interacively. A DPS algorithm will take all inputs up-front, do the processing, and produce output files. The `gdal_wrapper` is set up like a DPS algorithm. Some aspects to note:

- ARGPARSE STUFF
- LOG FILE
- EXAMPLE INPUT URL and EXAMPLE RUN COMMAND ON CLI
- EXPECTED OUTPUTS

The `README.md` file has usage instructions at the command line. You will need a test GeoTIF file as input, one of which is included with the repo for ease of use. Make a new folder to use for your test runs, at the same level as your copy of the `dps_tutorial` repository. In that folder, make two new folders: `input` and `output`. Something like this:
```
~/algorithms/maap-docs-test/test_run# ls -F
input/  output/
```

To run the script we will want to do something like this:

```
# python ../dps_tutorial_rt/gdal_wrapper/gdal_wrapper.py --input_file input/Copernicus_DSM_COG_10_N51_00_E007_00_DEM.tif --output_file output/Cop-30.tif --outsize 30
```

The output of that run should look like this:
```
Installed GDAL Version: 3.6.1
b'Input file size is 2400, 3600\n0...10...20...30...40...50...60...70...80...90...100 - done.\n'
```

And if you look in your output folder, you will see your output file:
```
# ls output/
Cop-30.tif
```


## Prepare the Algorithm for DPS

Once your scripts are working locally, make sure that they will also work in DPS.

The `gdal_wrapper` files are already prepared for DPS. Some important things to notice:

File: `build-env.sh`
- takes conda environment definition from `environment.yaml`
- Have it activate the environment with source activate base (or whatever your env is)

File: `run_gdal.sh`
- `run_gdal.sh` is a bash script to call the `gdal_wrapper.py` algorithm: make sure for DPS you have inputs and outputs in the right places. If you look at `run_gdal.sh` you will see that it is reading all the files from the `input/` folder and writing to the `output/` folder.

Run your scripts as if DPS is executing them:
- deactivate your conda environment
- run `build-env.sh`
- run `run_gdal.sh`

NOTES:
What happens with input and output in DPS (e.g. output is saved off somewhere else vs. the temp space when the job is running; what about “cloud” input data?)
- How does file management happen?
- Relative paths vs. absolute for input/output
- Mimic what’s happening on DPS (basedir)
- This wrapper run.sh script needs to manage the input files the way that your python script requires them (e.g. pass single file at a time vs. multiple files at once, etc.)



## Register the Algorithm with DPS using the Algorithm UI

1. Commit and push any changes to Github that you have made to the Algorithm while testing. The registration process will pull the code from Github as part of registration.
2. Open up Launcher: Register Algorithm
3. Fill in the fields as described below.

#### First you fill in the public code-repository information:
![Code repo information](_static/tutorial_register_1.png)

- The Repository URL is the .git URL. In the tutorial copy this author created, it is `https://github.com/rtapella/dps_tutorial_rt.git`
- Repository Branch is used as a version when this algorithm is registered. For your test it is likely `main`
- The Run and Build Commands must be the full path of the scripts that will be used by the DPS to build and execute the algorithm. Typically these will be the repository_name/script_name.sh. In this case we have run command: `dps_tutorial_rt/gdal_wrapper/run_gdal.sh` and build command: `dps_tutorial_rt/gdal_wrapper/build-env.sh`.
- Press the "Add General Information" button.

#### Then fill in the rest of the algorithm information:
![Algorithm information](_static/tutorial_register_2.png)

- The Algorithm Name will be the unique identifier for the algorithm in the MAAP system. It can be whatever you want. 
- Algorithm Description is additional free-form text to describe what this algorithm does.
- Disk Space is the minimum amount of space you expect—including all inputs, scratch, and outputs—it gives the DPS an approximation to help optimize the run.
- The Container URL is a URL of the Stack (workspace image environment) you are using as a base for the algorithm. In this example we use: `mas.maap-project.org/root/maap-workspaces/base_images/vanilla:v3.1.1`
See [the Getting Started guide](../../getting_started/running_at_scale.ipynb#Container-URLs) for more information on Containers.

#### Finally you fill in the input section:
![Algorithm-Inputs information](_static/tutorial_register_3.png)

- There are File Inputs and Positional Inputs (command-line parameters to adjust how the algorithm runs). In our example we have a File Input called `input_file` and two Positional Inputs: an output file called `output_file` and a parameters called `outsize` describing how much file-size reduction we want to get. For each input you can add a Description, a Default Value, and mark whether it’s required or optional.

4. Press Register and there will a popup dialog with a link to view progress of the build (you should copy the link and paste it into a new page; if you press register it will be difficult to edit the form and fix any mistakes).
![Register confirmation popup information](_static/tutorial_register_4.png)


## Running and Monitoring the Algorithm with the Jobs UI

1. Launcher: View & Submit Jobs
2. Choose the Submit tab
3. Run the job as described here:
- a, b, c
4. Submit the job and go back to the View tab
5. You can observe the progress of your job while it runs, and the status (complete or fail) when it completes

### Running and Monitoring using the HySDS Jobs UI (Figaro)

This will be described in a future update. HySDS is the data-processing system used to run the jobs. It has a full web application that is used by NASA missions to monitor jobs and data-outputs. If you would like to beta-test this UI with MAAP, please contact Sujen or George.

## Registering and Running the Algorithm using maap.py

This will be described in a future update. Often larger batch-jobs are run from Python Notebooks rather than the GUI.

## Getting the Outputs of the Job

- output folder
- stderr & stdout examples
- logfiles