# GEOtiled: A Scalable Workflow for Generating Large Datasets of High-Resolution Terrain Parameters

Terrain parameters such as slope, aspect and hillshading can be computed from a Digital Elevation Model (DEM) which is a representation of elevation data of the surface of the earth. These parameters can be generated at different spatial resolutions and are fundamental in applications such as forestry and agriculture, hydrology, landscape ecology, land-atmosphere interactions, and soil moisture prediction.

GEOtiled comprises three stages: (i) the partition of the DEM into tiles, each with a buffer region; (ii) the computation of the terrain parameters for each individual tile; and finally, (iii) the generation of a mosaic for each parameter from the tiles by averaging the values of the pixels that overlap between the tiles (i.e., pixels within the buffer regions). 

<p align="center">
<img src="../../somospie_pngs/geotiled.png" width="500"/>
</p>

<p align="center">
<b>Figure 1: </b>GEOtiled workflow.
</p>

This notebook uses DEMs from [USGS 3DEP products](https://www.usgs.gov/3d-elevation-program/about-3dep-products-services) to compute 3 topographic parameters: Aspect, Hillshading and Slope. The parameters are returned as GeoTIFF files using EPSG:4326 projection.

Before running the workflow on this notebook, go to [USGS Data Download Application](https://apps.nationalmap.gov/downloader/#/elevation) and use the map to look for available DEM data. Once you have selected a specific region and resolution, you can get a txt file with all the individual download links for the tiles corresponding to your selection. This txt file will serve as input to this notebook which uses the links to download the tiles and compute the parameters. This workflow works with GeoTIFF files.

## Environment Setup


In [None]:
from Pegasus.api import *
import os
from pathlib import Path
import logging

## Input parameters
In the code cell bellow specify the inputs to the workflow:
* **links_file:** path to the txt file with download links for DEM tiles you wish to use.
* **projection:** path to a wkt file. To compute terrain parameters correctly, the DEM must be in a projection whose x, y and z coordinates are expressed in the same units, Albers Equal Area USGS projection was used for CONUS, but you can modify it depending on the region you are analyzing.
* **n_tiles:** Number of tiles both from the x and y axis, total number of tiles = n_tiles*n_tiles.

In [None]:
links_file = 'OK_10m.txt'
projection_file = 'projection.wkt'
n_tiles = 2 # Number of tiles both from the x and y axis, total number of tiles = n_tiles*n_tiles

# Read file with links
with open(links_file, 'r', encoding='utf8') as f:
        download_links = f.read().splitlines()

## OSN credentials and setup
Before running the workflow, specify your access key and secret key in the Pegasus credentials file at ~/.pegasus/credentials.conf with the format below.

```
[osn]
endpoint = https://sdsc.osn.xsede.org

[USER@osn]
access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
secret_key = abababababababababababababababab
```
**Note:** Replace USER with your ACCESS username

In the following code cell also specify the OSN bucket and ACCESS username.

In [None]:
# update to a OSN bucket you have access to. For example asc190064-bucket01 
osn_bucket="BUCKET" 
# update to your ACCESS username
access_user="USER "

!chmod 600 ~/.pegasus/credentials.conf

## Pegasus logging and properties
Some properties for the workflow are specified, such as the data staging configuration to NonShared FileSystem to be able to use OSN for the intermediate and output data.

In [None]:
logging.basicConfig(level=logging.DEBUG)
BASE_DIR = Path(".").resolve()

stg_out = False # Set to True to disable cleanup jobs on intermediate files

# --- Properties ---------------------------------------------------------------
props = Properties()
props["pegasus.monitord.encoding"] = "json"  
# props["pegasus.mode"] = "tutorial" # speeds up tutorial workflows - remove for production ones
props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"
props["pegasus.data.configuration"] = "nonsharedfs"
props["pegasus.transfer.threads"] = "10"
props["pegasus.transfer.lite.threads"] = "10"
#props["pegasus.transfer.bypass.input.staging"] = "true"
props.write() # written to ./pegasus.properties 

## Replica Catalog
The input files to the workflow are specified in the Replica Catalog, specifically the input tiles that Pegasus will download from USGS servers and the wkt file that contains the information about the projection in which the computation will be run.

In [None]:
rc = ReplicaCatalog()
input_tiles = []
for f in download_links:
    input_tiles.append(File(os.path.basename(f.strip())))
    rc.add_replica(site="http", lfn=input_tiles[-1], pfn=f)

projection = File(projection_file)
rc.add_replica(site="local", lfn=projection, pfn=Path(".").resolve() / projection_file)

rc.write()

## Transformation Catalog
In this catalog the container in which the workflow will be run is specified along with the scripts that contain each of the functions of the workflow. 

In [None]:
# --- Container ----------------------------------------------------------
base_container = Container(
                  "base-container",
                  Container.SINGULARITY,
                  image="docker://olayap/somospie-gdal")

# --- Transformations ----------------------------------------------------------
merge = Transformation(
                "merge.py",
                site="local",
                pfn=Path(".").resolve() / "code/merge.py",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="50GB", request_memory="5GB")

reproject = Transformation(
                "reproject.py",
                site="local",
                pfn=Path(".").resolve() / "code/reproject.py",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="40GB", request_memory="5GB")

crop = Transformation(
                "crop.py",
                site="local",
                pfn=Path(".").resolve() / "code/crop.py",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="30GB", request_memory="5GB")

compute = Transformation(
                "compute.py",
                site="local",
                pfn=Path(".").resolve() / "code/compute.py",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="20GB", request_memory="5GB")

merge_avg = Transformation(
                "merge_avg.py",
                site="local",
                pfn=Path(".").resolve() / "code/merge_avg.py",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="50GB", request_memory="10GB") # For CONUS@10m 1TB, 20GB


tc = TransformationCatalog()\
    .add_containers(base_container)\
    .add_transformations(merge, reproject, crop, compute, merge_avg)\
    .write() # written to ./transformations.yml

## Site Catalog
Specifies the OSN bucket where the files from the workflow will be stored and the local site where the input files and scripts are present.

In [None]:
# --- Site Catalog ------------------------------------------------- 
osn = Site("osn", arch=Arch.X86_64, os_type=OS.LINUX)

# create and add a bucket in OSN to use for your workflows
osn_shared_scratch_dir = Directory(Directory.SHARED_SCRATCH, path="/" + osn_bucket + "/GEOtiled/work") \
    .add_file_servers(FileServer("s3://" + access_user +"@osn/" + osn_bucket + "/GEOtiled/work", Operation.ALL),)
osn_shared_storage_dir = Directory(Directory.SHARED_STORAGE, path="/" + osn_bucket + "/GEOtiled/storage") \
    .add_file_servers(FileServer("s3://" + access_user +"@osn/" + osn_bucket + "/GEOtiled/storage", Operation.ALL),)
osn.add_directories(osn_shared_scratch_dir, osn_shared_storage_dir)

# add a local site with an optional job env file to use for compute jobs
shared_scratch_dir = "{}/work".format(BASE_DIR)
local_storage_dir = "{}/storage".format(BASE_DIR)
local = Site("local") \
    .add_directories(
    Directory(Directory.SHARED_SCRATCH, shared_scratch_dir)
        .add_file_servers(FileServer("file://" + shared_scratch_dir, Operation.ALL)),
    Directory(Directory.LOCAL_STORAGE, local_storage_dir)
        .add_file_servers(FileServer("file://" + local_storage_dir, Operation.ALL)))

#job_env_file = Path(str(BASE_DIR) + "/../tools/job-env-setup.sh").resolve()
#local.add_pegasus_profile(pegasus_lite_env_source=job_env_file)

#condorpool_site = Site("condorpool")
#condorpool_site.add_condor_profile(request_cpus=1, request_memory="9 GB", request_disk="9 GB")

sc = SiteCatalog()\
   .add_sites(osn, local)\
   .write() # written to ./sites.yml

## Workflow
The workflow is specified in the next code cell with the inputs, output and intermediate files. The latter also have specified cleanup jobs by using the argument **stage_out=False**.

In [None]:
# --- Workflow -----------------------------------------------------------------
wf = Workflow("GEOtiled")

mosaic = File("mosaic.tif")
job_merge = Job(merge)\
                .add_args("-i", *input_tiles, "-o", mosaic)\
                .add_inputs(*input_tiles, bypass_staging=stg_out)\
                .add_outputs(mosaic, stage_out=stg_out)

wf.add_jobs(job_merge)

dem_m = File("elevation_m.tif")
job_reproject = Job(reproject)\
                    .add_args("-p", projection,"-i", mosaic, "-o", dem_m)\
                    .add_inputs(mosaic, projection)\
                    .add_outputs(dem_m, stage_out=stg_out)
    
wf.add_jobs(job_reproject)

# Crop tiles and compute each parameter for each tile
aspect_tiles = []
hillshading_tiles = []
slope_tiles = []

tile_count = 0
for i in range(n_tiles):
    for j in range(n_tiles):
        tile = File("tile_{0:04d}.tif".format(tile_count))
        aspect_tiles.append(File("aspect_tile_{0:04d}.tif".format(tile_count)))
        hillshading_tiles.append(File("hillshading_tile_{0:04d}.tif".format(tile_count)))
        slope_tiles.append(File("slope_tile_{0:04d}.tif".format(tile_count)))

        job_crop = Job(crop)\
                            .add_args("-n", n_tiles, "-x", i, "-y", j, "-i", dem_m, "-o", tile)\
                            .add_inputs(dem_m)\
                            .add_outputs(tile, stage_out=stg_out)
        
        job_compute = Job(compute)\
                            .add_args("-i", tile, "-o", aspect_tiles[-1], hillshading_tiles[-1], slope_tiles[-1])\
                            .add_inputs(tile)\
                            .add_outputs(aspect_tiles[-1], hillshading_tiles[-1], slope_tiles[-1], stage_out=stg_out)

        tile_count += 1
        wf.add_jobs(job_crop, job_compute)
        
# DEM reprojected to WGS84
dem = File("elevation.tif")
job_reproject = Job(reproject)\
                    .add_args("-p", 'EPSG:4326',"-i", mosaic, "-o", dem, "-n", "y")\
                    .add_inputs(mosaic)\
                    .add_outputs(dem, stage_out=True)

# Mosaic of each parameter
aspect = File("aspect.tif")
job_avg0 = Job(merge_avg)\
                .add_args("-i", *aspect_tiles, "-o", aspect)\
                .add_inputs(*aspect_tiles)\
                .add_outputs(aspect, stage_out=True)

hillshading = File("hillshading.tif")
job_avg1 = Job(merge_avg)\
                .add_args("-i", *hillshading_tiles, "-o", hillshading)\
                .add_inputs(*hillshading_tiles)\
                .add_outputs(hillshading, stage_out=True)

slope = File("slope.tif")
job_avg2 = Job(merge_avg)\
                .add_args("-i", *slope_tiles, "-o", slope)\
                .add_inputs(*slope_tiles)\
                .add_outputs(slope, stage_out=True)

wf.add_jobs(job_reproject, job_avg0, job_avg1, job_avg2)

## Visualizing the Workflow

In [None]:
try:
    wf.write()
    wf.graph(include_files=True, label="xform-id", output="graph.png")
except PegasusClientError as e:
    print(e)

# view rendered workflow
from IPython.display import Image
Image(filename='graph.png')

## Plan and submit the Workflow
In this case OSN is specified for data staging.

In [None]:
try:
    wf.plan(staging_sites={"condorpool": "osn"}, sites=["condorpool"], output_sites=["osn"], cluster=['horizontal'], submit=True)\
        .wait()
except PegasusClientError as e:
    print(e)


## Analyze the workflow
Pegasus returns statistics from the run of the workflow.

In [None]:
try:
    wf.statistics()
except PegasusClientError as e:
    print(e)

## Debug the workflow
In case of failure `wf.analyze()` is helpful to find the cause of the error.

In [None]:
try:
    wf.analyze()
except PegasusClientError as e:
    print(e)