# Harmony workflow

This notebook replicates the workflow to retrieve subsetted ATL06 (v006) data using Harmony (via [harmony-py](https://github.com/nasa/harmony-py)) to fulfil the subsetting request).

## Required packages:

* notebook
* pyproj
* numpy
* harmony-py

## Other requirements:

A `.netrc` file with credentials for Earthdata Login (production).

## Optional - install packages:

In [None]:
!pip install notebook harmony-py pyproj numpy

In [None]:
from datetime import datetime

import numpy as np
from harmony import BBox, Client, Collection, Environment, Request
from pyproj import CRS, Transformer

The cell below is a lightly edited version of the one provided from [here](https://github.com/ICESAT-2HackWeek/icesat2-cookbook/blob/main/notebooks/draft/workflows/greenland_dhdt/dhdt_40km_tile.ipynb).

In [None]:
%%time

# Bounding box information:

# Lower left corner:
xy0 = [0., -3.0e6]

# CRS of source data:
crs = 3413

# Padding width
width = 4.e4

pad = np.array([-width / 2, width / 2])

# Generate bounding box in projected metres:
bbox_projected = [xy0[0] + pad[[0, 1, 1, 0, 0]], xy0[1] + pad[[0, 0, 1, 1, 0]]]

# Generate a clipping polygon in lat/lon CRS
polygon_geographic = Transformer.from_crs(CRS(crs), CRS(4326)).transform(*bbox_projected)

## Generate variable list:

In [None]:
%%time

base_variable_names = [
    'land_ice_segments/h_li',
    'land_ice_segments/fit_statistics/n_fit_photons',
    'land_ice_segments/fit_statistics/dh_fit_dx',
    'land_ice_segments/fit_statistics/w_surface_window_final',
    'land_ice_segments/atl06_quality_summary',
    'land_ice_segments/geophysical/bsnow_conf',
    'land_ice_segments/h_li_sigma',
    'land_ice_segments/geophysical/r_eff',
    'land_ice_segments/geophysical/bsnow_h',
    'land_ice_segments/geophysical/tide_ocean',
    'land_ice_segments/ground_track/x_atc',
    'land_ice_segments/fit_statistics/h_robust_sprd',
    'land_ice_segments/segment_id',
    'land_ice_segments/ground_track/y_atc',
    'land_ice_segments/ground_track/seg_azimuth',
    'land_ice_segments/sigma_geo_h',
    'land_ice_segments/delta_time',
    'land_ice_segments/latitude',
    'land_ice_segments/longitude',
]

ground_tracks = ['gt1l', 'gt1r', 'gt2l', 'gt2r', 'gt3l', 'gt3r']

all_variables = [
    '/'.join([ground_track, base_variable_name])
    for ground_track in ground_tracks
    for base_variable_name in base_variable_names
]

## Run the Harmony request to process the data

Note - NSIDC has not populated UMM-Var records, so this does not do variable subsetting. (The option for that is commented out)

In [None]:
%%time

atl03_concept_id = 'C2670138092-NSIDC_CPRD'
start_datetime = datetime.fromisoformat('2018-10-13T00:00:00Z')
stop_datetime = datetime.fromisoformat('2025-10-13T00:00:00Z')

harmony_client = Client(env=Environment.PROD)

harmony_request = Request(
    collection=Collection(id=atl03_concept_id),
    spatial=BBox(
        w=min(polygon_geographic[1]),
        s=min(polygon_geographic[0]),
        e=max(polygon_geographic[1]),
        n=max(polygon_geographic[0]),
    ),
    temporal={'start': start_datetime, 'stop': stop_datetime},
    # variables=all_variables,
    max_results=5,  # To allow the job to complete. Really there are 214 matching granules
)

harmony_job_id = harmony_client.submit(harmony_request)
harmony_client.wait_for_processing(harmony_job_id, show_progress=True)

## Quick stats:

Individual granule takes about 10 - 15 seconds to process - 2 - 5 seconds to download granule to Harmony Docker container. The rest of the time is processing and staging the output. (The processing will be quicker with variable subsetting, as less of the file is spatially subsetted and staged to the output location)

Wall time of 7 minutes 23 seconds: 7 granules processed. 3 had no data. Maybe one more in progress. Rest not started.

Second run (max results = 5):

```
CPU times: user 1.44 s, sys: 679 ms, total: 2.12 s
Wall time: 3min 13s
```

But there's a possibility that Harmony has cached some of the output from the first time, and so was quicker.

Third run: parallelism of Harmony seemed to kick in this time (first two attempts ran one granule at a time). Total time spent by Harmony:

```
"createdAt": "2025-08-20T20:19:37.002Z",
"updatedAt": "2025-08-20T20:23:45.790Z",
```

So - total time: 4 minutes 8 seconds.

## Download all the results

In [None]:
harmony_job_id = '01534ac9-7a32-4c49-ac22-ac427f34af7f'
harmony_client = Client(env=Environment.PROD)

In [None]:
%%time

downloaded_files = [
    file_future.result()
    for file_future in harmony_client.download_all(harmony_job_id, overwrite=True)
]

print(downloaded_files)

## Quick stats:

```
CPU times: user 120 ms, sys: 164 ms, total: 284 ms
Wall time: 40.7 s
```

Approx. 10 seconds per download. Although, that is for outputs with all variables. So likely can gain a factor of 2 or 3 with variable subsetting.