# Setup Gabon GEDI L4A Testing

1. Download Gabon outline in geospatial format
2. Save to the workspace, probably not the repo?
3. Query CMR with the bbox of the polygon to find out how many granules are involved

Boundary file is available at `shared-buckets/alexdevseed/iso3/GAB-ADM0.json`

In [1]:
!pip install geopandas profilehooks

[0m

In [14]:
import os
import os.path
from typing import Any, Callable, Mapping, Optional, Iterable, TypeVar

import geopandas as gpd
import requests
from maap.maap import Granule, MAAP
from profilehooks import timecall

## Get Gabon GeoBoundary

In [3]:
def get_geo_boundary(*, iso: str, level: int) -> gpd.GeoDataFrame:
    file_path = f'/projects/my-public-bucket/iso3/{iso}-ADM{level}.json'
    
    if not os.path.exists(file_path):
        r = requests.get(
            'https://www.geoboundaries.org/gbRequest.html',
            dict(ISO=iso, ADM=f'ADM{level}')
        )
        r.raise_for_status()
        dl_url = r.json()[0]['gjDownloadURL']
        geo_boundary = requests.get(dl_url).json()

        with open(file_path, 'w') as out:
            out.write(json.dumps(geo_boundary))
    
    return gpd.read_file(file_path)


gabon_gdf = get_geo_boundary(iso='GAB', level=0)
gabon_gdf

Unnamed: 0,shapeName,shapeISO,shapeID,shapeGroup,shapeType,geometry
0,Gabon,GAB,GAB-ADM0-3_0_0-B1,GAB,ADM0,"MULTIPOLYGON (((8.83154 -0.92271, 8.83809 -0.9..."


## Find GEDI L4A Collection

In [4]:
nasa_cmr_host = 'cmr.earthdata.nasa.gov'
maap_cmr_host = 'cmr.maap-project.org'
maap = MAAP('api.ops.maap-project.org')
cmr_host = nasa_cmr_host

gedi_l4a_doi = '10.3334/ORNLDAAC/1986'
gedi_l4a = maap.searchCollection(cmr_host=cmr_host, doi=gedi_l4a_doi, limit=1)[0]
gedi_l4a_concept_id = gedi_l4a['concept-id']

## Find GEDI L4A Granules within Gabon Bounding Box

In [16]:
# TODO Handle cases where there are more than 2000 granules,
# as 2000 is the largest limit value allowed for a single query
# (a CMR constraint).
granules = timecall(maap.searchGranule)(
    cmr_host=cmr_host,
    collection_concept_id=gedi_l4a_concept_id,
    bounding_box=','.join(map(str, gabon_gdf.total_bounds)),
    limit=2000
)

print(f'Found {len(granules)} granules')


  searchGranule (/maap-py/maap/maap.py:104):
    31.632 seconds



Found 1009 granules


## Download Granule Files

Although the size of the granule files vary greatly, from under 100 MB to over 900 MB, they are all relatively large files.  As such, attempting to download them all serially would be rather time-consuming. Therefore, we want to perform some level of concurrency to shorten the total download time as much as possible.

Determining a reasonably good concurrency level is generally not something that can be computed without some trial, although for blocking operations that are generally several seconds or more, much more than a handful of concurrent threads tends to suffer from the management overhead of many threads, so a good rule of thumb as a starting point is to do some trials centered around a concurrency level of 5.

Since this is an IO-bound operation, we'll simply use a `ThreadPoolExecutor`, as using multiple processors won't help.

First, we need a function to download a single granule file, such that the function is suitable for use with a mapping function.

In [6]:
def download_granule(dest_dir: str, *, overwrite=False) -> Callable[[Granule], None]:
    os.makedirs(dest_dir, exist_ok=True)

    @timecall
    def do_download_granule(granule: Granule) -> None:
        granule.getData(dest_dir, overwrite)
    
    return do_download_granule

In addition to comparing the performance of varying numbers of threads in a `ThreadPoolExecutor`, it would be nice to compare the use of a `ThreadPoolExecutor` to using `asyncio.gather` in combination with `asycio.to_thread`, but `asyncio.to_thread` was not introduced until Python 3.9, so we cannot attempt this at the moment.

In [20]:
## Requires Python 3.9+

# import asyncio

# async def download_all_granules(dest_dir: str, granules: Iterator[Granule]) -> None:
#     await asyncio.gather(*(asyncio.to_thread(download_granule(dest_dir), granule) for granule in granules))

## TODO add code for timings

## In script:
# asyncio.run(download_all_granules('/projects/my-public-bucket/gedi-l4a/gabon', granules))

## In Jupyter:
# await download_all_granules('/projects/my-public-bucket/gedi-l4a/gabon', granules)

In [11]:
def make_concurrent_map(max_workers, *, timeout=None):
    @timecall
    def concurrent_map(fn, iterable) -> Iterable:
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            executor.map(fn, iterable, timeout=timeout)
    
    return concurrent_map

download_granule_to_gabon_dir = download_granule('/projects/my-public-bucket/gedi-l4a/gabon')

In [21]:
# make_concurrent_map(3)(download_granule_to_gabon_dir, granules[700:750])

In [22]:
print(f'3 workers: {11634422576 / 232.740 / 1_000_000} MB/s')
print(f'4 workers: {12796077537 / 213.866 / 1_000_000} MB/s')
print(f'5 workers: {10054004580 / 184.119 / 1_000_000} MB/s')
print(f'6 workers: {10522743776 / 208.35 / 1_000_000} MB/s')
print(f'8 workers: {16232707819 / 347.881 / 1_000_000} MB/s')
print(f'10 workers: {10350974972 / 227.481 / 1_000_000} MB/s')

3 workers: 49.98892573687376 MB/s
4 workers: 59.83221988067294 MB/s
5 workers: 54.6060133935118 MB/s
6 workers: 50.50512971442284 MB/s
8 workers: 46.66166826874708 MB/s
10 workers: 45.50259130213073 MB/s
