# Convert HDF5 to FlatGeobuf using H5DataFrame

Loading data from a nested ICESat-2 ATL03 HDF5 file,
and converting it to a [FlatGeobuf](https://flatgeobuf.org) format.

Steps:
1. Read a variable from HDF5 file into `H5DataFrame` class
2. Create `geopandas.GeoDataFrame` with columns h_ph and geometry (longitude/latitude)
3. Save GeoDataFrame to FlatGeobuf

References:
- https://github.com/MAAP-Project/gedi-subsetter/blob/0.6.0/src/gedi_subset/gedi_utils.py#L139-L381

In [1]:
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd
import h5py
import s3fs
import tqdm

try:
    from gedi_subset.h5frame import H5DataFrame
except ImportError:
    !pip install git+https://github.com/MAAP-Project/gedi-subsetter.git@0.6.0
    from gedi_subset.h5frame import H5DataFrame

try:
    import pyogrio
except ImportError:
    !mamba install -y pyogrio
    import pyogrio

In [2]:
gpd.show_versions()


SYSTEM INFO
-----------
python     : 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
executable : /srv/conda/envs/notebook/bin/python
machine    : Linux-5.10.167-147.601.amzn2.x86_64-x86_64-with-glibc2.27

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : 3.11.2
GEOS lib   : None
GDAL       : 3.7.0
GDAL data dir: /srv/conda/envs/notebook/share/gdal
PROJ       : 9.2.1
PROJ data dir: /srv/conda/envs/notebook/share/proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.12.1
numpy      : 1.23.5
pandas     : 1.5.1
pyproj     : 3.6.0
shapely    : 2.0.1
fiona      : 1.9.4
geoalchemy2: None
geopy      : None
matplotlib : 3.6.2
mapclassify: 2.5.0
pygeos     : 0.14
pyogrio    : 0.6.0
psycopg2   : 2.9.6 (dt dec pq3 ext lo64)
pyarrow    : 12.0.1
rtree      : 1.0.1


## List files to convert

In [3]:
!aws s3 ls s3://nasa-cryo-scratch/h5cloud/original/

2023-08-08 23:45:34 7754735138 ATL03_20181120182818_08110112_006_02.h5
2023-08-08 23:47:04 6997123664 ATL03_20190219140808_08110212_006_02.h5
2023-08-08 23:47:04 6925710500 ATL03_20200217204710_08110612_006_01.h5
2023-08-08 23:47:04 8392279594 ATL03_20211114142614_08111312_006_01.h5
2023-08-08 23:47:04 7954039827 ATL03_20230211164520_08111812_006_01.h5


In [4]:
s3 = s3fs.S3FileSystem(anon=False)
s3urls = s3.glob(path="s3://nasa-cryo-scratch/h5cloud/original/*")
s3urls

['nasa-cryo-scratch/h5cloud/original/ATL03_20181120182818_08110112_006_02.h5',
 'nasa-cryo-scratch/h5cloud/original/ATL03_20190219140808_08110212_006_02.h5',
 'nasa-cryo-scratch/h5cloud/original/ATL03_20200217204710_08110612_006_01.h5',
 'nasa-cryo-scratch/h5cloud/original/ATL03_20211114142614_08111312_006_01.h5',
 'nasa-cryo-scratch/h5cloud/original/ATL03_20230211164520_08111812_006_01.h5']

## Single file conversion

Showing how to process just one HDF5 file.
Skip to bottom if running on multiple.

### Read HDF5 variables into H5DataFrame

In [5]:
s3url = "s3://nasa-cryo-scratch/h5cloud/original/ATL03_20230211164520_08111812_006_01.h5"
h5 = h5py.File(name=s3.open(path=s3url, mode="rb"))

In [6]:
# Print top-level groups
print(h5.keys())

# Print all nested groups (very slow)
# h5.visit(func=lambda name: print(name))

# Print specific groups (slow)
h5["gt1l/heights"].visit(func=lambda name: print(name))

<KeysViewHDF5 ['METADATA', 'ancillary_data', 'atlas_impulse_response', 'ds_surf_type', 'ds_xyz', 'gt1l', 'gt1r', 'gt2l', 'gt2r', 'gt3l', 'gt3r', 'orbit_info', 'quality_assessment']>
delta_time
dist_ph_across
dist_ph_along
h_ph
lat_ph
lon_ph
pce_mframe_cnt
ph_id_channel
ph_id_count
ph_id_pulse
quality_ph
signal_conf_ph
weight_ph


#### Read in just the `gt1l/heights` group

In [7]:
df = H5DataFrame(group=h5["gt1l/heights"])
df

It looks like there are no columns, because H5DataFrame is lazy,
and you need to access the variable (e.g. `h_ph`) explicitly.

In [8]:
df["h_ph"]

0           267.050385
1           279.333771
2           272.782135
3           226.964447
4           228.495010
               ...    
84812149    286.131195
84812150    282.127655
84812151    278.603241
84812152    155.884216
84812153    151.234497
Name: h_ph, Length: 84812154, dtype: float32

### Create `geopandas.GeoDataFrame`

#### Create geometry column with CRS

In [9]:
%%time
geometry = gpd.points_from_xy(x=df["lon_ph"], y=df["lat_ph"], crs="OGC:CRS84")

CPU times: user 46.2 s, sys: 11.4 s, total: 57.6 s
Wall time: 1min 54s


In [10]:
geometry

<GeometryArray>
[<POINT (-60.059 -79.006)>, <POINT (-60.059 -79.006)>,
 <POINT (-60.059 -79.006)>, <POINT (-60.059 -79.006)>,
 <POINT (-60.059 -79.006)>, <POINT (-60.059 -79.006)>,
 <POINT (-60.059 -79.006)>, <POINT (-60.059 -79.006)>,
 <POINT (-60.059 -79.006)>, <POINT (-60.059 -79.006)>,
 ...
 <POINT (-69.925 -50.011)>, <POINT (-69.925 -50.011)>,
 <POINT (-69.925 -50.011)>, <POINT (-69.925 -50.011)>,
 <POINT (-69.925 -50.011)>, <POINT (-69.925 -50.011)>,
 <POINT (-69.925 -50.011)>, <POINT (-69.925 -50.011)>,
 <POINT (-69.925 -50.011)>, <POINT (-69.925 -50.011)>]
Length: 84812154, dtype: geometry

#### Create `geopandas.GeoDataFrame` with h_ph and geometry columns

In [11]:
gdf = gpd.GeoDataFrame(data=df[["h_ph"]], geometry=geometry)
gdf

Unnamed: 0,h_ph,geometry
0,267.050385,POINT (-60.05933 -79.00597)
1,279.333771,POINT (-60.05933 -79.00597)
2,272.782135,POINT (-60.05933 -79.00597)
3,226.964447,POINT (-60.05935 -79.00597)
4,228.495010,POINT (-60.05935 -79.00597)
...,...,...
84812149,286.131195,POINT (-69.92508 -50.01094)
84812150,282.127655,POINT (-69.92508 -50.01094)
84812151,278.603241,POINT (-69.92508 -50.01094)
84812152,155.884216,POINT (-69.92513 -50.01094)


In [12]:
gdf.crs

<Geographic 2D CRS: OGC:CRS84>
Name: WGS 84 (CRS84)
Axis Info [ellipsoidal]:
- Lon[east]: Geodetic longitude (degree)
- Lat[north]: Geodetic latitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

### Store as FlatGeobuf

To save time, we will save the dataframe without a spatial index.
This should take about 3min, compared to >1hour with a spatial index.

References:
- https://gdal.org/drivers/vector/flatgeobuf.html#layer-creation-options

In [13]:
granule_name:str = os.path.basename(os.path.splitext(s3url)[0])
filename:str = f"{granule_name}_gt1l_heights_h_ph.fgb"
filename

'ATL03_20230211164520_08111812_006_01_gt1l_heights_h_ph.fgb'

In [14]:
%%time
gdf.to_file(filename=filename, driver="FlatGeobuf", engine="pyogrio", SPATIAL_INDEX="NO")

#### Add spatial index (optional)

In [15]:
%%time
!ogr2ogr -f FlatGeobuf ATL03_20181120182818_08110112_006_02.fgb ATL03_20181120182818_08110112_006_02_gt1l_heights_h_ph.fgb

CPU times: user 1.44 s, sys: 682 ms, total: 2.12 s
Wall time: 3min 43s


#### Check that FlatGeobuf can be loaded back into `geopandas.GeoDataFrame`

In [16]:
%%time
_gdf = gpd.read_file(filename=filename, engine="pyogrio")

CPU times: user 1min 31s, sys: 9.96 s, total: 1min 41s
Wall time: 1min 43s


#### Upload FlatGeobuf to s3 bucket

In [17]:
# !aws s3 cp ATL03_20230211164520_08111812_006_01_gt1l_heights_h_ph.fgb s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ATL03_20230211164520_08111812_006_01.fgb

## Multiple file conversions

Convert many HDF5 files to FlatGeobuf in a for-loop,
and upload them to an s3 bucket.

In [5]:
def generate_flatgeobuf(hdf5_path: str) -> str:
    # Read into H5DataFrame
    h5 = h5py.File(name=s3.open(path=hdf5_path, mode="rb"))
    df = H5DataFrame(group=h5["gt1l/heights"])
    
    # Create geopandas.GeoDataFrame
    geometry = gpd.points_from_xy(x=df["lon_ph"], y=df["lat_ph"], crs="OGC:CRS84")
    gdf = gpd.GeoDataFrame(data=df[["h_ph"]], geometry=geometry)
    
    # Save to FlatGeobuf
    granule_name:str = os.path.basename(os.path.splitext(s3url)[0])
    filename:str = f"{granule_name}.fgb"
    gdf.to_file(filename=filename, driver="FlatGeobuf", engine="pyogrio", SPATIAL_INDEX="NO")
    
    return filename

In [6]:
%%time
flatgeobufs = []
for s3url in tqdm.tqdm(iterable=s3urls):
    flatgeobuf:str = generate_flatgeobuf(hdf5_path=s3url)
    flatgeobufs.append(flatgeobuf)
print(flatgeobufs)

100%|██████████| 5/5 [29:23<00:00, 352.61s/it]

['ATL03_20181120182818_08110112_006_02.fgb', 'ATL03_20190219140808_08110212_006_02.fgb', 'ATL03_20200217204710_08110612_006_01.fgb', 'ATL03_20211114142614_08111312_006_01.fgb', 'ATL03_20230211164520_08111812_006_01.fgb']
CPU times: user 22min 11s, sys: 1min 8s, total: 23min 20s
Wall time: 29min 23s





In [7]:
%%bash
mkdir -p sindex
for fgb in $(ls *.fgb); do
    echo "Adding spatial index to $fgb"
    time ogr2ogr -f FlatGeobuf -progress sindex/$fgb $fgb
done

Adding spatial index to ATL03_20181120182818_08110112_006_02.fgb
0...10...20...30...40...50...60...70...80...90...100 - done.



real	4m1.905s
user	2m40.654s
sys	0m12.627s


Adding spatial index to ATL03_20190219140808_08110212_006_02.fgb
0...10...20...30...40...50...60...70...80...90...100 - done.



real	6m28.763s
user	4m26.447s
sys	0m18.805s


Adding spatial index to ATL03_20200217204710_08110612_006_01.fgb
0...10...20...30...40...50...60...70...80...90...100 - done.



real	3m40.568s
user	2m43.953s
sys	0m11.046s


Adding spatial index to ATL03_20211114142614_08111312_006_01.fgb
0...10...20...30...40...50...60...70...80...90...100 - done.



real	9m7.924s
user	6m50.077s
sys	0m24.373s


Adding spatial index to ATL03_20230211164520_08111812_006_01.fgb
0...10...20...30...40...50...60...70...80...90...100 - done.



real	7m4.553s
user	5m2.343s
sys	0m21.201s


In [8]:
!aws s3 cp ./sindex/ s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ --recursive --exclude "*" --include "*.fgb"
!aws s3 ls s3://nasa-cryo-scratch/h5cloud/flatgeobuf/

upload: sindex/ATL03_20200217204710_08110612_006_01.fgb to s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ATL03_20200217204710_08110612_006_01.fgb
upload: sindex/ATL03_20181120182818_08110112_006_02.fgb to s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ATL03_20181120182818_08110112_006_02.fgb
upload: sindex/ATL03_20190219140808_08110212_006_02.fgb to s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ATL03_20190219140808_08110212_006_02.fgb
upload: sindex/ATL03_20230211164520_08111812_006_01.fgb to s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ATL03_20230211164520_08111812_006_01.fgb
upload: sindex/ATL03_20211114142614_08111312_006_01.fgb to s3://nasa-cryo-scratch/h5cloud/flatgeobuf/ATL03_20211114142614_08111312_006_01.fgb
2023-08-11 19:52:47 5702149992 ATL03_20181120182818_08110112_006_02.fgb
2023-08-11 19:52:47 9048510952 ATL03_20190219140808_08110212_006_02.fgb
2023-08-11 19:52:47 5622152472 ATL03_20200217204710_08110612_006_01.fgb
2023-08-11 19:52:47 10721045272 ATL03_20211114142614_08111312_006_01.fgb
2