# Convert HDF5 to FlatGeobuf using H5DataFrame

Loading data from a nested ICESat-2 ATL03 HDF5 file,
and converting it to a [FlatGeobuf](https://flatgeobuf.org) format.

References:
- https://github.com/MAAP-Project/gedi-subsetter/blob/0.6.0/src/gedi_subset/gedi_utils.py#L139-L381

In [1]:
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd
import h5py
import s3fs

try:
    from gedi_subset.h5frame import H5DataFrame
except ImportError:
    !pip install git+https://github.com/MAAP-Project/gedi-subsetter.git@0.6.0
    from gedi_subset.h5frame import H5DataFrame

In [2]:
gpd.show_versions()


SYSTEM INFO
-----------
python     : 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
executable : /srv/conda/envs/notebook/bin/python
machine    : Linux-5.10.167-147.601.amzn2.x86_64-x86_64-with-glibc2.27

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : 3.11.2
GEOS lib   : None
GDAL       : 3.7.0
GDAL data dir: /srv/conda/envs/notebook/share/gdal
PROJ       : 9.2.1
PROJ data dir: /srv/conda/envs/notebook/share/proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.12.1
numpy      : 1.23.5
pandas     : 1.5.1
pyproj     : 3.6.0
shapely    : 2.0.1
fiona      : 1.9.4
geoalchemy2: None
geopy      : None
matplotlib : 3.6.2
mapclassify: 2.5.0
pygeos     : 0.14
pyogrio    : None
psycopg2   : 2.9.6 (dt dec pq3 ext lo64)
pyarrow    : 12.0.1
rtree      : 1.0.1


## List files to convert

In [3]:
!aws s3 ls s3://nasa-cryo-scratch/h5cloud/original/

2023-08-08 23:45:34 7754735138 ATL03_20181120182818_08110112_006_02.h5
2023-08-08 23:47:04 6997123664 ATL03_20190219140808_08110212_006_02.h5
2023-08-08 23:47:04 6925710500 ATL03_20200217204710_08110612_006_01.h5
2023-08-08 23:47:04 8392279594 ATL03_20211114142614_08111312_006_01.h5
2023-08-08 23:47:04 7954039827 ATL03_20230211164520_08111812_006_01.h5


In [4]:
s3 = s3fs.S3FileSystem(anon=False)

## Read HDF5 variables into H5DataFrame

In [5]:
s3url = "s3://nasa-cryo-scratch/h5cloud/original/ATL03_20230211164520_08111812_006_01.h5"
h5 = h5py.File(name=s3.open(path=s3url, mode="rb"))

In [6]:
# Print top-level groups
print(h5.keys())

# Print all nested groups (very slow)
# h5.visit(func=lambda name: print(name))

# Print specific groups (slow)
h5["gt2l/heights"].visit(func=lambda name: print(name))

<KeysViewHDF5 ['METADATA', 'ancillary_data', 'atlas_impulse_response', 'ds_surf_type', 'ds_xyz', 'gt1l', 'gt1r', 'gt2l', 'gt2r', 'gt3l', 'gt3r', 'orbit_info', 'quality_assessment']>
delta_time
dist_ph_across
dist_ph_along
h_ph
lat_ph
lon_ph
pce_mframe_cnt
ph_id_channel
ph_id_count
ph_id_pulse
quality_ph
signal_conf_ph
weight_ph


### Read in just the `gt2l/heights` group

In [7]:
df = H5DataFrame(group=h5["gt2l/heights"])
df

It looks like there are no columns, because H5DataFrame is lazy,
and you need to access the variable (e.g. `h_ph`) explicitly.

In [8]:
df["h_ph"]

0           331.053101
1           262.945648
2           271.216248
3           252.607193
4           322.731079
               ...    
68550685    371.084259
68550686    332.404785
68550687    451.218292
68550688    466.256348
68550689    347.798798
Name: h_ph, Length: 68550690, dtype: float32

### Create geometry column

In [9]:
%%time
geometry = gpd.points_from_xy(x=df["lon_ph"], y=df["lat_ph"])

CPU times: user 36.8 s, sys: 6.36 s, total: 43.2 s
Wall time: 1min 16s


In [10]:
geometry

<GeometryArray>
[    <POINT (-59.903 -79)>,     <POINT (-59.903 -79)>,
     <POINT (-59.903 -79)>,     <POINT (-59.903 -79)>,
     <POINT (-59.903 -79)>,     <POINT (-59.903 -79)>,
     <POINT (-59.903 -79)>,     <POINT (-59.903 -79)>,
     <POINT (-59.903 -79)>,     <POINT (-59.903 -79)>,
 ...
 <POINT (-69.879 -50.008)>, <POINT (-69.879 -50.008)>,
 <POINT (-69.879 -50.008)>, <POINT (-69.879 -50.008)>,
 <POINT (-69.879 -50.008)>, <POINT (-69.879 -50.008)>,
 <POINT (-69.879 -50.008)>, <POINT (-69.879 -50.008)>,
 <POINT (-69.879 -50.008)>, <POINT (-69.879 -50.008)>]
Length: 68550690, dtype: geometry

### Create `geopandas.GeoDataFrame`

In [11]:
gdf = gpd.GeoDataFrame(data=df[["h_ph"]], geometry=geometry)

In [12]:
gdf

Unnamed: 0,h_ph,geometry
0,331.053101,POINT (-59.90285 -79.00011)
1,262.945648,POINT (-59.90285 -79.00010)
2,271.216248,POINT (-59.90285 -79.00010)
3,252.607193,POINT (-59.90285 -79.00010)
4,322.731079,POINT (-59.90286 -79.00010)
...,...,...
68550685,371.084259,POINT (-69.87862 -50.00811)
68550686,332.404785,POINT (-69.87863 -50.00811)
68550687,451.218292,POINT (-69.87859 -50.00810)
68550688,466.256348,POINT (-69.87859 -50.00810)


## Store as FlatGeobuf

In [13]:
granule_name:str = os.path.basename(os.path.splitext(s3url)[0])
filename:str = f"{granule_name}_gt2l_heights_h_ph.fgb"
filename

'ATL03_20230211164520_08111812_006_01_gt2l_heights_h_ph.fgb'

In [None]:
%%time
gdf.to_file(filename=filename, driver="FlatGeobuf")

### Check that FlatGeobuf can be loaded back into `geopandas.GeoDataFrame`

In [None]:
gpd.read_file