# Preprocessing Time Analysis for Different File Formats

## Introduction

ICESat-2's rich hierarchical photon data, primarily stored in HDF5 formats, presents both advantages and challenges. While HDF5 files excel in supporting diverse scientific workflows due to their self-describing nature and ability to house heterogeneous data, their serialization structure requires users to load an entire ATL03 HDF5 file even when only a subset is desired. This contrasts with raster data formats like cloud-optimized GeoTIFFs, which allow efficient partial data access.

In our quest to optimize data access, we explore various file formats and preprocessing techniques. This includes traditional methods and cutting-edge approaches such as using kerchunk, a library designed to enhance chunked data access in cloud environments.

## Objective:

In this section, we will analyze the preprocessing time taken for different file formats, including kerchunk, original h5, repacked, kerchunk of repacked, flatgeobuf and geoparquet. By benchmarking these times, we aim to support data providers considering the delivery of alternative data formats for HDF5 products.

## Server Context

The processing time will depend on the server you are using to generate the files. The server used to generate these benchmarks was one of the [CryoCloud](https://book.cryointhecloud.com/intro.html) JupyterHub instances. It is located in AWS us-west-2 and can be configured with different CPU and RAM options.

### Memory Information

In [60]:
import psutil

# Get Memory usage
memory_info = psutil.virtual_memory()

print(f"Total Memory: {memory_info.total / (1024**3):.2f} GB")
print(f"Available Memory: {memory_info.available / (1024**3):.2f} GB")
print(f"Used Memory: {memory_info.used / (1024**3):.2f} GB")
print(f"Memory Percentage in use: {memory_info.percent}%")
print("\nNOTE: JupyterHub is a multi-user server for Jupyter notebooks, and the above method gives the memory status of the whole server. If you have many users running tasks simultaneously, the available memory will be affected by all the tasks collectively.")

Total Memory: 498.38 GB
Available Memory: 490.67 GB
Used Memory: 4.48 GB
Memory Percentage in use: 1.5%

NOTE: JupyterHub is a multi-user server for Jupyter notebooks, and the above method gives the memory status of the whole server. If you have many users running tasks simultaneously, the available memory will be affected by all the tasks collectively.


### CPU Information

In [57]:
logical_cpus = psutil.cpu_count(logical=True)
print(f"Number of logical CPUs: {logical_cpus}")

# Number of physical CPUs (or cores)
physical_cpus = psutil.cpu_count(logical=False)
print(f"Number of physical CPUs: {physical_cpus}")
# CPU Frequencies
cpu_freq = psutil.cpu_freq()
print(f"Current Frequency: {cpu_freq.current}Mhz")
print(f"Max Frequency: {cpu_freq.max}Mhz")
print(f"Min Frequency: {cpu_freq.min}Mhz")

Number of logical CPUs: 64
Number of physical CPUs: 32
Current Frequency: 3106.564656249999Mhz
Max Frequency: 0.0Mhz
Min Frequency: 0.0Mhz


## Generating Kerchunk sidecar for ATL03 files

The Kerchunk library creates an accompanying metadata file (much like motorbike sidecar which rides along with the primary file). A kerchunk sidecar contains the metadata, byte range locations, compression information, and other essential details that allow for efficient, chunked access to the main data file without needing to read the whole file.

## Setting Up the Environment for Kerchunk Processing with ATL03 Files

In [3]:
%%capture
%pip install git+https://github.com/fsspec/kerchunk
# You may need restart the kernel after installing kerchunk.

In [2]:
from kerchunk.hdf import SingleHdf5ToZarr
import fsspec
from pathlib import Path

import os
import ujson
import dask

In [4]:
# Initialize connection to the AWS S3 filesystem for reading files.
fs_read_files = fsspec.filesystem('s3')

# Create a list of all files present in the specified S3 directory. Here I am using one of the files.
flist = fs_read_files.glob('s3://nasa-cryo-scratch/h5cloud/original/ATL03_20200217204710_08110612_006_01.h5')

def gen_json(file_url):
    """
    Generate a JSON representation of the chunked structure of an HDF5 file.

    Args:
    - file_url (str): URL to the HDF5 file hosted on S3 bucket.
    """
    
    # Configuration for opening the file: Avoid caching to reduce memory usage.
    so = dict(mode='rb', default_fill_cache=False, default_cache_type='first')
    
    # Initialize connection to the S3 filesystem.
    fs = fsspec.filesystem('s3')
    
    # Initialize connection to the local filesystem for saving JSON outputs.
    fs_local = fsspec.filesystem('')  
    
    # Open and process the HDF5 file from S3.
    with fs.open(file_url, **so) as infile:
        print(f"Processing: {file_url}")
        
        # Convert the HDF5 data structure for optimized cloud access.
        h5chunks = SingleHdf5ToZarr(infile, file_url, inline_threshold=300)
        
        # Determine the name of the output JSON file based on the input file's name and directory.
        variable = file_url.split('/')[-1].split('.')[0]
        month = file_url.split('/')[2]
        outf = f'{month}_{variable}.json'
        
        # Save the processed data as a JSON file locally.
        with fs_local.open(f"./{outf}", 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())

# Display the list of files to be processed.
flist


['nasa-cryo-scratch/h5cloud/original/ATL03_20200217204710_08110612_006_01.h5']

### Function Internals:

- The function sets up the S3 filesystem and a local filesystem.
- It opens the provided HDF5 file from S3 and processes it with `SingleHdf5ToZarr`, which is designed to create an indexed file optimized for cloud access (Zarr).
- The resulting representation is then saved as a JSON file locally in the `./kerchunked/` directory.


### Preprocessing Time calculation

In [6]:
import time

original_data = 's3://nasa-cryo-scratch/h5cloud/original/ATL03_20200217204710_08110612_006_01.h5'

# file formats or files to process
file_url = original_data

# Start the timer
start_time = time.time()

# Call the preprocessing function
kerchunked_json = gen_json(file_url)

# End the timer and store the result
end_time = time.time()
preprocess_time_kerchunk = end_time - start_time

print(f"Time taken for preprocessing using Kerchunk: {preprocess_time_kerchunk:.2f} seconds")


Processing: s3://nasa-cryo-scratch/h5cloud/original/ATL03_20200217204710_08110612_006_01.h5
Time taken for preprocessing using Kerchunk: 1291.09 seconds


## Generating Geoparquet Data Samples from ATL03 Samples

In this sub-section, the ICESat-2's ATL03 data samples are converted into the geoparquet format. Geoparquet is a columnar storage file format optimized for big data processing. By converting ATL03 samples to geoparquet, we can achieve more efficient data access and processing, especially in cloud-based environments.

### Using Sliderule

SlideRule is a processing engine designed to work with ICESat-2 data. In this sub-section, we utilize SlideRule's capability to handle geoparquet by choosing the geoparquet export option. This will allow you to download the desired ATL03 data granules by specifying their IDs.

SlideRule will save the granules to a bucket specified as the `output` parameter.

#### Reference Guide

For a detailed walkthrough of the process, especially the steps involving SlideRule, refer to the tutorial available at ICESat-2 Hackweek [SlideRule Tutorial](https://icesat-2-2023.hackweek.io/tutorials/sliderule/parquet-s3.html). This guide provides comprehensive instructions, tips, and best practices for working with ICESat-2 data and the geoparquet format.

In [8]:
os.environ['USE_PYGEOS'] = '0'
from sliderule import icesat2, earthdata, io
import boto3

In [9]:
# Initialize the connection to SlideRule service
icesat2.init("slideruleearth.io")

In [23]:
# Granule to process, we are using one granule to test but it can be more
granule = 'ATL03_20190219140808_08110212_006_02.h5'

In [24]:
# Fetch AWS credentials for accessing S3 resources and these credentials are needed for sliderule
client = boto3.client('sts')
with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

ACCESS_KEY_ID = response['Credentials']['AccessKeyId']
SECRET_ACCESS_KEY_ID = response['Credentials']['SecretAccessKey']
SESSION_TOKEN = response['Credentials']['SessionToken']


In [25]:
# Function to convert a given ATL03 granule to geoparquet format and save to S3
def get_gpq(granule):
    asset = "icesat2"
    # Define the S3 path for the output geoparquet file
    output = f"s3://nasa-cryo-scratch/h5cloud/geoparquet/{granule}.gpq"
    
    # Configuration parameters for the conversion process
    params = {
        "output": {
            "path": output,
            "format": "parquet",
            "open_on_complete": False,
            "region": "us-west-2",
            "credentials": {
                 "aws_access_key_id": ACCESS_KEY_ID,
                 "aws_secret_access_key": SECRET_ACCESS_KEY_ID,
                 "aws_session_token": SESSION_TOKEN
             }
        }
    }
    
    # Convert ATL03 granule to geoparquet using the SlideRule service
    status = icesat2.atl03s(parm=params, resource=granule, asset=asset)
    
    return status

In [26]:
start_time = time.time()

# Call the sliderul function
result = get_gpq(granule)

# End the timer and store the result
end_time = time.time()
preprocess_time_geoparquet_via_sliderule = end_time - start_time
print(f"Time taken for creating Geoparquet via SlideRule: {preprocess_time_geoparquet_via_sliderule:.2f} seconds")

Time taken for creating Geoparquet via SlideRule: 278.76 seconds


In [27]:
print(result)

s3://nasa-cryo-scratch/h5cloud/geoparquet/ATL03_20190219140808_08110212_006_02.h5.gpq


In [28]:
import geopandas as gpd
# Make sure it worked
gpd.read_parquet(result)

ArrowInvalid: Could not open Parquet input source 'nasa-cryo-scratch/h5cloud/geoparquet/ATL03_20190219140808_08110212_006_02.h5.gpq': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

## Repacking h5 dataset

In [29]:
import subprocess

def repack_hdf5(input_file, options=[]):
    cmd = ["h5repack"] + options + [input_file, output_file]
    subprocess.run(cmd)
    
# Start the timer
start_time = time.time()

# Usage
input_file = "s3://nasa-cryo-scratch/h5cloud/original/ATL03_20200217204710_08110612_006_01.h5"
output_file = "h5repacked.h5"
repack_hdf5(input_file, options=["-S", "PAGE", "-G", "8000000"])

# End the timer and store the result
end_time = time.time()
preprocess_time_repacking = end_time - start_time
print(f"Time taken to repack: {preprocess_time_repacking:.2f} seconds")

Error occurred while repacking
Time taken to repack: 0.05 seconds


## Generating Kerchunk sidecar h5repacked data

In this sub-section, we create a Kerchunk sidecar metadata file specifically for an HDF5 file that has undergone the repacking process. This combination allows for the benefits of both techniques:

* Efficient Data Structure: Thanks to h5repacking, the HDF5 file itself is optimized for faster read operations.
* Optimized Access: The Kerchunk sidecar allows for chunked, efficient access, especially beneficial in cloud-based workflows. It means applications can pull specific chunks of data without downloading or reading the entire file.

In [30]:
# Initialize connection to the AWS S3 filesystem for reading files.
fs_read_files = fsspec.filesystem('s3')

In [31]:
h5repack_data = 's3://nasa-cryo-scratch/h5cloud/h5repack/ATL03_20200217204710_08110612_006_01_repacked.h5'

# file formats or files to process
file_url = h5repack_data


# Start the timer
start_time = time.time()

# Call the preprocessing function
kerchunked_json = gen_json(file_url)

# End the timer and store the result
end_time = time.time()
preprocess_time_kerchunk_repacked = end_time - start_time

In [32]:
print(f"Time taken for generating a Kerchunk sidecar for an existing repacked file: {preprocess_time_kerchunk_repacked:.2f} seconds")

Time taken for generating a Kerchunk sidecar for an existing repacked file: 760.89 seconds


# Comparing times

In this section, we combine all the preprocessing times into 1 dataframe to compare. Note that if considering a kerchunk sidecar for repacked HDF5 files, you would need to sum the preprocessing time.

In [33]:
all_times = {
    "Kerchunk": preprocess_time_kerchunk,
    "Repacking": preprocess_time_repacking,
    "Kerchunk of Repacked": preprocess_time_kerchunk_repacked,
    "Geoparquet": preprocess_time_geoparquet_via_sliderule,
}

In [46]:
all_times_in_minutes = {}
for key in all_times.keys():
    all_times_in_minutes[key] = round(all_times[key]/60, 3)
    
all_times_in_minutes

{'Kerchunk': 21.518,
 'Repacking': 0.001,
 'Kerchunk of Repacked': 12.682,
 'Geoparquet': 4.646}

In [47]:
import pandas as pd

all_times_df = pd.DataFrame(all_times_in_minutes, index=[0])
all_times_df

Unnamed: 0,Kerchunk,Repacking,Kerchunk of Repacked,Geoparquet
0,21.518,0.001,12.682,4.646
