# Reducing ICESat-2 data files

Preparing data for large-scale processing and analysis 

![Splash](images/splash.png)

* Select files of interest (segment and time)
* Select area of interest (subset lon/lat)
* Reduce selected files with variables of interest
* Filter data, separate tracks into asc/des, reproject coords
* Read/Process each file in parallel
* Handle and plot millions of data points 

In [None]:
%matplotlib widget
#%matplotlib inline

## How ICESat-2 files are organized spatially 

ICESat-2 ground tracks are subsetted into granules (individual files)

Granules are then grouped into latitudinal bands (segments)

![Segments](images/segments.png "Latitudinal bands (Segments)")

## File naming convention

`ATL06_20181120202321_08130101_001_01.h5`

`[ATL06]_[yyyy][mmdd][hhmmss]_[RGT][cc][ss]_[rrr]_[vv].h5`

![Naming](images/name-convention.png)  
Source: Figure from Ben Smith

## Download data files

First let's define the product, area and time interval

In [None]:
from pathlib import Path

# Our data folder 
data_home = Path('/home/jovyan/tutorial-data/land_ice_applications/PIG_ATL06')

# Create folder if it doesn't exist
data_home.mkdir(exist_ok=True)

In [None]:
from icepyx import icesat2data as ipd

short_name = 'ATL06'
spatial_extent = [-102, -76, -98, -74.5]  # PIG
date_range = ['2018-10-14','2020-04-01']

# spatial_extent = [148, -81, 162, -80]  # Byrd
# date_range = ['2018-10-14','2018-12-22']

region = ipd.Icesat2Data(short_name, spatial_extent, date_range)

Query available data files without downloading

In [None]:
print('product:    ', region.dataset)
print('dates:      ', region.dates)
print('start time: ', region.start_time)
print('end time:   ', region.end_time)
print('version:    ', region.dataset_version)
print('extent:     ', region.spatial_extent)

print('\nDATA:')
print('\n'.join([str(item) for item in region.avail_granules().items()]))

region.visualize_spatial_extent()

Login to Earthdata and download the data files

In [None]:
name = 'fspaolo'
email = 'fspaolo@gmail.com'

# Only download if data folder is empty
if not list(data_home.glob('*.h5')):
    region.earthdata_login(name, email)
    region.download_granules(data_home)

Let's check we got the files we wanted 

In [None]:
files = list(data_home.glob('*.h5'))

for f in files[:10]: print(f)
print('Total number of files:', len(files))

## Reducing ICESat-2 files

> **NOTE:** 
> - This is neither the only nor the best way to handled ICESat-2 data files.
> - This is *one* way that works well for large-scale processing (e.g. full continent) on parallel machines (e.g. HPC clusters).
> - The idea is to (a) simplify the I/O of a complex workflow and (b) take advantage of embarrasingly parallelization.

Let's check the ICESat-2 file structure (!)

In [None]:
!h5ls -r {files[0]} 

Let's code a simple reader that:

- Select variables of interest `(x, y, t, h, ...)`  
- Filter data points based on quality flag and bbox   
- Separate into beams and ascending/descending tracks  
- Save data to a simpler HDF5 structure (NOTE: redundancy vs. efficiency)

Some utility functions

In [None]:
import pyproj
from astropy.time import Time

def gps2dyr(time):
    """Converte GPS time to decimal years."""
    return Time(time, format='gps').decimalyear


def orbit_type(time, lat, tmax=1):
    """Separate tracks into ascending and descending.
    
    Defines tracks as segments with time breaks > tmax,
    and tests whether lat increases or decreases w/time.
    """
    tracks = np.zeros(lat.shape)  # generate track segment
    tracks[0:np.argmax(np.abs(lat))] = 1  # set values for segment
    is_asc = np.zeros(tracks.shape, dtype=bool)  # output index array

    # Loop trough individual secments
    for track in np.unique(tracks):
    
        i_track, = np.where(track == tracks)  # get all pts from seg
    
        if len(i_track) < 2: continue
    
        # Test if lat increases (asc) or decreases (des) w/time
        i_min = time[i_track].argmin()
        i_max = time[i_track].argmax()
        lat_diff = lat[i_track][i_max] - lat[i_track][i_min]
    
        # Determine track type
        if lat_diff > 0:  is_asc[i_track] = True
    
    return is_asc


def transform_coord(proj1, proj2, x, y):
    """Transform coordinates from proj1 to proj2 (EPSG num).

    Example EPSG projections:
        Geodetic (lon/lat): 4326
        Polar Stereo AnIS (x/y): 3031
        Polar Stereo GrIS (x/y): 3413
    """
    # Set full EPSG projection strings
    proj1 = pyproj.Proj("+init=EPSG:"+str(proj1))
    proj2 = pyproj.Proj("+init=EPSG:"+str(proj2))
    return pyproj.transform(proj1, proj2, x, y)  # convert


The simple reader

In [None]:
import h5py
import numpy as np

def read_atl06(fname, outdir='data', bbox=None):
    """Read one ATL06 file and output 6 reduced files. 
    
    Extract variables of interest and separate the ATL06 file 
    into each beam (ground track) and ascending/descending orbits.
    """

    # Each beam is a group
    group = ['/gt1l', '/gt1r', '/gt2l', '/gt2r', '/gt3l', '/gt3r']

    # Loop trough beams
    for k, g in enumerate(group):
    
        #-----------------------------------#
        # 1) Read in data for a single beam #
        #-----------------------------------#
        
        data = {}
    
        try:
            # Load vars into memory (include as many as you want)
            with h5py.File(fname, 'r') as fi:
                
                data['lat'] = fi[g+'/land_ice_segments/latitude'][:]
                data['lon'] = fi[g+'/land_ice_segments/longitude'][:]
                data['h_li'] = fi[g+'/land_ice_segments/h_li'][:]
                data['s_li'] = fi[g+'/land_ice_segments/h_li_sigma'][:]
                data['t_dt'] = fi[g+'/land_ice_segments/delta_time'][:]
                data['q_flag'] = fi[g+'/land_ice_segments/atl06_quality_summary'][:]
                data['s_fg'] = fi[g+'/land_ice_segments/fit_statistics/signal_selection_source'][:]
                data['snr'] = fi[g+'/land_ice_segments/fit_statistics/snr_significance'][:]
                data['h_rb'] = fi[g+'/land_ice_segments/fit_statistics/h_robust_sprd'][:]
                data['dac'] = fi[g+'/land_ice_segments/geophysical/dac'][:]
                data['f_sn'] = fi[g+'/land_ice_segments/geophysical/bsnow_conf'][:]
                data['dh_fit_dx'] = fi[g+'/land_ice_segments/fit_statistics/dh_fit_dx'][:]
                data['tide_earth'] = fi[g+'/land_ice_segments/geophysical/tide_earth'][:]
                data['tide_load'] = fi[g+'/land_ice_segments/geophysical/tide_load'][:]
                data['tide_ocean'] = fi[g+'/land_ice_segments/geophysical/tide_ocean'][:]
                data['tide_pole'] = fi[g+'/land_ice_segments/geophysical/tide_pole'][:]
                
                rgt = fi['/orbit_info/rgt'][:]                           # single value
                t_ref = fi['/ancillary_data/atlas_sdp_gps_epoch'][:]     # single value
                beam_type = fi[g].attrs["atlas_beam_type"].decode()      # strong/weak (str)
                spot_number = fi[g].attrs["atlas_spot_number"].decode()  # number (str)
                
        except:
            print('skeeping group:', g)
            print('in file:', fname)
            continue
            
        #---------------------------------------------#
        # 2) Filter data according region and quality #
        #---------------------------------------------#
        
        # Select a region of interest
        if bbox:
            lonmin, lonmax, latmin, latmax = bbox
            bbox_mask = (data['lon'] >= lonmin) & (data['lon'] <= lonmax) & \
                        (data['lat'] >= latmin) & (data['lat'] <= latmax)
        else:
            bbox_mask = np.ones_like(data['lat'], dtype=bool)  # get all
            
        # Only keep good data (quality flag + threshold + bbox)
        mask = (data['q_flag'] == 0) & (np.abs(data['h_li']) < 10e3) & (bbox_mask == 1)
        
        # If no data left, skeep
        if not any(mask): continue
        
        # Update data variables
        for k, v in data.items(): data[k] = v[mask]
            
        #----------------------------------------------------#
        # 3) Convert time, separate tracks, reproject coords #
        #----------------------------------------------------#
        
        # Time in GPS seconds (secs sinde Jan 5, 1980)
        t_gps = t_ref + data['t_dt']

        # Time in decimal years
        t_year = gps2dyr(t_gps)

        # Determine orbit type
        is_asc = orbit_type(t_year, data['lat'])
        
        # Geodetic lon/lat -> Polar Stereo x/y
        x, y = transform_coord(4326, 3031, data['lon'], data['lat'])
        
        data['x'] = x
        data['y'] = y
        data['t_gps'] = t_gps
        data['t_year'] = t_year
        data['is_asc'] = is_asc
        
        #-----------------------#
        # 4) Save selected data #
        #-----------------------#
        
        # Define output dir and file
        outdir = Path(outdir)    
        fname = Path(fname)
        outdir.mkdir(exist_ok=True)
        outfile = outdir / fname.name.replace('.h5', '_' + g[1:] + '.h5')
        
        # Save variables
        with h5py.File(outfile, 'w') as fo:
            for k, v in data.items(): fo[k] = v
            print('out ->', outfile)
                

## Simple parallelization

* If your problem is embarrasingly parallel, it's easy to parallelize
* We can use the very simple and lightweight `joblib` library
* There is no need to modify your code!

Read more: [https://joblib.readthedocs.io](https://joblib.readthedocs.io)

Let's check the available resources first

In [None]:
!python system-status.py

Let's run our reader

In [None]:
outdir = Path.home()/'shared/data-paolo'

njobs = 8

bbox = None  #[-1124782, 81623, -919821, -96334]  # Kamb bounding box

outdir.mkdir(exist_ok=True)


if njobs == 1:
    print('running in serial ...')
    [read_atl06(f, outdir, bbox) for f in files]

else:
    print('running in parallel (%d jobs) ...' % njobs)
    from joblib import Parallel, delayed
    Parallel(n_jobs=njobs, verbose=5)(delayed(read_atl06)(f, outdir, bbox) for f in files)


Let's check our created files

In [None]:
#outfiles = !ls {outdir}/*.h5
outfiles = list(outdir.glob('*.h5'))

for f in outfiles[:10]: print(f)
print('Total number of files:', len(outfiles))

In [None]:
!h5ls -r {outfiles[0]}

## How to handle and visualize millions of points? 

* [Dask (DataFrame)](https://dask.org/) - Advanced parallelism for analytics, scalling Python (Pandas) workflows
* [Datashader](https://datashader.org/) - A graphics pipeline system for creating representations of large datasets quickly and flexibly

Reading data now becomes trivial!

In [None]:
import h5py
import numpy as np
import dask.dataframe as dd

import warnings
warnings.filterwarnings("ignore")


def read_h5(fname, vnames=[]):
    """Read a list of vars [v1, v2, ..] -> 2D."""
    with h5py.File(fname, 'r') as f:
        return np.column_stack([f[v][()] for v in vnames])

    
# Get list of files to plot
#files = list(Path('/home/jovyan/tutorial-data/gridding-time-series/org').glob('*.h5'))
files = list(outdir.glob('*.h5'))

# Variables we want to plot
#vnames = ['lon', 'lat', 'h_elv']
vnames = ['x', 'y', 'h_li']

# List with one dataframe per file
dfs = [dd.from_array(read_h5(f, vnames), columns=vnames) for f in files]

# Single parallel dataframe (larger than memory)
df = dd.concat(dfs)

print('Number of files:', len(files))
print('Number of points:', len(df))
print(df.head())

If you like to work with CSV files, no problem!

In [None]:
#df.to_csv(str(outdir)+'/points-*.csv')  # -> N csv files

Plotting data also becomes trivial!

In [None]:
import datashader as ds
import datashader.transfer_functions as tf
from matplotlib.cm import terrain as cmap

#df = dd.read_csv(str(outdir)+'/*.csv')

pts = ds.Canvas(plot_width=600, plot_height=600)
#agg = pts.points(df, 'lon', 'lat', ds.mean('h_elv'))
agg = pts.points(df, 'x', 'y', ds.mean('h_li'))
img = tf.shade(agg, cmap=cmap, how='linear')
img

## Single program from the command line

You can put all of the above (and more) into a single script and run it on the command line:

In [None]:
!python readatl06.py -h

Try reading the ICESat-2 files in parallel from the command line:

In [None]:
#!python readatl06.py {data_home}/*.h5 -o {outdir} -n 8

In [None]:
#!cat readatl06.py

> **NOTE:** Please remove your created files after you're done with the tutorial:

    cd ~/shared/data-lastname
    rm *.h5

> **And don't forget to checkout our [CAPToolkit](https://github.com/fspaolo/captoolkit) package for processing and analyzing altimetry data :)**