# 06. Sentinel-6 MF L2 Altimetry Data Access (OPeNDAP) & Gridding

To-dos:
* Add tutorial objectives, e.g. How much data are we selecting/where? Over a certain Cycle?
* What is the goal of the tutorial? e.g. Grab ___ cycle(s)/pass(es) to plot satellite altimetry tracks from S6 using opendap in the cloud to only select data we are interested in, reducing the data volume and time-to-data.

## Getting Started

### Summary

*Description goes here...*

### Objectives

In this tutorial you will learn...

1. about level 2 radar altimetry data from the Sentinel-6 Michael Freilich mission;
2. how to efficiently download variable subsets using OPeNDAP;
3. how to grid the along-track altimetry observations produced by S6 at level 2.;


### Requirements

This workflow was developed using Python 3.9 (and tested against versions 3.7, 3.8).

In [None]:
#

In [None]:
#

### Workspace

Create some directories inside a temporary user workspace. They will be used to write outputs.

In [None]:
#

>https://docs.python.org/3/library/os.html#os.makedirs

## Dataset(s)

This example operates on Level 2 Low Resolution Altimetry from Sentinel-6 Michael Freilich (the Near Real Time Reduced distribution). It is most easily identified by its collection *ShortName*, given below:

In [None]:
#

Search CMR using a simple function that wraps `requests.get`:

#### collection

In [None]:
#

Get the collection's *concept-id* from the record's `meta` object. It uniquely identifies the collection in the CMR and is a component of the OPeNDAP endpoints for its granules.

In [None]:
#

>https://docs.python-requests.org/en/latest/api/#requests.get

#### granules

In [None]:
#

>https://docs.python.org/3/library/io.html#io.StringIO    
>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html    
>https://docs.python-requests.org/en/latest/api/#requests.get     

In [None]:
#

>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Display the list of items created by splitting the a value in the `Granule UR` field.

In [None]:
#

References:    
https://docs.python.org/3/library/stdtypes.html#str.split    
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.iloc.html    

*Note that cycle and pass are items 8 and 9, respectively, after splitting the `Granule UR` field by `_`.* Add two columns containing the cycle/pass numbers for granules in the table.

In [None]:
#

>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html    
>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html    
>https://docs.python.org/3/reference/expressions.html#lambda    
>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html    

Now create a table with one row per cycle and with these columns:

1. start time
2. end time
3. granule names (list)

In [None]:
#

>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html    
>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

#### Pick a cycle of interest

Pick one cycle that is nearly complete, >= 120 orbits/files. Limit options to cycles with at least 120 granules/files available (i.e. orbits in S6 context).

In [None]:
#

Choose from the nearly-complete cycles that remain in the table, which start with cycle number `023` for S6 data in the public domain (and start on June 22). Slice the table of *granules* to exclude all rows that are not from the cycle of interest.

In [None]:
#

#### OPeNDAP Access Endpoints

All endpoints for granules/files in OPeNDAP/Hyrax start with the server hostname and path to the parent collection, followed by *granules*. The collection is specified by the *concept-id* given right after *collections* in a valid url. The next cell formats a string giving the base url to which we will append granule filenames (stored in the `Granule UR` column of the *granules* table) to get the full url/endpoint for each granule.

*Fyi, the url printed by this cell will not be accessible from your web browser.*

In [None]:
#

>https://docs.python.org/3/library/string.html#format-string-syntax    

Appending the granule name (taken from the `Granule UR` column) to the end of the url above results in a valid endpoint. You can click the one printed by this cell to confirm. A new browser tab should open to the HTML access form served by Hyrax/OPeNDAP.

Make a new column with the `nc4` download url for all granules.

In [None]:
#

 >https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html    

##### Pick target data/coordinate variables

Display the url to access the DDS file for the first granule:

In [None]:
#

>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.iloc.html    

Assemble the list of target variables that you will subset from each granule/file using OPeNDAP.

In [None]:
#

>https://docs.python.org/3/library/stdtypes.html#list    

Join the list of variables with commas and append the resulting string to the end of each opendap `nc4` endpoint created during the steps above. The variables given after the `?` will be subset from the source file on the server-side into a new netCDF4 file that OPeNDAP returns in response content which is downloaded over https.

Here's an example for the first granule in the selected cycle. Clicking this link should download a netCDF4 containing the target variables (4 of them, in my case). *You may be prompted to authenticate with your Earthdata Login account info.

In [None]:
#

>https://opendap.github.io/documentation/UserGuideComprehensive.html#Constraint_Expressions (Hyrax/OPeNDAP docs)    

#### Download subsets

This function downloads one granule from the remote `url` to a local `target` path, and will reliably manage simultaneous streaming downloads divided between multiple threads.

In [None]:
#

>https://docs.python.org/3/library/os.path.html#os.path.isfile    
>https://docs.python-requests.org/en/latest/api/#requests.Response.text    
>https://docs.python-requests.org/en/latest/api/#requests.Response.status_code    
>https://docs.python-requests.org/en/latest/api/#requests.Response.iter_content    

In [None]:
#

*This next cell assembles a list of local paths for the subset downloads.*

Calling `tolist` on the resulting Series will convert it to a Python list (by way of the *numpy* method, *tolist*). Use *zip* merge the lists of *subset* urls and local paths, itemwise. The result will be a list of lists, each length two and containing a remote url and local path (corresponding to two positional arguments in the *download* function that will be defined in a subsequent cell).

In [None]:
#

>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tolist.html    

*This next cell creates a pool of workers to divide the list of downloads between multiple threads.*

Use the `ThreadPoolExecutor` from `concurrent.futures` module (in the Python 3 standard library) to divide the 120+ download jobs between multiple threads and run them concurrently. This should take no more than a minute or two to process all subsets on the server side and download to the local host.

In [None]:
#

>https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor    

The source files range from 2.5MB to 3.0MB. These OPeNDAP subsets are ~100KB apiece. (anecdote: it took less than 10 minutes to download subsets for >1700 granules/files when I ran this routine for all cycles going back to 2021-06-22.) Here we call the shell *du* and *ls* utilities to get the size of the directory:

In [None]:
#

>https://www.gnu.org/software/coreutils/manual/html_node/du-invocation.html    

Confirm that a netcdf file exists on disk for all the file paths in the *local* column.

In [None]:
#

>https://docs.python.org/3/library/functions.html#sorted    
>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.all.html    

#### Aggregate to cycle

Make a dictionary to rename variables so that the `data_01_` prefix is removed from each one.

In [None]:
#

References:    
https://docs.python.org/3/library/functions.html#map    
https://docs.python.org/3/library/functions.html#zip    

Sort the list of local paths to the downloaded subsets to ensure they concatenate in proper order. Call `open_mfdataset` on the list to open all the subsets in memory as one dataset in xarray.

In [None]:
#

>https://tqdm.github.io/docs/tqdm/#pandas    
>https://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html    
>https://xarray.pydata.org/en/stable/generated/xarray.Dataset.rename.html  

### Render along-track altimetry data to the ECCO grid 

>**Acknowledgement** *This approach using `pyresample` was shared to me by Ian Fenty, ECCO Lead.*

ECCO V4r4 products are distributed in two spatial formats. One set of collections provides the ocean state estimates on the native model grid (LLC0090) and the other provides them after interpolating to a regular grid defined in geographic coordinates with horizontal cell size of 0.5-degrees.

#### Download the ECCO V4r4 0.5-Deg Grid Geometry and Mask

It's distributed as its own dataset/collection containing just one file. We can access it over OPeNDAP like demonstrated above or simply download it from the HTTPS download endpoint -- the file size is inconsequential. The next cell downloads the file into the *data* folder from the granule's https endpoint.

In [None]:
#

>https://docs.python.org/3/library/os.path.html#os.path.basename    
>https://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html    

The `maskC` variable contains a boolean mask representing the wet/dry state of the area contained in each cell of the 3d grid defined by `Z` and `latitude` and `longitude`.

Here are the variable's attributes:

In [None]:
#

So, the mask derives from another variable `hFacC` that essentially describes 3d space/volume contained within each model grid cell by the fractional area representing the horizontal coverage (in the `longitude,latitude` dimensions) within each vertical/depth layer.

In [None]:
#

Select the 2d array from `maskC` that corresponds to the depth layer at ocean surface (i.e. at index `0` on the `Z` axis/dimension) and then produce a boolean array where True represents cells with a value greater than `0`. The resulting array/grid is our land/water mask for the 2d grids generated during the next few steps.

Plot the land/water mask:

In [None]:
#

>https://xarray.pydata.org/en/stable/generated/xarray.DataArray.isel.html    
>https://xarray.pydata.org/en/stable/generated/xarray.DataArray.plot.html    

### Grid ssha or a similar variable

*Jinbo's recommendation: wrap this logic for parameterization by workshop attendees.*

Get a single timestamp to represent the midpoint of the cycle.

In [None]:
#

Access the target variable, *ssha* in this case, and make a nan mask.

In [None]:
#

Define a simple function *get_grid_defn* to validate input arrays of longitudes/latitudes and return a *pyresample.geometry.SwathDefinition* object. (We use it twice to define source/target grids in the next steps.)

In [None]:
#

Define source grid/geometry for the input along-track data. (They are stored as 1-dimensional arrays.)

In [None]:
#

Define target grid based on the longitudes and latitudes from the ECCO grid geometry dataset. This time define the grid using two 2-dimensional arrays that give positions of all SSHA values in geographic/longitude-latitude coordinates.

In [None]:
#

Make the *pyresample* object for the target grid and proceed.

In [None]:
#

Show the help for `pyresample.kdtree.resample_gauss` to aid during the hackathon.

In [None]:
#

Get the target grid definition defined by the 2d arrays of lons and lats created in the cell above. Apply gaussian resampling with some optional arguments (borrowed from Ian's implementation).

In [None]:
#

Apply the land/water mask in the numpy array created from the ECCO layer in the steps above. Then, convert the masked numpy array to an xarray data array object named *gridded*. Print its header.

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

**Additional References:**

* *numpy* (https://numpy.org/doc/stable/reference)    
    * [numpy.ndarray.data](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.data.html)    
    * [numpy.ravel](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html)    
    * [numpy.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html)    
    * [numpy.isnan](https://numpy.org/doc/stable/reference/generated/numpy.isnan.html)    
    * [datetimes](https://numpy.org/doc/stable/reference/arrays.datetime.html)    
* *xarray* (https://xarray.pydata.org/en/stable)    
    * [xarray.DataArray](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.html)    
    * [xarray.DataArray.values](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.values.html)    
    * [xarray.DataArray.mean](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.html)    
    * https://xarray.pydata.org/en/stable/generated/xarray.DataArray.plot.pcolormesh.html (matplotlib)    
* *pyresample* (https://pyresample.readthedocs.io/en/latest/api/pyresample)    
    * [pyresample.utils.check_and_wrap](https://pyresample.readthedocs.io/en/latest/api/pyresample.utils.html#pyresample.utils.check_and_wrap)    
    * [pyresample.kd_tree.resample_gauss](https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.kd_tree.resample_gauss)    
    * [pyresample.geometry.SwathDefinition](https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.geometry.SwathDefinition)    


**Bonus**: generate a grid for every cycle and get mean/std over all the cycles

```python
stack = xr.concat(data['grid'].tolist(), dim="time")

midlat = stack.sel(latitude=slice(-66.,66.0))

stats = xr.concat(objs=[midlat.mean("time"),
                        midlat.std("time") ], 
                  dim=pd.Index(['mean','std'], name="stat"))
```

>https://xarray.pydata.org/en/stable/generated/xarray.concat.html    

**Bonus**: calculate area-weighted mean

```python
def to_area_weighted_mean(x):
    nonzero_mask = np.where(~np.isnan(x), 1, np.nan)            # mask where nans=0 & ~nans=1
    nonzero_area = np.sum(nonzero_mask * ecco_grid.area)        # total area where data is nonzero
    return float(np.nansum(x * ecco_grid.area) / nonzero_area)  # area-weighted global mean
```

>https://numpy.org/doc/stable/reference/generated/numpy.sum.html    
>https://numpy.org/doc/stable/reference/generated/numpy.nansum.html    
