# Global Drifter Program (hourly data): example

In this Notebook, we use the *hourly* data from the NOAA Global Drifter Program to illustrate the typical steps needed to preprocess a Lagrangian dataset into a ragged array format that can be ingested by the *CloudDrift* library.

## Dataformat module

In [None]:
import sys
sys.path.insert(0, '../')
from clouddrift import ragged_array

The `dataformat.py` module contains the class `ragged_array`. This class object can be initialized with a series of dictionnaries matching the name of the variables of the original data with the ragged representation of the data and their respectives attributes.

In [None]:
ragged_array?

The length of the variables in the dataset is either equal to the number_of_trajectories (`nb_traj`) or the number of observations (`obs`) measured along the trajectories. 

The first three dictionnaries: `coords`, `metadata`, `data`, match the variable names—respectively for coordinates variables, metatadata variables, and data variables—with the ragged representation of the data.

- The coordinates are mandatatory variables (length `obs`) for the ragged array to be used with the library and are always `time`, `lon`, `lat`, and `ids`. 
- The metatadata variables (length `nb_traj`) are constant values associated with a single trajectory such as the length of the observations (`rowsize`), the deployement information (`deploy_lon`, `deploy_lat`, `deploy_date`), the type of buoys (`typebuoy`), etc.
- The data variables (length `obs`) are quantity measured along the trajectories such as velocity component (`ve`, `vn`), the sea surface temperature and its uncertainty (`sst`, `err_sst`), the drogue presence flag (`drogue_status`), etc.

The last two dictionnaries: `attrs_global` and `attrs_variables` are optional and contains the attributes related to the dataset and each variable, respectively.

## Classmethod `ragged_array.from_files()`

This classmethod is available to create a `ragged_array` instance from a series of files.

In [None]:
ragged_array.from_files?

This class method was inspired by the [Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/) project which aims easing the extraction of data from traditional data archives and deposition in cloud object storage. For our `ragged_array.from_files()` classmethod, the different parameters are:

- a list of indices (or identification number) that will be interate to concatenate the files into the ragged array format
- a **preprocessing function** with the following signature:
    - `Signature: preprocess_func(index: int) -> xarray.core.dataset.Dataset`, where the index parameter is an identifier of a trajectory, e.g. the identification number of an Argo float) and returns an [xarray Dataset](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.html). 
- a dictionnary mapping the mandatory coordinates list to the name of those variables in the dataset, e.g.
    coords = {'ids': 'number', 'time': 't', 'lon': 'longitude', 'lat': 'latitude'}
- an optional list of variable names containing metadata information about the trajectory (size: 1 per trajectory)
- an optional list of variable names containing the data along the trajectory (size: number of observations per trajectory)
- an optional function that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`)

Because every dataset is unique, the preprocessing function is used to perform operation such as: formatting the date, changing the type of the variables, modifying the metadata, etc. The class also needs to *initially* calculate the sum of all observations to allocate memory. To *speed up* this process, in the situation where a lot of preprocessing are performed, it is possible to provide a second function `rowsize_func`, that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`). By default, this is operation performed using `lambda i: preprocess_func(i).dims['obs']`. 

We provide preprocessing function for different datasets in the `data/` folder (`gdp.py`, `gdp6h.py`, `parcels.py`, etc.) and those can serve as a guide to defined a new set of functions for another dataset.

# Dataset-specific functions
The `gdp.py` module contains a number of specific functions for the current GDP files, including:
- `gdp.preprocess`: applies preprocessing routine and returned a `xarray.Dataset` for a specific trajectory 
- `gdp.download`: fetches NetCDF files from the GDP FTP server
- `gdp.rowsize [Optional]`: returns the dimension of a specific trajectory to speed up the preprocessing

In [None]:
from data import gdp

## Download:

The `gdp.download` function will store the raw dataset into the `data/raw/gdp-v2.00/` folder (specified in the `gdp.py` module). By default `download()` will download the complete GPD dataset (containing 17,324 files for versions 1.04c and 2.00) from the AOML repository ([link](https://www.aoml.noaa.gov/ftp/pub/phod/lumpkin/hourly/v2.00/netcdf/)).

In [None]:
gdp.download?

With this function, it is also possible to retrieve a subset from a `drifter_ids` list or specified an integer `n_random_id` to randomly retrieve `n` trajectory. If both arguments are given, the function downloads `n_random_id` out of the list `drifter_ids`. The function returns the list of `drifters_ids` that was downloaded, and can be passed to create the ragged array.

In [None]:
drifter_ids = gdp.download(n_random_id=100)

Once the data are downloaded, the ragged array object can be created and either saved as a NetCDF file, a parquet file, or converted to an [Awkward Array](https://github.com/scikit-hep/awkward) that can be used for analysis:

In [None]:
coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
metadata = ['ID', 'rowsize', 'WMO', 'expno', 'deploy_date', 'deploy_lat', 'deploy_lon', 'end_date', 'end_lat', 'end_lon', 'drogue_lost_date', 'typedeath', 'typebuoy', 'location_type', 'DeployingShip', 'DeploymentStatus', 'BuoyTypeManufacturer', 'BuoyTypeSensorArray', 'CurrentProgram', 'PurchaserFunding', 'SensorUpgrade', 'Transmissions', 'DeployingCountry', 'DeploymentComments', 'ManufactureYear', 'ManufactureMonth', 'ManufactureSensorType', 'ManufactureVoltage', 'FloatDiameter', 'SubsfcFloatPresence', 'DrogueType', 'DrogueLength', 'DrogueBallast', 'DragAreaAboveDrogue', 'DragAreaOfDrogue', 'DragAreaRatio', 'DrogueCenterDepth', 'DrogueDetectSensor']
data = ['ve', 'vn', 'err_lat', 'err_lon', 'err_ve', 'err_vn', 'gap', 'sst', 'sst1', 'sst2', 'err_sst', 'err_sst1', 'err_sst2', 'flg_sst', 'flg_sst1', 'flg_sst2', 'drogue_status']

ra = ragged_array.from_files(
    drifter_ids,
    gdp.preprocess,
    coords,
    metadata,
    data,
    rowsize_func=gdp.rowsize
)

## Export to data files:

In [None]:
ra.to_parquet('../data/process/gdp_v2.00.parquet')
ra.to_netcdf('../data/process/gdp_v2.00.nc')

## Import from data files:

In [None]:
ra2 = ragged_array.from_parquet('../data/process/gdp_v2.00.parquet')

In [None]:
ra2

## Convert to Awkward Array:

In [None]:
ds = ra2.to_awkward()

In [None]:
ds.fields

In [None]:
ds.obs.fields

### Global attributes

In [None]:
ds.layout.parameters

### Variable attributes

In [None]:
ds.ID.layout.parameters

In [None]:
ds.obs.sst.layout.parameters