# Global Drifter Program (GDP)

As part of this Notebook, we will use the *GDP historical dataset* to highlight the required steps to preprocess and dataset into a format that can be ingest by the *CloudDrift* library.

## Dataformat module

The `dataformat.py` module contains the class `create_ragged_array` to transform a series of archives into an *Awkward Array* where all variables is stored as a *ragged array*. The module also contains `read_from_netcdf` and `read_from_parquet` to initialize the *Awkward Array* directly from an previously preprocessed archive. Right now, it *only* supports local array but we will soon add the possibility of *lazy-loading* array stored in the Cloud.

In [None]:
import sys

In [None]:
sys.path.insert(0, '../')
from clouddrift import dataformat

The main class of this module is *create_ragged_array* and is used to create a single archive that can be saved to a netCDF or Parquet file. The signature of the class is:

In [None]:
dataformat.create_ragged_array?

## Dataset-specific functions

Since each dataset is different, we have to create specific functions to preprocess the dataset (`preprocess_func`) and return the metadata and data of a single trajectory. This was inspired by the [Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/) project. The class *create_ragged_array* will use those functions to create the single archive of ragged arrays. More precisely, it requires:
- a list of indices (or identification number) that will be concatenate into the ragged array format
- a preprecessing function with the following signature:
    - `Signature: preprocess_func(index: int) -> xarray.core.dataset.Dataset`, where the index parameter is an identifier of a trajectory, e.g. the identification number of an Argo float) and returns an *xarray Dataset*. 
- a dictionnary mapping the mandatory coordinates list to the name of those variables in the dataset, e.g.
    coords = {'ids': 'number', 'time': 't', 'longitude': 'lon', 'latitude': 'lat'}
- an optional list of variable names containing metadata information about the trajectory (size: 1 per trajectory)
- an optional list of variable names containing the data along the trajectory (size: number of observations per trajectory)
- an optional funcition that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`)
    
This function can performs all type of operations, such as formatting the date, changing the type of variables, modifying the metadata, etc. We provide preprocessing function for different datasets in the `data/recipes/` folder. The class also needs to *initially* calculate the sum of all observations. By default, this is performed using `lambda i: preprocess_func(i).dims['obs']`. To *speed up* this process, in the situation where a lot of preprocessing are performed, it is possible to provide a second function `rowsize_func`, that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`)

Finally, we import the gdp module which contains a function to download (or update) and preprocess the GDP dataset.

In [None]:
from data import gdp

# Download

The download function will store the raw dataset into the `data/raw/` folder specified in the `gdp.py` module. By default `download_gdp_data()` will download the complete GPD dataset (containing 17,324 files as of May 2022) from the AOML `https` server.

**Note**: this Notebook is very similar to the `data-glad.ipynb` Notebook because very few functions have to be created to transform a new dataset. We hope that this will encourage people to use this dataformat and utilize the CloudDrift library.

In [None]:
gdp.download?

It is possible to prodive a list of `drifter_ids` to retrieve a subset and/or specified a integer `n_random_id` to randomly retrieve `n` trajectory. The function returns the list of `drifters_ids` that was downloaded, and can be passed to create the ragged array.

In [None]:
drifter_ids = gdp.download(n_random_id=100)

Once the data downloaded, it is possible to create the ragged array and either save a netCDF, parquet file, or simply output an Awkward Array that can be used for analysis.

In [None]:
coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
metadata = ['ID', 'rowsize', 'WMO', 'expno', 'deploy_date', 'deploy_lat', 'deploy_lon', 'end_date', 'end_lat', 'end_lon', 'drogue_lost_date', 'typedeath', 'typebuoy', 'location_type', 'DeployingShip', 'DeploymentStatus', 'BuoyTypeManufacturer', 'BuoyTypeSensorArray', 'CurrentProgram', 'PurchaserFunding', 'SensorUpgrade', 'Transmissions', 'DeployingCountry', 'DeploymentComments', 'ManufactureYear', 'ManufactureMonth', 'ManufactureSensorType', 'ManufactureVoltage', 'FloatDiameter', 'SubsfcFloatPresence', 'DrogueType', 'DrogueLength', 'DrogueBallast', 'DragAreaAboveDrogue', 'DragAreaOfDrogue', 'DragAreaRatio', 'DrogueCenterDepth', 'DrogueDetectSensor']
data = ['ve', 'vn', 'err_lat', 'err_lon', 'err_ve', 'err_vn', 'gap', 'sst', 'sst1', 'sst2', 'err_sst', 'err_sst1', 'err_sst2', 'flg_sst', 'flg_sst1', 'flg_sst2', 'drogue_status']

ra = dataformat.create_ragged_array(drifter_ids,
                         gdp.preprocess,
                         coords, 
                         metadata, 
                         data,
                         rowsize_func=gdp.rowsize
                        )

## Export

In [None]:
ra.to_parquet('../data/process/gdp_v2.00.parquet')
ra.to_netcdf('../data/process/gdp_v2.00.nc')

## Import

In [None]:
ds2 = dataformat.read_from_parquet('../data/process/gdp_v2.00.parquet')

In [None]:
ds2.ID

## Awkward Array

In [None]:
import awkward._v2 as ak

In [None]:
ds = ra.to_awkward()

In [None]:
ak.nanmean(ds.obs.err_lat, axis=1)

In [None]:
ds.fields

In [None]:
ds.obs.fields

### global attributes

In [None]:
ds.layout.parameters

### variable attributes

In [None]:
ds.ID.layout.parameters

In [None]:
ds.obs.sst.layout.parameters