# Global Drifter Program (GDP)

As part of this Notebook, we will use the *GDP historical dataset* to highlight the required steps to preprocess and dataset into a format that can be ingest by the *CloudDrift* library.

## Dataformat module

The `dataformat.py` module contains the class `create_ragged_array` to transform a series of archives into an *Awkward Array* where all variables is stored as a *ragged array*. The module also contains `read_from_netcdf` and `read_from_parquet` to initialize the *Awkward Array* directly from an previously preprocessed archive. Right now, it *only* supports local array but we will soon add the possibility of *lazy-loading* array stored in the Cloud.

In [1]:
import sys

In [2]:
%load_ext autoreload
%autoreload 2

sys.path.insert(0, '../')
from clouddrift.dataformat import create_ragged_array, read_from_netcdf, read_from_parquet

The main class of this module is *create_ragged_array* and is used to create a single archive that can be saved to a netCDF or Parquet file. The signature of the class is:

In [3]:
create_ragged_array?

[0;31mInit signature:[0m
[0mcreate_ragged_array[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mindices[0m[0;34m:[0m [0mlist[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvars_coords[0m[0;34m:[0m [0mdict[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvars_meta[0m[0;34m:[0m [0mlist[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvars_data[0m[0;34m:[0m [0mlist[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreprocess_func[0m[0;34m:[0m [0mcollections[0m[0;34m.[0m[0mabc[0m[0;34m.[0m[0mCallable[0m[0;34m[[0m[0;34m[[0m[0mint[0m[0;34m][0m[0;34m,[0m [0mxarray[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mdataset[0m[0;34m.[0m[0mDataset[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrowsize_func[0m[0;34m:[0m [0mcollections[0m[0;34m.[0m[0mabc[0m[0;34m.[0m[0mCallable[0m[0;34m[[0m[0;34m[[0m[0mint[0m[0;34m][0m[0;34m,[0m [0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m

## Dataset-specific functions

Since each dataset is different, we have to create specific functions to preprocess the dataset (`preprocess_func`) and return the metadata and data of a single trajectory. This was inspired by the [Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/) project. The class *create_ragged_array* will use those functions to create the single archive of ragged arrays. More precisely, it requires:
- a list of indices (or identification number) that will be concatenate into the ragged array format
- a dictionnary mapping the mandatory coordinates list to the name of those variables in the dataset, e.g.
    coords = {'ids': 'number', 'time': 't', 'longitude': 'lon', 'latitude': 'lat'}
- a list of variable names containing metadata information about the trajectory (size: 1 per trajectory)
- a list of variable names containing the data along the trajectory (size: number of observations per trajectory)
- a preprecessing function with the following signature:
    - `Signature: preprocess_func(index: int) -> xarray.core.dataset.Dataset`, where the index parameter is an identifier of a trajectory, e.g. the identification number of an Argo float) and returns an *xarray Dataset*. 
    
This function can performs all type of operations, such as formatting the date, changing the type of variables, modifying the metadata, etc. We provide preprocessing function for different datasets in the `data/recipes/` folder. The class also needs to *initially* calculate the sum of all observations. By default, this is performed using `lambda i: self.preprocess_func(i).dims['obs']`. To *speed up* this process, in the situation where a lot of preprocessing are performed, it is possible to provide a second function `rowsize_func`, that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`)

Finally, we included an function to download (or update) the GDP dataset.

In [4]:
from data.gdp import preprocess, rowsize, download

# Download

The download function will store the raw dataset into the `data/raw/` folder specified in the `gdp.py` module. By default `download_gdp_data()` will download the complete GPD dataset (containing 17,324 files as of May 2022) from the AOML `https` server.

**Note**: this Notebook is very similar to the `data-glad.ipynb` Notebook because very few functions have to be created to transform a new dataset. We hope that this will encourage people to use this dataformat and utilize the CloudDrift library.

In [5]:
download?

[0;31mSignature:[0m [0mdownload[0m[0;34m([0m[0mdrifter_ids[0m[0;34m:[0m [0mlist[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m [0mn_random_id[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Download individual netCDF files from the AOML server

:param drifter_ids [list]: list of drifter to retrieve (Default: all)
:param n_random_id [int]: randomly select n drifter netCDF files
:return drifters_ids [list]: list of retrived drifter
[0;31mFile:[0m      ~/Library/CloudStorage/OneDrive-FloridaStateUniversity/projects/clouddrift/data/gdp.py
[0;31mType:[0m      function


It is possible to prodive a list of `drifter_ids` to retrieve a subset and/or specified a integer `n_random_id` to randomly retrieve `n` trajectory. The function returns the list of `drifters_ids` that was downloaded, and can be passed to create the ragged array.

In [6]:
drifter_ids = download()

100%|███████████████████████████████████████████████████████████████████████████████████████████| 17324/17324 [00:00<00:00, 74506.15it/s]


Once the data downloaded, it is possible to create the ragged array and either save a netCDF, parquet file, or simply output an Awkward Array that can be used for analysis.

In [7]:
coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
metadata = ['ID', 'rowsize', 'WMO', 'expno', 'deploy_date', 'deploy_lat', 'deploy_lon', 'end_date', 'end_lat', 'end_lon', 'drogue_lost_date', 'typedeath', 'typebuoy', 'location_type', 'DeployingShip', 'DeploymentStatus', 'BuoyTypeManufacturer', 'BuoyTypeSensorArray', 'CurrentProgram', 'PurchaserFunding', 'SensorUpgrade', 'Transmissions', 'DeployingCountry', 'DeploymentComments', 'ManufactureYear', 'ManufactureMonth', 'ManufactureSensorType', 'ManufactureVoltage', 'FloatDiameter', 'SubsfcFloatPresence', 'DrogueType', 'DrogueLength', 'DrogueBallast', 'DragAreaAboveDrogue', 'DragAreaOfDrogue', 'DragAreaRatio', 'DrogueCenterDepth', 'DrogueDetectSensor']
data = ['ve', 'vn', 'err_lat', 'err_lon', 'err_ve', 'err_vn', 'gap', 'sst', 'sst1', 'sst2', 'err_sst', 'err_sst1', 'err_sst2', 'flg_sst', 'flg_sst1', 'flg_sst2', 'drogue_status']

ra = create_ragged_array(drifter_ids,
                         coords, 
                         metadata, 
                         data,
                         preprocess_func=preprocess,
                         rowsize_func=rowsize
                        )

Calculating the number of observations: 100%|██████████████████████████████████████████████████████| 17324/17324 [12:55<00:00, 22.34it/s]
  condition |= data == fv
Filling the ragged array: 100%|████████████████████████████████████████████████████████████████████| 17324/17324 [25:00<00:00, 11.55it/s]


## Export

In [8]:
ra.to_parquet('../data/process/gdp_v2.00.parquet')
ra.to_netcdf('../data/process/gdp_v2.00.nc')

In [33]:
ds = ra.to_awkward()

In [34]:
from clouddrift import spectrum
import awkward._v2 as ak

In [35]:
ds = read_from_netcdf('../data/process/gdp_v2.00.nc')

In [37]:
%%time
ret = spectrum.periodogram_per_traj(ds.obs.ve + 1j*ds.obs.vn)

CPU times: user 31.8 s, sys: 5.12 s, total: 36.9 s
Wall time: 38.3 s


In [38]:
ak.nanmean(ds.obs.err_lat, axis=1)

<Array [0.0025, 0.00228, ..., 0.00042, 0.000416] type='17324 * ?float64'>

In [13]:
ds.fieldserr_lat

['ID',
 'rowsize',
 'WMO',
 'expno',
 'deploy_date',
 'deploy_lat',
 'deploy_lon',
 'end_date',
 'end_lat',
 'end_lon',
 'drogue_lost_date',
 'typedeath',
 'typebuoy',
 'location_type',
 'DeployingShip',
 'DeploymentStatus',
 'BuoyTypeManufacturer',
 'BuoyTypeSensorArray',
 'CurrentProgram',
 'PurchaserFunding',
 'SensorUpgrade',
 'Transmissions',
 'DeployingCountry',
 'DeploymentComments',
 'ManufactureYear',
 'ManufactureMonth',
 'ManufactureSensorType',
 'ManufactureVoltage',
 'FloatDiameter',
 'SubsfcFloatPresence',
 'DrogueType',
 'DrogueLength',
 'DrogueBallast',
 'DragAreaAboveDrogue',
 'DragAreaOfDrogue',
 'DragAreaRatio',
 'DrogueCenterDepth',
 'DrogueDetectSensor',
 'obs']

In [14]:
ds.obs.fields

['ids',
 'time',
 'lon',
 'lat',
 've',
 'vn',
 'err_lat',
 'err_lon',
 'err_ve',
 'err_vn',
 'gap',
 'sst',
 'sst1',
 'sst2',
 'err_sst',
 'err_sst1',
 'err_sst2',
 'flg_sst',
 'flg_sst1',
 'flg_sst2',
 'drogue_status']

## Read

In [15]:
ds2 = read_from_parquet('gdp.parquet')

In [16]:
ds2.ID

<Array [44136, 54680, ..., 63714570, 65510710] type='12 * int64[parameters=...'>