# Global Drifter Program (GDP 6-hourly data)

To process the 6-hourly dataset, we follow the same steps described in `dataformat-gdp.ipynb`. Only difference is that we have to defined a new preprocessing functions (`data/gdp6h.py`) to handle the slight differences between the format of the individual NetCDF archives.

In [None]:
import sys
sys.path.insert(0, '../')
from clouddrift.dataformat import ragged_array

## Dataset-specific functions
- `gdp6h.preprocess`: applies preprocessing routine and returned a `Dataset` for a specific trajectory 
- `gdp6h.download`: fetch the NetCDF from the server
- `gdp6h.rowsize [Optional]`: return the dimension of a specific trajectory to speed up the preprocessing

In [None]:
from data import gdp6h

# Download

The download function will store the raw dataset into the `data/raw/gdp-6hourly/` folder specified in the `gdp6h.py` module. By default `download()` will download the complete GPD dataset (containing 25,587 files as of May 2022) from the AOML `https` server. It is also possible to specify a list of drifter IDs (`drifter_ids`) or a number of random IDs (`n_random_id`) to retrieve. The function returns the list of `drifters_ids` that was downloaded that is needed to create the ragged array.

In [None]:
gdp6h.download?

In [None]:
drifter_ids = gdp6h.download(n_random_id=100)

In [None]:
drifter_ids[:5]

Once the data downloaded, it is possible to create the ragged array and either save a netCDF, parquet file, or simply output an Awkward Array that can be used for analysis.

# Ragged array from a series of files

In [None]:
coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
metadata = ['ID', 'rowsize', 'WMO', 'expno', 'deploy_date', 'deploy_lat', 'deploy_lon', 'end_date', 'end_lat', 'end_lon', 'drogue_lost_date', 'typedeath', 'typebuoy', 'DeployingShip', 'DeploymentStatus', 'BuoyTypeManufacturer', 'BuoyTypeSensorArray', 'CurrentProgram', 'PurchaserFunding', 'SensorUpgrade', 'Transmissions', 'DeployingCountry', 'DeploymentComments', 'ManufactureYear', 'ManufactureMonth', 'ManufactureSensorType', 'ManufactureVoltage', 'FloatDiameter', 'SubsfcFloatPresence', 'DrogueType', 'DrogueLength', 'DrogueBallast', 'DragAreaAboveDrogue', 'DragAreaOfDrogue', 'DragAreaRatio', 'DrogueCenterDepth', 'DrogueDetectSensor']
data = ['ve', 'vn', 'temp', 'err_lat', 'err_lon', 'err_temp', 'drogue_status']

ra = ragged_array.from_files(
    drifter_ids,
    gdp6h.preprocess,
    coords, 
    metadata,
    data,
    rowsize_func=gdp6h.rowsize
)

## Export

In [None]:
ra.to_parquet('../data/process/gdp_6h.parquet')
ra.to_netcdf('../data/process/gdp_6h.nc')

## Import

In [None]:
ra2 = ragged_array.from_parquet('../data/process/gdp_6h.parquet')

In [None]:
ra2

## Awkward Array

In [None]:
ds = ra2.to_awkward()

In [None]:
ds.ID

In [None]:
ds.fields

In [None]:
ds.obs.fields