# Global Drifter Program (6-hourly data): example

To process the NOAA Global Drifter Program (GDP) 6-hourly dataset, we follow the same steps described in `dataformat-gdp.ipynb` for the hourly product. The only difference is that we have to define a new preprocessing module (`data/gdp6h.py`) to handle the slight differences between the format of the individual NetCDF files.

## Dataformat module

In [1]:
import sys
sys.path.insert(0, '../')
from clouddrift import ragged_array

# Dataset-specific functions
The `gdp6h.py` module contains a number of specific functions for the current GDP 6-hourly files, including:
- `gdp6h.preprocess`: applies preprocessing routine and returns an `xarray.Dataset` for a specific trajectory 
- `gdp6h.download`: fetches NetCDF files from the GDP FTP server
- `gdp6h.rowsize [Optional]`: returns the dimension of a specific trajectory to speed up the preprocessing

In [None]:
from data import gdp6h

# Download:

The `gdp6h.download` function will store the raw dataset into the `data/raw/gdp-6hourly/` folder (specified in the `gdp6h.py` module). By default `download()` will download the complete GPD dataset (containing 25,587 files as of May 2022) from the AOML repository ([link](https://www.aoml.noaa.gov/ftp/pub/phod/lumpkin/netcdf/)).

In [None]:
gdp6h.download?

With this function, it is also possible to retrieve a subset from a `drifter_ids` list or specified an integer `n_random_id` to randomly retrieve `n` trajectory. If both arguments are given, the function downloads `n_random_id` out of the list `drifter_ids`. The function returns the list of `drifters_ids` that was downloaded, and can be passed to create the ragged array.

In [None]:
drifter_ids = gdp6h.download(n_random_id=100)

In [None]:
drifter_ids[:5]

Once the data are downloaded, the ragged array object can be created and either saved as a NetCDF file, a parquet file, or converted to an [Awkward Array](https://github.com/scikit-hep/awkward) that can be used for analysis:

In [None]:
coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
metadata = ['ID', 'rowsize', 'WMO', 'expno', 'deploy_date', 'deploy_lat', 'deploy_lon', 'end_date', 'end_lat', 'end_lon', 'drogue_lost_date', 'typedeath', 'typebuoy', 'DeployingShip', 'DeploymentStatus', 'BuoyTypeManufacturer', 'BuoyTypeSensorArray', 'CurrentProgram', 'PurchaserFunding', 'SensorUpgrade', 'Transmissions', 'DeployingCountry', 'DeploymentComments', 'ManufactureYear', 'ManufactureMonth', 'ManufactureSensorType', 'ManufactureVoltage', 'FloatDiameter', 'SubsfcFloatPresence', 'DrogueType', 'DrogueLength', 'DrogueBallast', 'DragAreaAboveDrogue', 'DragAreaOfDrogue', 'DragAreaRatio', 'DrogueCenterDepth', 'DrogueDetectSensor']
data = ['ve', 'vn', 'temp', 'err_lat', 'err_lon', 'err_temp', 'drogue_status']

ra = ragged_array.from_files(
    drifter_ids,
    gdp6h.preprocess,
    coords, 
    metadata,
    data,
    rowsize_func=gdp6h.rowsize
)

## Export to data files:

In [None]:
ra.to_parquet('../data/process/gdp_6h.parquet')
ra.to_netcdf('../data/process/gdp_6h.nc')

## Import from data files:

In [None]:
ra2 = ragged_array.from_parquet('../data/process/gdp_6h.parquet')

In [None]:
ra2

## Convert to Awkward Array:

In [None]:
ds = ra2.to_awkward()

In [None]:
ds.ID

In [None]:
ds.fields

In [None]:
ds.obs.fields