# CARTHE Glad Experiment

As part of this Notebook, we will use the *CARTHE GLAD experiment dataset* to highlight the required steps to preprocess and dataset into a format that can be ingest by the *CloudDrift* library.

**Note**: this Notebook is very similar to the `data-gdp.ipynb` Notebook because very few functions have to be created to transform a new dataset. We hope that this will encourage people to use this dataformat and utilize the CloudDrift library.

## Dataformat module

The `dataformat.py` module contains the class `create_ragged_array` to transform a series of archives into an *Awkward Array* where all variables is stored as a *ragged array*. The module also contains `read_from_netcdf` and `read_from_parquet` to initialize the *Awkward Array* directly from an previously preprocessed archive. Right now, it *only* supports local array but we will soon add the possibility of *lazy-loading* array stored in the Cloud.

In [None]:
import sys
sys.path.insert(0, '../')
from clouddrift import dataformat

The main class of this module is *create_ragged_array* and is used to create a single archive that can be saved to a netCDF or Parquet file. The signature of the class is:

In [None]:
dataformat.create_ragged_array?

## Dataset-specific functions

Since each dataset is different, we have to create specific functions to preprocess the dataset (`preprocess_func`) and return the metadata and data of a single trajectory. This was inspired by the [Pangeo Forge](https://pangeo-forge.readthedocs.io/en/latest/) project. The class *create_ragged_array* will use those functions to create the single archive of ragged arrays. More precisely, it requires:
- a list of indices (or identification number) that will be concatenate into the ragged array format
- a preprecessing function with the following signature:
    - `Signature: preprocess_func(index: int) -> xarray.core.dataset.Dataset`, where the index parameter is an identifier of a trajectory, e.g. the identification number of an Argo float) and returns an *xarray Dataset*. 
- a dictionnary mapping the mandatory coordinates list to the name of those variables in the dataset, e.g.
    coords = {'ids': 'number', 'time': 't', 'longitude': 'lon', 'latitude': 'lat'}
- an optional list of variable names containing metadata information about the trajectory (size: 1 per trajectory)
- an optional list of variable names containing the data along the trajectory (size: number of observations per trajectory)
- an optional funcition that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`)
    
This function can performs all type of operations, such as formatting the date, changing the type of variables, modifying the metadata, etc. We provide preprocessing function for different datasets in the `data/recipes/` folder. The class also needs to *initially* calculate the sum of all observations. By default, this is performed using an lambda function: `lambda i: self.preprocess_func(i).dims['obs']`. To *speed up* this process, in the situation where a lot of preprocessing are performed, it is possible to provide a function `rowsize_func`, that returns directly the number of observation of a trajectory (`Signature: rowsize_func(index: int) -> int`)

Because there is only one `dat` file, it is automatically loaded when importing the module.

In [None]:
from data import glad

We can test the preprocessing function by calling it with one of the identification numbers.

In [None]:
glad.preprocess(1)

It is now possible to create the ragged array and either save a netCDF, parquet file, or simply output an Awkward Array that can be used for analysis.

In [None]:
files = [  1,   2,   3,   4,   5,   6,   7,   8,  10,  11,  12,  13,  14,
        15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
        41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  52,  53,  54,
        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
        68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,
        83,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
       123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
       136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
       149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161,
       162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174,
       175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187,
       188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200,
       201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213,
       214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226,
       227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
       240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
       253, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266,
       267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
       280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 301, 302,
       303, 304, 306, 307, 308, 310, 313, 314, 315, 317, 451]

coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
metadata = ['ID', 'rowsize']
data = ['ve', 'vn', 'err_pos', 'err_vel']

ra = dataformat.create_ragged_array(files,
                         glad.preprocess,
                         coords,
                         metadata, 
                         data,
                        )

## Export

In [None]:
ra.to_parquet('../data/process/glad.parquet')

In [None]:
ak = ra.to_awkward()

In [None]:
ak.ID

In [None]:
ak.fields

In [None]:
ak.obs.fields

## Read

In [None]:
ak2 = dataformat.read_from_parquet('../data/process/glad.parquet')

In [None]:
ak2.ID