# Lagrangian simulation sample file

In this Notebook, we use a simple output file from a Lagrangian simimulation to highlight the required steps to preprocess a dataset into the ragged array format which is used by the *CloudDrift* library. The example dataset we use here comes in a format that is very closed to the output format of [Ocean Parcels](https://oceanparcels.org/) and [OpenDrift](https://opendrift.github.io/).

In [None]:
import sys
import numpy as np
import xarray as xr
sys.path.insert(0, '../')
from clouddrift import ragged_array

## Download

In [None]:
import os
from os.path import isfile, join, exists
import urllib.request

folder = "../data/raw/numerical/"
file = "example.nc"
os.makedirs(folder, exist_ok=exists(folder))  # create raw data folder

if not isfile(join(folder, file)):
    url = "https://zenodo.org/record/6310460/files/global-marine-litter-2021.nc"
    print(f"Downloading ~1.1GB from {url}.")
    req = urllib.request.urlretrieve(url, join(folder, file))
    print(f"Dataset saved at {join(folder, file)}")
else:
    print(f"Dataset already at {join(folder, file)}.")

## Data

Numerical outputs from Lagrangian simulations are usually stored as bidimensional matrices. This particular example contains 387,600 trajectories saved at daily intervals during the year 2021.

In [None]:
ds = xr.open_dataset(join(folder, file), decode_times=False)

In [None]:
ds

 At the beginning of each month, 32,300 particles are released, and trajectories are padded with `nan` before their release date.

In [None]:
ds.lon[0,:]

In [None]:
ds.lon[32300,:]

In [None]:
ds.close()

## Preprocessing

To re-organize the data into a ragged array, it is possible to create a preprocessing function and use the `ragged_array.from_files()` class method, similarly to what is presented in the notebook example `dataformat-gdp.ipynb`. A *much faster* alternative solution for numerical simulations is to manually create the required dictionnaires to hold the dataset and to create the ragged array instance directly.

In [None]:
# initialized dictionnaries
coords = {}
metadata = {}
# note that this example dataset does not contain other data than time, lon, lat, and ids 
# an empty dictionary "data" is initialize anyway
data = {}
attrs_global = {}
attrs_variables = {}

In [None]:
# decode_times=False to get time data and not datetime conversion
ds = xr.open_dataset(join(folder, file), decode_times=False)

finite_values = np.isfinite(ds['lon'])
idx_finite = np.where(finite_values)

rowsize = np.bincount(idx_finite[0]).astype('int32')
unique_id = np.unique(idx_finite[0]).astype('int32')

# coordinates
coords["time"] = np.tile(ds.time.data, (ds.dims['traj'],1))[idx_finite]  # reshape to 2D to get ragged time
coords["lon"] = ds.lon.data[idx_finite].astype('float32')
coords["lat"] = ds.lat.data[idx_finite].astype('float32')
coords["ids"] = np.repeat(unique_id, rowsize)

# metadata variables
metadata["rowsize"] = rowsize
metadata["ID"] = unique_id

# attributes for each variable
attrs_variables = {
    "ID": {'long_name': 'Trajectory id', 'units':'-'},
    "time": {'long_name': 'Time in days', 'units': 'days since 2021-01-01'}, 
    "lon": {'long_name': 'longitude', 'units': 'degrees_east'}, 
    "lat": {'long_name': 'latitude', 'units': 'degrees_north'}, 
    "ids": {'long_name': 'Trajectory identification number repeated along observations', 'units': '-'},
    "rowsize": {'long_name': 'Number of observations per trajectory', 'sample_dimension': 'obs', 'units':'-'},
}

# 
attrs_global={
    'title': 'Marine Litter 2021',
    'institution': 'Florida State University Center for Ocean-Atmospheric Prediction Studies (COAPS)'
}

ds.close()

In [None]:
ra = ragged_array(coords, metadata, data, attrs_global, attrs_variables)

## Export

In [None]:
ra.to_parquet('../data/process/numerical_sample.parquet')

## Read

In [None]:
ra2 = ragged_array.from_parquet('../data/process/numerical_sample.parquet')

In [None]:
ra2