# TileDB-CF Example with ERA-5

## About this Example

### What it Shows

1. Converting a NetCDF file to a virtual TileDB Group keeping the arrays dense
2. Converting a NetCDF file to a virtual TileDB Group by converting NetCDF "coordinates" to TileDB sparse dimensions

## Example Dataset

This example shows converting a sample of ERA-5 data originally downloaded from Copernicus ([www.copernicus.eu](www.copernicus.eu)). The NetCDF file contains the following:

* Dimensions:
    * longitude: 1171
    * latitude: 1501
    * time: 24
* Variables:
    * t2m(time, latitude, longitude)
        * description: 2 metre temperature in degrees Kelvin
        * data type: 32-bit floating point values
    * sp(time, latitude, longitude)
        * description: surface air pressure in Pascals
        * data type: 32-bit floating point values     
    * tp(time, latitude, longitude)
        * description: total precipitation in metres
        * data type: 32-bit floating point values

## Set-up Requirements

This example requires the following Python libraries: tiledb, tiledb-cf, netCDF4, numpy, xarray

In [18]:
import netCDF4
import numpy as np
import tiledb
import tiledb.cf
from tiledb.cf.engines.netcdf4_engine import NetCDF4ConverterEngine
import xarray as xr

In [19]:
input_file = "../data/era5_sample_monthly.nc"

### Example: ERA-5 NetCDF to Dense TileDB Arrays

This example shows converting a NetCDF file with ERA-5 data to dense TileDB arrays with axis labels for the time, latitude, and longitude data. The conversion is completed in 3 steps.

1. Auto-generate a NetCDF-to-TileDB conversion schema from a NetCDF file.
    * For this example we use the `from_netcdf` method with `coords_to_dims=False` and `collect_attrs=True`. This will create a conversion schema that maps NetCDF dimension to TileDB dimensions, NetCDF variables to TileDB attributes, and collects attributes with the same underlying dimensions into a single array. 
2. View and update the TileDB Group schema with desired properties.
    * In this example we will show renaming TileDB arrays to more meaning names and setting compression filters.
3. Create the schema and copy the data from NetCDF to TileDB.
    * Creating the virtual TileDB group and copying the data from NetCDF to TileDB can be done in a single step using the `convert_to_virtual_group` method once we have defined our desired conversion schema.

In [5]:
# 1. Auto-generate a NetCDF-to-TileDB conversion schema from a NetCDF file.
converter = NetCDF4ConverterEngine.from_file(input_file, coords_to_dims=False, collect_attrs=True)
converter

0
"NetCDFDimension(name=longitude, size=3600) → SharedDim(name=longitude, domain=(0, 3599), dtype='uint64')"
"NetCDFDimension(name=latitude, size=1801) → SharedDim(name=latitude, domain=(0, 1800), dtype='uint64')"
"NetCDFDimension(name=time, size=133) → SharedDim(name=time, domain=(0, 132), dtype='uint64')"

0
"NetCDFDimension(name=latitude, size=1801) → tiledb.Dim(name=latitude, domain=(0, 1800), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=latitude, dtype=float32) → tiledb.Attr(name=latitude.data, dtype='float32', var=False, nullable=False)"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None

0
"NetCDFDimension(name=longitude, size=3600) → tiledb.Dim(name=longitude, domain=(0, 3599), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=longitude, dtype=float32) → tiledb.Attr(name=longitude.data, dtype='float32', var=False, nullable=False)"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None

0
"NetCDFDimension(name=time, size=133) → tiledb.Dim(name=time, domain=(0, 132), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=time, dtype=int32) → tiledb.Attr(name=time.data, dtype='int32', var=False, nullable=False)"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None

0
"NetCDFDimension(name=time, size=133) → tiledb.Dim(name=time, domain=(0, 132), dtype='uint64', tile=None)"
"NetCDFDimension(name=latitude, size=1801) → tiledb.Dim(name=latitude, domain=(0, 1800), dtype='uint64', tile=None)"
"NetCDFDimension(name=longitude, size=3600) → tiledb.Dim(name=longitude, domain=(0, 3599), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=t2m, dtype=int16) → tiledb.Attr(name=t2m, dtype='int16', var=False, nullable=False)"
"NetCDFVariable(name=sp, dtype=int16) → tiledb.Attr(name=sp, dtype='int16', var=False, nullable=False)"
"NetCDFVariable(name=tp, dtype=int16) → tiledb.Attr(name=tp, dtype='int16', var=False, nullable=False)"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None


In [6]:
# 2a. Rename arrays and set array properties
converter.rename_array('array0', 'latitude')
converter.rename_array('array1', 'longitude')
converter.rename_array('array2', 'time')
converter.rename_array('array3', 'data')

In [9]:
# 2b. Set desired array and attribute properties for the TileDB array. Here we set the tile sizes for the sample array and attribute compression filters.
converter.set_array_properties("data", tiles=(32, 64, 64), capacity=100000)
filter_list = tiledb.FilterList([tiledb.ZstdFilter(level=7),])
for attr_name in converter.attr_names:
    converter.set_attr_properties(attr_name, filters=filter_list)
converter

0
"NetCDFDimension(name=longitude, size=3600) → SharedDim(name=longitude, domain=(0, 3599), dtype='uint64')"
"NetCDFDimension(name=latitude, size=1801) → SharedDim(name=latitude, domain=(0, 1800), dtype='uint64')"
"NetCDFDimension(name=time, size=133) → SharedDim(name=time, domain=(0, 132), dtype='uint64')"

0
"NetCDFDimension(name=latitude, size=1801) → tiledb.Dim(name=latitude, domain=(0, 1800), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=latitude, dtype=float32) → tiledb.Attr(name=latitude.data, dtype='float32', var=False, nullable=False, filters=FilterList(FilterList([ZstdFilter(level=7)])))"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None

0
"NetCDFDimension(name=longitude, size=3600) → tiledb.Dim(name=longitude, domain=(0, 3599), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=longitude, dtype=float32) → tiledb.Attr(name=longitude.data, dtype='float32', var=False, nullable=False, filters=FilterList(FilterList([ZstdFilter(level=7)])))"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None

0
"NetCDFDimension(name=time, size=133) → tiledb.Dim(name=time, domain=(0, 132), dtype='uint64', tile=None)"

0
"NetCDFVariable(name=time, dtype=int32) → tiledb.Attr(name=time.data, dtype='int32', var=False, nullable=False, filters=FilterList(FilterList([ZstdFilter(level=7)])))"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=False
coords_filters=None

0
"NetCDFDimension(name=time, size=133) → tiledb.Dim(name=time, domain=(0, 132), dtype='uint64', tile=32)"
"NetCDFDimension(name=latitude, size=1801) → tiledb.Dim(name=latitude, domain=(0, 1800), dtype='uint64', tile=64)"
"NetCDFDimension(name=longitude, size=3600) → tiledb.Dim(name=longitude, domain=(0, 3599), dtype='uint64', tile=64)"

0
"NetCDFVariable(name=t2m, dtype=int16) → tiledb.Attr(name=t2m, dtype='int16', var=False, nullable=False, filters=FilterList(FilterList([ZstdFilter(level=7)])))"
"NetCDFVariable(name=sp, dtype=int16) → tiledb.Attr(name=sp, dtype='int16', var=False, nullable=False, filters=FilterList(FilterList([ZstdFilter(level=7)])))"
"NetCDFVariable(name=tp, dtype=int16) → tiledb.Attr(name=tp, dtype='int16', var=False, nullable=False, filters=FilterList(FilterList([ZstdFilter(level=7)])))"

0
cell_order=row-major
tile_order=row-major
capacity=100000
sparse=False
coords_filters=None


In [11]:
group_uri = "era5_monthly_dense"
if tiledb.object_type(group_uri) is None:
    converter.convert_to_group(group_uri)
else:
    print(f"No group created. A {tiledb.object_type(group_uri)} already exists at '{group_uri}'.")

No group created. A group already exists at 'era5_monthly'.


In [28]:
# View the schema for the new ERA-5 group
era5_sample = xr.merge([
    xr.open_dataset(f"{group_uri}/{array_name}", engine="tiledb")
    for array_name in tiledb.cf.GroupSchema.load(group_uri)
])
era5_sample

### Example: ERA-5 NetCDF to a Sparse TileDB Array
This example shows converting a NetCDF file with sample ERA-5 data to a sparse TileDB array where the 
NetCDF coordinate variables are converted directly to TileDB dimensions. As in the above dense example, 
the conversion is completed in 3 steps.

1. Auto-generate a NetCDF-to-TileDB conversion schema from a NetCDF file.
    * For this example we use the `from_netcdf` method with `coords_to_dims=True` and `collect_attrs=True`. This will create a conversion schema that maps NetCDF coordinates to TileDB dimensions, NetCDF variables to TileDB attributes, and collects attributes with the same underlying dimensions into a single array. 
2. View and update the TileDB Group schema with desired properties.
    * In this example we will show renaming TileDB arrays to more meaning names and setting compression filters.
3. Create the schema and copy the data from NetCDF to TileDB.
    * Creating the virtual TileDB group and copying the data from NetCDF to TileDB can be done in a single step using the `convert_to_virtual_group` method once we have defined our desired conversion schema.

In [32]:
converter_sparse = NetCDF4ConverterEngine.from_file(input_file, coords_to_dims=True)
converter_sparse

0
"NetCDFVariable(name=longitude, dtype=float32) → SharedDim(name=longitude, domain=None, dtype='float32')"
"NetCDFVariable(name=latitude, dtype=float32) → SharedDim(name=latitude, domain=None, dtype='float32')"
"NetCDFVariable(name=time, dtype=int32) → SharedDim(name=time, domain=None, dtype='int32')"

0
"NetCDFVariable(name=time, dtype=int32) → tiledb.Dim(name=time, domain=None, dtype='int32', tile=None)"
"NetCDFVariable(name=latitude, dtype=float32) → tiledb.Dim(name=latitude, domain=None, dtype='float32', tile=None)"
"NetCDFVariable(name=longitude, dtype=float32) → tiledb.Dim(name=longitude, domain=None, dtype='float32', tile=None)"

0
"NetCDFVariable(name=t2m, dtype=int16) → tiledb.Attr(name=t2m, dtype='int16', var=False, nullable=False)"
"NetCDFVariable(name=sp, dtype=int16) → tiledb.Attr(name=sp, dtype='int16', var=False, nullable=False)"
"NetCDFVariable(name=tp, dtype=int16) → tiledb.Attr(name=tp, dtype='int16', var=False, nullable=False)"

0
cell_order=row-major
tile_order=row-major
capacity=0
sparse=True
allows_duplicates=False
coords_filters=None


## Setting the sparse array properties

### Renaming the array



### Setting the domain
Here we need to explicitly set the longitude, latitude, and time domains where time is stored in hours since 1900-01-01 00:00:00.0. 

* Latitude: the inclusive domain is (-180.0, 180.0)
* Longitude: the inclusive domain is (-90.0, 90.0)
* Time: Time is stored in hours since 1900-01-01. Here we assume we will not be storing data from before 1900-01-01. We restrict the domain to be from (0, max_int32). 

### Picking the tiles
As a rule of thumb, we aim to have a data tile size of around 1-10 MB. However, unlike in the dense example above, the data tile size is determined by the capacity instead of the array tiles. Instead, the tiles will help determine the layout of the data on disk. Here we use 5 degree x 5 degree spatial resolution and 200 hours (approx. 8 days). This will allow for efficiently slicing over either larger spatial queries or larger temporal queries. 

### Setting Compression Filters

In [33]:
# Rename array0 to era5_sample
converter_sparse.rename_array("array0", "sample")
# Set the domain for the sparse dimensions 
converter_sparse.set_dim_properties("longitude", domain=(-180.0, 180.0))
converter_sparse.set_dim_properties("latitude", domain=(-90.0, 90.0))
converter_sparse.set_dim_properties("time", domain=(0, np.iinfo(np.dtype("int32")).max))
# Set the tile sizes for array0
converter_sparse.set_array_properties("sample", tiles=(5.0, 5.0, 200), capacity=500_000)

In [None]:
sparse_uri = "era5_monthly_sparse"
converter_sparse