# Converting a simple NetCDF file to a TileDB array

## About this Example

### What it Shows

The purpose of this example is to show the basics of converting a NetCDF file to TileDB array.s

This includes:

1. Options for auto-generating a converter from a NetCDF file.
2. Changing the TileDB schema settings before conversion.
3. Creating the TileDB group and copying data from the NetCDF file to the TileDB array.

### Example dataset

This example shows convertering a small NetCDF file with 2 dimensions and 4 variables:

* Dimensions:
    * x: size=100
    * y: size=100
* Variables:
    * x(x)
        * description: evenly spaced values from -5 to 5
        * data type: 64-bit floating point
    * y(x)
        * description: evenly spaced values from -5 to 5
        * data type: 64-bit floating point
    * A1(x, y)
        * description: x + y
        * data type: 64-bit floating point
    * A1(x, y)
        * description: sin((x/2)^2 + y^2
        * data type: 64-bit floating point

### Set-up Requirements

This example requires the following python packages are installed: netCDF4, numpy, tiledb, tiledb-cf, and matplotlib


In [None]:
import netCDF4
import numpy as np
import tiledb
from tiledb.cf import AttrMetadata, Group, GroupSchema, NetCDF4ConverterEngine
import matplotlib.pyplot as plt

In [None]:
# Set names for the output generated by the example.
output_dir = "output/convert"
netcdf_file = f"{output_dir}/simple1.nc"
group_uri = f"{output_dir}/simple_netcdf_to_group_1"
array_uri = f"{output_dir}/simple_netcdf_to_array_1"

In [None]:
# Reset output folder
import os
import shutil

shutil.rmtree(output_dir, ignore_errors=True)
os.mkdir(output_dir)

## Create an example NetCDF file

If the NetCDF file does not exist, we create a small NetCDF file for this example. 

In [None]:
x_data = np.linspace(-5.0, 5.0, 100)
y_data = np.linspace(-5.0, 5.0, 100)
xv, yv = np.meshgrid(x_data, y_data, sparse=True)
with netCDF4.Dataset(netcdf_file, mode="w") as dataset:
    dataset.setncatts({"title": "Simple dataset for examples"})
    dataset.createDimension("x", 100)
    dataset.createDimension("y", 100)
    A1 = dataset.createVariable("A1", np.float64, ("x", "y"))
    A1.setncattr("full_name", "Example matrix A1")
    A1.setncattr("description", "x + y")
    A1[:, :] = xv + yv
    A2 = dataset.createVariable("A2", np.float64, ("x", "y"))
    A2[:, :] = np.sin((xv / 2.0) ** 2 + yv ** 2)
    A2.setncattr("full_name", "Example matrix A2")
    A2.setncattr("description", "sin((x/2)^2 + y^2")
    x1 = dataset.createVariable("x", np.float64, ("x",))
    x1[:] = x_data
    y = dataset.createVariable("y", np.float64, ("y",))
    y[:] = y_data
print(f"Created example NetCDF file `{netcdf_file}`.")


## Auto-Generating Converter from File

The functions `NetCDF4ConverterEngine.from_file` and `NetCDF4ConverterEngine.from_group` auto-generate a `NetCDF4ConverterEngine` for an exising NetCDF file. The properties in the `NetCDF4ConverterEngine` can be modified after the converter is generated. 

Parameters:

* Set the location of the NetCDF group to be converted.

    * In `from_file`: 
        * `input_file`: The input NetCDF file to generate the converter engine from.
        * `group_path`: The path to the NetCDF group to copy data from. Use `'/'` for the root group.
    * In `fraom_group`:
        * `input_netcdf_group`: The NetCDF group to generate the converter engine from. (Must be a `netCDF4.Dataset` or `netCDF4.Group`.)

* Set the array grouping. A NetCDF variable maps to TileDB attributes. The `collect_attrs` parameters determines if each NetCDF variable is stored in a separate array, or if all NetCDF variables with the same underlying dimensions are stored in the same TileDB array. Scalar variables are always grouped together.

    * `collect_attrs`: If `True`, store all attributes with the same dimensions in the same array. Otherwise, store each attribute in a separate array.

* Set default properties for TileDB dimension. 

    * `unlimited_dim_size`: The default size of the domain for TileDB dimensions created from unlimited NetCDF dimensions. If `None`, the current size of the NetCDF dimension will be used.
    * `dim_dtype`: The default numpy dtype to use when converting a NetCDF dimension to a TileDB dimension.

* Set tile sizes for TileDB dimensions. Multiple arrays in the TileDB group may have the same name, domain, and type, but different tiles and compression filters. The `tiles_by_var` and `tiles_by_dims` parameters allow a way of setting the tiles for the dimensions in different arrays. The `tiles_by_var` parameter is a mapping from variable name to the tiles for the dimensions of the array that variable is stored in. The `tiles_by_dims` parameter is a mapping from the names of the dimensions of the array to the tiles for the dimensions of the array. If using `collect_attrs=True`, then `tiles_by_dims` will over-write `tiles_by_var`. If using `collect_attrs=False`, then `tiles_by_vars` with over-write `tiles_by_var`.

    * `tiles_by_var`: A map from the name of a single NetCDF variable to the tiles of the dimensions of the variable in the generated TileDB array.
    * `tiles_by_dims`: A map from the name of NetCDF dimensions defining a variable to the tiles of those dimensions in the generated TileDB array.

* Convert 1D variables with the same name and dimension to a TileDB dimension instead of a TileDB attribute.

    * `coords_to_dims`: If `True`, convert the NetCDF coordinate variable into a TileDB dimension for sparse arrays. Otherwise, convert the coordinate dimension into a TileDB dimension and the coordinate variable into a TileDB attribute.

### Examples: Collecting attributes and setting tiles

Below are some examples of how the parameter `collect_attrs`, `tiles_by_var`, and `tiles_by_dims` interact.

In [None]:
# Try changing the parameters, `collect_attrs`, `tiles_by_dims`, and `tiles_by_var` and see how it effects the tile size for all dimensions
def test_setting_tiles(input_file, **kwargs):
    converter = NetCDF4ConverterEngine.from_file(input_file, **kwargs)
    print(f"Keyword arguments: {kwargs}")
    print(f"Generated TileDB Arrays:")
    for array_creator in converter.array_creators():
        print(f"  * {array_creator.name}({', '.join(dim_creator.name for dim_creator in array_creator.domain_creator)})")
        print(f"      - attributes: {', '.join(attr_creator.name for attr_creator in array_creator)}")
        print(f"      - tiles: {array_creator.domain_creator.tiles}")

In [None]:
# 1. `collect_attrs=True`
#    * `A1` and `A2` are in the same array.
#    * `tile=None` for all dimensions.
test_setting_tiles(netcdf_file, collect_attrs=True)

In [None]:
# 2. `collect_attrs=True`, `tiles_by_dims={(x,y): (10, 20)}`
#     * `A1` and `A2` are in the same array.
#     * Only array with dimensions `(x,y)` has tiles set.
test_setting_tiles(netcdf_file, collect_attrs=True, tiles_by_dims={("x", "y"): (10, 20)})

In [None]:
# 3. `collect_attrs=True`, `tiles_by_var={'A1': (50, 50)}`
#    * `A1` and `A2` are in the same array.
#    * Only array with variable `A1` has tiles set.
test_setting_tiles(netcdf_file, collect_attrs=True, tiles_by_var={'A1': (50, 50)})

In [None]:
# 4. `collect_attrs=True`, `tiles_by_dims={(x,y): (10, 20)}`, `tiles_by_var={'A1': (50, 50)}`
#     * `A1` and `A2` are in the same array.
#     * Only array with dimensions `(x,y)` has tiles set. `tiles_by_dims` took priority over `tiles_by_var`.
test_setting_tiles(netcdf_file, collect_attrs=True, tiles_by_var={'A1': (50, 50)}, tiles_by_dims={("x", "y"): (10, 20)})

In [None]:
# 5. `collect_attrs=False`
#     * `A1` and `A2` are in separate arrays.
#     * `tile=None` for all dimensions.
test_setting_tiles(netcdf_file, collect_attrs=False)

In [None]:
# 6. `collect_attrs=False`, `tiles_by_dims={(x,y): (10, 20)}`
#     * `A1` and `A2` are in separate arrays.
#     * Only arrays with dimensions `(x,y)` have tiles set.
test_setting_tiles(netcdf_file, collect_attrs=False, tiles_by_dims={("x", "y"): (10, 20)})

In [None]:
# 7. `collect_attrs=False`, `tiles_by_var={'A1': (50, 50)}`
#     * `A1` and `A2` are in separate arrays.
#     * Only array with variable `A1` has tiles set.
test_setting_tiles(netcdf_file, collect_attrs=False, tiles_by_var={'A1': (50, 50)})

In [None]:
# 8. `collect_attrs=False`, `tiles_by_dims={(x,y): (10, 20)}`, `tiles_by_var={'A1': (50, 50)}`
#     * `A1` and `A2` are in separate arrays.
#     * The array with `A2` has tiles set by `tiles_by_dims`.
#     * The array with `A1` has tiles set. `tiles_by_var` took priority over `tiles_by_dims`. 
test_setting_tiles(netcdf_file, collect_attrs=False, tiles_by_var={'A1': (50, 50)}, tiles_by_dims={("x", "y"): (10, 20)})

## Convert `simple1.nc` to a TileDB Group

In this example, we create a NetCDF4ConverterEngine from the NetCDF file and manually change properties of the TileDB arrays before conversion. NetCDF dimensions are mapped to TileDB dimensions, and NetCDF variables are mapped to TileDB attributes.

In [None]:
converter = NetCDF4ConverterEngine.from_file(netcdf_file, collect_attrs=True, dim_dtype=np.uint32)
converter

In [None]:
# Update properties manually by modifying the array creators
# 1. Update properties for x 
x_array = converter.get_array_creator_by_attr("x.data")
x_array.name = "x"
x_array.domain_creator.tiles = (20,)
x_array.attr_creator("x.data").filters = tiledb.FilterList([tiledb.ZstdFilter()])
# 2. Update properties for y
y_array = converter.get_array_creator_by_attr("y.data")
y_array.name = "y"
y_array.domain_creator.tiles = (20,)
y_array.attr_creator("y.data").filters = tiledb.FilterList([tiledb.ZstdFilter()])
# 3. Update properties for array of matrices
data_array = converter.get_array_creator_by_attr("A1")
data_array.name = "data"
data_array.domain_creator.tiles = (20, 20)
for attr_creator in data_array:
    attr_creator.filters = tiledb.FilterList([tiledb.ZstdFilter()])
converter

Run the conversions to create two dense TileDB arrays:

In [None]:
converter.convert_to_group(group_uri)

### Examine the TileDB group schema

In [None]:
group_schema = GroupSchema.load(group_uri)
group_schema

### Examine the data in the arrays

Open the attributes from the generated TileDB group:

In [None]:
with Group(group_uri, attr="x.data") as group:
    with (
        group.open_array(attr="x.data") as x_array,
        group.open_array(attr="y.data") as y_array,
        group.open_array(array="data") as data_array,
    ):
        x = x_array[:]
        y = y_array[:]
        data = data_array[...]
        A1 = data["A1"]
        A2 = data["A2"]
        a1_description = AttrMetadata(data_array.meta, "A1")["description"]
        a2_description = AttrMetadata(data_array.meta, "A2")["description"]

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].contourf(x, y, A1);
axes[0].set_title(a1_description);
axes[1].contourf(x, y, A2);
axes[1].set_title(a2_description);

## Convert `simple1.nc` to a Single Sparse TileDB Array

In this example, we create a NetCDF4ConverterEngine from the NetCDF file and manually change properties of the TileDB arrays before conversion. Here we use `coord_to_dims` to convert the `x` and `y` variables to TileDB variables in a sparse array.

In [None]:
converter2 = NetCDF4ConverterEngine.from_file(netcdf_file, coords_to_dims=True, dim_dtype=np.uint32)
converter2

In [None]:
# Update properties for the array
converter2.get_shared_dim('x').domain = (-5.0, 5.0)
converter2.get_shared_dim('y').domain = (-5.0, 5.0)
data_array = converter2.get_array_creator("array0")
data_array.domain_creator.tiles = (1.0, 1.0)
data_array.capacity = 400
for attr_creator in data_array:
    attr_creator.filters = tiledb.FilterList([tiledb.ZstdFilter()])
converter2

In [None]:
converter2.convert_to_array(array_uri)