## Example: Preparing a data catalog

This example illustrates the how to prepare your own HydroMT [DataCatalog](https://deltares.github.io/hydromt/latest/_generated/hydromt.data_catalog.DataCatalog.html) to reference your own data sources and start using then within HydroMT, see [user guide](https://deltares.github.io/hydromt/latest/user_guide/data_prepare_cat.html).

In [None]:
# import hydromt and setup logging
import os
from pprint import pprint
import hydromt
from hydromt.log import setuplog

logger = setuplog("prepare data catalog", log_level=10)

The steps to use your own data within HydroMT are in brief:

  1) **Have your (local) dataset ready** in one of the supported [raster](https://deltares.github.io/hydromt/latest/user_guide/data_types.html#raster-formats) (tif, ascii, netcdf, zarr...), 
   [vector](https://deltares.github.io/hydromt/latest/user_guide/data_types.html#vector-formats) (shp, geojson, gpkg...) or [geospatial time-series](https://deltares.github.io/hydromt/latest/user_guide/data_types.html#geo-formats) (netcdf, csv...) format.
  2) **Create your own [yaml file](https://deltares.github.io/hydromt/latest/user_guide/data_prepare_cat.html#data-yaml)** with a reference to your prepared dataset and properties (path, data_type, driver, etc.) following the HydroMT [data conventions](https://deltares.github.io/hydromt/latest/user_guide/data_conventions.html#data-convention). For this step, you can also start from an existing pre-defined catalog or use it for inspiration.

The existing pre-defined catalog are:

In [None]:
# this download the artifact_data archive v0.0.6
data_catalog = hydromt.DataCatalog(data_libs=["artifact_data=v0.0.6"])
pprint(data_catalog.predefined_catalogs)

In this notebook, we will see how we can create a data catalog for several type of input data. For this we have prepared several type of data that we will catalogue, let's see which data we have available:

In [None]:
# the artifact data is stored in the following location
root = os.path.join(data_catalog._cache_dir, "artifact_data", "v0.0.6")
# let's print some of the file that are there
for item in os.listdir(root)[-10:]:
    print(item)

### RasterDataset from raster file

The first file we will use is a 'simple' raster file in a tif format: **vito.tif**. This file contains a landuse classification raster. The first thing to do before adding a new file to a data catalog is to get to know what is inside of our file mainly:

  - **location of the file**: `path`.
  - **type of data**: `data_type`. `RasterDataset` for gridded data, `GeoDataFrame` for vector data, `GeoDataset` for point timeseries and `DataFrame` for tabular data.
  - **file format**: `driver`. The file format impacts the driver or python function that will be used to open the data. Either `raster`, `raster_tindex`, `netcdf`, `zarr`, `vector`, `vector_table`.
  - **crs**: `crs`. Coordinate sytem of the data. Optional as it is usually encoded in the data itself.
  - **variables and their properties**: `rename`, `unit_mult`, `unit_add`. Looking at the variables in the input data and what are their names and units so that we can convert them to the [HydroMT data conventions](https://deltares.github.io/hydromt/latest/user_guide/data_conventions.html).
  
There are more arguments or properties to look for that are explained in more detailed in the [documentation](https://deltares.github.io/hydromt/latest/user_guide/data_prepare_cat.html). To discover our data we can either use GIS software like QGIS or GDAL or just use python directly to try and open the data.

Let's open our vito.tif file with xarray and rioxarray:

In [None]:
import xarray as xr
import rioxarray

da = xr.open_dataarray(os.path.join(root, "vito.tif"))
pprint(da)
print(f"CRS: {da.raster.crs}")
da.plot()

What we see is that we have a simple raster with landuse data in crs 4326. Let's translate what we know into a data catalog.

In [None]:
yml_str = f"""meta:
  root: {root}
  
vito:
    path: vito.tif
    data_type: RasterDataset
    driver: raster
    crs: 4326
    meta:
        category: landuse
"""
data_lib = "tmpdir/vito.yml"
with open(data_lib, mode="w") as f:
    f.write(yml_str)

And let's now see if HydroMT can properly read the file from the data catalog we prepared:

In [None]:
data_catalog = hydromt.DataCatalog(data_libs=[data_lib], logger=logger)
da = data_catalog.get_rasterdataset("vito")
da

### RasterDataset from several raster files

In [None]:
folder_name = os.path.join(root, "merit_hydro")
# let's see which files are there
for path, _, files in os.walk(folder_name):
    print(path)
    for name in files:
        print(f" - {name}")

In [None]:
yml_str = f"""meta:
  root: {root}

merit_hydro:
  data_type: RasterDataset
  driver: raster
  kwargs:
    chunks:
      x: 6000
      y: 6000
  meta:
    category: topography
  rename:
    hnd: height_above_nearest_drain
  path: merit_hydro/*.tif
"""
# overwrite data catalog
data_lib = "tmpdir/merit_hydro.yml"
with open(data_lib, mode="w") as f:
    f.write(yml_str)

In [None]:
data_catalog.from_yml(data_lib)  # add a yaml file to the data catalog
print(data_catalog.sources.keys())
ds = data_catalog.get_rasterdataset("merit_hydro")
ds

merit_hydro: rename vars + {variable} placeholder

### RasterDataset from a netcdf file

climate.nc: rename vars + unit_mult unit_add

### GeoDataFrame from a vector file

rivers_lin: rename vars

### GeoDataset from a netcdf file

gtsm: just open

### GeoDataset from vector files

stations.csv and stations_data.csv - use the reading point timeseries example: just open