## Example: Reading raster data

This example illustrates the how to read raster data using the HydroMT [DataCatalog](https://deltares.github.io/hydromt/latest/_generated/hydromt.data_catalog.DataCatalog.html) with the `raster`, `netcdf` and `raster_tindex` drivers.

In [None]:
# import hydromt and setup logging
import hydromt
from hydromt.log import setuplog

logger = setuplog("read raster data", log_level=10)

In [None]:
# Download artifacts for the Piave basin to `~/.hydromt_data/`.
data_catalog = hydromt.DataCatalog(logger=logger, data_libs=["artifact_data=v0.0.8"])

## Raster driver

To read raster data and parse it into a [xarray Dataset or DataArray](https://xarray.pydata.org/en/stable/user-guide/data-structures.html) we use the [open_mfraster()](https://deltares.github.io/hydromt/latest/_generated/hydromt.io.open_mfraster.html) method. All `driver_kwargs` in the data catalog yaml file will be passed to this method. The `raster` driver supports all [GDAL data formats ](http://www.gdal.org/formats_list.html), including the often used GeoTiff of Cloud Optimized GeoTiff (COG) formats. Tiled datasets can also be passed as a [virtual raster tileset (vrt) file](https://gdal.org/drivers/raster/vrt.html). 

As an example we will use the [MERIT Hydro](http://hydro.iis.u-tokyo.ac.jp/~yamadai/MERIT_Hydro) dataset which is a set of GeoTiff files with identical grids, one for each variable of the datasets. 

In [None]:
# inspect data source entry in data catalog yaml file
data_catalog["merit_hydro"]

We can load any RasterDataset using [DataCatalog.get_rasterdataset()](https://deltares.github.io/hydromt/latest/_generated/hydromt.data_catalog.DataCatalog.get_rasterdataset.html). Note that if we don't provide any arguments it returns the full dataset with nine data variables and for the full spatial domain. Only the data coordinates are actually read, the data variables are still lazy [Dask arrays](https://docs.dask.org/en/stable/array.html).

In [None]:
ds = data_catalog.get_rasterdataset("merit_hydro")
ds

The data can be visualized with the [.plot()](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.plot.html) xarray method. We replace all nodata values with NaNs with [.raster.mask_nodata()](https://deltares.github.io/hydromt/latest/_generated/hydromt.DataArray.raster.mask_nodata.html).

In [None]:
ds["elevtn"].raster.mask_nodata().plot(cmap="terrain")

We can request a (spatial) subset data by providing additional `variables` and `bbox` / `geom` arguments. Note that these return a smaller spatial extent and just two data variables. The variables argument is especially useful if each variable of the dataset is saved in a separate file and the `{variable}` key is used in the path argument of the data source (see above) to limit which files are actually read. If a single variable is requested a DataArray instead of a Dataset is returned unless the `single_var_as_array` argument is set to False (True by default).

In [None]:
bbox = [11.70, 45.35, 12.95, 46.70]
ds = data_catalog.get_rasterdataset(
    "merit_hydro", bbox=bbox, variables=["elevtn"], single_var_as_array=True
)
ds

As mentioned earlier, all `driver_kwargs` in the data entry of the catalog yaml file will be passed to the [open_mfraster()](../_generated/hydromt.io.open_mfraster.rst) method. Here we show how these arguments can be used to concatenate multiple raster files along a new dimension.

In [None]:
# update data source entry (this is normally done manually before initializing the data catalog!)
data_catalog["merit_hydro"].driver_kwargs.update(concat_dim="variable", concat=True)
data_catalog["merit_hydro"]

In [None]:
# this returns a DataArray (single variable) with a new 'variable' dimension
da = data_catalog.get_rasterdataset("merit_hydro")
da

TIP: To write a dataset back to a stack of raster in a single folder use the [.raster.to_mapstack()](https://deltares.github.io/hydromt/latest/_generated/hydromt.Dataset.raster.to_mapstack.html) method.

## Netcdf driver

Many gridded datasets with a third dimension (e.g. time) are saved in netcdf or zarr files, which can be read with the netcdf and zarr drivers respectively. This data is read using the [xarray.open_mfdataset()](https://docs.xarray.dev/en/latest/generated/xarray.open_mfdataset.html) method. These formats are flexible and therefore 
HydroMT is not always able to read the geospatial attributes such as the CRS from the data and it has to be set through the data catalog [yaml file](https://deltares.github.io/hydromt/latest/user_guide/data_prepare_cat.html). 

If the data is stored per year or month, the `{year}` and `{month}` keys can be used in the path argument of a data source in the data catalog yaml file to speed up the reading of a temporal subset of the data using the `date_tuple` argument of [DataCatalog.get_rasterdataset()](https://deltares.github.io/hydromt/latest/_generated/hydromt.data_catalog.DataCatalog.get_rasterdataset.html) (not in this example).

As example we use the [ERA5](https://doi.org/10.24381/cds.bd0915c6) dataset. 

In [None]:
# Note the crs argument as this is missing in the original data
data_catalog["era5_hourly"]

In [None]:
# Note that the some units are converted
ds = data_catalog.get_rasterdataset("era5_hourly")
ds

## Raster_tindex driver

If the raster data is tiled but for each tile a different CRS is used (for instance a different UTM projection for each UTM zone), this dataset cannot be described using a VRT file. In this case a vector file can be build to use a raster tile index using [gdaltindex](https://gdal.org/programs/gdaltindex.html) and read using [open_raster_from_tindex()](https://deltares.github.io/hydromt/latest/_generated/hydromt.io.open_raster_from_tindex.html). To read the data into a single `xarray.Dataset` the data needs to be reprojected and mosaiced to a single CRS while reading. As this type of data cannot be loaded lazily the method is typically used with an area of interest for which the data is loaded and combined. 

As example we use the [GRWL mask](https://doi.org/10.5281/zenodo.1297434) raster tiles for which we have created a tileindex using the aforementioned *gdaltindex* command line tool. Note that the path points to the GeoPackage output of the *gdaltindex* tool.

In [None]:
data_catalog["grwl_mask"]

In [None]:
# the tileindex is a GeoPackage vector file
# with an attribute column 'location' (see also the tileindex argument under driver_kwargs) containing the (relative) paths to the raster file data
import geopandas as gpd

fn_tindex = data_catalog["grwl_mask"].path
print(fn_tindex)
gpd.read_file(fn_tindex, rows=5)

In [None]:
# this returns a DataArray (single variable) wit a mosaic of several files (see source_file attribute)
ds = data_catalog.get_rasterdataset("grwl_mask", bbox=bbox)
ds