In [1]:
import valenspy as vp
import xarray as xr
from pathlib import Path

import dask

## The Input Manager
The input manager aims to make accessing shared standard datasets easy.
As the available data, data path and variables is HPC specific, the input manager only works for HPC systems specified in the dataset_PATHS.yml. To add a dataset or a new machine, add your machine name, datasets and paths to the dataset_PATHS.yml file.

In [2]:
manager = vp.InputManager(machine='hortense')

### Usage

For everyday use, a dataset can be accessed through the load data functionality. Variables are accessed through their CORDEX variable name. 

You do not need to know the how the variable is called in the oroginal dataset. E.g. In era5, the 2m temperature is called 'tp' but is accessed here through the CORDEX variable name 'tas'. The original name is added to the attributes of the variable for reference.

Note that the files that are found and used to load the data are printed and the CF_status of the ds is printed. This is to help debug if the data is not loaded as expected.


In [3]:
ds = manager.load_data("ERA5",["pr"], period=[2000,2001],freq="daily",region="europe", path_identifiers=["min"])
ds

File paths found:
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/total_precipitation/era5-daily_min-europe-total_precipitation-2001.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/total_precipitation/era5-daily_min-europe-total_precipitation-2000.nc
The file is ValEnsPy CF compliant.
50.00% of the variables are ValEnsPy CF compliant
ValEnsPy CF compliant: ['pr']
Unknown to ValEnsPy: ['time_bnds']


Unnamed: 0,Array,Chunk
Bytes,11.42 kiB,5.72 kiB
Shape,"(731, 2)","(366, 2)"
Dask graph,2 chunks in 5 graph layers,2 chunks in 5 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 11.42 kiB 5.72 kiB Shape (731, 2) (366, 2) Dask graph 2 chunks in 5 graph layers Data type datetime64[ns] numpy.ndarray",2  731,

Unnamed: 0,Array,Chunk
Bytes,11.42 kiB,5.72 kiB
Shape,"(731, 2)","(366, 2)"
Dask graph,2 chunks in 5 graph layers,2 chunks in 5 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,262.72 MiB,127.17 MiB
Shape,"(731, 163, 289)","(362, 161, 286)"
Dask graph,16 chunks in 7 graph layers,16 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 262.72 MiB 127.17 MiB Shape (731, 163, 289) (362, 161, 286) Dask graph 16 chunks in 7 graph layers Data type float64 numpy.ndarray",289  163  731,

Unnamed: 0,Array,Chunk
Bytes,262.72 MiB,127.17 MiB
Shape,"(731, 163, 289)","(362, 161, 286)"
Dask graph,16 chunks in 7 graph layers,16 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Note that depending on your search criteria, metadata is added to the dataset.

With the load_dataset functionalilty you can also:
- load multiple variables simultaneously and/or
- **not** convert the ds to cf-compliant format and/or ``cf_convert=False``
- Add additional meta_data to the ds - if using the metdata_info dictionary

However, the ds is then not in cf convention and applying diagnostics will not work.

In [4]:
ds = manager.load_data("EOBS",["tas","pr"], path_identifiers=["mean"], cf_convert=True, metadata_info={"creator":"ME"})
ds

File paths found:
/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/rr_ens_mean_0.1deg_reg_v29.0e.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/tg_ens_mean_0.1deg_reg_v29.0e.nc
The file is ValEnsPy CF compliant.
100.00% of the variables are ValEnsPy CF compliant
ValEnsPy CF compliant: ['pr', 'tas']


Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 131.94 GiB 509.86 MiB Shape (27028, 930, 1409) (102, 930, 1409) Dask graph 265 chunks in 15 graph layers Data type float32 numpy.ndarray",1409  930  27028,

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 131.94 GiB 509.86 MiB Shape (27028, 930, 1409) (102, 930, 1409) Dask graph 265 chunks in 15 graph layers Data type float32 numpy.ndarray",1409  930  27028,

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


The name of the dataset (e.g. "ERA5") should be in the dataset_PATHS.yml file to be able to find the data. If using the inputconvertor ``cf_convert=True`` (default option) a corresponding input_convertor should be available. Currently the following datasets have input convertors:

In [5]:
vp.inputconverter.INPUT_CONVERTORS

{'ERA5': <valenspy.inputconverter.InputConverter at 0x145cb4d9e340>,
 'ERA5-Land': <valenspy.inputconverter.InputConverter at 0x145cb4d9e160>,
 'EOBS': <valenspy.inputconverter.InputConverter at 0x145cb4d9e370>,
 'CLIMATE_GRID': <valenspy.inputconverter.InputConverter at 0x145cb4d9e430>}

## A peak inside the manager

The input_manager uses the path specified in the dataset_PATHS.yml for the given dataset and machine to search all .nc files and file paths that match the filtering requested. The following function is doing all the "magic":

In [6]:
manager._get_file_paths("EOBS",["tas","pr"], path_identifiers=["mean"]) #The magic happens here ! All

[PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/rr_ens_mean_0.1deg_reg_v29.0e.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/tg_ens_mean_0.1deg_reg_v29.0e.nc')]

Above all paths starting with '/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/' and containing the original name (long or short name) for 'tas' or 'pr' in this case tg and rr and "mean" are selected. Other options are:
- region: e.g. europe, belgium
- period: [start_year, end_year] possibly more is covered (note some datasets are not stored by year)!
- frequency: eg. yearly, daily, monthly
- other: Any other keywords to filter by are specified in the path_identifier. E.g. 'mean' for monthly mean data or "min" for minimum daily temperatures

For more information see the documentation on the input_manager and the load_data function.

## Experimental Load multiple datasets

Working on functionality to load multiple datasets at once. Currently this is implemented as load_m_data, but the currently the usage is slightly cluncky and can be improved. The idea is to load multiple datasets at once into a datatree using the load_data functions for each dataset seperately. 

In [7]:
data_request_dict={"EOBS":
                        {"path_identifiers":["mean"]},
                    "ERA5":
                        {"period":[2000,2001],
                         "freq":"daily",
                         "region":"europe",
                         "path_identifiers":["min"]}}


dt = manager.load_m_data(data_request_dict, variables=["tas","pr"])

Loading data for EOBS...
File paths found:
/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/rr_ens_mean_0.1deg_reg_v29.0e.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/tg_ens_mean_0.1deg_reg_v29.0e.nc
The file is ValEnsPy CF compliant.
100.00% of the variables are ValEnsPy CF compliant
ValEnsPy CF compliant: ['pr', 'tas']
Loading data for ERA5...
File paths found:
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/total_precipitation/era5-daily_min-europe-total_precipitation-2001.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily_min-europe-2m_temperature-2001.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily_min-europe-2m_temperature-2000.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/total_precipitation

In [8]:
dt

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 131.94 GiB 509.86 MiB Shape (27028, 930, 1409) (102, 930, 1409) Dask graph 265 chunks in 15 graph layers Data type float32 numpy.ndarray",1409  930  27028,

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 131.94 GiB 509.86 MiB Shape (27028, 930, 1409) (102, 930, 1409) Dask graph 265 chunks in 15 graph layers Data type float32 numpy.ndarray",1409  930  27028,

Unnamed: 0,Array,Chunk
Bytes,131.94 GiB,509.86 MiB
Shape,"(27028, 930, 1409)","(102, 930, 1409)"
Dask graph,265 chunks in 15 graph layers,265 chunks in 15 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,11.42 kiB,5.72 kiB
Shape,"(731, 2)","(366, 2)"
Dask graph,2 chunks in 10 graph layers,2 chunks in 10 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 11.42 kiB 5.72 kiB Shape (731, 2) (366, 2) Dask graph 2 chunks in 10 graph layers Data type datetime64[ns] numpy.ndarray",2  731,

Unnamed: 0,Array,Chunk
Bytes,11.42 kiB,5.72 kiB
Shape,"(731, 2)","(366, 2)"
Dask graph,2 chunks in 10 graph layers,2 chunks in 10 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,262.72 MiB,127.17 MiB
Shape,"(731, 163, 289)","(362, 161, 286)"
Dask graph,16 chunks in 5 graph layers,16 chunks in 5 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 262.72 MiB 127.17 MiB Shape (731, 163, 289) (362, 161, 286) Dask graph 16 chunks in 5 graph layers Data type float64 numpy.ndarray",289  163  731,

Unnamed: 0,Array,Chunk
Bytes,262.72 MiB,127.17 MiB
Shape,"(731, 163, 289)","(362, 161, 286)"
Dask graph,16 chunks in 5 graph layers,16 chunks in 5 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,262.72 MiB,127.17 MiB
Shape,"(731, 163, 289)","(362, 161, 286)"
Dask graph,16 chunks in 7 graph layers,16 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 262.72 MiB 127.17 MiB Shape (731, 163, 289) (362, 161, 286) Dask graph 16 chunks in 7 graph layers Data type float64 numpy.ndarray",289  163  731,

Unnamed: 0,Array,Chunk
Bytes,262.72 MiB,127.17 MiB
Shape,"(731, 163, 289)","(362, 161, 286)"
Dask graph,16 chunks in 7 graph layers,16 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Manual tests of input manager 
Finding exceptional and rare cases

### 1. EOBS finding mean and spread files

In [9]:
manager._get_file_paths("EOBS",["tas"], path_identifiers=[],) #The magic happens here ! All

[PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/tg_ens_mean_0.1deg_reg_v29.0e.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/0.1deg/tg_ens_spread_0.1deg_reg_v29.0e.nc')]

## 2. ERA5 not giving only "mean" values when also min and max exist (due to naming of files!)
-> I don't think we need to resolve this here? But rather when we give the ERA5 data a new structure

In [10]:
manager._get_file_paths("ERA5",["tas"], period=[2000,2001],freq="daily",region="europe") #The magic happens here ! All

[PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily_min-europe-2m_temperature-2001.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily-europe-2m_temperature-2000.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily-europe-2m_temperature-2001.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily_min-europe-2m_temperature-2000.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily_max-europe-2m_temperature-2001.nc'),
 PosixPath('/dodrio/scratch/projects/2022_200/project_input/External/observations/era5/europe/daily/2m_temperature/era5-daily_max-europe-2m_temperature-2000.nc')]

Same is true for ERA5-Land, here different values for pr (mean, min and max) are loaded. 

In [43]:
ds = manager.load_data("ERA5-Land",[ "pr","hfls"], period=[2000,2001], freq="daily", region="belgium", path_identifiers=[])


File paths found:
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5-land/belgium/daily/surface_latent_heat_flux/era5-land-daily_max-belgium-surface_latent_heat_flux-2000.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5-land/belgium/daily/surface_latent_heat_flux/era5-land-daily_max-belgium-surface_latent_heat_flux-2001.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5-land/belgium/daily/surface_latent_heat_flux/era5-land-daily_min-belgium-surface_latent_heat_flux-2000.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5-land/belgium/daily/total_precipitation/era5-land-daily_max-belgium-total_precipitation-2000.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5-land/belgium/daily/total_precipitation/era5-land-daily-belgium-total_precipitation-2001.nc
/dodrio/scratch/projects/2022_200/project_input/External/observations/era5-land/belgium/daily/total_precip