In [1]:
import valenspy as vp
import xarray as xr
from pathlib import Path

import dask

## The Input Manager
The input manager aims to make accessing shared standard datasets easy.
As the available data, data path and variables is HPC specific, the input manager only works for HPC systems specified in the dataset_PATHS.yml. To add a dataset or a new machine, add your machine name, datasets and paths to the dataset_PATHS.yml file.

In [2]:
manager = vp.InputManager(machine='hortense')

In [3]:
df = manager.available_data

In [5]:
df

Unnamed: 0_level_0,variable,base_path,files
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ERA5,pr,/dodrio/scratch/projects/2022_200/project_inpu...,[europe/total_precipitation/hourly/era5-hourly...
ERA5,prsn,/dodrio/scratch/projects/2022_200/project_inpu...,[]
ERA5,prw,/dodrio/scratch/projects/2022_200/project_inpu...,[europe/total_column_water_vapour/hourly/era5-...
ERA5,snw,/dodrio/scratch/projects/2022_200/project_inpu...,[]
ERA5,snd,/dodrio/scratch/projects/2022_200/project_inpu...,[]
...,...,...,...
CCLM,hfls,/dodrio/scratch/projects/2022_200/RCS/CORDEXBE...,[EUR11_1993_temp_NU_TT_EC_TSO/LHFL_S/temp/LHFL...
CCLM,ts,/dodrio/scratch/projects/2022_200/RCS/CORDEXBE...,[EUR11_NU_TT_GC_TSO/T_S/temp/T_S_lffd199506180...
CCLM,prw,/dodrio/scratch/projects/2022_200/RCS/CORDEXBE...,[EUR11_CO_TA_GC_TSO/TQV/temp/TQV_lffd199505210...
CCLM,clivi,/dodrio/scratch/projects/2022_200/RCS/CORDEXBE...,[EUR11_NU_TT_GC_TSO/TQI/temp/TQI_lffd199506050...


In [7]:
df = df.reset_index()
df.loc[df['dataset'] == 'ERA5'].variable.unique()

array(['pr', 'prsn', 'prw', 'snw', 'snd', 'mrros', 'mrro', 'tas', 'ts',
       'evspsbl', 'evspsblpot', 'uas', 'sfcWind', 'hfls', 'hfss', 'rsds',
       'rlds', 'ps', 'psl', 'zmla', 'clt', 'cll', 'clm', 'clh', 'rsdt',
       'hurs', 'prhmax', 'tasmax', 'tasmin', 'sfcWindmax', 'wsgsmax',
       'mrso'], dtype=object)

In [9]:
df.loc[df['variable'] == 'pr']

Unnamed: 0,index,dataset,variable,base_path,files
0,0,ERA5,pr,/dodrio/scratch/projects/2022_200/project_inpu...,[europe/total_precipitation/hourly/era5-hourly...
32,32,ERA5-Land,pr,/dodrio/scratch/projects/2022_200/project_inpu...,[belgium/monthly/total_precipitation/era5-land...
67,67,EOBS,pr,/dodrio/scratch/projects/2022_200/project_inpu...,"[rr_ens_mean_0.1deg_reg_v29.0e.nc, rr_ens_spre..."
74,74,CLIMATE_GRID,pr,/dodrio/scratch/projects/2022_200/project_outp...,[PRECIP_QUANTITY_CLIMATE_GRID_1951_2023_daily....
86,86,CCLM,pr,/dodrio/scratch/projects/2022_200/RCS/CORDEXBE...,[EUR11_CO_TA_GC_TSO/TOT_PREC/temp/TOT_PREC_lff...
97,97,RADCLIM,pr,/dodrio/scratch/projects/2022_200/project_outp...,[2019/RADCLIM_precipitation_20190118_hourly.nc...


### Usage

For everyday use, a dataset can be accessed through the load data functionality. Variables are accessed through their CORDEX variable name. 

You do not need to know the how the variable is called in the oroginal dataset. E.g. In era5, the 2m temperature is called 'tp' but is accessed here through the CORDEX variable name 'tas'. The original name is added to the attributes of the variable for reference.

Note that the files that are found and used to load the data are printed and the CF_status of the ds is printed. This is to help debug if the data is not loaded as expected.


In [3]:
ds = manager.load_data("ERA5",["pr"], period=[2000],freq="daily",region="europe", path_identifiers=["min"])
ds

FileNotFoundError: No files found for dataset ERA5, variables ['pr'], period 2000, frequency daily, region europe and path_identifiers ['min'].

Note that depending on your search criteria, metadata is added to the dataset.

With the load_dataset functionalilty you can also:
- load multiple variables simultaneously and/or
- **not** convert the ds to cf-compliant format and/or ``cf_convert=False``
- Add additional meta_data to the ds - if using the metdata_info dictionary

However, the ds is then not in cf convention and applying diagnostics will not work.

In [None]:
ds = manager.load_data("EOBS",["tas","pr"], path_identifiers=["mean"], cf_convert=True, metadata_info={"creator":"ME"})
ds

: 

The name of the dataset (e.g. "ERA5") should be in the dataset_PATHS.yml file to be able to find the data. If using the inputconvertor ``cf_convert=True`` (default option) a corresponding input_convertor should be available. Currently the following datasets have input convertors:

In [None]:
vp.inputconverter.INPUT_CONVERTORS

: 

## A peak inside the manager

The input_manager uses the path specified in the dataset_PATHS.yml for the given dataset and machine to search all .nc files and file paths that match the filtering requested. The following function is doing all the "magic":

In [None]:
manager._get_file_paths("EOBS",["tas","pr"], path_identifiers=["mean"]) #The magic happens here ! All

: 

Above all paths starting with '/dodrio/scratch/projects/2022_200/project_input/External/observations/EOBS/' and containing the original name (long or short name) for 'tas' or 'pr' in this case tg and rr and "mean" are selected. Other options are:
- region: e.g. europe, belgium
- period: [start_year, end_year] possibly more is covered (note some datasets are not stored by year)!
- frequency: eg. yearly, daily, monthly
- other: Any other keywords to filter by are specified in the path_identifier. E.g. 'mean' for monthly mean data or "min" for minimum daily temperatures

For more information see the documentation on the input_manager and the load_data function.

## Experimental Load multiple datasets

Working on functionality to load multiple datasets at once. Currently this is implemented as load_m_data, but the currently the usage is slightly cluncky and can be improved. The idea is to load multiple datasets at once into a datatree using the load_data functions for each dataset seperately. 

In [None]:
data_request_dict={"EOBS":
                        {"path_identifiers":["mean"]},
                    "ERA5":
                        {"period":[2000,2001],
                         "freq":"daily",
                         "region":"europe",
                         "path_identifiers":["min"]}}


dt = manager.load_m_data(data_request_dict, variables=["tas","pr"])

: 

In [None]:
dt

: 

## Manual tests of input manager 
Finding exceptional and rare cases

### 1. EOBS finding mean and spread files

In [None]:
manager._get_file_paths("EOBS",["tas"], path_identifiers=[],) #The magic happens here ! All

: 

## 2. ERA5 not giving only "mean" values when also min and max exist (due to naming of files!)
-> I don't think we need to resolve this here? But rather when we give the ERA5 data a new structure

In [None]:
manager._get_file_paths("ERA5",["tas"], period=[2000,2001],freq="daily",region="europe") #The magic happens here ! All

: 

Same is true for ERA5-Land, here different values for pr (mean, min and max) are loaded. 

In [None]:
ds = manager.load_data("ERA5-Land",[ "pr","hfls"], period=[2000,2001], freq="daily", region="belgium", path_identifiers=[])


: 

: 

: 