# Input 

The input step is the first step of any analysis and consists of loading all the datasets we want to use. 

In this notebook, we will explore what we can do with ValEnsPy to load our data and why this is important.


### 0. Loading packages & settings
As always, we first load the necessary packages.

In [1]:
import valenspy as vp

# 1. The Input Step
## 1.1 Why ValEnsPy?
1. Different gridded datasets have different naming conventions and units for the same variables. Therefore, to compare them, a translation and unit conversion is often necessary.
ValEnsPy automates this process, allowing you to focus on the analysis. 
 
2. ValEnsPy also simplifies the access to standard datasets, such as the [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=overview) and climate grid.
This allows you to load the data with a single line of code without even explicitly knowing where the data is stored.

## 1.2 Variable convention

ValEnsPy uses the [CMIP/CORDEX](https://cordex.org/experiment-guidelines/cordex-cmip6/data-request-cordex-cmip6-rcms/) variable naming conventions and the [Climate and Forecast (CF)](http://cfconventions.org/) metadata conventions.
[Here is an overview of all the variables names and there corresponding metadata](https://github.com/CORDEX-be2/ValEnsPy/blob/main/src/valenspy/ancilliary_data/CORDEX_variables.yml).

All functions within ValEnsPy are built to work with these conventions. This allows you to easily switch between different datasets and models.

## 1.3 Loading a dataset - The InputManager
The simplest way to load a dataset is with the InputManager. 
To start we have to specify which machine we are using. This informs the InputManager where to look for the data.

In [2]:
manager = vp.InputManager("tier2_VO_geo") #tier2_VO_geo

Now that we have defined the machine, we can load some data.
Here we load the CLIMATE_GRID data, for the variable "tas" (temperature at the surface) on a "latlon_5km" grid.

In [3]:
ds_ref = manager.load_data("CLIMATE_GRID", variables="tas", path_identifiers=["latlon_5km"])

File paths found:
/data/gent/vo/002/gvo00202/master/data/Belgium/observations/climate_grid/TEMP_AVG_CLIMATE_GRID_1954_2023_daily_latlon_5km.nc
The file is ValEnsPy CF compliant.
100.00% of the variables are ValEnsPy CF compliant
ValEnsPy CF compliant: ['tas']


When loading the InputManager, the following steps take place:
1. The InputManager looks for the available datasets on the specified machine and uses the search term to find the correct dataset.
2. The found datasets are loaded with xarray.
3. The variables names are translated and the units are converted.
4. Information about the dataset is printed, namely the files found and which variables were automatically converted to the correct naming and metadata conventions.

Now lets take a look at the data we loaded

In [4]:
ds_ref

Unnamed: 0,Array,Chunk
Bytes,1.00 GiB,127.98 MiB
Shape,"(25567, 70, 75)","(13334, 34, 37)"
Dask graph,18 chunks in 3 graph layers,18 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.00 GiB 127.98 MiB Shape (25567, 70, 75) (13334, 34, 37) Dask graph 18 chunks in 3 graph layers Data type float64 numpy.ndarray",75  70  25567,

Unnamed: 0,Array,Chunk
Bytes,1.00 GiB,127.98 MiB
Shape,"(25567, 70, 75)","(13334, 34, 37)"
Dask graph,18 chunks in 3 graph layers,18 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Note that temperature at the surface (tas) is actually called something else in the CLIMATE_GRID dataset and was in degrees Celsius not Kelvin! ValEnsPy automatically translated this for us.
ValEnsPy keeps track of this information in the metadata - can you find it in the xarray attributes above?

### 1.4 The InputManager - In depth

Now that we have loaded the data, we can take a closer look at the InputManager and the options it offers.

#### 1.4.1 The InputManager - finding the dataset
The InputManager starts by looking available datasets on the specified machine. There is a [list of available machines and datasets within these machines](https://github.com/CORDEX-be2/ValEnsPy/blob/main/src/valenspy/ancilliary_data/dataset_PATHS.yml).
Then all search terms provided in load_dataset are used to find the correct dataset. 
The options are:
- dataset_name: The name of the dataset you want to load.
- variable: The variable you want to load - this should be the CMIP6/CORDEX [variable names](https://github.com/CORDEX-be2/ValEnsPy/blob/main/src/valenspy/ancilliary_data/CORDEX_variables.yml)!
- period: A year (1999) or a range of years [1999, 2000].
- region: Some data is available for different regions, such as "global" or "europe".
- path_identifiers: Some datasets have other identifiers in the path, such as "latlon_5km" for the CLIMATE_GRID dataset.

Note that not all options are mandatory! For example, the CLIMATE_GRID data is stored per variable for all available years, so putting in a separate year here is not necessary. 

For more info see the [documentation](https://cordex-be2.github.io/ValEnsPy/) or the docstring of the InputManager:

In [5]:
help(manager.load_data)

Help on method load_data in module valenspy.input.manager:

load_data(dataset_name, variables=['tas'], period=None, freq=None, region=None, cf_convert=True, path_identifiers=[], metadata_info={'path_identifiers': ['latlon_5km'], 'dataset': 'CLIMATE_GRID'}) method of valenspy.input.manager.InputManager instance
    Load the data for the specified dataset, variables, period and frequency and transform it into ValEnsPy CF-Compliant format.
    
    For files to be found and loaded they should be in a subdirectory of the dataset path and contain
    the raw_long_name or raw_name or CORDEX variable name, the year (optional), frequency and path_identifiers (optional) in the file name.
    
    A regex search is used to match any netcdf (.nc) file paths that start with the dataset_path from the dataset_PATHS.yml and contains:
    1) The raw_long_name of the CORDEX variables given the dataset_name_lookup.yml
    2) Any YYYY string within the period
    3) The frequency of the data (daily, mont

#### 1.4.2 The InputManager - automatic conversion and translation
Once the files are found the InputManager will automatically convert the variable names and units to the CMIP6/CORDEX conventions. This allows you to load temperature at the surface (tas) from both ERA5 and CLIMATE_GRID without having to know that in ERA5 tas is called "t2m" and in CLIMATE_GRID it is called "TEMP_AVG".

This is not magic, rather so called [InputConvertors](#15-input-convertorsInput-Convertors) do this conversion.

### 1.5 Input Convertors
Input Convertors are used to convert the variable names and units of the loaded data to the CMIP6/CORDEX conventions.
There is an input convertor for each dataset type. Below is a list of the available input convertors:

In [6]:
vp.INPUT_CONVERTORS

{'ERA5': <valenspy.input.converter.InputConverter at 0x14cbc35fac50>,
 'ERA5-Land': <valenspy.input.converter.InputConverter at 0x14cbc35fadd0>,
 'EOBS': <valenspy.input.converter.InputConverter at 0x14cbc36cc070>,
 'CLIMATE_GRID': <valenspy.input.converter.InputConverter at 0x14cbc36cc220>,
 'CCLM': <valenspy.input.converter.InputConverter at 0x14cbc36cc2e0>,
 'ALARO_K': <valenspy.input.converter.InputConverter at 0x14cbc36cc0a0>,
 'RADCLIM': <valenspy.input.converter.InputConverter at 0x14cbc36cc190>}

Each input convertor uses a lookup table linking the original variable names and units to the CMIP6/CORDEX conventions.

For example in the [ERA5_lookup.yml](https://github.com/CORDEX-be2/ValEnsPy/blob/main/src/valenspy/ancilliary_data/ERA5_lookup.yml) file "tas" is linked to "t2m" and is already in Kelvin but in the [CLIMATE_GRID_lookup.yml](https://github.com/CORDEX-be2/ValEnsPy/blob/main/src/valenspy/ancilliary_data/CLIMATE_GRID_lookup.yml) file "tas" is linked to "TEMP_AVG" and is in degrees Celsius.

This information is used to convert the variables.

#### 1.5.1 Example
If we were to load the CLIMATE_GRID dataset as is (using xarray) - the variable name would be "TEMP_AVG" and the units would be degrees Celsius.

In [7]:
import xarray as xr
file = manager._get_file_paths("CLIMATE_GRID", variables=["tas"], path_identifiers=["latlon_5km"])[0] #Get the raw location of the file
ds = xr.open_dataset(file, chunks="auto") #Open the file with xarray
ds

Unnamed: 0,Array,Chunk
Bytes,1.00 GiB,127.98 MiB
Shape,"(25567, 70, 75)","(13334, 34, 37)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.00 GiB 127.98 MiB Shape (25567, 70, 75) (13334, 34, 37) Dask graph 18 chunks in 2 graph layers Data type float64 numpy.ndarray",75  70  25567,

Unnamed: 0,Array,Chunk
Bytes,1.00 GiB,127.98 MiB
Shape,"(25567, 70, 75)","(13334, 34, 37)"
Dask graph,18 chunks in 2 graph layers,18 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


However, we could convert it automatically with the correct INPUT_CONVERTOR...

In [8]:
ic = vp.INPUT_CONVERTORS["CLIMATE_GRID"]
ds_converted = ic.convert_input(ds)
ds_converted

The file is ValEnsPy CF compliant.
100.00% of the variables are ValEnsPy CF compliant
ValEnsPy CF compliant: ['tas']


Unnamed: 0,Array,Chunk
Bytes,1.00 GiB,127.98 MiB
Shape,"(25567, 70, 75)","(13334, 34, 37)"
Dask graph,18 chunks in 3 graph layers,18 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.00 GiB 127.98 MiB Shape (25567, 70, 75) (13334, 34, 37) Dask graph 18 chunks in 3 graph layers Data type float64 numpy.ndarray",75  70  25567,

Unnamed: 0,Array,Chunk
Bytes,1.00 GiB,127.98 MiB
Shape,"(25567, 70, 75)","(13334, 34, 37)"
Dask graph,18 chunks in 3 graph layers,18 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


NOTE! The InputManager does all this automatically for you! 
This is just to show you how it works under the hood.

For more information on the Input Convertors see the [documentation](https://cordex-be2.github.io/ValEnsPy/).

If you were to add a new dataset to Valenspy (new model, other observational product etc) a new InputConvertor and translator file would need to be created. Luckily there are already a lot of good examples in the current existing inputconvertors available, as well as conversion functions. 