# Training models on Fluxnet data

To create fluxnet models efficiently, we created a workflow runner. 

After specifying some directories, which datasets you want to include, and which labels from these datasets you'd want to use, you can run the workflow.

We'll start by importing the necessary modules and setting up Dask:

In [1]:
from pathlib import Path

from dask.distributed import Client

from excited_workflow.train_fluxnet_models import FluxnetExperiment
from excited_workflow.train_fluxnet_models import calculate_era5_derived_vars
from excited_workflow.train_fluxnet_models import collect_training_data
from excited_workflow.train_fluxnet_models import run_workflow


client = Client(n_workers=2, threads_per_worker=2)

Next we have to define some directories:

- where is the pre-processed fluxnet data stored?
- where should the pre-processed ERA5 data be stored?
- where do you want the trained models to be written to?

Additionally, you have to define which additional (monthly) datasets are required:

In [2]:
ameriflux_file = Path("/data/volume_2/NEE_ameriflux_transcom2.nc")
preprocessed_dir = Path("/data/volume_2/preprocessed_site_data")
output_directory = Path("/data/volume_2/trained_models")

additional_datasets = [
    "biomass",
    "spei",
    "modis",
]

If you want to know which variables will be available when you run this workflow, you can load the dataset that the workflow uses.
Note that loading in all this data takes some time, especially if the ERA5 data has not been pre-processed yet.

The `collect_training_data` can be provided with a function that derives variables from the collected xarray Dataset. Here we use the `calculate_era5_derived_vars` function from `excited_workflow.train_fluxnet_models`.

In [3]:
ds = collect_training_data(
    ameriflux_file, preprocessed_dir, additional_datasets,
    variable_derivation=calculate_era5_derived_vars,
)
ds

Valid file fluxnet-sites_era5_10m_v_component_of_wind_2004.nc already exists, skipping.                         

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,30.60 MiB
Shape,"(61, 271755)","(61, 131490)"
Dask graph,16 chunks in 47 graph layers,16 chunks in 47 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 30.60 MiB Shape (61, 271755) (61, 131490) Dask graph 16 chunks in 47 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,30.60 MiB
Shape,"(61, 271755)","(61, 131490)"
Dask graph,16 chunks in 47 graph layers,16 chunks in 47 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,30.60 MiB
Shape,"(61, 271755)","(61, 131490)"
Dask graph,16 chunks in 47 graph layers,16 chunks in 47 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 30.60 MiB Shape (61, 271755) (61, 131490) Dask graph 16 chunks in 47 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,30.60 MiB
Shape,"(61, 271755)","(61, 131490)"
Dask graph,16 chunks in 47 graph layers,16 chunks in 47 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.20 MiB Shape (61, 271755) (61, 43818) Dask graph 26 chunks in 61 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.20 MiB
Shape,"(61, 271755)","(61, 43818)"
Dask graph,26 chunks in 61 graph layers,26 chunks in 61 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.04 MiB,171.18 kiB
Shape,"(271755,)","(43821,)"
Dask graph,27 chunks in 89 graph layers,27 chunks in 89 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.04 MiB 171.18 kiB Shape (271755,) (43821,) Dask graph 27 chunks in 89 graph layers Data type float32 numpy.ndarray",271755  1,

Unnamed: 0,Array,Chunk
Bytes,1.04 MiB,171.18 kiB
Shape,"(271755,)","(43821,)"
Dask graph,27 chunks in 89 graph layers,27 chunks in 89 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.22 MiB
Shape,"(61, 271755)","(61, 43902)"
Dask graph,27 chunks in 88 graph layers,27 chunks in 88 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.22 MiB Shape (61, 271755) (61, 43902) Dask graph 27 chunks in 88 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.22 MiB
Shape,"(61, 271755)","(61, 43902)"
Dask graph,27 chunks in 88 graph layers,27 chunks in 88 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,244 B,244 B
Shape,"(61,)","(61,)"
Dask graph,1 chunks in 65 graph layers,1 chunks in 65 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 244 B 244 B Shape (61,) (61,) Dask graph 1 chunks in 65 graph layers Data type float32 numpy.ndarray",61  1,

Unnamed: 0,Array,Chunk
Bytes,244 B,244 B
Shape,"(61,)","(61,)"
Dask graph,1 chunks in 65 graph layers,1 chunks in 65 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,244 B,244 B
Shape,"(61,)","(61,)"
Dask graph,1 chunks in 125 graph layers,1 chunks in 125 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 244 B 244 B Shape (61,) (61,) Dask graph 1 chunks in 125 graph layers Data type float32 numpy.ndarray",61  1,

Unnamed: 0,Array,Chunk
Bytes,244 B,244 B
Shape,"(61,)","(61,)"
Dask graph,1 chunks in 125 graph layers,1 chunks in 125 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.22 MiB
Shape,"(61, 271755)","(61, 43902)"
Dask graph,27 chunks in 165 graph layers,27 chunks in 165 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 63.24 MiB 10.22 MiB Shape (61, 271755) (61, 43902) Dask graph 27 chunks in 165 graph layers Data type float32 numpy.ndarray",271755  61,

Unnamed: 0,Array,Chunk
Bytes,63.24 MiB,10.22 MiB
Shape,"(61, 271755)","(61, 43902)"
Dask graph,27 chunks in 165 graph layers,27 chunks in 165 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


*Additionally, if you want to get the pandas DataFrame that is used for training the model, run the following command; `ds.to_dataframe().dropna().reset_index()`*

## Defining the experiments

Now you can define your model training parameters. You will need to define:
- The `group_key`: this is the name of the variable used for splitting up data in cross-validation (i.e. the site names).
- The predictor variables (`X_keys`).
- The target variable (`y_key`)
- the cross validation method (see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))
- The name of the ML model you want to use. For the available ones, see the [documentation on pycaret](https://pycaret.gitbook.io/docs/get-started/quickstart#compare-models-1)

In [4]:
# Define keys for models
group_key = "site"  # for fold groups
X_keys_resp = [
    "d2m", "t2m", "ssr", # era5
    "biomass", "spei", "NDVI", "NIRv", # other datasets
    "day_of_year", "t2m_1w_rolling", "mean_air_temperature",
    "mean_dewpoint_depression", "dewpoint_depression_1w_rolling"
]
y_key_resp = "resp"

X_keys_gpp = [
    "d2m", "mslhf", "msshf", "ssr", "ssr_6hr", "str", "t2m", # era5
    "biomass", "spei", "NDVI", "NIRv", # other datasets
]
y_key_gpp = "GPP_NT_VUT_REF"

from sklearn.model_selection import GroupShuffleSplit
cv_method = GroupShuffleSplit(n_splits=10, test_size=0.4)

All this information has to be provided to the `FluxnetExperiment` 'dataclass'. 

We will train three models here, two for the respiration (using different ML models) and one for GPP.

The name of the experiment is used to create the model's output directory, you're free to define this.

In [5]:
models = [
    FluxnetExperiment(
        name="respiration",
        X_keys=X_keys_resp,
        y_key=y_key_resp,
        ml_model_name="ridge",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
    FluxnetExperiment(
        name="respiration",
        X_keys=X_keys_resp,
        y_key=y_key_resp,
        ml_model_name="lightgbm",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
    FluxnetExperiment(
        name="gpp",
        X_keys=X_keys_gpp,
        y_key=y_key_gpp,
        ml_model_name="lightgbm",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
]

## Executing the workflow

Now you can run the workflow. For a deeper look into the specific steps of the workflow, see the file `src/excited_workflow/train_fluxnet_models.py`.

In [8]:
run_workflow(
    fluxnet_file=ameriflux_file,
    preprocessing_dir=preprocessed_dir,
    additional_datasets=additional_datasets,
    models=models,
    variable_derivation=calculate_era5_derived_vars,
)

Valid file fluxnet-sites_era5_10m_v_component_of_wind_2004.nc already exists, skipping.                         

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,1.41915,5.57636,2.3246,0.40921,0.46475,15009.15083,0.614


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,1.34189,5.08345,2.216,0.40873,0.45494,14587.10069,3.761


The maximum opset needed by this model is only 8.


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,2.44205,24.3207,4.91198,0.54952,0.63091,6.56505,3.032


The maximum opset needed by this model is only 8.
