# Training a model on CarbonTracker's carbon flux
This notebook outlines the entire workflow to load and preprocess the following data sets, to be able to train a ML model:

- CarbonTracker
- ERA5 (monthly)
- ERA5-land (monthly)
- SPEI (monthly)
- MODIS (monthly)
- Biomass (yearly)
- Copernicus Landcover (yearly)

First follow the data downloading and config setup instuctions.

If you run this notebook on Surf Research Cloud, you shouldn't need to do this anymore.

In [1]:
import datetime
from pathlib import Path

import xarray as xr
from dask.distributed import Client

import excited_workflow
from excited_workflow import carbon_tracker
from excited_workflow.source_datasets import datasets


client = Client()

Define the paths for the carbon tracker dataset, the regions dataset and create output directory, as well as define the datasets that need to be included, the input variables (x_keys) for the model and the target variable (y_key).

In [None]:
cb_file= Path("/data/volume_2/EXCITED_prepped_data/CT2022.flux1x1-monthly.nc")
regions_file = Path("/data/volume_2/EXCITED_prepped_data/regions.nc")
output_path = Path.home()

time = datetime.datetime.now().strftime("%Y-%m-%d_%H_%M")
output_dir = output_path / f"carbon_tracker-{time}"
output_dir.mkdir(parents=True, exist_ok=True)

desired_data = [
    "biomass",
    "spei",
    "modis",
    "era5_monthly",
    "era5_land_monthly",
    "copernicus_landcover"
]

x_keys = ["d2m", "mslhf", "msshf", "ssr", "str", "t2m", "spei", "NIRv", "skt",
            "stl1", "swvl1", "lccs_class"]
y_key = "bio_flux_opt"

Merge the desired datasets into a single xr.Dataset with the same dimensions as the carbon tracker dataset. 

In [None]:
ds_cb = xr.open_dataset(cb_file)
ds_cb = excited_workflow.utils.convert_timestamps(ds_cb)
ds_input = xr.merge(
    [
        datasets[name].load(freq="monthly", target_grid=ds_cb)
        for name in desired_data
    ]
)

To limit the analyis to Transcom region 2 (North America) we require the `regions.nc` file:

In [None]:
df_na = carbon_tracker.mask_region(regions_file, cb_file, ds_input, 2)
df_na


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,bio_flux_opt,transcom_regions,biomass,spei,NDVI,NIRv,d2m,mslhf,msshf,sp,...,skt,stl1,stl2,stl3,stl4,swvl1,swvl2,swvl3,swvl4,lccs_class
time,latitude,longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2000-08-01,16.5,-97.5,-6.010017e-07,2.0,45.289826,0.402637,0.756299,0.264352,291.160736,-123.552444,-29.256828,89704.265625,...,293.800140,294.488708,294.490234,294.558624,294.885742,0.405656,0.409333,0.420406,0.421968,90.0
2000-08-01,16.5,-96.5,-5.498427e-07,2.0,42.524410,0.859432,0.635423,0.186314,287.451385,-98.636139,-44.350342,83853.054688,...,291.346344,292.329987,292.415466,292.695587,293.209106,0.441099,0.443025,0.432267,0.446795,120.0
2000-08-01,16.5,-95.5,-3.203208e-07,2.0,40.658878,-0.002091,0.714530,0.245570,292.602661,-88.933502,-76.770813,94853.679688,...,298.976379,299.503906,299.532959,299.621643,299.661987,0.340303,0.338384,0.337199,0.330479,60.0
2000-08-01,17.5,-100.5,-7.650994e-07,2.0,46.187639,0.364373,0.768887,0.270905,290.568329,-118.364532,-51.205177,85797.492188,...,293.096253,293.790924,293.769348,293.723419,293.783417,0.412808,0.414333,0.415583,0.423979,90.0
2000-08-01,17.5,-99.5,-2.389529e-06,2.0,41.231996,0.007593,0.748444,0.249498,290.078156,-119.800644,-42.374344,85094.835938,...,291.973236,292.737579,292.763458,292.868774,293.306427,0.406332,0.408797,0.411133,0.398836,120.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-01,55.5,-114.5,5.078427e-07,2.0,23.803023,0.650443,0.168989,0.014420,259.794739,-2.107529,16.971649,92106.984375,...,261.254547,272.136719,272.757507,274.221741,277.282837,0.360522,0.363317,0.371951,0.383640,71.0
2019-12-01,55.5,-113.5,7.799189e-07,2.0,29.709211,0.582704,0.181974,0.015521,259.467133,-1.547745,15.787460,92649.859375,...,261.202423,268.530670,269.509583,273.085327,277.388824,0.373655,0.367109,0.377971,0.379900,71.0
2019-12-01,55.5,-112.5,8.044822e-07,2.0,15.008749,0.328202,0.176154,0.017731,258.919037,-0.648941,16.144135,93281.093750,...,260.399170,268.575684,270.025513,273.835937,278.872986,0.641833,0.636263,0.648265,0.633022,71.0
2019-12-01,56.5,-131.5,5.528356e-07,2.0,64.364447,-0.181532,-0.066072,-0.038799,269.274261,-3.085175,0.307220,91528.421875,...,265.012054,270.157807,270.336700,270.928772,271.985443,0.281438,0.286921,0.199295,0.362787,71.0


Validate the model by splitting the dataset into 5 groups. Train the model over 4 groups and predict over the remaining group iteratively. Output rmse netcdfs and scatterplots are stored in the output directory.

In [None]:
carbon_tracker.validate_model(df_na, 5, x_keys, y_key, output_dir)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7738,0.0,4.9746,1.29


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7672,0.0,4.6821,1.23


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7663,0.0,4.5535,1.24




Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7621,0.0,3.9417,1.15






Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.758,0.0,8.3723,1.18






Train the model over the entire dataset.

In [None]:
pycs, model = carbon_tracker.train_model(df_na, x_keys, y_key)

Save the model to ONNX in the output directory. 

In [None]:

carbon_tracker.save_model(pycs, model, output_dir)



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7691,0.0,3.9693,1.57


The maximum opset needed by this model is only 8.


Create dataframe to run the model with. 

In [None]:
df = df_na[x_keys]
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,d2m,mslhf,msshf,ssr,str,t2m,spei,NIRv,skt,stl1,swvl1,lccs_class
time,latitude,longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000-08-01,16.5,-97.5,291.160736,-123.552444,-29.256828,17632392.0,-4539167.0,294.455078,0.402637,0.264352,293.80014,294.488708,0.405656,90.0
2000-08-01,16.5,-96.5,287.451385,-98.636139,-44.350342,17875046.0,-5688396.0,291.743347,0.859432,0.186314,291.346344,292.329987,0.441099,120.0
2000-08-01,16.5,-95.5,292.602661,-88.933502,-76.770813,18379498.0,-4277693.5,298.020874,-0.002091,0.24557,298.976379,299.503906,0.340303,60.0
2000-08-01,17.5,-100.5,290.568329,-118.364532,-51.205177,17707054.0,-3109625.0,293.402252,0.364373,0.270905,293.096253,293.790924,0.412808,90.0
2000-08-01,17.5,-99.5,290.078156,-119.800644,-42.374344,17862124.0,-3914028.0,293.21994,0.007593,0.249498,291.973236,292.737579,0.406332,120.0


Open model and run it over the dataframe to check it was saved correctly. 

In [None]:
import datetime

from onnxruntime import InferenceSession


with open(output_dir / "lightgbm.onnx", "rb") as f:
    model = f.read()

sess = InferenceSession(model)
predictions_onnx = sess.run(None, {'X': df.to_numpy()})[0]

In [None]:
predictions_onnx

array([[-1.1604789e-06],
       [-5.3657362e-07],
       [-4.3432399e-07],
       ...,
       [ 4.4027504e-07],
       [ 4.7860465e-07],
       [ 1.3122254e-07]], dtype=float32)