# Training a model on CarbonTracker's carbon flux
This notebook outlines the entire workflow to load and preprocess the following data sets, to be able to train a ML model:

- CarbonTracker
- ERA5 (monthly)
- ERA5-land (monthly)
- SPEI (monthly)
- MODIS (monthly)
- Biomass (yearly)
- Copernicus Landcover (yearly)

First follow the data downloading and config setup instuctions.

If you run this notebook on Surf Research Cloud, you shouldn't need to do this anymore.

In [1]:
from pathlib import Path

from dask.distributed import Client

from excited_workflow import carbon_tracker


client = Client()

Define the paths for the carbon tracker dataset, the regions dataset and output directory, as well as define the datasets that need to be included, the input variables (x_keys) for the model and the target variable (y_key).

In [2]:
cb_file= Path("/data/volume_2/EXCITED_prepped_data/CT2022.flux1x1-monthly.nc")
regions_file = Path("/data/volume_2/EXCITED_prepped_data/regions.nc")
output_path = Path("/home/cdonnelly")

desired_data = [
    "biomass",
    "spei",
    "modis",
    "era5_monthly",
    "era5_land_monthly",
    "copernicus_landcover"
]

x_keys = ["d2m", "mslhf", "msshf", "ssr", "str", "t2m", "spei", "NIRv", "skt",
            "stl1", "swvl1", "lccs_class"]
y_key = "bio_flux_opt"

Merge the desired datasets into a single xr.Dataset with the same dimensions as the carbon tracker dataset. 

In [3]:
ds_input = carbon_tracker.merge_datasets(desired_data, cb_file)

To limit the analyis to Transcom region 2 (North America) we require the `regions.nc` file:

In [4]:
ds_na = carbon_tracker.mask_region(regions_file, cb_file, ds_input)
ds_na

Validate the model by splitting the dataset into 5 groups. Train the model over 4 groups and predict over the remaining group iteratively. Output rmse netcdfs and scatterplots are stored in the output directory.

In [5]:
carbon_tracker.validate_model(ds_na, 5, x_keys, y_key, output_path)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,3.4,1.2


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,6.9,1.2


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,4.1,1.2




Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,7.2,1.2






Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,4.2,1.2






<xarray.DataArray (latitude: 42, longitude: 68)>
array([[           nan,            nan,            nan, ...,
                   nan,            nan,            nan],
       [           nan,            nan,            nan, ...,
                   nan,            nan,            nan],
       [           nan,            nan,            nan, ...,
                   nan,            nan,            nan],
       ...,
       [3.96923892e-07, 3.87105643e-07,            nan, ...,
                   nan,            nan,            nan],
       [           nan, 3.88582264e-07,            nan, ...,
                   nan,            nan,            nan],
       [1.08093237e-07,            nan,            nan, ...,
                   nan,            nan,            nan]])
Coordinates:
  * latitude   (latitude) float64 16.5 17.5 18.5 19.5 ... 54.5 55.5 56.5 57.5
  * longitude  (longitude) float64 -132.5 -131.5 -127.5 ... -64.5 -62.5 -61.5


Train the model over the entire dataset and save the model to ONNX in the output directory. 

In [6]:
onnx_model = carbon_tracker.save_model(ds_na, x_keys, y_key, output_path)



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,4.4,0.8


The maximum opset needed by this model is only 8.


Create dataframe to run the model with. 

In [7]:
df = ds_na.to_dataframe().dropna()
df = df[x_keys]
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,d2m,mslhf,msshf,ssr,str,t2m,spei,NIRv,skt,stl1,swvl1,lccs_class
time,latitude,longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000-08-01,16.5,-97.5,291.160736,-123.552444,-29.256828,17632392.0,-4539167.0,294.455078,0.402637,0.264352,293.80014,294.488708,0.405656,90.0
2000-08-01,16.5,-96.5,287.451385,-98.636139,-44.350342,17875046.0,-5688396.0,291.743347,0.859432,0.186314,291.346344,292.329987,0.441099,120.0
2000-08-01,16.5,-95.5,292.602661,-88.933502,-76.770813,18379498.0,-4277693.5,298.020874,-0.002091,0.24557,298.976379,299.503906,0.340303,60.0
2000-08-01,17.5,-100.5,290.568329,-118.364532,-51.205177,17707054.0,-3109625.0,293.402252,0.364373,0.270905,293.096253,293.790924,0.412808,90.0
2000-08-01,17.5,-99.5,290.078156,-119.800644,-42.374344,17862124.0,-3914028.0,293.21994,0.007593,0.249498,291.973236,292.737579,0.406332,120.0


Open model and run it over the dataframe to check it was saved correctly. 

In [8]:
from onnxruntime import InferenceSession
import datetime

time = datetime.datetime.now().strftime('%Y-%m-%d_%H')
output_dir = output_path / f"carbon_tracker-{time}/lightgbm.onnx"
with open(output_dir, "rb") as f:
    model = f.read()

sess = InferenceSession(model)
predictions_onnx = sess.run(None, {'X': df.to_numpy()})[0]

In [9]:
predictions_onnx

array([[-1.2209808e-06],
       [-5.2906603e-07],
       [-3.8912000e-07],
       ...,
       [ 4.4822761e-07],
       [ 4.6443941e-07],
       [ 1.3892891e-07]], dtype=float32)