# Training a model on CarbonTracker's carbon flux
This notebook outlines the entire workflow to load and preprocess the following data sets, to be able to train a ML model:

- CarbonTracker
- ERA5 (monthly)
- ERA5-land (monthly)
- SPEI (monthly)
- MODIS (monthly)
- Biomass (yearly)
- Copernicus Landcover (yearly)

First follow the data downloading and config setup instuctions.

If you run this notebook on Surf Research Cloud, you shouldn't need to do this anymore.

In [1]:
from pathlib import Path

from dask.distributed import Client

from excited_workflow import carbon_tracker


client = Client()

Define the paths for the carbon tracker dataset, the regions dataset and output directory, as well as define the datasets that need to be included, the input variables (x_keys) for the model and the target variable (y_key).

In [2]:
ds_cb = Path("/data/volume_2/EXCITED_prepped_data/CT2022.flux1x1-monthly.nc")
ds_regions = Path("/data/volume_2/EXCITED_prepped_data/regions.nc")
output_dir = Path("/home/cdonnelly")

desired_data = [
    "biomass",
    "spei",
    "modis",
    "era5_monthly",
    "era5_land_monthly",
    "copernicus_landcover"
]

x_keys = ["d2m", "mslhf", "msshf", "ssr", "str", "t2m", "spei", "NIRv", "skt",
            "stl1", "swvl1", "lccs_class"]
y_key = "bio_flux_opt"

Merge the desired datasets into a single xr.Dataset with the same dimensions as the carbon tracker dataset. 

In [3]:
ds_input = carbon_tracker.merge_datasets(desired_data, ds_cb)

To limit the analyis to Transcom region 2 (North America) we require the `regions.nc` file:

In [4]:
ds_na = carbon_tracker.mask_region(ds_regions, ds_cb, ds_input)
ds_na

Validate the model by splitting the dataset into 5 groups. Train the model over 4 groups and predict over the remaining group iteratively. Output rmse netcdfs and scatterplots are stored in the output directory.

In [5]:
carbon_tracker.validate_model(ds_na, 5, x_keys, y_key, output_dir)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,3.6,0.2


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,3.5,0.2


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,5.1,0.2


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.8,0.0,4.2,0.2


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7,0.0,3.5,0.2


Train the model over the entire dataset and save the model to ONNX in the output directory. 

In [6]:
onnx_model = carbon_tracker.save_model(ds_na, x_keys, y_key, output_dir)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.0,0.0,0.0,0.7,0.0,4.0,0.4


The maximum opset needed by this model is only 8.


Create dataframe to run the model with. Convert values to float32.

In [7]:
df = ds_na.to_dataframe().dropna()
df = df[x_keys][::10]
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,d2m,mslhf,msshf,ssr,str,t2m,spei,NIRv,skt,stl1,swvl1,lccs_class
time,latitude,longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000-08-01,16.5,-97.5,291.160736,-123.552444,-29.256828,17632392.0,-4539167.0,294.455078,0.402637,0.264352,293.800140,294.488708,0.405656,90.0
2000-08-01,18.5,-101.5,293.513855,-93.125107,-56.438019,18068404.0,-5339288.5,299.283936,-0.300553,0.281680,299.144745,299.822601,0.334316,60.0
2000-08-01,19.5,-101.5,285.543335,-89.911835,-44.041389,15922805.0,-4355907.5,288.824677,-0.292989,0.243739,287.316406,287.972260,0.406064,90.0
2000-08-01,20.5,-100.5,285.119324,-68.271637,-76.867363,18879166.0,-6372623.5,292.068542,0.086668,0.152103,293.144592,293.650024,0.195828,11.0
2000-08-01,21.5,-98.5,296.115051,-104.443359,-43.687386,16755104.0,-4156947.5,300.844299,0.120198,0.252774,300.711945,301.543549,0.287784,50.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-01,51.5,-101.5,256.830872,-0.664703,11.963760,2044303.0,-3697116.5,258.765778,1.172457,-0.031652,258.316467,270.511261,0.444990,11.0
2019-12-01,52.5,-112.5,262.490692,-1.074692,15.552032,1481686.0,-3616690.5,264.937561,1.633417,-0.039948,263.270752,268.994141,0.311979,11.0
2019-12-01,53.5,-116.5,261.434357,-3.534576,20.831009,1808079.0,-3773900.0,264.205383,2.066818,0.017840,261.317444,272.363464,0.482138,71.0
2019-12-01,54.5,-114.5,260.227997,-2.052338,16.436630,1571385.0,-3568210.5,262.839630,0.981084,-0.007025,261.786438,269.405945,0.462469,11.0


Open model and run it over the dataframe to check it was saved correctly. 

In [8]:
from onnxruntime import InferenceSession

with open(output_dir / "lightgbm.onnx", "rb") as f:
    model = f.read()

sess = InferenceSession(model)
predictions_onnx = sess.run(None, {'X': df.to_numpy()})[0]

In [9]:
predictions_onnx

array([[-9.1330406e-07],
       [-1.4581279e-06],
       [-1.0954077e-06],
       ...,
       [ 5.8456959e-07],
       [ 3.8920575e-07],
       [ 4.0000950e-07]], dtype=float32)