# Training a model on CarbonTracker's carbon flux
This notebook outlines the entire workflow to load and preprocess the following data sets, to be able to train a ML model:

- CarbonTracker
- ERA5 (monthly)
- ERA5-land (monthly)
- SPEI (monthly)
- MODIS (monthly)
- Biomass (yearly)

First follow the data downloading and config setup instuctions.

If you run this notebook on Surf Research Cloud, you shouldn't need to do this anymore.

We start by setting up a Dask client. This will ensure that Dask can run efficiently to process the data:

In [1]:
from dask.distributed import Client
client = Client()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 32819 instead


In [2]:
import excited_workflow
from pathlib import Path
import xarray as xr
import xarray_regrid  # Importing this will make Dataset.regrid accessible.

We will load the CarbonTracker data into an xarray `Dataset` and convert the timestamps (middle point of each month) to a more standard format (1st day of the month), to allow merging with the other datasets.

In [3]:
ds_cb = xr.open_dataset("/data/volume_2/EXCITED_prepped_data/CT2022.flux1x1-monthly.nc")
ds_cb = excited_workflow.utils.convert_timestamps(ds_cb)

The other datasets can be found using the excited_workflow.source_datasets module:

In [4]:
from excited_workflow.source_datasets import datasets
datasets

{'biomass': <excited_workflow.source_datasets.biomass.Biomass at 0x7fca4acf5e50>,
 'era5_hourly': <excited_workflow.source_datasets.era5.ERA5Hourly at 0x7fca0534f890>,
 'era5_monthly': <excited_workflow.source_datasets.era5.ERA5Monthly at 0x7fca0534f350>,
 'era5_land_monthly': <excited_workflow.source_datasets.era5.ERA5LandMonthly at 0x7fca0534d310>,
 'copernicus_landcover': <excited_workflow.source_datasets.land_cover.LandCover at 0x7fca14a7b190>,
 'modis': <excited_workflow.source_datasets.modis.Modis at 0x7fca0534f690>,
 'spei': <excited_workflow.source_datasets.spei.Spei at 0x7fca0534f510>}

We can loop over the desired datasets and merge them into a single xr.Dataset:

In [5]:
desired_data = [
    "biomass",
    "spei",
    "modis",
    "era5_monthly",
    "era5_land_monthly",
    "copernicus_landcover"
]
ds_input = xr.merge(
    [datasets[name].load(freq="monthly", target_grid=ds_cb) for name in desired_data]
)

To limit the analyis to Transcom region 2 (North America) we require the `regions.nc` file:

In [6]:
ds_regions = xr.open_dataset("/data/volume_2/EXCITED_prepped_data/regions.nc")
# Uncomment the next line to preview the region:
#ds_regions["transcom_regions"].where(ds_regions["transcom_regions"]==2).plot()

Now we can merge everything together. From the CarbonTracker file we only require the `bio_flux_opt` variable:

In [7]:
ds_merged = xr.merge([
    ds_cb[["bio_flux_opt"]], 
    ds_regions["transcom_regions"],
    ds_input,
])

To make computations faster and less memory intensive, we can reduce the scope to only North America.

This `.sel` operation reduces the size of the dataset from worldwide to only a rectangular area around North America:

In [8]:
time_region_na = {
    "time": slice("2010-01", "2019-12"),
    "latitude": slice(15, 60),
    "longitude": slice(-140, -55),
}
ds_na = ds_merged.sel(time_region_na)

In [None]:
ds_na = ds_na.compute()

From this North American dataset we can mask the transcom region, and preview the 2m air temperature of ERA5:

In [None]:
ds_na = ds_na.where(ds_merged["transcom_regions"]==2)

import matplotlib.pyplot as plt
plt.figure(figsize=(5,3))
ds_na["t2m"].isel(time=0).plot()
plt.tight_layout()

To prepare the data for training, we convert it to a Pandas `DataFrame`.

In [None]:
df_train = ds_na.to_dataframe().dropna()
df_train.columns

Now we can our ML models on the data. Here we use pycaret to try a set of models and see which type performs best.

In [None]:
X_keys = ["d2m", "mslhf", "msshf", "ssr", "str", "t2m", "spei", "NIRv", "skt", "stl1", "swvl1", "lccs_class"]
y_key = "bio_flux_opt"

df_pycaret = df_train[X_keys + [y_key]]
df_reduced = df_pycaret[::10]

df_reduced["bio_flux_opt"] = df_reduced["bio_flux_opt"]*1e6  # So RMSE etc. are easier to interpret.

import pycaret.regression
pycs = pycaret.regression.setup(df_reduced, target=y_key)
best = pycs.compare_models(n_select=5, round=2)

Using pycaret, the trained models can be evaluated.

In [None]:
pycs.plot_model(best[0], plot="feature")

By including biomass info, we get a slightly better model performance.

In [None]:
X_keys = ["biomass", "d2m", "mslhf", "msshf", "ssr", "str", "t2m", "spei", "NIRv", "skt", "stl1", "swvl1"]
y_key = "bio_flux_opt"

df_pycaret = df_train[X_keys + [y_key]]
df_reduced = df_pycaret[::10]

df_reduced["bio_flux_opt"] = df_reduced["bio_flux_opt"]*1e6  # So RMSE etc. are easier to interpret.

import pycaret.regression
pycs = pycaret.regression.setup(df_reduced, target=y_key)
best = pycs.compare_models(round=2)

However, in the feature importance plot you can see that the importance of NIRv is now reduced by including biomass info:

In [None]:
pycs.plot_model(best[0], plot="feature")

However, this feature imporance depends on the model used. For a different well performing model, the feature imporance is the following:

In [None]:
pycs.plot_model(best[3], plot="feature")