# Training a model on CarbonTracker's carbon flux
This notebook outlines the entire workflow to load and preprocess the ERA5 and CarbonTracker data, to be able to train a ML model.

First download the CarbonTracker and monthly ERA5 data using the instructions in the README file.

The functions written to process the data are contained in `src/carbontracker.py`. We can import them with:

In [1]:
from src import carbontracker
from src import spei
from src import utils
from pathlib import Path
import xarray as xr
import xarray_regrid  # Importing this will make Dataset.regrid accessible.

We will load the CarbonTracker data into an xarray `Dataset` and convert the timestamps (middle point of each month) to a more standard format (1st day of the month), to allow merging with ERA5 data.

In [None]:
data_folder = Path("/home/yangliu/Excited/EXCITED_prepped_data")
ds_cb = xr.open_dataset(data_folder / "CT2022.flux1x1-monthly.nc")
ds_cb = utils.convert_timestamps(ds_cb)

Next we load in the monthly ERA5 data. We will have to convert the latitude values to be formatted (-180 -> 180 degrees) instead of (0 -> 360 degrees).

Next we coarsen the data to a 1-degree grid, centered around the half values (e.g., [0.5, 1.5, ...]).

In [None]:
ds_era5 = xr.open_mfdataset("/home/yangliu/Excited/EXCITED_prepped_data/monthly_era5/*.nc")
ds_era5 = carbontracker.shift_era5_longitude(ds_era5)
ds_era5 = carbontracker.coarsen_era5(ds_era5)

Let's add SPEI dataset to our recipe. <br>
It is also needed to convert the timestamps to a more standard format.

In [None]:
ds_spei = spei.load_spei_data(data_folder / "spei/spei06.nc").sel(time=slice("2000-01", "2020-12"))

In [None]:
# regrid spei dataset to desired era5 grid
ds_spei_regrid = ds_spei.regrid.regrid(ds_era5, method="linear")
ds_spei_regrid

To limit the analyis to Transcom region 2 (North America) we require the `regions.nc` file:

In [None]:
ds_regions = xr.open_dataset(data_folder / "regions.nc")
# Uncomment the next line to preview the region:
#ds_regions["transcom_regions"].where(ds_regions["transcom_regions"]==2).plot()

Now we can merge all three datasets together. From the CarbonTracker file we only require the `bio_flux_opt` variable:

In [None]:
ds_merged = xr.merge([ds_cb[["bio_flux_opt"]], ds_regions["transcom_regions"], ds_era5, ds_spei])

To make computations faster and less memory intensive, we can reduce the scope to only North America.

This `.sel` operation reduces the size of the dataset from worldwide to only a rectangular area around North America:

In [None]:
time_region_na = {
    "time": slice("2015-01", "2020-12"),
    "latitude": slice(15, 60),
    "longitude": slice(-140, -55),
}
ds_na = ds_merged.sel(time_region_na)

We can now compute the dataset (instead of leaving it lazy and out-of-memory), as it is small enough to fit into RAM.
This operation should not take more than 1 or 2 minutes.

In [None]:
ds_na = ds_na.compute()

From this North American dataset we can mask the transcom region, and preview the 2m air temperature of ERA5:

In [None]:
ds_na = ds_na.where(ds_merged["transcom_regions"]==2)

import matplotlib.pyplot as plt
plt.figure(figsize=(5,3))
ds_na["t2m"].isel(time=0).plot()
plt.tight_layout()

To prepare the data for training, we convert it to a Pandas `DataFrame`.

We will remove all rows with NaN values, and unset the indices:

In [None]:
df_train = ds_na.to_dataframe().dropna().reset_index()
df_train.head(3)

Now we can our ML models on the data. Here we use pycaret to try a set of models and see which type performs best.

In [None]:
X_keys = ["d2m", "mslhf", "msshf", "ssr", "str", "t2m"]
y_key = "bio_flux_opt"

df_pycaret = df_train[X_keys + [y_key]]
df_reduced = df_pycaret[::10]

df_reduced["bio_flux_opt"] = df_reduced["bio_flux_opt"]*1e6  # So RMSE etc. are easier to interpret.

import pycaret.regression
pycs = pycaret.regression.setup(df_reduced, target=y_key)# normalize=True, normalize_method="robust")
best = pycs.compare_models(round=2)