[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NOAA-EPIC/global-eagle/blob/feature/hello_world/examples/getting_started/colab_notebook_demo/pipeline_demo.ipynb)

# Welcome to the `ufs2arco` + `anemoi` + `wxvx` pipeline!

Before we start, let's go over a few Google Colab tips!

Q) Where are files located?!

A) You should see a navigation bar on the left of your screen. The bottom option is a folder. Click on that and you will see all files in your workspace. If you have not run anything yet, you should only see a "sample data" folder (this automatically populates in any colab notebook). Throughout this notebook you can go into this area and watch your files populate, look at plots, and edit yamls if you wish to update any configurations on your own. Note: sometimes clicking through folders can be a little laggy.

Q) How do I connect to compute?!

A) You will need to connect to a runtime. Towards the top right of your screen you will see the words RAM and disk. There is a drop down button next to that. Click there, and then click "change runtime type". Make sure "Python 3" is selected under runtime type, and if available, select a T4 GPU as your hardware accelerator. If not available, you can run this notebook with a CPU but it will be very, very slow during training. If you happen to have credits for an A-100, use that!

Now that you are connected to compute and know where to find your files, let's do some Machine Learning!

This notebook will guide you through an entire ML pipeline.
1) Data preprocessing using `ufs2arco` to create training and validation datasets
2) Model training using `anemoi-core` modules to train a graph-based model
3) Creating a forecast with `anemoi-inference` to run inference from a model checkpoint
4) Verifying your forecast (or multiple!) with `wxvx` to verify against the Global Forecast System (GFS)

More information about the various modules and instructions will be provided within each individual step. You will also find additional instructions if you wish to change configurations yourself at all!

Acknowledgments:
- ufs2arco and Anemoi configurations were adapted from Tim Smith at NOAA Physical Sciences Laboratory
    - https://github.com/NOAA-PSL/anemoi-house
- ufs2arco: Tim Smith (NOAA Physical Sciences Laboratory)
    - https://github.com/NOAA-PSL/ufs2arco
- Anemoi: European Centre for Medium-Range Weather Forecasts
    - https://github.com/ecmwf/anemoi-core
    - https://github.com/ecmwf/anemoi-inference
- wxvx: Paul Madden (NOAA Global Systems Laboratory/Cooperative Institute for Research In Environmental Sciences)
     - https://github.com/maddenp-cu/wxvx

### Step 1: Environment Setup
Runtime: 1 minute

You will receive a popup after all packages are installed. Click "restart session" on the popup and continue on to the next step.

You may see some red warnings about numpy versions. You can ignore this.

In [None]:
!pip install anemoi-datasets==0.5.25 anemoi-graphs==0.6.2 anemoi-models==0.8.1 anemoi-training==0.5.1 anemoi-inference==0.6.3 trimesh 'numpy<2.3' 'earthkit-data<0.14.0' ufs2arco

Clone repository

In [None]:
!git clone -b feature/hello_world https://github.com/NOAA-EPIC/global-eagle.git

#TODO -- right before merging to main we need to update this to not load branch.

### Step 2: Create training and validation datasets with ufs2arco

Runtime: 3 minutes

`ufs2arco` is a python package developed by NOAA Physical Sciences Laboratory (PSL) that is designed to make NOAA forecast, reanalysis, and reforecast datasets more accessible for scientific analysis and machine learning model development. The name stems from its original intent, which was to transform output from the Unified Forecast System (UFS) into Analysis Ready, Cloud Optimized (ARCO; Abernathey et al., (2021)) format. However, the package now pulls data from a number of non-UFS sources, including GFS/GEFS before UFS was created, and even ECMWF's ERA5 dataset.

To learn more about ufs2arco, check out the documentation: https://ufs2arco.readthedocs.io/en/latest/index.html

We are going to create the following dataset:
- NOAA Replay Reanalysis
- 3-hourly
- Training data dates: 2022-01-01T00 - 2022-02-04T21
- Validation data dates: 2022-01-03T00 - 2022-01-04T21
- 1-degree global resolution

For the purposes of running this notebook, we will not be creating a test set.

While this cell is running, go into the `global-eagle/examples/getting_started/colab_notebook_demo/data` folder and look at `logs/logs.serial.out`. This will provide more insight into the dataset creation. Additionally, open `global-eagle/examples/getting_started/colab_notebook_demo/data/replay.yaml` to see all configurations related to data preprocessing.

In [None]:
!ufs2arco global-eagle/examples/getting_started/colab_notebook_demo/data/replay.yaml

After the dataset has completed, let's view it!

You will notice that this format looks different than a "typical" gridded netcdf or zarr file. The gridded data is flattened to be 1-dimensional, and we have calculated various statistics that will be used for normalization during training. These important details make the dataset ready to be used within a ML model.

In [None]:
import xarray as xr

ufs2arco_ds = xr.open_dataset("global-eagle/examples/getting_started/colab_notebook_demo/data/replay.zarr")
ufs2arco_ds

### Step 3: Train a model with anemoi-core modules

Runtime: 4 minutes

We train a graph-based model with the `anemoi-core` modules from the European Centre for Medium-Range Weather Forecasts (ECMWF). The modules include the following:
- `anemoi-graphs`: https://anemoi.readthedocs.io/projects/graphs/en/latest/
- `anemoi-training`: https://anemoi.readthedocs.io/projects/training/en/latest/
- `anemoi-models`: https://anemoi.readthedocs.io/projects/models/en/latest/

Training is executed using `anemoi-training`.

While training is running, go to the `global-eagle/examples/getting_started/colab_notebook_demo/train/training-output` folder. You will see folders containing checkpoints and plots from your run.

We will use the following configurations to train the model:
- Model task: Deterministic Forecasting (GraphForecaster)
- Model type: Graph Transformer Neural Network
- Graph: multi_scale encoder-processor-decoder configuration

In [None]:
import os
os.environ["ANEMOI_BASE_SEED"] = "42"
os.environ["SLURM_JOB_ID"] = "0"

In [None]:
%cd global-eagle/examples/getting_started/colab_notebook_demo/train/

In [None]:
!anemoi-training train --config-name=config

### Step 4: Create a forecast with anemoi-inference

Runtime: 12 seconds

Documentation: https://anemoi.readthedocs.io/projects/inference/en/latest/

Next, we will run inference using `anemoi-inference`. We will create a 48 hour forecast from 01/03/2022 0Z to 01/04/2022 21Z using a checkpoint from the model we just trained.

Before executing the next two cells you will need to complete the following steps:
1) Go to `global-eagle/examples/getting_started/colab_notebook_demo/train/training-output/checkpoints/` folder
2) Copy the long id number found within that folder (e.g. `35a9632c-ab04-4284-af5e-4defcef37cff`)
3) Open `global-eagle/examples/getting_started/colab_notebook_demo/inference/inference_config.yaml`
4) Replace the checkpoint with the following: `"../train/training-output/checkpoint/<ENTER YOUR ID HERE>/inference-last.ckpt"`

In [None]:
%cd /content/global-eagle/examples/getting_started/colab_notebook_demo/inference/

In [None]:
!anemoi-inference run inference_config.yaml

View inference

In [None]:
import xarray as xr
ds = xr.open_dataset("2022-01-03T00.nc")
ds

In [None]:
import matplotlib.pyplot as plt
import numpy as np

fhr = 1
temp = ds['tmp2m'].isel(time=fhr).values
lat = ds['latitude'].values
lon = ds['longitude'].values

plt.figure(figsize=(10, 6))
plt.scatter(lon, lat, c=temp, s=10, cmap='coolwarm')
plt.colorbar(label='2m Temperature')
plt.title(f'2m Temperature at {ds["time"][fhr].values}')
plt.show()

Postprocess inference

We perform some postprocessing to ensure that the output will work with the wxvx framework for verification. This includes making the data 2D or 3D again, and adding necessary attributes required by wxvx.

In [None]:
!python postprocess.py

In [None]:
ds_post = xr.open_dataset("2022-01-03T00_postprocessed.nc")
ds_post

Consider locally saving this final postprocessed netcdf file. This ensures that if you get disconnceted from this runtime, you can go run wxvx at a later time without having to rerun the whole notebook.

### Step 5: Verify the forecast against GFS with wxvx

Runtime: 4 minutes

`wxvx` is a workflow tool for verifying weather models. It leverages `uwtools` to drive `MET`. We are going to run grid-to-grid verification. Verification against observations is currently under development (coming soon!)

First, Google Colab does not automatically come with Conda, so we have to install it. We will then run wxvx.

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install() 

In [None]:
!conda create -y -n wxvx -c ufs-community -c paul.madden wxvx python=3.13

In [None]:
%cd global-eagle/examples/getting_started/colab_notebook_demo/verification/

In [None]:
import os
os.environ["MPLBACKEND"] = "agg"

In [None]:
!conda run -n wxvx wxvx -c wxvx_config.yaml -t plots

Now go to `global-eagle/examples/getting_started/colab_notebook_demo/verification/run/plots/20220103/00` and open some plots comparing (RMSE and ME) our model vs. GFS for numerous variables.