# Aurora inference & fine-tuning in Azure ML

This notebook explains the workflow and submits an Azure Machine Learning (AML) job that runs **Microsoft Aurora** on a GPU cluster.

## Files used in this workshop

We use the following components:

1. `notebooks/0_aurora_workshop.ipynb` *(this notebook)* – explains the workflow and submits jobs to AML:
   - <mark>Run this notebook with a CPU Compute Instance using the "Python 3.10 - SDK v2" kernel</mark>
2. `setup/components/inference` - contains Aurora inference logic:
   - `main.py`: a script with a CLI interface for running a simple inference loop.
   - `component.py`: AML component definition.
3. `setup/components/training` - contains Aurora fine-tuning logic:
   - `main.py`: a script with a CLI interface for running a simple fine-tuning loop.
   - `component.py`: AML component definition.
4. `setup/components/common/utils.py` - contains Aurora helper logic, including:
   - Loading model checkpoints in train or eval mode.
   - Loading data on disk into `aurora.Batch` objects for inference and fine-tuning.
   - Converting `aurora.Batch` objects into `xarray.Dataset` objects for analysis and writing of data.

**NOTE**: inference and fine-tuning scripts in `setup/components/*/main.py` will work in local and remote environments provided the hardware and dependencies required to run Aurora are present in each. AML component definitions in `setup/components/*/components.py` serve only to deploy and make these scripts executable in AML. Notebooks and common code in `setup/notebooks` and `setup/common`, respectively, are largely for workspace setup including data asset and pre-trained model download and registration.

In [None]:
# use Python 3.10 - SDK v2 kernel to avoid numpy / xarray version issues
# install necessary dependencies for this notebook and setup.common.utils
%pip install azure-ai-ml xarray

In [None]:
import json
import sys
from datetime import UTC, datetime
from pathlib import Path

import numpy as np
import xarray as xr
import yaml
from azure.ai.ml import Input, Output, PyTorchDistribution
from azure.ai.ml.entities import Command, CommandJobLimits, Model
from azure.ai.ml.exceptions import JobException, MlException

sys.path.insert(0, str(Path.cwd().parent.resolve()))
from setup.common.utils import create_mlclient, get_latest_asset

In [None]:
PARTICIPANT_ID = input("Enter your firstnamelastname format participant ID.").strip()
print(f"Hello, {PARTICIPANT_ID}!")

Create an `azure.ai.ml.MLClient` object to interact with the workspace and, with this, retrieve the compute cluster, model, and data required to run jobs. This assumes there is one cluster in the workspace and that it is GPU enabled.

**NOTE**: the `local` parameter expects a boolean argument that decides what environment variables to look for when configuring the `MLClient` object. `True` will look for environment variables set in a local `.env` file in the project root, `False` will look for environment variables automatically set in Azure Machine Learning Compute Instances. See `setup/common/utils.py` for more.

In [None]:
ml_client = create_mlclient(local=True)
print(
    f"Connected to workspace: sub={ml_client.subscription_id}, "
    f"rg={ml_client.resource_group_name}, workspace={ml_client.workspace_name}",
)
OUTFILE_TEMPLATE = f"azureml://datastores/${{{{{{{{default_datastore}}}}}}}}/paths/aurora-workshop/{PARTICIPANT_ID}/{{experiment_name}}/{{display_name}}/{{filename}}"

# get the name of the first AML compute cluster (type="amlcompute") in the workspace
CLUSTER_NAME = next(iter(ml_client.compute.list(compute_type="amlcompute"))).name

# get the latest pre-trained Aurora 0.25 model registered in the workspace
model = get_latest_asset(ml_client.models, name="aurora-0p25-pretrained")
MODEL_NAME = f"azureml:{model.name}:{model.version}"

# get the latest ERA5 subset data asset registered in the workspace
data = get_latest_asset(ml_client.data, name="gcp-era5-arco")
DATA_NAME = f"azureml:{data.name}:{data.version}"

print(f"Using assets: cluster={CLUSTER_NAME}, model={MODEL_NAME}, data={DATA_NAME}")

## Inference jobs

Here, we'll run inference and evaluation jobs using different data:
- Generated synthetic test data comprising a low resolution tensor of random float values.
- Real, pre-loaded ERA5 data over the 2025-01-01T00 to 2025-01-31T18 period.

After inference, the final prediction is compared to its corresponding ground truth. We calculate global difference and RMSE, logging the plot and value, respectively. Both are logged with [MLflow](https://mlflow.org/), an open-source machine learning lifecycle framework [integrated in AML](https://learn.microsoft.com/en-us/azure/machine-learning/concept-mlflow?view=azureml-api-2).

First, load the fine-tuning configs defined in YAML into a dictionary.

In [None]:
with Path("inference_configs.yaml").open("r") as f:
    inference_configs = yaml.safe_load(f)

Then, specifying the name of a config defined in YAML, run the job.

**NOTE:** for the definition and logic of the Command Component used in this job, see either `setup/components/inference` or the registered component in AML.

In [None]:
config_name = input("Enter a config name e.g. test").strip()
if not (cfg := inference_configs.get(config_name)):
    msg = f"Config not found: name={config_name}"
    raise ValueError(msg)

display_name = f"{PARTICIPANT_ID}-{datetime.now(UTC).strftime('%Y%m%d-%H%M%S')}"
experiment_name = f"inference-{config_name}"
inference_component = get_latest_asset(ml_client.components, name="aurora_inference")

inference_command = Command(
    component=inference_component,
    display_name=display_name,
    experiment_name=experiment_name,
    compute=CLUSTER_NAME,
    inputs={
        "model": Input(type="custom_model", path=MODEL_NAME, mode="ro_mount"),
        "data": Input(type="uri_folder", path=DATA_NAME, mode="ro_mount"),
        # initial state timestamp below and that -6 hours must exist in the data
        "start_datetime": "2025-01-01T06:00:00",
        "config": json.dumps(cfg),
    },
    outputs={
        "predictions": Output(
            type="uri_file",
            path=OUTFILE_TEMPLATE.format(
                experiment_name=experiment_name,
                display_name=display_name,
                filename="predictions.nc",
            ),
            mode="rw_mount",
        ),
    },
    limits=CommandJobLimits(timeout=7200),
    distribution=PyTorchDistribution(process_count_per_instance=1),
    environment=inference_component.environment,
)

print(f"Submitting inference job: name={display_name}, config={config_name}")
inference_job = ml_client.jobs.create_or_update(inference_command)

print("Streaming logs:")
ml_client.jobs.stream(inference_job.name)

## Fine-tuning jobs

Here, we'll run fine-tuning jobs using different data:
- Generated synthetic test data comprising a low resolution tensor of random float values.
- Real, pre-loaded ERA5 data over the 2025-01-01T00 to 2025-01-31T18 period.

First, load the fine-tuning configs defined in YAML into a dictionary.

In [None]:
with Path("finetune_configs.yaml").open("r") as f:
    finetune_configs = yaml.safe_load(f)

Then, specifying the name of a config defined in YAML, run the job.

For the definition and logic of the Command Component used in this job, see either `setup/components/training` or the registered component in AML.

In [None]:
config_name = input("Enter a config name e.g. test_short_lead").strip()
if not (cfg := finetune_configs.get(config_name)):
    msg = f"Config not found: name={config_name}"
    raise ValueError(msg)

display_name = f"{PARTICIPANT_ID}-{datetime.now(UTC).strftime('%Y%m%d-%H%M%S')}"
experiment_name = f"finetuning-{config_name}"
train_component = get_latest_asset(ml_client.components, name="aurora_finetuning")

train_command = Command(
    component=train_component,
    display_name=display_name,
    experiment_name=experiment_name,
    compute=CLUSTER_NAME,
    inputs={
        "model": Input(type="custom_model", path=MODEL_NAME, mode="ro_mount"),
        "data": Input(type="uri_folder", path=DATA_NAME, mode="ro_mount"),
        # below timestamp and that -6 hours must exist in the data
        "start_datetime": "2025-01-01T06:00:00",
        # below timestamp only possibly used as a training target
        "end_datetime": "2025-01-31T23:00:00",
        "config": json.dumps(cfg),
    },
    outputs={
        "loss": Output(
            type="uri_file",
            path=OUTFILE_TEMPLATE.format(
                experiment_name=experiment_name,
                display_name=display_name,
                filename="loss.npy",
            ),
            mode="upload",
        ),
        "prediction": Output(
            type="uri_file",
            path=OUTFILE_TEMPLATE.format(
                experiment_name=experiment_name,
                display_name=display_name,
                filename="prediction.nc",
            ),
            mode="rw_mount",
        ),
        "finetuned": Output(
            type="uri_file",
            path=OUTFILE_TEMPLATE.format(
                experiment_name=experiment_name,
                display_name=display_name,
                filename="finetuned.ckpt",
            ),
            mode="rw_mount",
        ),
    },
    limits=CommandJobLimits(timeout=7200),
    distribution=PyTorchDistribution(process_count_per_instance=1),
    environment=train_component.environment,
)

print(f"Submitting fine-tuning job: name={display_name}, config={config_name}")
train_job = ml_client.jobs.create_or_update(train_command)

print("Streaming logs:")
ml_client.jobs.stream(train_job.name)

Next, we register the fine-tuned model in AML using the job output location. This enables us to easily track, version, use, and deploy models.

**NOTE:** there are several ways to specify the location of model assets [described in documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-models?view=azureml-api-2&tabs=cli).

In [None]:
model = Model(
    name=f"{display_name}-aurora-finetuned",
    version="1",
    path=f"azureml://jobs/{train_job.name}/outputs/finetuned/paths/",
    description="Fine-tuned Aurora model.",
    tags={"author": PARTICIPANT_ID, "config_name": config_name},
)
ml_client.models.create_or_update(model)

## [Optional] Plotting and evaluating fine-tuning results

Here, we'll download the products of inference and fine-tuning from their respective AML jobs.

First, download inference and fine-tuning job outputs and artefacts (logs etc.). This requires the jobs to have completed in a successful state.

The following new directories and files tagged with * will be created:
```md
aurora-introductory-workshop/
└── notebooks/
    └── *outputs/
        ├── *inference/
        |   ├── *artifacts/: log files for the job, also visible in the job's "Outputs + logs" tab in the Studio UI.
        |   └── *named-outputs/
        |       └── *predictions/
        |           └── *predictions.nc: forecasts generated in inference with the pre-trained model and ERA5 data.
        └── *training
            ├── *artifacts/: log files for the job, also visible in the job's "Outputs + logs" tab in the Studio UI.
            └── *named-outputs/
                ├── *loss/
                |   └── *loss.npy: loss history (loss values at each step) of fine-tuning.
                ├── *prediction/
                |   └── *prediction.nc: last forecast generated in inference with the fine-tuned model and ERA5 data.
                └── *finetuned/
                    └── *finetuned.ckpt: fine-tuned model checkpoint.
```
**NOTE:** download specific outputs by replacing `all=True` with `output_name=<name of output>`.

In [None]:
data_dir = Path("outputs")
data_dir.mkdir(exist_ok=True)
inference_out_dir = data_dir / "inference"
inference_out_dir.mkdir(exist_ok=True)
training_out_dir = data_dir / "training"
training_out_dir.mkdir(exist_ok=True)

try:
    print("Downloading inference job outputs for job:", inference_job.display_name)
    ml_client.jobs.download(
        name=inference_job.name,
        download_path=inference_out_dir,
        all=True,
    )
    print("Downloaded inference job outputs to:", inference_out_dir)

    print("Downloading training job outputs for job:", train_job.display_name)
    ml_client.jobs.download(
        name=train_job.name,
        download_path=training_out_dir,
        all=True,
    )
    print("Downloaded fine-tuning outputs to:", training_out_dir)

except (JobException, MlException) as e:
    print("Failed to download job outputs and logs, verify the job has succeeded.", e)
    raise

Second, load job outputs.

In [None]:
inference_ds_path = inference_out_dir / "named-outputs/predictions/predictions.nc"
finetune_ds_path = training_out_dir / "named-outputs/prediction/prediction.nc"
loss_arr_path = training_out_dir / "named-outputs/loss/loss.npy"

inference_ds = xr.open_dataset(inference_ds_path)
finetune_ds = xr.open_dataset(finetune_ds_path)
loss_arr = np.load(loss_arr_path)

Finally, plot the downloaded data.

In [None]:
import matplotlib.pyplot as plt

## [Optional] Create an Aurora Batch

Using the downloaded NetCDFs produced by previous jobs, try to create an `aurora.Batch` object from the data within.

In [None]:
from aurora import Batch