# Aurora fine‑tuning workshop - master notebook

This is the **one notebook you run**.

It drives everything by submitting **one Azure ML job** to your GPU cluster.  
That job runs `run_aurora_job.py`, which then calls functions in `aurora_demo_core.py`.

You can use this notebook to run:

- **Toy mode** (quick sanity check - random inputs, dummy loss)
- **ERA5 short‑lead fine‑tuning**
- **ERA5 rollout / autoregressive inference**
- **ERA5 rollout fine‑tuning** (long‑lead) with **LoRA**
- **Add one extra variable**


## Files used in this workshop

You will mainly touch **three files**:

1) **`0_aurora_workshop.ipynb`** (this notebook)  
   The “remote control”. You pick a run profile and submit a job.

2) **`run_aurora_job.py`**  
   The Azure ML entrypoint that reads environment variables and decides what to run.

3) **`aurora_demo_core.py`**  
   The core logic: batch creation, inference, short‑lead fine‑tuning, rollout fine‑tuning, LoRA, extra variable support.

## Before you run

### 1) You need a GPU compute cluster
Set `COMPUTE_NAME` to the name of your Azure ML compute cluster.  
This should be created with **Standard_NC40ads_H100_v5**.

### 2) You need an environment that has Aurora installed
Set `ENV_NAME` to the Azure ML custom environment.

### 3) If you want ERA5 runs, you need two data assets
- **Dynamic ERA5 subset**: Zarr folder (URI_FOLDER)  
- **Static file**: `era5_static.nc` (URI_FILE) containing `lsm`, `slt`, `z`



In [1]:
# (Optional) If your notebook kernel is missing packages, uncomment these.
%pip -q install azure-ai-ml azure-identity numpy matplotlib xarray zarr gcsfs netcdf4



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/anaconda/envs/azureml_py310_sdkv2/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Set the PARTICIPANT_ID below to make sure your Jobs/experimets are trackable and you have a separate output path
PARTICIPANT_ID = "saadat"   # TODO: each participant changes this

In [3]:
# Connect to the Azure ML workspace

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

try:
    ml_client = MLClient.from_config(DefaultAzureCredential())
    print("Connected using MLClient.from_config()")
except Exception as e:
    print("MLClient.from_config() failed. Falling back to manual settings.")
    print("Error:", repr(e))

    # ---- Fallback: fill these in  ----
    SUBSCRIPTION_ID = "62118f5c-be37-400f-9f20-a8b77a2a7877"
    RESOURCE_GROUP  = "aurora-workshop-rg"
    WORKSPACE_NAME  = "aurora-workshop-aml-ws"

    ml_client = MLClient(
        credential=DefaultAzureCredential(),
        subscription_id=SUBSCRIPTION_ID,
        resource_group_name=RESOURCE_GROUP,
        workspace_name=WORKSPACE_NAME,
    )

print("Workspace:", ml_client.workspace_name)

# ---- Set your compute + environment here ----
COMPUTE_NAME = "aurora-ws-gpu-ins"        # TODO: Azure ML compute cluster name
ENV_NAME     = "azureml:aurora-environment:3" # TODO: Azure ML environment name

print("Compute:", COMPUTE_NAME)
print("Environment:", ENV_NAME)


Found the config file in: /config.json


Connected using MLClient.from_config()
Workspace: aurora-workshop-aml-ws
Compute: aurora-ws-gpu-ins
Environment: azureml:aurora-environment:3


## (Optional) Register your ERA5 subset as Data Assets

### Skip This as Data Assets are already registered
### 
**If you already registered the data assets, skip this section.

You will register:
- the **dynamic Zarr folder** as a **URI_FOLDER**
- the **static NetCDF** as a **URI_FILE**


## Register Data Assets (Optional, If not registered before)

In [4]:
# Optional: register local files as Azure ML Data Assets

# Only required to do this once. After that, you can just use the azureml:... references
# in the next cells.

DO_REGISTER_ASSETS = False # As We have already registered the data assets hence we are setting this as False

# If True, set these to your local paths 
LOCAL_ERA5_ZARR_PATH   = r"../era5_subsets/era5_aurora_2025-09_6hourly.zarr" # Folder (Zarr store)
LOCAL_ERA5_STATIC_PATH = r"../era5_subsets/era5_static.nc"                 # file (NetCDF file)

if DO_REGISTER_ASSETS:
    from azure.ai.ml.entities import Data
    from azure.ai.ml.constants import AssetTypes

    # Pick names
    DYNAMIC_NAME = "era5_aurora_2025-09_6hourly_zarr"
    STATIC_NAME  = "era5_static_nc"

    dynamic_asset = Data(
        name=DYNAMIC_NAME,
        version="2",
        type=AssetTypes.URI_FOLDER,
        path=LOCAL_ERA5_ZARR_PATH,
        description="ERA5 subset for Aurora workshop (dynamic variables, 6-hourly).",
    )

    static_asset = Data(
        name=STATIC_NAME,
        version="2",
        type=AssetTypes.URI_FILE,
        path=LOCAL_ERA5_STATIC_PATH,
        description="Static fields for Aurora workshop (lsm, slt, z).",
    )

    dynamic_asset = ml_client.data.create_or_update(dynamic_asset)
    static_asset  = ml_client.data.create_or_update(static_asset)

    print("Registered dynamic asset:", f"azureml:{dynamic_asset.name}:{dynamic_asset.version}")
    print("Registered static asset :", f"azureml:{static_asset.name}:{static_asset.version}")
else:
    print("Skipping asset registration (DO_REGISTER_ASSETS=False).")


Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy '/mnt/batch/tasks/shared/LS_root/mounts/clusters/aurora-ws-cpu-cls/code/Users/saadatali/era5_subsets/era5_aurora_2025-09_6hourly.zarr' 'https://auroraworkshop7918090421.blob.core.windows.net/azureml-blobstore-fe6df4e0-19e0-41f0-9fdb-8eddf05a138a/LocalUpload/379a26ba761d6e084916a45b15f7223e7e3700590307d816a23fc98e17095560/era5_aurora_2025-09_6hourly.zarr' 

See https://learn.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
[32mUploading era5_static.nc[32m (< 1 MB): 100%|██████████| 12.5M/12.5M [00:00<00:00, 54.6MB/s]

Registered dynamic asset: azureml:era5_aurora_2025-09_6hourly_zarr:2
Registered static asset : azureml:era5_static_nc:2


## Pick a run profile

Pick a **profile** below, and the notebook will set sensible defaults.

### Profiles you can use
- `toy`  
  Quick sanity check. No ERA5 inputs needed. We are creating dummy data

- `era5_short`  
  Real short‑lead fine‑tuning on ERA5: `[t-6h, t] → t+6h`.

- `era5_short_lora`  
  Same as above, but trains LoRA only.

- `era5_rollout_infer_48h`  
  Autoregressive **inference** for 8 steps (48 hours at 6h lead).

- `era5_rollout_ft_safe`  
  Rollout **fine‑tuning** with LoRA, light settings.

- `era5_rollout_ft_2days`  
  Rollout fine‑tuning with LoRA for 8 steps (48h).

- `era5_add_variable_demo`  
  Shows how to add **one extra surface variable**.  
  You must point `ERA5_ZARR_ASSET` to a PLUS dataset that actually contains that variable.

### A note on “stages”
- In inference, stages = `INFER_ROLLOUT_STEPS`
- In rollout fine‑tuning, stages per update = `ROLLOUT_HORIZON_STEPS`

Runtime roughly scales like:
`FINETUNE_STEPS × ROLLOUT_HORIZON_STEPS`


In [8]:
# Choose your profile and configure assets
# ----------------------------------------
# Fill in your Data Asset references here.
#
# If you want to run the "add variable" demo, create a PLUS dataset (separate Zarr folder)
# that includes the extra variable, register it as a Data Asset, and point ERA5_ZARR_ASSET
# to that PLUS dataset.

# Base dataset (no extra variables)
ERA5_ZARR_ASSET   = "azureml:era5_aurora_2025-09_6hourly_zarr:2"   # TODO: replace
ERA5_STATIC_ASSET = "azureml:era5_static_nc:2"                  # TODO: replace

# Additional dataset with extra variable
ERA5_ZARR_ASSET_PLUS = "azureml:era5_aurora_2025-09_6hourly_PLUS_2m_dewpoint_temperature:1"  # TODO: replace

# Safe crop sizes (global 721x1440 is too big for workshop training)
CROP_LAT = 128
CROP_LON = 256   # keep lon multiple of 4

# Learning rate used for fine-tuning
# The learning rate (LR) controls how big each weight update is each step.
# Too big → updates jump too far, training can become unstable (loss spikes, model “breaks”).
# Too small → training is very slow (loss barely moves).
# 3e-5 means 3 × 10⁻⁵ which equals to 0.00003

LR = 3e-5

# Profiles: env vars we pass to the Azure ML job


profiles = {

     # It creates a fake Aurora Batch (random data).
     # Runs inference.
     # Does 3 training updates

    "toy": {                      
        "FLOW": "toy",           
        "FINETUNE_STEPS": "3",
        "DEVICE": "cuda",
    },

    # --- ERA5 short-lead ---
    # Short Lead Training using ERA5 data


    "era5_short": {
        "FLOW": "era5",
        "FT_MODE": "short",
        "FINETUNE_STEPS": "5",
        "ERA5_LEAD_HOURS": "6",
        "ERA5_CROP_LAT": str(CROP_LAT),
        "ERA5_CROP_LON": str(CROP_LON),
        "ERA5_TIME_INDEX": "10",
        "AUTOCAST": "1",
        "LR": str(LR),
        "USE_LORA": "0",
        "LOG_MLFLOW": "1",
        "RUN_NAME": "era5_short_lora",


    },

    "era5_short_lora": {
        "FLOW": "era5",
        "FT_MODE": "short",
        "FINETUNE_STEPS": "5",
        "ERA5_LEAD_HOURS": "6",
        "ERA5_CROP_LAT": str(CROP_LAT),
        "ERA5_CROP_LON": str(CROP_LON),
        "ERA5_TIME_INDEX": "10",
        "AUTOCAST": "1",
        "LR": str(LR),
        "USE_LORA": "1",
        "TRAIN_LORA_ONLY": "1",
        "LORA_MODE": "single",
        "LORA_STEPS": "40",
        "LOG_MLFLOW": "1",
    },

    # --- Autoregressive inference (no training) ---
    "era5_rollout_infer_48h": {
        "FLOW": "era5",
        "FT_MODE": "short",              # fine-tune mode doesn't matter when FINETUNE_STEPS=0
        "FINETUNE_STEPS": "0",
        "ERA5_LEAD_HOURS": "6",
        "ERA5_CROP_LAT": str(CROP_LAT),
        "ERA5_CROP_LON": str(CROP_LON),
        "ERA5_TIME_INDEX": "10",
        "AUTOCAST": "1",
        "INFER_ROLLOUT_STEPS": "8",      # 8 × 6h = 48 hours
    },

    # --- Rollout fine-tuning (safe starter) ---
    "era5_rollout_ft_safe": {
        "FLOW": "era5",
        "FT_MODE": "rollout",
        "FINETUNE_STEPS": "2",
        "ROLLOUT_HORIZON_STEPS": "4",    # 4 × 6h = 24h horizon per update (fast + safe)
        "ROLLOUT_LOSS_ON": "last",       # last = cheaper on memory (recommended)
        "ERA5_LEAD_HOURS": "6",
        "ERA5_CROP_LAT": str(CROP_LAT),
        "ERA5_CROP_LON": str(CROP_LON),
        "ERA5_TIME_INDEX": "10",
        "AUTOCAST": "1",
        "LR": str(LR),
        "USE_LORA": "1",
        "TRAIN_LORA_ONLY": "1",
        "LORA_MODE": "all",
        "LORA_STEPS": "40",
        "LOG_MLFLOW": "1",
    },

    "era5_rollout_ft_2days": {
        "FLOW": "era5",
        "FT_MODE": "rollout",
        "FINETUNE_STEPS": "3",
        "ROLLOUT_HORIZON_STEPS": "8",    # 8 × 6h = 48h horizon per update
        "ROLLOUT_LOSS_ON": "last",
        "ERA5_LEAD_HOURS": "6",
        "ERA5_CROP_LAT": str(CROP_LAT),
        "ERA5_CROP_LON": str(CROP_LON),
        "ERA5_TIME_INDEX": "10",
        "AUTOCAST": "1",
        "LR": str(LR),
        "USE_LORA": "1",
        "TRAIN_LORA_ONLY": "1",
        "LORA_MODE": "all",
        "LORA_STEPS": "40",
        "LOG_MLFLOW": "1",
    },

    # --- Add-variable demo (example: 2m dewpoint) ---
    "era5_add_variable_demo": {
        "FLOW": "era5",
        "FT_MODE": "short",
        "FINETUNE_STEPS": "3",
        "ERA5_LEAD_HOURS": "6",
        "ERA5_CROP_LAT": str(CROP_LAT),
        "ERA5_CROP_LON": str(CROP_LON),
        "ERA5_TIME_INDEX": "10",
        "AUTOCAST": "1",
        "LR": str(LR),
        "USE_LORA": "1",
        "TRAIN_LORA_ONLY": "1",
        "LORA_MODE": "single",
        "LORA_STEPS": "40",

        # Extra variable config (surface var example)
        "EXTRA_KIND": "surf",
        "EXTRA_KEY": "d2m",                       # Aurora key you choose
        "EXTRA_SRC": "2m_dewpoint_temperature",   # must exist in the PLUS dataset
        "EXTRA_LOCATION": "0.0",                  # set real stats if you can
        "EXTRA_SCALE": "1.0",
    },
    "era5_rollout_no_lora": {
    "FLOW": "era5",
    "FT_MODE": "rollout",

    "FINETUNE_STEPS": "5",         
    "ERA5_LEAD_HOURS": "6",
    "ERA5_TIME_INDEX": "10",
    "ERA5_CROP_LAT": str(CROP_LAT),
    "ERA5_CROP_LON": str(CROP_LON),
    "AUTOCAST": "1",
    "LR": str(LR),

    "ROLLOUT_HORIZON_STEPS": "8",   
    "ROLLOUT_LOSS_ON": "last",      

    "USE_LORA": "0",
    "TRAIN_LORA_ONLY": "0", 
    "LOG_MLFLOW": "1",    

    
    },

}

# Pick one:
RUN_PROFILE = "era5_rollout_no_lora"   # toy | era5_short | era5_short_lora | era5_rollout_infer_48h | era5_rollout_ft_safe | era5_rollout_ft_2days | era5_add_variable_demo

if RUN_PROFILE not in profiles:
    raise ValueError(f"Unknown RUN_PROFILE: {RUN_PROFILE}. Choose one of: {list(profiles)}")

env_vars = dict(profiles[RUN_PROFILE]) 

# Make sure we always set these
env_vars["PARTICIPANT_ID"] = PARTICIPANT_ID
env_vars["DEVICE"] = env_vars.get("DEVICE", "cuda")

# This helps reduce CUDA memory fragmentation in some training runs
env_vars["PYTORCH_CUDA_ALLOC_CONF"] = "backend:cudaMallocAsync"

# Pick which dynamic asset to mount
if RUN_PROFILE == "era5_add_variable_demo":
    SELECTED_ERA5_ZARR_ASSET = ERA5_ZARR_ASSET_PLUS
else:
    SELECTED_ERA5_ZARR_ASSET = ERA5_ZARR_ASSET

print("RUN_PROFILE:", RUN_PROFILE)
print("Dynamic asset:", SELECTED_ERA5_ZARR_ASSET)
print("Static asset :", ERA5_STATIC_ASSET)
print("Env vars to job (preview):")
for k in sorted(env_vars.keys()):
    print(f"  {k}={env_vars[k]}")


RUN_PROFILE: era5_rollout_no_lora
Dynamic asset: azureml:era5_aurora_2025-09_6hourly_zarr:2
Static asset : azureml:era5_static_nc:2
Env vars to job (preview):
  AUTOCAST=1
  DEVICE=cuda
  ERA5_CROP_LAT=128
  ERA5_CROP_LON=256
  ERA5_LEAD_HOURS=6
  ERA5_TIME_INDEX=10
  FINETUNE_STEPS=5
  FLOW=era5
  FT_MODE=rollout
  LOG_MLFLOW=1
  LR=3e-05
  PARTICIPANT_ID=saadat
  PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
  ROLLOUT_HORIZON_STEPS=8
  ROLLOUT_LOSS_ON=last
  TRAIN_LORA_ONLY=0
  USE_LORA=0


In [9]:
# Submit the Azure ML job
# This creates one job that runs on the GPU cluster.
# The job executes:  python run_aurora_job.py

# NOTE: For ERA5 flows, we mount two inputs:
#   - era5_zarr   (URI_FOLDER)
#   - era5_static (URI_FILE)
# and pass their mounted paths to the script via env vars.

from datetime import datetime, timezone
from azure.ai.ml import command, Input, Output



# Make a unique job name every submission
run_suffix = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")

# Separate experiments so toy doesn't mix with ERA5
flow = env_vars.get("FLOW", "toy").lower()
if flow == "toy":
    EXPERIMENT_NAME = f"aurora_workshop_{PARTICIPANT_ID}_toy"
else:
    EXPERIMENT_NAME = f"aurora_workshop_{PARTICIPANT_ID}_era5"

JOB_NAME = f"aurora-{PARTICIPANT_ID}-{RUN_PROFILE}-{run_suffix}"

DISPLAY_NAME = f"Aurora | {PARTICIPANT_ID} | {RUN_PROFILE} | {run_suffix}"

# Build inputs only when needed
inputs = {}
if env_vars["FLOW"] == "era5":
    inputs = {
        "era5_zarr": Input(type="uri_folder", path=SELECTED_ERA5_ZARR_ASSET, mode="ro_mount"),
        "era5_static": Input(type="uri_file", path=ERA5_STATIC_ASSET, mode="ro_mount"),
    }


# Output folder (Azure ML will mount this)
outputs = {"out_dir": Output(type="uri_folder", mode="rw_mount")}

env_vars["PARENT_JOB_NAME"] = JOB_NAME

job = command(
    name=JOB_NAME,
    display_name=DISPLAY_NAME,
    experiment_name=EXPERIMENT_NAME,
    code=".",
    command=(
    "python -m pip uninstall -y mlflow mlflow-skinny mlflow-tracing || true && "
    "python -m pip install -q --upgrade 'zarr<3' numcodecs fasteners asciitree && "
    "python -m pip install -q --upgrade azureml-mlflow && "

    "export OUT_DIR='${{outputs.out_dir}}' && "
    + (
        "export ERA5_ZARR_PATH='${{inputs.era5_zarr}}' && "
        "export ERA5_STATIC_NC='${{inputs.era5_static}}' && "
        if flow == "era5" else ""
    ) +
    "PYTHONPATH=$PWD python run_aurora_job.py"
),
    environment=ENV_NAME,
    compute=COMPUTE_NAME,
    inputs=inputs,
    outputs=outputs,
    environment_variables=env_vars,
     tags={
        "participant_id": PARTICIPANT_ID,
        "run_profile": RUN_PROFILE,
        "flow": flow,
    },
    resources={"instance_count": 1},
)

print("Submitting job:", JOB_NAME)
returned_job = ml_client.jobs.create_or_update(job)
print("Job name:", returned_job.name)

print("Streaming logs…")
ml_client.jobs.stream(returned_job.name)


Submitting job: aurora-saadat-era5_rollout_no_lora-20260115-090019
Job name: aurora-saadat-era5_rollout_no_lora-20260115-090019
Streaming logs…
RunId: aurora-saadat-era5_rollout_no_lora-20260115-090019
Web View: https://ml.azure.com/runs/aurora-saadat-era5_rollout_no_lora-20260115-090019?wsid=/subscriptions/62118f5c-be37-400f-9f20-a8b77a2a7877/resourcegroups/aurora-workshop-rg/workspaces/aurora-workshop-aml-ws


Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy '/mnt/batch/tasks/shared/LS_root/mounts/clusters/aurora-ws-cpu-cls/code/Users/saadatali/Updated' 'https://auroraworkshop7918090421.blob.core.windows.net/fe6df4e0-19e0-41f0-9fdb-8eddf05a138a-p8lfl15ycgwib8elz6zzrus74z/Updated' 

See https://learn.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
[32mUploading Updated (5045.53 MBs): 100%|██████████| 5045530808/5045530808 [00:15<00:00, 319657244.05it/s]
[39m

pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored


## After the job finishes

The next cell downloads the job outputs to a local folder, then shows quick plots.

Files you’ll typically see:
- `inference_2t.npy`
- `finetune_last_2t.npy` (if FINETUNE_STEPS > 0)
- `inference_rollout_2t.npy` (if INFER_ROLLOUT_STEPS was set)
- `finetune_losses.npy` / `finetune_losses.json`
- `finetuned_state_dict.pt` (full model, or LoRA-only weights if TRAIN_LORA_ONLY=1)

If you’re doing rollout inference, the rollout file is a stack:
`(steps, batch, time, H, W)`.


In [34]:
# Download outputs + quick plots
# ------------------------------
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

download_root = Path("job_outputs") / returned_job.name
download_root.mkdir(parents=True, exist_ok=True)

print("Downloading outputs to:", download_root)

# Azure ML SDK has slightly different download() signatures across versions, so we try a couple.
downloaded = False
last_err = None
for kwargs in [
    {"name": returned_job.name, "download_path": str(download_root)},
    {"job_name": returned_job.name, "download_path": str(download_root)},
    {"name": returned_job.name, "download_path": str(download_root), "all": True},
]:
    try:
        ml_client.jobs.download(**kwargs)
        downloaded = True
        break
    except TypeError:
        continue
    except Exception as e:
        last_err = e

if not downloaded:
    raise RuntimeError(f"Could not download job outputs. Last error: {repr(last_err)}")

# The job writes outputs under out_dir/<PARTICIPANT_ID>/...
# We'll try to find that folder. If we can't, we just use the download root.
participant_dir = None
for p in download_root.rglob(PARTICIPANT_ID):
    if p.is_dir():
        participant_dir = p
        break

if participant_dir is None:
    participant_dir = download_root

print("Using output folder:", participant_dir)

def load_npy(name: str):
    p = participant_dir / name
    if p.exists():
        return np.load(p)
    return None

inf_2t = load_npy("inference_2t.npy")
fin_2t = load_npy("finetune_last_2t.npy")
roll_2t = load_npy("inference_rollout_2t.npy")
losses = load_npy("finetune_losses.npy")

print("Found files:")
for fname in [
    "inference_2t.npy",
    "finetune_last_2t.npy",
    "inference_rollout_2t.npy",
    "finetune_losses.npy",
    "finetuned_state_dict.pt",
]:
    print(" -", fname, "✅" if (participant_dir / fname).exists() else "—")

def show_field(ax, arr, title):
    im = ax.imshow(arr, origin="lower", aspect="auto")
    ax.set_title(title)
    ax.set_xticks([])
    ax.set_yticks([])
    plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

# Extract 2D fields for plotting
inf_field = inf_2t[0, 0] if inf_2t is not None else None
fin_field = fin_2t[0, 0] if fin_2t is not None else None

# For rollout inference, take the last step for display (shape: steps,1,1,H,W)
roll_last_field = roll_2t[-1, 0, 0] if roll_2t is not None else None

plots = []
if inf_field is not None:
    plots.append(("Inference (single-step)", inf_field))
if roll_last_field is not None:
    plots.append(("Inference (rollout last step)", roll_last_field))
if fin_field is not None:
    plots.append(("After fine-tuning", fin_field))
if (inf_field is not None) and (fin_field is not None):
    plots.append(("Difference (finetune - inference)", fin_field - inf_field))

if plots:
    fig, axes = plt.subplots(1, len(plots), figsize=(5 * len(plots), 4), constrained_layout=True)
    if len(plots) == 1:
        axes = [axes]
    for ax, (title, arr) in zip(axes, plots):
        show_field(ax, arr, title)
    plt.show()
else:
    print("No 2t output files were found to plot.")

# Plot loss curve if we have it
if losses is not None and len(losses) > 0:
    plt.figure(figsize=(6, 4))
    plt.plot(losses)
    plt.title("Fine-tuning loss (per step)")
    plt.xlabel("Step")
    plt.ylabel("Loss")
    plt.grid(True, alpha=0.3)
    plt.show()


Downloading outputs to: job_outputs/aurora-saadat-era5_short-20260111-214538
Using output folder: job_outputs/aurora-saadat-era5_short-20260111-214538
Found files:
 - inference_2t.npy —
 - finetune_last_2t.npy —
 - inference_rollout_2t.npy —
 - finetune_losses.npy —
 - finetuned_state_dict.pt —
No 2t output files were found to plot.


Downloading artifact azureml://datastores/workspaceartifactstore/ExperimentRun/dcid.aurora-saadat-era5_short-20260111-214538 to job_outputs/aurora-saadat-era5_short-20260111-214538/artifacts
