# Aurora Workshop User Guide

This notebook is a **documentation/wiki-style guide** for the workshop codebase:
- `0_aurora_workshop.ipynb` (job submission “remote control”)
- `run_aurora_job.py` (Azure ML entrypoint)
- `aurora_demo_core.py` (model/data “engine”)

It explains **what each piece does**, **how the flows work**, **what inputs/outputs are expected**, and **how to troubleshoot common errors**.



## Contents
1. Overview
2. Code layout and responsibilities
3. Concepts you need (Batch, variables, lead time, cropping)
4. Inputs and data contracts (ERA5 Zarr + static NetCDF)
5. Outputs (what files get written and how to interpret them)
6. How execution works end-to-end
7. Configuration reference (profiles & environment variables)
8. Flows explained
   - Toy inference / toy fine-tune
   - ERA5 short-lead inference
   - ERA5 short-lead fine-tuning
   - ERA5 rollout (autoregressive) fine-tuning
   - LoRA modes (short vs rollout; LoRA-only training)
   - “Add variable” demo (EXTRA_* env vars)
9. Recipes (copy/paste profiles)
10. Troubleshooting (the errors you actually hit)
11. Extending the workshop (new vars, new losses, bigger runs)



## 1) Overview

This workshop code is split into **three layers**:

### A) Notebook (controller)
You choose a **profile** (toy / era5 short / era5 rollout / LoRA variants), set up Azure ML inputs (datasets),
and submit a job. The notebook does **no model logic**.

### B) Job runner: `run_aurora_job.py`
Azure ML executes this on the compute node. It:
- reads environment variables (FLOW, FT_MODE, USE_LORA, etc.)
- resolves mounted input paths for ERA5 data
- calls the appropriate function in `aurora_demo_core.py`
- saves outputs (`*.npz`, `*.npy`, `*.json`, `*.pt`) under `OUT_DIR/<participant_id>/`

### C) Core engine: `aurora_demo_core.py`
This is the “engine”. It:
- builds Aurora batches (toy and ERA5)
- constructs `AuroraPretrained` (optionally with LoRA and/or an extra variable)
- runs inference
- runs fine-tuning loops:
  - **short-lead** supervised (single step)
  - **rollout** supervised (autoregressive / unrolled)



## 2) Code layout and responsibilities

### Files you will touch most
- **`0_aurora_workshop.ipynb`**  
  - defines “profiles” (sets env vars)
  - mounts Azure ML inputs:
    - dynamic ERA5 Zarr folder (URI_FOLDER)
    - static NetCDF (URI_FILE)
  - submits the job

- **`run_aurora_job.py`**
  - reads env vars
  - validates required inputs (e.g., when FLOW=era5)
  - runs inference
  - optionally runs fine-tuning (`FINETUNE_STEPS > 0`)
  - writes outputs

- **`aurora_demo_core.py`**
  - implements toy and ERA5 logic
  - key functions:
    - `run_inference_era5(...)`
    - `run_finetuning_era5_short_lead(...)`
    - `run_finetuning_era5_rollout(...)`
    - `build_model(...)`
    - `make_era5_short_lead_pair(...)`



## 3) Key concepts

### 3.1 Aurora `Batch` and shapes
Aurora uses a `Batch` object with:
- `surf_vars`: dict of surface fields
- `static_vars`: dict of static fields (2D)
- `atmos_vars`: dict of multi-level fields
- `metadata`: lat/lon/time/levels

**Shape convention used in this workshop:**
- Surface history input: `(B, T, H, W)`  
- Atmos history input: `(B, T, C, H, W)` where `C` is number of pressure levels
- Static vars: `(H, W)` (no time/history dimension)

In the code, history length `T` is **2**: `[t-lead, t]`.

### 3.2 “Dynamic” vs “Static” variables
- **Dynamic** variables change with time and live in your Zarr:
  - surface: 2m temperature, 10m winds, MSLP
  - atmos: temperature, u/v wind, humidity, geopotential (with levels)
- **Static** variables are 2D fields:
  - `lsm` (land-sea mask)
  - `slt` (soil type)
  - `z` (surface geopotential/orography-like field)

Aurora expands each 2D static var into `(B, T, H, W)` internally by repeating it.
That’s why static vars must be **exactly 2D**.

### 3.3 Lead time and indexing
You choose a forecast lead time like `ERA5_LEAD_HOURS=6`.
The code infers the dataset’s time step (`dt_hours`) from `ds.time[1] - ds.time[0]`.
Then it converts lead hours to an index step:

`step = lead_hours / dt_hours`

### 3.4 Cropping
To keep workshop runs small, the code center-crops the ERA5 subset:
- `ERA5_CROP_LAT` controls height `H`
- `ERA5_CROP_LON` controls width `W`

**Important:** static NetCDF is aligned to the (cropped) lat/lon grid.



## 4) Inputs and data contracts

### 4.1 Dynamic ERA5 Zarr input (required for FLOW=era5)
The job runner expects `ERA5_ZARR_PATH` to point to a **local mounted folder** that contains a Zarr store.
It opens it with `xr.open_zarr(..., consolidated=True)` and then:
- standardises lon to `[0, 360)` and sorts
- ensures latitude is decreasing (north→south)
- optionally selects a set of “Aurora levels” if the dataset has a `level` coordinate
- crops to center patch

**Variables expected by default (from `aurora_demo_core.py`):**
Surface mapping (Aurora key → ERA5 name):
- `2t`  → `2m_temperature`
- `10u` → `10m_u_component_of_wind`
- `10v` → `10m_v_component_of_wind`
- `msl` → `mean_sea_level_pressure`

Atmos mapping (Aurora key → ERA5 name):
- `t` → `temperature`
- `u` → `u_component_of_wind`
- `v` → `v_component_of_wind`
- `q` → `specific_humidity`
- `z` → `geopotential`  (on levels)

### 4.2 Static NetCDF input (required for FLOW=era5)
The job runner expects `ERA5_STATIC_NC` to point to a **local mounted file** (NetCDF) that contains:
- `lsm`, `slt`, `z`

**Critical shape rule:** Each static variable must be **2D** `(latitude, longitude)`.
Your file *may* contain a singleton time dimension, but it must be removable.  
In `aurora_demo_core.py`, `_get_static_arrays()` drops `time` if it exists, but it does NOT drop `valid_time`.
So if CDS gave you `valid_time`, you must pre-process the file (or rename the dimension).

### 4.3 Mounting in Azure ML (what the notebook does)
Your notebook mounts:
- `era5_zarr` as `URI_FOLDER` → e.g. `.../wd/INPUT_era5_zarr`
- `era5_static` as `URI_FILE` → e.g. `.../wd/INPUT_era5_static`

Then it passes those mounted paths to the runner via:
- `ERA5_ZARR_PATH=<mounted folder>`
- `ERA5_STATIC_NC=<mounted file>`

If you run locally, you can set these env vars yourself and call:
`python run_aurora_job.py`



## 5) Outputs produced by `run_aurora_job.py`

Outputs are written under:
`OUT_DIR/<PARTICIPANT_ID>/`

### Always (inference)
- `inference_full_prediction.npz`  
  Full model prediction saved as NPZ (surf, static, atmos, plus metadata when available).
- `inference_2t.npy`  
  Convenience dump of predicted surface 2m temperature (if present).

### Optional (inference rollout)
If `INFER_ROLLOUT_STEPS` is set:
- `inference_rollout_2t.npy`  
  Stack of predicted 2m temperature across rollout steps.

### If `FINETUNE_STEPS > 0` (fine-tuning)
- `finetune_last_prediction.npz`  
  Final prediction after the last training step.
- `finetune_last_2t.npy`  
  Final predicted 2m temperature after fine-tuning.
- `finetune_losses.npy` + `finetune_losses.json`  
  Loss history (one per training step).
- `finetuned_state_dict.pt`  
  Saved model weights:
  - if `USE_LORA=1` and `TRAIN_LORA_ONLY=1`: saved weights are filtered to only LoRA params (by name containing 'lora')
  - otherwise: full state_dict



## 6) End-to-end execution flow

### Step 1: You pick a profile in the notebook
A “profile” is just a dictionary of environment variables.
Example idea:
- `FLOW=toy` vs `FLOW=era5`
- `FT_MODE=short` vs `FT_MODE=rollout`
- `FINETUNE_STEPS=0` to skip training

### Step 2: Notebook submits Azure ML command job
The job runs:
`python run_aurora_job.py`

and passes env vars + mounts ERA5 inputs.

### Step 3: Runner reads env vars and calls core functions
The runner:
1) runs inference (toy or ERA5)
2) if `FINETUNE_STEPS>0`, runs fine-tuning:
   - short: `run_finetuning_era5_short_lead(...)`
   - rollout: `run_finetuning_era5_rollout(...)`

### Step 4: Runner writes outputs and exits
Everything you inspect later (loss curves, NPZs, weights) comes from the output directory.



## 7) Configuration reference (profiles & environment variables)

This section is your “reference manual”.

### 7.1 Top-level switches
- `FLOW`: `toy` or `era5`
- `FT_MODE` (ERA5 only): `short` or `rollout`
- `FINETUNE_STEPS`: 0 means “inference only”

### 7.2 LoRA switches
- `USE_LORA`:
  - default is **True for rollout**, False for short (because runner sets default based on FT_MODE)
  - explicitly set `USE_LORA=0` to disable LoRA in rollout runs
- `TRAIN_LORA_ONLY`:
  - if `USE_LORA=1` and `TRAIN_LORA_ONLY=1`, the code freezes base weights and trains only LoRA params
- `LORA_MODE`: `"single" | "from_second" | "all"`
- `LORA_STEPS`: passed into Aurora constructor (model configuration; not the training-loop length)

### 7.3 Rollout-specific switches
- `ROLLOUT_HORIZON_STEPS`: how many autoregressive steps to unroll
- `ROLLOUT_LOSS_ON`: `"last"` (cheaper) or `"sum"` (heavier)

### 7.4 “Add one variable” demo
If you want to extend the model + inputs with a new variable:
- `EXTRA_KIND`: `surf` | `atmos` | `static`
- `EXTRA_KEY`: the Aurora variable key you want to add (e.g., `2d` if you invent one)
- `EXTRA_SRC`: the variable name in your dataset (Zarr or NetCDF)
- `EXTRA_LOCATION`, `EXTRA_SCALE`: normalisation stats registered into Aurora’s normalisation tables


### Environment variables table

| Variable | Default | Values | Used for |
|---|---|---|---|
| `FLOW` | toy | toy | era5 | Which flow to run. |
| `DEVICE` | cuda (if available) else cpu | cuda | cpu | Torch device. |
| `OUT_DIR` | ./outputs | path | Base output directory. |
| `PARTICIPANT_ID` | unknown | string | Subfolder under OUT_DIR. |
| `ERA5_ZARR_PATH` | (required for era5) | path | Mounted folder containing dynamic ERA5 subset Zarr. |
| `ERA5_STATIC_NC` | (required for era5) | path | Mounted NetCDF file containing lsm/slt/z. |
| `FT_MODE` | short | short | rollout | Fine-tuning mode (ERA5). |
| `FINETUNE_STEPS` | 0 | int | Number of training steps (optimizer updates). |
| `LR` | 3e-05 | float | Learning rate for AdamW. |
| `AUTOCAST` | True | 0/1 | Enable autocast (mixed precision) inside Aurora. |
| `ERA5_TIME_INDEX` | 10 | int | Time index chosen from subset for repeatability. |
| `ERA5_LEAD_HOURS` | 6 | int (hours) | Forecast lead time; must divide dataset timestep. |
| `ERA5_CROP_LAT` | 128 | int | Center crop height. |
| `ERA5_CROP_LON` | 256 | int | Center crop width. |
| `INFER_ROLLOUT_STEPS` | unset | int or unset | If set, run autoregressive rollout during inference. |
| `USE_LORA` | default: (FT_MODE==rollout) | 0/1 | Enable LoRA adapters. |
| `TRAIN_LORA_ONLY` | True | 0/1 | Train only LoRA params (freeze base weights). |
| `LORA_MODE` | all for rollout, single for short | single | from_second | all | Where LoRA adapters apply inside the model. |
| `LORA_STEPS` | 40 | int | LoRA configuration passed into Aurora (not training loop length). |
| `STABILISE_LEVEL_AGG` | False | 0/1 | Enable level aggregation stabilisation in Aurora. |
| `ROLLOUT_HORIZON_STEPS` | 8 | int | Autoregressive unroll length during rollout fine-tuning. |
| `ROLLOUT_LOSS_ON` | last | last | sum | Loss strategy during rollout fine-tuning. |
| `EXTRA_KIND` | unset | surf | atmos | static | Add one extra variable (optional). |
| `EXTRA_KEY` | unset | string | Aurora key name for extra variable. |
| `EXTRA_SRC` | unset | string | Dataset variable name for extra variable. |
| `EXTRA_LOCATION` | 0.0 | float | Normalisation location for new variable. |
| `EXTRA_SCALE` | 1.0 | float | Normalisation scale for new variable. |



## 8) Flows explained

### 8.1 Toy flow (`FLOW=toy`)
Toy flow is a sanity check:
- creates a small synthetic batch (`make_lowres_batch()`)
- runs inference (`run_inference()`)
- optionally runs a tiny fine-tune (`run_finetuning()`)

Use this when:
- you just built a new environment/Docker image
- you want to confirm CUDA works
- you want to confirm the job infrastructure works (outputs, logging)

### 8.2 ERA5 short-lead inference (`FLOW=era5`, `FT_MODE=short`, `FINETUNE_STEPS=0`)
- builds a single input batch `x` with history length `T=2`: `[t-lead, t]`
- runs `model(x)`
- saves prediction to NPZ and dumps `2t` to NPY

If you set `INFER_ROLLOUT_STEPS=N`, inference switches to a pure inference rollout:
- calls `aurora.rollout(model, x, steps=N)`
- saves a rollout stack for plotting

### 8.3 ERA5 short-lead fine-tuning (`FT_MODE=short`, `FINETUNE_STEPS>0`)
- builds `(x, y)` pair:
  - `x` contains `[t-lead, t]`
  - `y` contains `[t+lead]`
- trains for `FINETUNE_STEPS` optimizer updates on one sample
- loss used: `supervised_mae(pred, y)`
- optionally uses LoRA:
  - if `USE_LORA=1` and `TRAIN_LORA_ONLY=1`, freezes base params and trains LoRA params only

This is the simplest “hello-world” fine-tune.

### 8.4 ERA5 rollout (autoregressive) fine-tuning (`FT_MODE=rollout`)
This is the autoregressive training loop.
It unrolls multiple steps and feeds predictions back as the next input state:

- Build initial batch `x0` with history `[t-lead, t]`.
- Build target batches for each rollout step:
  - targets at `t+lead`, `t+2*lead`, ..., `t+K*lead`
- For each training iteration:
  - predict step by step
  - update the batch history by shifting the time axis and appending predicted fields
  - compute loss either:
    - `sum`: loss at every step (bigger graph)
    - `last`: no-grad for first K-1 steps; loss only at final step (lighter)

**Important:** Rollout fine-tuning can be run with or without LoRA.
Set `USE_LORA=0` explicitly if you want full-model rollout training.

### 8.5 LoRA: `LORA_MODE` and `LORA_STEPS`
- `LORA_MODE` controls where LoRA adapters apply inside Aurora.
- `LORA_STEPS` is passed into the Aurora constructor as configuration.
In this workshop code, the training-loop length is always `FINETUNE_STEPS`, not `LORA_STEPS`.

### 8.6 “Add variable” demo
This code supports adding **one new variable** by:
1) extending the variable tuples passed to `AuroraPretrained`
2) registering normalisation stats in Aurora’s normalisation tables

You must:
- create a PLUS dataset (Zarr/NetCDF containing your new variable)
- mount it
- set `EXTRA_*` env vars



## 9) Recipes: profiles you can copy/paste

### 9.1 ERA5 short-lead (no training)
```python
{
  "FLOW": "era5",
  "FT_MODE": "short",
  "FINETUNE_STEPS": "0",
  "ERA5_LEAD_HOURS": "6",
  "ERA5_TIME_INDEX": "10",
  "ERA5_CROP_LAT": "128",
  "ERA5_CROP_LON": "256",
  "AUTOCAST": "1",
}
```

### 9.2 ERA5 short-lead fine-tune (full model, no LoRA)
```python
{
  "FLOW": "era5",
  "FT_MODE": "short",
  "FINETUNE_STEPS": "50",
  "USE_LORA": "0",
  "LR": "3e-5",
  "ERA5_LEAD_HOURS": "6",
  "ERA5_TIME_INDEX": "10",
  "ERA5_CROP_LAT": "128",
  "ERA5_CROP_LON": "256",
  "AUTOCAST": "1",
}
```

### 9.3 ERA5 short-lead fine-tune (LoRA-only)
```python
{
  "FLOW": "era5",
  "FT_MODE": "short",
  "FINETUNE_STEPS": "100",
  "USE_LORA": "1",
  "TRAIN_LORA_ONLY": "1",
  "LORA_MODE": "single",
  "LORA_STEPS": "40",
  "LR": "3e-4",
  "AUTOCAST": "1",
}
```

### 9.4 ERA5 rollout fine-tune WITHOUT LoRA (full model autoregressive)
```python
{
  "FLOW": "era5",
  "FT_MODE": "rollout",
  "FINETUNE_STEPS": "20",
  "USE_LORA": "0",
  "ROLLOUT_HORIZON_STEPS": "8",
  "ROLLOUT_LOSS_ON": "last",
  "LR": "3e-5",
  "AUTOCAST": "1",
}
```

### 9.5 ERA5 rollout fine-tune WITH LoRA-only (fast adaptation)
```python
{
  "FLOW": "era5",
  "FT_MODE": "rollout",
  "FINETUNE_STEPS": "50",
  "USE_LORA": "1",
  "TRAIN_LORA_ONLY": "1",
  "LORA_MODE": "all",
  "LORA_STEPS": "40",
  "ROLLOUT_HORIZON_STEPS": "8",
  "ROLLOUT_LOSS_ON": "last",
  "LR": "3e-4",
  "AUTOCAST": "1",
}
```



## 10) Troubleshooting (common errors and fixes)

### A) Static vars shape error (repeat dims / wrong dims)
**Symptom:**
`RuntimeError: Number of dimensions of repeat dims can not be smaller...`

**Cause:**
Aurora expects static vars as 2D `(H, W)`, but you provided `(1, H, W)` or `(valid_time, H, W)`.

**Fix:**
Ensure the static NetCDF has 2D `lsm/slt/z` or at least a removable singleton time dimension.
If CDS gives you `valid_time`, drop it when creating the file.

### B) Static vars are NaN
**Likely causes:**
- reading the wrong variable from ARCO store
- misaligned lon convention (-180..180 vs 0..360)
- selecting the wrong coordinates (empty selection or wrong nearest)

**Fix:**
Standardise lon to 0..360 on both sides and align to the dynamic grid.

### C) LoRA checkpoint loading “missing lora_A/lora_B keys”
**Symptom:**
`Missing key(s) in state_dict: ... lora_A ... lora_B ...`

**Cause:**
You built a LoRA-enabled model but tried to load a base checkpoint with `strict=True`.

**Fix:**
When LoRA is enabled, load base checkpoint with `strict=False` so LoRA weights can stay initialised.

### D) Cross-device rename error
**Symptom:**
`Invalid cross-device link` when moving `/tmp/...` to `/mnt/...`

**Fix:**
Write temp files into the same directory as the final destination and use `os.replace()`.

### E) “FLOW=era5 requires ERA5_ZARR_PATH and ERA5_STATIC_NC”
**Fix:**
Your job didn’t mount the inputs or didn’t pass the env vars. Check the notebook job definition.



## 11) Extending the workshop

### 11.1 Larger crops / larger horizons
- Increase `ERA5_CROP_LAT/LON` carefully (memory grows with H×W×levels)
- Increase `ROLLOUT_HORIZON_STEPS` for longer autoregressive sequences

### 11.2 Multiple samples / real training
Right now short-lead and rollout fine-tune loops train on **one chosen time index** (repeatable workshop demo).
To make it “real”, you would:
- iterate over many `time_index` values
- shuffle and batch them
- checkpoint periodically

### 11.3 Add variables cleanly
Use the `EXTRA_*` mechanism and register normalisation stats. Then:
- expand your Zarr/NetCDF to contain the new variable
- mount PLUS dataset
- set env vars in profile

### 11.4 Alternative loss functions
Replace or extend `supervised_mae` with:
- weighted MAE by variable
- pressure-level weighting
- spatial masks (land-only, etc.)


## Appendix: quick output inspection

In [None]:

from pathlib import Path
import json
import numpy as np

# Point this at your downloaded job output folder
participant_dir = Path("job_outputs/<JOB_NAME>/out_dir/<PARTICIPANT_ID>")  # <- edit

print("Participant dir:", participant_dir.resolve())
print("Files:", [p.name for p in participant_dir.iterdir()])

# Losses
loss_json = participant_dir / "finetune_losses.json"
if loss_json.exists():
    losses = json.loads(loss_json.read_text())
    print("Loss steps:", len(losses))
    print("First/last:", losses[0], losses[-1])

# 2m temperature arrays
t2_inf = participant_dir / "inference_2t.npy"
if t2_inf.exists():
    arr = np.load(t2_inf)
    print("inference_2t.npy shape:", arr.shape, "min/max:", float(arr.min()), float(arr.max()))
