# Creating a STAC Item for the NOAA GFS Forecast (dynamical.org Icechunk)

This notebook shows how to use `cloudify.stac` to generate a STAC item for the
[NOAA GFS Forecast](https://dynamical.org/catalog/noaa-gfs-forecast/) dataset
published by [dynamical.org](https://dynamical.org) as a public Icechunk store on AWS S3.

The dataset is a **regular** (non-virtual) Icechunk store — chunk data lives
directly in the repo, not in referenced external files.

## Install dependencies
```
pip install icechunk xarray zarr pystac xstac cloudify
```

In [None]:
import json

import icechunk
import pystac
import xarray as xr

from cloudify.stac import build_stac_item_from_icechunk

## 1. Open the Icechunk store

We open the `main` branch and capture the current snapshot ID.
The snapshot ID pins the STAC item to a specific, reproducible version of the data.

In [None]:
BUCKET = "dynamical-noaa-gfs"
PREFIX = "noaa-gfs-forecast/v0.2.7.icechunk/"
REGION = "us-west-2"
ICECHUNK_HREF = f"s3://{BUCKET}/{PREFIX}"

storage = icechunk.s3_storage(
    bucket=BUCKET,
    prefix=PREFIX,
    region=REGION,
    anonymous=True,
)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")

snapshot_id = session.snapshot_id
print(f"snapshot_id: {snapshot_id}")

In [None]:
ds = xr.open_zarr(session.store, chunks=None)
ds

## 2. Define storage schemes and providers

`storage_schemes` describes the S3 backend so that tools like
[xpystac](https://github.com/stac-utils/xpystac) can reconstruct the
icechunk storage config from the STAC asset alone.

**Important:** `storage_schemes` is stored at the top-level item
`extra_fields` (not inside `properties`) — that is where `xpystac` looks
for it when you call `xr.open_dataset(asset)`.

In [None]:
storage_schemes = {
    "aws-s3-dynamical-noaa-gfs": {
        "type": "aws-s3",
        "platform": "https://{bucket}.s3.{region}.amazonaws.com",
        "bucket": BUCKET,
        "region": REGION,
        "anonymous": True,
    }
}

providers = [
    pystac.Provider(
        name="dynamical.org",
        description="Analysis-ready, cloud-optimized weather forecast data",
        roles=["producer", "processor", "host"],
        url="https://dynamical.org",
    ),
    pystac.Provider(
        name="NOAA NCEP",
        description="National Oceanic and Atmospheric Administration — Global Forecast System",
        roles=["producer", "licensor"],
        url="https://www.ncep.noaa.gov",
    ),
]

## 3. Build the STAC item

Two GFS-specific arguments differ from the defaults:

- `temporal_dimension="init_time"` — GFS uses `init_time` (forecast
  initialisation time) rather than the generic `time`
- `x_dimension="longitude"`, `y_dimension="latitude"` — full names, not
  the abbreviated `lon`/`lat`
- `virtual=False` — this is a regular Icechunk store, not a virtual one
  (chunk data lives in the repo, not in referenced external files)

In [None]:
item_id = f"noaa-gfs-forecast-{snapshot_id.lower()}"

item = build_stac_item_from_icechunk(
    ds,
    item_id=item_id,
    icechunk_href=ICECHUNK_HREF,
    snapshot_id=snapshot_id,
    storage_schemes=storage_schemes,
    title="NOAA GFS Forecast (dynamical.org)",
    description=ds.attrs.get("description", 
        "NOAA Global Forecast System (GFS) weather forecast data, "
        "analysis-ready and cloud-optimized by dynamical.org. "
        "Global coverage at 0.25° resolution, initialized every 6 hours, "
        "with forecast lead times from 0 to 384 hours."
    ),
    providers=providers,
    virtual=False,                   # regular icechunk, not virtual
    temporal_dimension="init_time",  # GFS-specific
    x_dimension="longitude",
    y_dimension="latitude",
)

print(json.dumps(item, indent=2))

## 4. Inspect key fields

In [None]:
print("bbox:           ", item["bbox"])
print("start_datetime: ", item["properties"]["start_datetime"])
print("end_datetime:   ", item["properties"]["end_datetime"])
print("variables:      ", list(item["properties"]["cube:variables"].keys()))
print()
print("assets:")
for key, asset in item["assets"].items():
    print(f"  {key}: {asset['href']}")
    print(f"    snapshot_id:  {asset.get('icechunk:snapshot_id')}")
    print(f"    roles:        {asset['roles']}")
print()
print("storage:schemes at top level:", "storage:schemes" in item)
print("storage:schemes in properties (should be False):",
      "storage:schemes" in item.get("properties", {}))

## 5. Save and round-trip via xpystac

Save the item to disk, reload it with pystac, and open the data directly
from the asset using `xr.open_dataset(asset)` — this exercises the full
round-trip that end users of the STAC catalog will follow.

In [None]:
out_path = f"{item_id}.json"
with open(out_path, "w") as f:
    json.dump(item, f, indent=2)
print(f"Written to {out_path}")

In [None]:
# Reload and open via xpystac  (requires: pip install xpystac)
loaded_item = pystac.Item.from_file(out_path)

# The asset key is "{name}@{snapshot_id}"
asset_key = next(k for k in loaded_item.assets if "@" in k)
asset = loaded_item.assets[asset_key]
print(f"Opening asset: {asset_key}")

# xpystac detects the icechunk media type and reconstructs the repo
# from storage:schemes + icechunk:snapshot_id automatically
ds_from_stac = xr.open_dataset(asset)
ds_from_stac