# Creating a STAC Item for the NLDAS-3 Virtual Icechunk Store

This notebook shows how to use `cloudify.stac` to generate a STAC item for the
NLDAS-3 meteorological forcing dataset stored as a **virtual** Icechunk repository
on AWS S3.

NLDAS-3 differs from the GFS and HRRR examples in two important ways:

1. **Virtual chunks** — the Icechunk repo holds only chunk *references*; the actual
   data bytes live in the original source files at
   `s3://nasa-waterinsight/NLDAS3/forcing/daily/`.  Opening the store therefore
   requires a `VirtualChunkContainer` config and anonymous credentials for the
   source bucket.  Pass `virtual=True` so the asset roles reflect this.

2. **Snapshot-pinned session** — rather than opening the `main` branch, we pin to
   a specific `snapshot_id` for reproducibility.  The snapshot ID is also used as
   part of the STAC item ID and asset key.

> **Note on the `storage:schemes` bug**: the original notebook that built this item
> placed `storage:schemes` inside `item.properties`, which causes
> `xr.open_dataset(asset)` via xpystac to fail with `KeyError: 'storage:schemes'`.
> `build_stac_item_from_icechunk` always places it in `item.extra_fields`
> (top-level), which is where xpystac looks.

## Install dependencies
```
pip install icechunk xarray zarr pystac xstac rioxarray cloudify
```

In [None]:
import json

import icechunk
import pystac
import rioxarray  # registers .rio accessor for CRS-aware bbox
import xarray as xr

from cloudify.stac import build_stac_item_from_icechunk

## 1. Open the virtual Icechunk store

Opening a virtual store requires two extra steps compared to a regular Icechunk store:

- A `VirtualChunkContainer` that tells the library where the source chunk data lives
- `containers_credentials` granting access to the virtual chunk source bucket

Both are needed at open time, but they don't affect the STAC item — they're an
implementation detail of how the data is read.

In [None]:
BUCKET = "nasa-waterinsight"
PREFIX = "virtual-zarr-store/NLDAS-3-icechunk/"
REGION = "us-west-2"
ICECHUNK_HREF = f"s3://{BUCKET}/{PREFIX}"

# Snapshot to pin to (from original dataset notebook)
SNAPSHOT_ID = "YTNGFY4WY9189GEH1FNG"

# Virtual chunk source — where the actual data bytes live
VIRTUAL_SOURCE = "s3://nasa-waterinsight/NLDAS3/forcing/daily/"

storage = icechunk.s3_storage(
    bucket=BUCKET,
    prefix=PREFIX,
    region=REGION,
    anonymous=True,
)

config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        VIRTUAL_SOURCE,
        icechunk.s3_store(region=REGION),
    )
)

virtual_credentials = icechunk.containers_credentials(
    {VIRTUAL_SOURCE: icechunk.s3_anonymous_credentials()}
)

repo = icechunk.Repository.open(
    storage=storage,
    config=config,
    authorize_virtual_chunk_access=virtual_credentials,
)

session = repo.readonly_session(snapshot_id=SNAPSHOT_ID)
print(f"snapshot_id: {session.snapshot_id}")

In [None]:
import warnings
warnings.filterwarnings(
    "ignore",
    message="Numcodecs codecs are not in the Zarr version 3 specification*",
    category=UserWarning,
)

ds = xr.open_zarr(session.store, consolidated=False, zarr_format=3)
ds

## 2. Define storage schemes and providers

Both the Icechunk repo and the virtual chunk source live in the same S3 bucket
(`nasa-waterinsight`), so a single storage scheme entry covers both.

In [None]:
storage_schemes = {
    "aws-s3-nasa-waterinsight": {
        "type": "aws-s3",
        "platform": "https://{bucket}.s3.{region}.amazonaws.com",
        "bucket": BUCKET,
        "region": REGION,
        "anonymous": True,
    }
}

providers = [
    pystac.Provider(
        name="NLDAS",
        description="NASA Land Data Assimilation Systems",
        roles=["producer", "processor", "licensor"],
        url="https://ldas.gsfc.nasa.gov/nldas",
    )
]

## 3. Build the STAC item

Key arguments for NLDAS-3 that differ from the GFS/HRRR examples:

| Argument | Value | Reason |
|---|---|---|
| `virtual` | `True` | Chunks are virtual references, not embedded data |
| `temporal_dimension` | `"time"` | Standard time dimension (not `init_time`) |
| `x_dimension` | `"lon"` | Longitude coordinate name in this dataset |
| `y_dimension` | `"lat"` | Latitude coordinate name in this dataset |

`extract_spatial_extent_rio` uses `ds.rio.bounds()` (already in WGS84 for this
lat/lon dataset) to get accurate bounds, matching the rioxarray approach in the
original notebook.

In [None]:
item_id = f"nldas-3-virtual-zarr-{SNAPSHOT_ID.lower()}"

item = build_stac_item_from_icechunk(
    ds,
    item_id=item_id,
    icechunk_href=ICECHUNK_HREF,
    snapshot_id=SNAPSHOT_ID,
    storage_schemes=storage_schemes,
    title=ds.attrs.get("title", "NLDAS-3 Virtual Zarr Store"),
    description=(
        "NLDAS-3 provides a fine-scale (1 km) meteorological forcing (precipitation) in "
        "both retrospective and near real-time over North and Central America, including "
        "Alaska, Hawaii, and Puerto Rico, by leveraging high-quality gauge, satellite, "
        "and model datasets through advanced data assimilation methods. "
        "Read more: https://ldas.gsfc.nasa.gov/nldas/v3"
    ),
    providers=providers,
    virtual=True,               # virtual chunk references
    temporal_dimension="time",
    x_dimension="lon",
    y_dimension="lat",
)

print(json.dumps(item, indent=2))

## 4. Inspect key fields

In [None]:
print("bbox:           ", item["bbox"])
print("start_datetime: ", item["properties"]["start_datetime"])
print("end_datetime:   ", item["properties"]["end_datetime"])
print("variables:      ", list(item["properties"]["cube:variables"].keys()))
print()
print("assets:")
for key, asset in item["assets"].items():
    print(f"  {key}: {asset['href']}")
    print(f"    snapshot_id: {asset.get('icechunk:snapshot_id')}")
    print(f"    roles:       {asset['roles']}")
print()
print("storage:schemes at top level:", "storage:schemes" in item)
print("storage:schemes in properties (should be False):",
      "storage:schemes" in item.get("properties", {}))

## 5. Save and round-trip via xpystac

Note: `xr.open_dataset(asset)` via xpystac will reconstruct the full icechunk
config including the `VirtualChunkContainer` from `storage:schemes`.  This only
works if `storage:schemes` is at the item top level (not in `properties`) —
which is what `build_stac_item_from_icechunk` ensures.

In [None]:
out_path = f"{item_id}.json"
with open(out_path, "w") as f:
    json.dump(item, f, indent=2)
print(f"Written to {out_path}")

In [None]:
# Reload and open via xpystac  (requires: pip install xpystac)
loaded_item = pystac.Item.from_file(out_path)

asset_key = next(k for k in loaded_item.assets if "@" in k)
asset = loaded_item.assets[asset_key]
print(f"Opening asset: {asset_key}")

ds_from_stac = xr.open_dataset(asset)
ds_from_stac