![Course header](../assets/img/header.png)

# 05 — STAC + xarray for Satellite Data
Catalog search → stack raster assets → compute NDVI → summarize and export

This notebook connects STAC discovery with real satellite pixel data using xarray.

## Learning Objectives

This notebook serves as a **quick reference** for STAC-to-xarray workflows. If you're already comfortable with STAC discovery, feel free to skim or skip ahead to the exercises.

By the end of this notebook, you will be able to:

- Query a STAC API for Sentinel-2 scenes (Items)
- Filter scenes by time, area, and cloud cover
- Stack raster assets into an xarray DataArray (time, band, y, x)
- Compute NDVI and summarize results
- Export outputs for later work

Tooling in this notebook:
- pystac-client for STAC search
- stackstac to create an xarray cube from STAC Items (COG assets)
- xarray for analysis
- numpy for math
- pandas for tabular inspection
- matplotlib for plotting

Requires: Internet access (Planetary Computer STAC).
To avoid large downloads, use a small AOI, few scenes, and coarse resolution.

---

## How to use this notebook

1. Run cells top to bottom.
2. Keep the AOI small and the number of scenes small.
3. If something breaks: restart kernel and run all.
4. Do not scale up until the workflow is working.

---

## Table of contents

1. Setup: imports and paths
2. Configure: STAC endpoint, AOI, time range, scene filters
3. Search STAC for Sentinel-2 L2A Items
4. Inspect results as a table (pandas)
5. Choose a small set of best scenes
6. Build an xarray cube from STAC assets with stackstac
7. Compute NDVI and summarize (time series + composite)
8. Plot quicklooks
9. Export outputs (NetCDF)
10. Exercises
11. Recap

---

## 1) Setup

### Imports

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr

import pystac_client
import planetary_computer
import stackstac

### Paths
This notebook lives in notebooks/. Outputs go to ../outputs/.

In [None]:
OUT_DIR = Path('..') / 'outputs'
OUT_DIR.mkdir(exist_ok=True)

print('Working directory:', Path.cwd())
print('Outputs dir:', OUT_DIR.resolve())

---

## 2) Configure the search (STAC endpoint, AOI, time range)

### 2.1 Choose a STAC API endpoint
We will use Microsoft Planetary Computer.

In [None]:
STAC_API_URL = 'https://planetarycomputer.microsoft.com/api/stac/v1'

### 2.2 Define an AOI bbox (keep it small)
Bbox format: (min_lon, min_lat, max_lon, max_lat)

Choose something small (a few km). Large AOIs will be slow.

In [None]:
AOI_BBOX = (9.95, 49.78, 10.05, 49.83)

### 2.3 Time range
Use a short time range first (weeks, not years).

In [None]:
DATE_RANGE = '2024-06-01/2024-07-15'

### 2.4 Scene filters
We will filter by cloud cover and limit the number of scenes we stack.

In [None]:
MAX_CLOUD = 20
MAX_ITEMS = 6

---

## 3) Search STAC for Sentinel-2 L2A Items

### 3.1 Open the STAC API

In [None]:
catalog = pystac_client.Client.open(
    STAC_API_URL,
    modifier=planetary_computer.sign_inplace,
)
print(f'Connected to: {catalog.title}')

### 3.2 Search
Planetary Computer collection for Sentinel-2 L2A is sentinel-2-l2a.

We query:
- collection
- bbox
- datetime range
- cloud cover constraint

In [None]:
search = catalog.search(
    collections=['sentinel-2-l2a'],
    bbox=AOI_BBOX,
    datetime=DATE_RANGE,
    query={'eo:cloud_cover': {'lt': MAX_CLOUD}},
)
items = list(search.items())
print(f'Found {len(items)} items')

---

## 4) Inspect results as a table (pandas)

### 4.1 Extract common fields
We build a table of: item id, datetime, cloud cover, bbox, asset keys.

In [None]:
rows = []
for it in items:
    props = it.properties
    rows.append({
        'id': it.id,
        'datetime': props.get('datetime'),
        'cloud': props.get('eo:cloud_cover', np.nan),
        'bbox': it.bbox,
        'assets': list(it.assets.keys()),
    })

df = pd.DataFrame(rows)
df['datetime'] = pd.to_datetime(df['datetime'], utc=True, errors='coerce')
df['cloud'] = pd.to_numeric(df['cloud'], errors='coerce')
df.sort_values(['cloud', 'datetime'], ascending=[True, False]).head(10)

### 4.2 Basic sanity checks

In [None]:
df['cloud'].describe()

In [None]:
df['assets'].head(3)

### ✅ Try it — inspect asset keys

What asset keys look like bands? Do you see `B04` and `B08`?

<details><summary>Show solution</summary>

```python
# Pick the first item and list its asset keys
sample_item = items[0]
for key in sorted(sample_item.assets):
    print(key)
```

</details>

In [None]:
# ✅ List the asset keys of the first Item — look for B04 and B08


---

## 5) Choose a small set of best scenes
We will pick up to MAX_ITEMS with low cloud.

In [None]:
df_best = df.sort_values(['cloud', 'datetime'], ascending=[True, False]).head(MAX_ITEMS).copy()
df_best[['id', 'datetime', 'cloud']]

Now get the corresponding Item objects and keep time order consistent.

In [None]:
best_ids = set(df_best['id'].tolist())
items_best = [it for it in items if it.id in best_ids]
items_best = sorted(items_best, key=lambda it: it.properties.get('datetime'))
[it.id for it in items_best]

---

## 6) Build an xarray cube from STAC assets with stackstac
We will stack two Sentinel-2 bands:
- B04 (red)
- B08 (NIR)

NDVI = (NIR - red) / (NIR + red)

### 6.1 Choose assets and resolution
Keep resolution coarse at first (e.g., 60 meters) to make it fast.

In [None]:
ASSETS = ['B04', 'B08']
RESOLUTION = 60

### 6.2 Create the stacked DataArray
This produces an xarray.DataArray with dims roughly: time, band, y, x.
It is usually backed by a lazy Dask array.

In [None]:
stack = stackstac.stack(
    items_best,
    assets=ASSETS,
    bounds_latlon=AOI_BBOX,
    resolution=RESOLUTION,
    chunksize=2048,
)
stack

Check dimensions and coordinates:

In [None]:
stack.dims, stack.shape

In [None]:
stack['time'].values[:5], stack['band'].values

---

## 7) Compute NDVI and summarize

### 7.1 Convert to float and separate bands

In [None]:
stack_f = stack.astype('float32')
red = stack_f.sel(band='B04')
nir = stack_f.sel(band='B08')
red, nir

### 7.2 Compute NDVI

In [None]:
ndvi = (nir - red) / (nir + red)
ndvi.name = 'ndvi'
ndvi

### 7.3 Mean NDVI per scene (time series)
We reduce over x/y dimensions to get one value per time.

In [None]:
spatial_dims = [d for d in ndvi.dims if d not in ['time']]
ts_mean = ndvi.mean(dim=spatial_dims, skipna=True)
ts_mean

Plot it:

In [None]:
ts_mean.plot(marker='o')
plt.title('Mean NDVI over AOI (per scene)')
plt.xlabel('time')
plt.ylabel('mean NDVI')
plt.tight_layout()
plt.show()

### 7.4 Median NDVI composite over time
This creates one representative 2D NDVI map over the AOI.

In [None]:
ndvi_median = ndvi.median(dim='time', skipna=True)
ndvi_median

Plot quicklook:

In [None]:
ndvi_median.plot(vmin=-0.2, vmax=0.9)
plt.title('NDVI median composite (AOI)')
plt.tight_layout()
plt.show()

---

## 8) Plot quicklooks for individual dates
Pick one timestep index and plot NDVI.

In [None]:
i = min(2, ndvi.sizes['time'] - 1)
ndvi_i = ndvi.isel(time=i)

ndvi_i.plot(vmin=-0.2, vmax=0.9)
plt.title(f"NDVI at timestep {i} ({str(ndvi['time'].values[i])[:10]})")
plt.tight_layout()
plt.show()

### ✅ Try it — compare scenes

Change `i` and compare scenes. Do cloudy scenes show artefacts in NDVI even if cloud cover was low?

<details><summary>Show solution</summary>

```python
for i in range(min(4, ndvi.sizes["time"])):
    ndvi_i = ndvi.isel(time=i)
    ndvi_i.plot(vmin=-0.2, vmax=0.9)
    plt.title(f"NDVI at timestep {i} ({str(ndvi['time'].values[i])[:10]})")
    plt.tight_layout()
    plt.show()
```

</details>

In [None]:
# ✅ Change i and plot several scenes — look for cloud artefacts


---

## 9) Export outputs (NetCDF)
We will export the NDVI time series and the NDVI median composite.

### 9.1 Create a summary Dataset

In [None]:
summary = xr.Dataset(
    {
        'ndvi_mean_timeseries': ts_mean,
        'ndvi_median_composite': ndvi_median,
    }
)
summary

### 9.2 Save

In [None]:
out_nc = OUT_DIR / 'stac_s2_ndvi_summary.nc'
summary.to_netcdf(out_nc)
out_nc

Also export the time series as CSV:

In [None]:
out_csv = OUT_DIR / 'stac_s2_ndvi_timeseries.csv'
ts_df = ts_mean.to_dataframe(name='mean_ndvi').reset_index()
ts_df.to_csv(out_csv, index=False)
out_csv

---

## 10) Exercises

### ✅ Try it — AOI and composites

**Exercise 1 — Make the AOI smaller and faster**
- Shrink the bbox to a smaller region (a few km).
- Re-run the search and stacking.
- Compare runtime and the smoothness of the NDVI time series.

**Exercise 2 — Compare mean vs median composite**
- Compute `ndvi_mean` and `ndvi_median`, then plot both side-by-side.

<details><summary>Show solution</summary>

```python
# Exercise 2 — mean vs median
ndvi_mean = ndvi.mean(dim="time", skipna=True)

ndvi_mean.plot(vmin=-0.2, vmax=0.9)
plt.title("NDVI mean composite")
plt.tight_layout()
plt.show()

ndvi_median.plot(vmin=-0.2, vmax=0.9)
plt.title("NDVI median composite")
plt.tight_layout()
plt.show()
```

</details>

In [None]:
# ✅ Exercise 1: Shrink the bbox, re-search and re-stack
# ✅ Exercise 2: Compute ndvi_mean and compare with ndvi_median


### ✅ Try it — scene selection & resolution

**Exercise 3 — Tighten scene selection rules**
- Add an additional filter in pandas (e.g., `cloud < 5`) before stacking.
- Re-run stacking and compare.

**Exercise 4 (optional) — Use a higher resolution**
- Set `RESOLUTION = 20` or `RESOLUTION = 10`. Re-run stacking and NDVI.
- Which composite is less sensitive to outliers?
- What is the trade-off for your teaching environment?

<details><summary>Show solution</summary>

```python
# Exercise 3 — tighter cloud filter
df_strict = df.loc[df["cloud"] < 5].sort_values("datetime")
strict_ids = set(df_strict["id"].tolist())
items_strict = sorted(
    [it for it in items if it.id in strict_ids],
    key=lambda it: it.properties.get("datetime"),
)
print(f"Strict selection: {len(items_strict)} items")
```

```python
# Exercise 4 — higher resolution
stack_hires = stackstac.stack(
    items_best,
    assets=ASSETS,
    bounds_latlon=AOI_BBOX,
    resolution=20,
    chunksize=2048,
)
print(f"High-res shape: {stack_hires.shape}")
```

</details>

In [None]:
# ✅ Exercise 3: Add a stricter cloud filter and re-stack
# ✅ Exercise 4: Set RESOLUTION = 20 and compare shape / runtime


---

## 11) Recap
You now have an end-to-end workflow:

- Query a STAC API for scenes over an AOI and time range
- Inspect and filter Items in pandas
- Choose a small set of scenes
- Stack raster band assets into an xarray cube with stackstac
- Compute NDVI, summarize, plot, and export results

Next steps (depending on course scope):
- Cloud masking using Sentinel-2 scene classification (SCL) assets
- Larger AOIs with chunking strategies and performance tuning
- Export to Zarr and work with xarray lazily across larger time ranges
- Use alternative STAC providers (Planetary Computer requires signed URLs)