# Searching for Sentinel-2 Scenes with `maji.search`

**Learning objectives**

1. Understand how `maji.search` queries the free CDSE STAC catalog for Sentinel-2 L2A imagery.
2. Inspect the returned GeoDataFrame — scene IDs, MGRS tiles, cloud cover, geometries, and S3 asset paths.
3. Visualise scene footprints on a map.
4. Compare scene-selection strategies (`least_cloudy` vs `most_recent`).
5. Plot a cloud-cover time series to understand temporal variability.

**Prerequisites**

* The `maji` conda environment is active (`conda activate maji`).
* Live network access is required — the notebook queries the [CDSE STAC endpoint](https://stac.dataspace.copernicus.eu/v1/), which is free and requires no authentication.

In [1]:
import folium
import geopandas as gpd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
from shapely.geometry import box

from maji.search import (
    BANDS_OF_INTEREST,
    COLLECTION,
    STAC_URL,
    search_scenes,
    select_best_scenes,
)

print(f"STAC endpoint : {STAC_URL}")
print(f"Collection    : {COLLECTION}")
print(f"Bands         : {BANDS_OF_INTEREST}")

STAC endpoint : https://stac.dataspace.copernicus.eu/v1/
Collection    : sentinel-2-l2a
Bands         : ['B03', 'B04', 'B08', 'B8A', 'B11', 'B12', 'SCL']


## §1 Define the Search Area

The bounding box is specified as `(west, south, east, north)` in **EPSG:4326** (decimal degrees).

We will search the **Karoo region near Sutherland, South Africa** — adjacent to the SKA core site, semi-arid, and relatively cloud-free, which makes it a good test case for Sentinel-2 searches.

In [2]:
BBOX = (20.0, -32.5, 21.0, -31.5)
START = "2025-10-01"
END = "2025-12-31"
MAX_CLOUD = 40.0
MAX_ITEMS = 100

search_box = box(*BBOX)
print(f"Search bbox : {BBOX}")
print(f"Date range  : {START} to {END}")
print(f"Max cloud   : {MAX_CLOUD}%")
print(f"Max items   : {MAX_ITEMS}")
print(f"Search box  : {search_box}")

Search bbox : (20.0, -32.5, 21.0, -31.5)
Date range  : 2025-10-01 to 2025-12-31
Max cloud   : 40.0%
Max items   : 100
Search box  : POLYGON ((21 -32.5, 21 -31.5, 20 -31.5, 20 -32.5, 21 -32.5))


## §2 Query the STAC Catalog

`search_scenes()` does the following under the hood:

1. Opens a `pystac_client.Client` pointing at the CDSE STAC endpoint.
2. Sends a filtered search: collection, bbox, date range, and cloud-cover threshold.
3. For each returned item it extracts the **scene ID**, **MGRS tile** (via `grid:code`), **datetime**, **cloud cover**, **footprint geometry**, and a dict of **S3 asset hrefs** for the bands of interest.
4. Returns a GeoDataFrame sorted by `(mgrs_tile, datetime)` in EPSG:4326.

In [3]:
%%time
scenes = search_scenes(
    bbox=BBOX,
    start=START,
    end=END,
    max_cloud=MAX_CLOUD,
    max_items=MAX_ITEMS,
)

print(f"Scenes returned : {len(scenes)}")
print(f"Unique tiles    : {scenes['mgrs_tile'].nunique()}")
scenes.head()

APIError: {"code":"SerializationError","description":"canceling statement due to conflict with recovery\nDETAIL:  User query might have needed to see row versions that must be removed."}

## §3 Explore the Results

In [None]:
print("Columns:", scenes.columns.tolist())
print(f"CRS: {scenes.crs}\n")

tile_counts = scenes.groupby("mgrs_tile").size().rename("n_scenes")
print("Scenes per tile:")
print(tile_counts.to_string())

print("\nCloud cover statistics:")
print(scenes["cloud_cover"].describe().to_string())

In [None]:
first_row = scenes.iloc[0]
print(f"Scene  : {first_row['scene_id']}")
print(f"Tile   : {first_row['mgrs_tile']}")
print(f"Assets ({len(first_row['assets'])} bands):")
for band, href in sorted(first_row["assets"].items()):
    # Show only the tail of the S3 path for readability
    print(f"  {band:4s} -> ...{href[-60:]}")

## §4 Visualise Footprints

Each scene carries a `geometry` column with its footprint polygon.
The interactive map below uses [folium](https://python-visualization.github.io/folium/)
with OpenStreetMap tiles as the basemap. Scene footprints are grouped by MGRS tile —
use the layer control in the top-right corner to toggle individual tiles on and off.
Click any footprint to see scene metadata.

In [None]:
# Colour palette — one colour per MGRS tile (tab10 cycle)
_TAB10 = [
    "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
    "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
]
_tiles = sorted(scenes["mgrs_tile"].unique())
_tile_colours = {t: _TAB10[i % len(_TAB10)] for i, t in enumerate(_tiles)}

# Base map centred on the BBOX midpoint
_centre = [(BBOX[1] + BBOX[3]) / 2, (BBOX[0] + BBOX[2]) / 2]
m = folium.Map(location=_centre, zoom_start=9, control_scale=True)

# One FeatureGroup per MGRS tile (toggleable via LayerControl)
for tile in _tiles:
    fg = folium.FeatureGroup(name=f"MGRS {tile}")
    tile_scenes = scenes[scenes["mgrs_tile"] == tile]

    for _, row in tile_scenes.iterrows():
        dt_str = row["datetime"].strftime("%Y-%m-%d %H:%M")
        popup_html = (
            f"<b>Scene:</b> {row['scene_id']}<br>"
            f"<b>Tile:</b> {row['mgrs_tile']}<br>"
            f"<b>Date:</b> {dt_str}<br>"
            f"<b>Cloud:</b> {row['cloud_cover']:.1f}%"
        )
        folium.GeoJson(
            row["geometry"].__geo_interface__,
            style_function=lambda _feat, _c=_tile_colours[tile]: {
                "fillColor": _c,
                "color": _c,
                "weight": 1.5,
                "fillOpacity": 0.35,
            },
            popup=folium.Popup(popup_html, max_width=300),
            tooltip=f"{tile} — {row['cloud_cover']:.0f}% cloud",
        ).add_to(fg)

    fg.add_to(m)

# Search bounding box (red, dashed) — not in any FeatureGroup, always visible
folium.Rectangle(
    bounds=[[BBOX[1], BBOX[0]], [BBOX[3], BBOX[2]]],
    color="red",
    weight=2,
    dash_array="8",
    fill=False,
    popup="Search bounding box",
).add_to(m)

# Auto-zoom to AOI with 0.5° buffer
m.fit_bounds([[BBOX[1] - 0.5, BBOX[0] - 0.5], [BBOX[3] + 0.5, BBOX[2] + 0.5]])

# Layer control (expanded by default for discoverability)
folium.LayerControl(collapsed=False).add_to(m)

# Render inline HTML to avoid VS Code iframe trust issues
from IPython.display import display, HTML

display(HTML(m.get_root().render()))

## §5 Under the Hood

Two private helpers do the heavy lifting inside `search_scenes()`:

| Helper | What it does |
|--------|--------------|
| `_extract_mgrs_tile` | Reads `grid:code` (e.g. `MGRS-34HBJ`) and strips the prefix. Falls back to parsing the scene title/id for a `T`-prefixed 6-char token. |
| `_extract_band_assets` | Matches asset keys using `startswith(band + "_")` so that `B08` does **not** collide with `B8A`. Falls back to href substring matching. |

In [None]:
from maji.search import _extract_band_assets, _extract_mgrs_tile

# --- MGRS tile extraction ---
# Case 1: grid:code present (primary path)
print("grid:code  :", _extract_mgrs_tile({"grid:code": "MGRS-34HBJ"}))

# Case 2: fallback to title
print("title      :", _extract_mgrs_tile(
    {"title": "S2A_MSIL2A_20260110T081234_N0510_R035_T34HBJ_20260110T100000"}
))

# Case 3: nothing available
print("missing    :", repr(_extract_mgrs_tile({})))

print()

# --- Band asset extraction ---
# Demonstrate B08 vs B8A disambiguation
fake_assets = {
    "B03_10m": {"href": "s3://bucket/B03_10m.tif"},
    "B04_10m": {"href": "s3://bucket/B04_10m.tif"},
    "B08_10m": {"href": "s3://bucket/B08_10m.tif"},
    "B8A_20m": {"href": "s3://bucket/B8A_20m.tif"},
    "B11_20m": {"href": "s3://bucket/B11_20m.tif"},
    "B12_20m": {"href": "s3://bucket/B12_20m.tif"},
    "SCL_20m": {"href": "s3://bucket/SCL_20m.tif"},
}

extracted = _extract_band_assets(fake_assets)
for band, href in sorted(extracted.items()):
    print(f"  {band:4s} -> {href}")

# Verify B08 and B8A resolve to different assets
assert extracted["B08"] != extracted["B8A"], "B08/B8A collision!"
print("\nB08/B8A correctly disambiguated.")

## §6 Select Best Scene per Tile

`select_best_scenes()` reduces the search results to **one scene per MGRS tile** using one of three strategies:

| Strategy | Picks | Use-case |
|----------|-------|----------|
| `least_cloudy` | Lowest `cloud_cover` per tile | Best for clear imagery (default) |
| `most_recent` | Latest `datetime` per tile | Best for change detection |
| `all` | No filtering | Keep everything |

In [None]:
best_cloud = select_best_scenes(scenes, strategy="least_cloudy")
best_recent = select_best_scenes(scenes, strategy="most_recent")

print(f"least_cloudy : {len(best_cloud)} scene(s)")
print(f"most_recent  : {len(best_recent)} scene(s)")
print()

comparison = pd.merge(
    best_cloud[["mgrs_tile", "datetime", "cloud_cover"]].rename(
        columns={"datetime": "date_cloud", "cloud_cover": "cc_cloud"}
    ),
    best_recent[["mgrs_tile", "datetime", "cloud_cover"]].rename(
        columns={"datetime": "date_recent", "cloud_cover": "cc_recent"}
    ),
    on="mgrs_tile",
)

comparison

## §7 Cloud Cover Over Time

Cloud cover varies considerably from one revisit to the next. The bar chart below shows every scene for the tile with the most observations, highlighting the best (least cloudy) scene in green.

In [None]:
tile_counts = scenes.groupby("mgrs_tile").size()
focus_tile = tile_counts.idxmax()
tile_df = scenes[scenes["mgrs_tile"] == focus_tile].copy()
tile_df = tile_df.sort_values("datetime")

best_idx = tile_df["cloud_cover"].idxmin()

colours = [
    "seagreen" if i == best_idx else "steelblue"
    for i in tile_df.index
]

fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(tile_df["datetime"], tile_df["cloud_cover"], width=1.5, color=colours)
ax.axhline(MAX_CLOUD, color="red", linestyle="--", linewidth=1, label=f"Max cloud = {MAX_CLOUD}%")

ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d %b"))
fig.autofmt_xdate()

ax.set_xlabel("Date")
ax.set_ylabel("Cloud cover (%)")
ax.set_title(f"Cloud Cover Time Series — Tile {focus_tile}")
ax.legend()
plt.tight_layout()
plt.show()

best_row = tile_df.loc[best_idx]
print(f"Best scene: {best_row['scene_id']}")
print(f"  Date       : {best_row['datetime']}")
print(f"  Cloud cover: {best_row['cloud_cover']:.1f}%")

## §8 Putting It All Together

In a real workflow the search feeds directly into the download stage:

```python
from maji.search import search_scenes, select_best_scenes
from maji.download import download_tile

scenes = search_scenes(bbox=BBOX, start=START, end=END)
best   = select_best_scenes(scenes, strategy="least_cloudy")

for _, row in best.iterrows():
    download_tile(row["assets"], out_dir="DATA/")
```

**Key takeaways**

* **STAC** is an open standard for cataloguing geospatial data — no vendor lock-in.
* **MGRS tiles** partition the globe into ~ 110 km squares; each Sentinel-2 scene covers one tile.
* Cloud-cover filtering at search time avoids downloading unusable imagery.
* The `assets` dict gives you direct S3 paths for each band, ready for `download_tile()`.

## Exercises

1. **Change the search area.** Set `BBOX` to a region in Kenya (e.g. `(36.0, -2.0, 37.0, -1.0)`) and re-run the notebook. How does tile coverage change?
2. **Tighten the cloud filter.** Set `MAX_CLOUD = 10.0`. How many scenes survive?
3. **Compare strategy trade-offs.** For which tiles do `least_cloudy` and `most_recent` select different scenes? Why might that matter?
4. **Extend the date range.** Search a full 6 months. Plot the cloud-cover time series and look for seasonal patterns.