# Searching for Sentinel-2 Scenes from Scratch

This notebook reproduces what `maji.search` does, **step by step**,
using only standard libraries.  No `maji` imports appear anywhere.

**Why?**  Understanding the underlying STAC workflow makes it easier to
debug searches, adapt parameters, and extend the pipeline.

**Libraries used**

| Library | Role |
|---------|------|
| `pystac_client` | Query the CDSE STAC catalog |
| `geopandas` / `shapely` | Handle geospatial data |
| `pandas` | Tabular manipulation |
| `folium` | Interactive map |
| `matplotlib` | Static charts |

**Prerequisites**

* The `maji` conda environment is active (`conda activate maji`).
* Network access is required — we query the free [CDSE STAC endpoint](https://stac.dataspace.copernicus.eu/v1/) (no authentication needed).

In [None]:
import logging

import folium
import geopandas as gpd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
from pystac_client import Client
from shapely.geometry import box, shape

# --- Constants (mirrors maji/search.py) ---
STAC_URL = "https://stac.dataspace.copernicus.eu/v1/"
COLLECTION = "sentinel-2-l2a"
#BANDS_OF_INTEREST = ["B03", "B04", "B08", "B8A", "B11", "B12", "SCL"]
BANDS_OF_INTEREST = ['B02', 'B03', 'B04', 'B08', 'B11', 'B12', 'SCL']  # Added B02 for inference

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("search_from_scratch")

print(f"STAC endpoint : {STAC_URL}")
print(f"Collection    : {COLLECTION}")
print(f"Bands         : {BANDS_OF_INTEREST}")

## §1 Define the Search Area

The bounding box is specified as `(west, south, east, north)` in **EPSG:4326** (decimal degrees).

We will search the **Faza region in coastal Kenya** — located in the Lamu Archipelago, this area includes Pate Island and surrounding coastline, with a mix of mangroves, coastal vegetation, and ocean.

In [None]:
BBOX = (40.6, -2.6, 41.6, -1.6)  # Faza region, Kenya
START = "2025-10-01"
END = "2025-12-31"
MAX_CLOUD = 40.0
MAX_ITEMS = 100

search_box = box(*BBOX)
print(f"Search bbox : {BBOX}")
print(f"Date range  : {START} to {END}")
print(f"Max cloud   : {MAX_CLOUD}%")
print(f"Max items   : {MAX_ITEMS}")
print(f"Search box  : {search_box}")

## §2 Extract MGRS Tile

Every Sentinel-2 scene belongs to a **Military Grid Reference System (MGRS)** tile.
CDSE stores the tile code in the `grid:code` property using the format `MGRS-<tile>`
(e.g. `MGRS-34HBJ`).

The function below mirrors `maji.search._extract_mgrs_tile`:

1. **Primary path** — read `grid:code` and strip the `MGRS-` prefix.
2. **Fallback** — if `grid:code` is absent (some older catalog entries), scan
   underscore-delimited segments of the `title` or `id` for a 6-character token
   starting with `T` (e.g. `T36MZE`), and strip the leading `T`.

In [None]:
def extract_mgrs_tile(properties: dict) -> str:
    """Extract MGRS tile ID from STAC item properties.

    Parameters
    ----------
    properties : dict
        STAC item ``properties`` dictionary.

    Returns
    -------
    str
        Five-character MGRS tile code (e.g. ``"37MFT"``), or an empty
        string if no tile code can be determined.
    """
    grid_code = properties.get("grid:code", "")
    if grid_code.startswith("MGRS-"):
        return grid_code[5:]  # Strip "MGRS-" prefix

    # Fallback: parse from product/scene ID
    # S2A_MSIL2A_20260110T..._T36MZE_... -> "36MZE"
    scene_id = properties.get("title", "") or properties.get("id", "")
    for part in scene_id.split("_"):
        if part.startswith("T") and len(part) == 6:
            return part[1:]  # Strip leading "T"

    return ""

## §3 Extract Band Assets

Each STAC item has an `assets` dict mapping asset keys to download URLs.
Asset keys in CDSE follow the pattern `<band>_<resolution>` (e.g. `B03_10m`,
`B8A_20m`, `SCL_20m`).

### The B08 / B8A problem

A naive `key.startswith("B08")` check would also match `B08` against keys
starting with `B8A` in some edge cases.  More importantly, `B8A` starts with
`B8`, not `B08`, so the two bands are distinct.  We use
`key.startswith(band + "_")` to enforce an exact prefix match.

If no key-prefix match is found, the function falls back to searching the
`href` path for `_<band>_` or `_<band>.` substrings.

This mirrors `maji.search._extract_band_assets`.

In [None]:
def extract_band_assets(assets: dict, bands: list[str] | None = None) -> dict[str, str]:
    """Extract S3 hrefs for requested bands from STAC item assets.

    Parameters
    ----------
    assets : dict
        STAC item ``assets`` mapping.
    bands : list[str] or None, optional
        Band names to extract (default: ``BANDS_OF_INTEREST``).

    Returns
    -------
    dict[str, str]
        ``{band_name: s3_href}`` for every band that was found.
    """
    if bands is None:
        bands = BANDS_OF_INTEREST

    result: dict[str, str] = {}
    for band in bands:
        band_upper = band.upper()
        # Try key-prefix match first (e.g. "B03_10m", "B8A_20m", "SCL_20m")
        for key, asset in assets.items():
            href = getattr(asset, "href", None) or asset.get("href", "")
            key_upper = key.upper()
            # Match "B03_..." but not "B08" matching "B8A_..."
            if key_upper.startswith(band_upper + "_") or key_upper == band_upper:
                result[band] = href
                break
        else:
            # Fallback: match on href path
            for key, asset in assets.items():
                href = getattr(asset, "href", None) or asset.get("href", "")
                if f"_{band}_" in href or f"_{band}." in href:
                    result[band] = href
                    break

    return result

## §4 Query the STAC Catalog

This is the core of `maji.search.search_scenes()`.  Four steps:

1. **Open the client** — `pystac_client.Client.open()` connects to the CDSE STAC endpoint.
2. **Build the search** — `catalog.search()` filters by collection, bbox, date range,
   cloud cover, and maximum number of items.
3. **Iterate items** — for each returned STAC item, we extract the MGRS tile,
   band assets, and metadata.  Items with missing tiles or incomplete bands are
   skipped (with a warning).
4. **Build a GeoDataFrame** — sorted by `(mgrs_tile, datetime)` with CRS EPSG:4326.

In [None]:
%%time

# Step 1: Open the STAC client
catalog = Client.open(STAC_URL)

# Step 2: Build the search request
search = catalog.search(
    collections=[COLLECTION],
    bbox=list(BBOX),
    datetime=f"{START}/{END}",
    query={"eo:cloud_cover": {"lt": MAX_CLOUD}},
    max_items=MAX_ITEMS,
)

# Step 3: Iterate items and build rows
items = search.item_collection()
logger.info("STAC search returned %d items", len(items))

rows: list[dict] = []
for item in items:
    props = item.properties
    mgrs_tile = extract_mgrs_tile(props)
    if not mgrs_tile:
        logger.warning("Could not extract MGRS tile from item %s", item.id)
        continue

    band_assets = extract_band_assets(item.assets)
    if len(band_assets) < len(BANDS_OF_INTEREST):
        missing = set(BANDS_OF_INTEREST) - set(band_assets.keys())
        logger.warning(
            "Item %s missing bands: %s — skipping", item.id, missing
        )
        continue

    rows.append(
        {
            "scene_id": item.id,
            "mgrs_tile": mgrs_tile,
            "datetime": pd.Timestamp(props.get("datetime")),
            "cloud_cover": props.get("eo:cloud_cover", 100.0),
            "geometry": shape(item.geometry),
            "assets": band_assets,
        }
    )

# Step 4: Build the GeoDataFrame
if not rows:
    scenes = gpd.GeoDataFrame(
        columns=["scene_id", "mgrs_tile", "datetime", "cloud_cover", "geometry", "assets"],
        geometry="geometry",
        crs="EPSG:4326",
    )
else:
    scenes = gpd.GeoDataFrame(rows, geometry="geometry", crs="EPSG:4326")
    scenes = scenes.sort_values(["mgrs_tile", "datetime"]).reset_index(drop=True)

print(f"Scenes returned : {len(scenes)}")
print(f"Unique tiles    : {scenes['mgrs_tile'].nunique()}")
scenes.head()

## §5 Explore the Results

In [None]:
print("Columns:", scenes.columns.tolist())
print(f"CRS: {scenes.crs}\n")

tile_counts = scenes.groupby("mgrs_tile").size().rename("n_scenes")
print("Scenes per tile:")
print(tile_counts.to_string())

print("\nCloud cover statistics:")
print(scenes["cloud_cover"].describe().to_string())

print()

first_row = scenes.iloc[0]
print(f"Scene  : {first_row['scene_id']}")
print(f"Tile   : {first_row['mgrs_tile']}")
print(f"Assets ({len(first_row['assets'])} bands):")
for band, href in sorted(first_row["assets"].items()):
    # Show only the tail of the S3 path for readability
    print(f"  {band:4s} -> ...{href[-60:]}")

## §6 Visualise Footprints

Each scene carries a `geometry` column with its footprint polygon.
The interactive map below uses [folium](https://python-visualization.github.io/folium/)
with OpenStreetMap tiles as the basemap.  Scene footprints are grouped by MGRS tile —
use the layer control in the top-right corner to toggle individual tiles on and off.
Click any footprint to see scene metadata.

In [None]:
# Colour palette — one colour per MGRS tile (tab10 cycle)
_TAB10 = [
    "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
    "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf",
]
_tiles = sorted(scenes["mgrs_tile"].unique())
_tile_colours = {t: _TAB10[i % len(_TAB10)] for i, t in enumerate(_tiles)}

# Base map centred on the BBOX midpoint
_centre = [(BBOX[1] + BBOX[3]) / 2, (BBOX[0] + BBOX[2]) / 2]
m = folium.Map(location=_centre, zoom_start=9, control_scale=True)

# One FeatureGroup per MGRS tile (toggleable via LayerControl)
for tile in _tiles:
    fg = folium.FeatureGroup(name=f"MGRS {tile}")
    tile_scenes = scenes[scenes["mgrs_tile"] == tile]

    for _, row in tile_scenes.iterrows():
        dt_str = row["datetime"].strftime("%Y-%m-%d %H:%M")
        popup_html = (
            f"<b>Scene:</b> {row['scene_id']}<br>"
            f"<b>Tile:</b> {row['mgrs_tile']}<br>"
            f"<b>Date:</b> {dt_str}<br>"
            f"<b>Cloud:</b> {row['cloud_cover']:.1f}%"
        )
        folium.GeoJson(
            row["geometry"].__geo_interface__,
            style_function=lambda _feat, _c=_tile_colours[tile]: {
                "fillColor": _c,
                "color": _c,
                "weight": 1.5,
                "fillOpacity": 0.1,
            },
            popup=folium.Popup(popup_html, max_width=300),
            tooltip=f"{tile} \u2014 {row['cloud_cover']:.0f}% cloud",
        ).add_to(fg)

    fg.add_to(m)

# Search bounding box (red, dashed) — not in any FeatureGroup, always visible
folium.Rectangle(
    bounds=[[BBOX[1], BBOX[0]], [BBOX[3], BBOX[2]]],
    color="red",
    weight=2,
    dash_array="8",
    fill=False,
    popup="Search bounding box",
).add_to(m)

# Auto-zoom to AOI with 0.5° buffer
m.fit_bounds([[BBOX[1] - 0.5, BBOX[0] - 0.5], [BBOX[3] + 0.5, BBOX[2] + 0.5]])

# Layer control (expanded by default for discoverability)
folium.LayerControl(collapsed=False).add_to(m)

# Render inline HTML to avoid VS Code iframe trust issues
from IPython.display import display, HTML

display(HTML(m.get_root().render()))
m

## §7 Select Best Scene per Tile

`select_best_scenes()` reduces the search results to **one scene per MGRS tile**
using one of three strategies:

| Strategy | Picks | Use-case |
|----------|-------|----------|
| `least_cloudy` | Lowest `cloud_cover` per tile | Best for clear imagery (default) |
| `most_recent` | Latest `datetime` per tile | Best for change detection |
| `all` | No filtering | Keep everything |

The implementation below mirrors `maji.search.select_best_scenes` exactly.

In [None]:
def select_best_scenes(
    scenes: gpd.GeoDataFrame,
    strategy: str = "least_cloudy",
) -> gpd.GeoDataFrame:
    """For each MGRS tile, select the best scene.

    Parameters
    ----------
    scenes : geopandas.GeoDataFrame
        Search results with ``mgrs_tile``, ``cloud_cover``, and
        ``datetime`` columns.
    strategy : str, optional
        ``"least_cloudy"`` (default), ``"most_recent"``, or ``"all"``.

    Returns
    -------
    geopandas.GeoDataFrame
        Filtered DataFrame with one row per MGRS tile (or all rows
        if ``strategy="all"``).
    """
    if strategy == "all":
        return scenes

    if strategy == "least_cloudy":
        idx = scenes.groupby("mgrs_tile")["cloud_cover"].idxmin()
    elif strategy == "most_recent":
        idx = scenes.groupby("mgrs_tile")["datetime"].idxmax()
    else:
        raise ValueError(f"Unknown strategy: {strategy!r}")

    return scenes.loc[idx].reset_index(drop=True)


# Apply both strategies
best_cloud = select_best_scenes(scenes, strategy="least_cloudy")
best_recent = select_best_scenes(scenes, strategy="most_recent")

print(f"least_cloudy : {len(best_cloud)} scene(s)")
print(f"most_recent  : {len(best_recent)} scene(s)")
print()

# Side-by-side comparison
comparison = pd.merge(
    best_cloud[["mgrs_tile", "datetime", "cloud_cover"]].rename(
        columns={"datetime": "date_cloud", "cloud_cover": "cc_cloud"}
    ),
    best_recent[["mgrs_tile", "datetime", "cloud_cover"]].rename(
        columns={"datetime": "date_recent", "cloud_cover": "cc_recent"}
    ),
    on="mgrs_tile",
)

comparison

## §7b Save Search Results to Disk

We save the search results so that `download_from_scratch.ipynb` can load them
directly without re-running the STAC query.

**Selection criteria:**

1. Compute each scene's **coverage fraction** = intersection(scene, bbox).area / bbox.area
2. Sort by coverage (descending), then cloud_cover (ascending)
3. Pick the top scene

This ensures we get the scene that covers the most of the AOI, with ties broken by cloud cover.

In [None]:
# --- Save search results to disk ---
from pathlib import Path
from shapely.geometry import box

# Compute coverage fraction for each scene
aoi = box(*BBOX)
aoi_area = aoi.area
scenes["coverage"] = scenes.geometry.intersection(aoi).area / aoi_area

# Select best scene: most coverage, then least cloud
best_scene = (
    scenes
    .sort_values(["coverage", "cloud_cover"], ascending=[False, True])
    .iloc[0]
)

print(f"Selected scene: {best_scene['scene_id']}")
print(f"  Coverage : {best_scene['coverage']:.1%}")
print(f"  Cloud    : {best_scene['cloud_cover']:.1f}%")

# Save to disk
out_dir = Path("../DATA/notebook_search_results")
out_dir.mkdir(exist_ok=True)

# Save all scenes
scenes.to_parquet(out_dir / "scenes.parquet")
print(f"\nSaved {len(scenes)} scenes to {out_dir / 'scenes.parquet'}")

# Save selected scene as single-row GeoDataFrame
best_gdf = scenes.loc[[best_scene.name]]  # Keep as GDF to preserve geometry
best_gdf.to_parquet(out_dir / "best_scene.parquet")
print(f"Saved selected scene to {out_dir / 'best_scene.parquet'}")

## §8 Cloud Cover Over Time

Cloud cover varies considerably from one revisit to the next.  The bar chart below
shows every scene for the tile with the most observations, highlighting the best
(least cloudy) scene in green.

In [None]:
tile_counts = scenes.groupby("mgrs_tile").size()
focus_tile = tile_counts.idxmax()
tile_df = scenes[scenes["mgrs_tile"] == focus_tile].copy()
tile_df = tile_df.sort_values("datetime")

best_idx = tile_df["cloud_cover"].idxmin()

colours = [
    "seagreen" if i == best_idx else "steelblue"
    for i in tile_df.index
]

fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(tile_df["datetime"], tile_df["cloud_cover"], width=1.5, color=colours)
ax.axhline(MAX_CLOUD, color="red", linestyle="--", linewidth=1, label=f"Max cloud = {MAX_CLOUD}%")

ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d %b"))
fig.autofmt_xdate()

ax.set_xlabel("Date")
ax.set_ylabel("Cloud cover (%)")
ax.set_title(f"Cloud Cover Time Series \u2014 Tile {focus_tile}")
ax.legend()
plt.tight_layout()
plt.show()

best_row = tile_df.loc[best_idx]
print(f"Best scene: {best_row['scene_id']}")
print(f"  Date       : {best_row['datetime']}")
print(f"  Cloud cover: {best_row['cloud_cover']:.1f}%")

## Wrap-up & Key Takeaways

This notebook walked through every step that `maji.search` performs:

1. **MGRS tile extraction** — `grid:code` with fallback to title parsing.
2. **Band-asset extraction** — key-prefix matching with B08/B8A disambiguation and href fallback.
3. **STAC query** — `pystac_client` handles pagination; we filter by collection, bbox, dates, and cloud cover.
4. **GeoDataFrame construction** — normalised columns, sorted by tile and date, CRS set.
5. **Scene selection** — `least_cloudy` vs `most_recent` groupby strategies.
6. **Save results** — GeoParquet files saved to `search_results/` for use by the download notebook.

In production, use the packaged module instead of re-implementing these steps:

```python
from maji.search import search_scenes, select_best_scenes

scenes = search_scenes(bbox=BBOX, start=START, end=END)
best   = select_best_scenes(scenes, strategy="least_cloudy")
```

**Key points**

* **STAC** is an open standard for cataloguing geospatial data — no vendor lock-in.
* **MGRS tiles** partition the globe into ~110 km squares; each Sentinel-2 scene covers one tile.
* Cloud-cover filtering at search time avoids downloading unusable imagery.
* The `assets` dict gives you direct S3 paths for each band, ready for the download stage.
* Results are saved to `search_results/` directory as GeoParquet files for the download notebook.