# Downloading Sentinel-2 Bands from Scratch

This notebook reproduces what `maji.download` does, **step by step**,
using only standard libraries.  No `maji` imports appear anywhere.

**What you'll learn**

1. How to authenticate with the CDSE S3-compatible API.
2. How to read JP2 band files directly from S3 using rasterio.
3. How to resample 20 m bands to 10 m resolution.
4. How to stack multiple bands into a single Cloud-Optimized GeoTIFF.
5. Retry strategies for transient network errors.

**Libraries used**

| Library | Role |
|---------|------|
| `rasterio` | Read JP2 from S3, write GeoTIFF |
| `numpy` | Array manipulation |
| `pystac_client` | Search STAC catalog (to get asset URLs) |
| `geopandas` / `pandas` | Tabular data |
| `matplotlib` | Visualise the downloaded bands |

**Prerequisites**

* The `maji` conda environment is active (`conda activate maji`).
* You have CDSE S3 credentials (access key + secret key).  
  Register free at [dataspace.copernicus.eu](https://dataspace.copernicus.eu/) and
  generate credentials under *User Settings → S3 Access*.
* Generate an access key at [https://eodata-s3keysmanager.dataspace.copernicus.eu/](https://eodata-s3keysmanager.dataspace.copernicus.eu/).
* Save the key and secret in the `.env` file.

In [None]:
import logging
import os
import time
import warnings
from pathlib import Path

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import rasterio
from rasterio.enums import Resampling
from rasterio.session import AWSSession
from rasterio.transform import Affine
from dotenv import load_dotenv

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("download_from_scratch")

# --- Constants (mirrors maji/download.py) ---

# Bands we want to download
MODEL_BANDS = ["B03", "B04", "B08", "B8A", "B11", "B12"]
CLOUD_BAND = "SCL"
ALL_BANDS = MODEL_BANDS + [CLOUD_BAND]

# Native resolution of each Sentinel-2 band (metres)
BAND_RESOLUTION: dict[str, int] = {
    "B02": 10, "B03": 10, "B04": 10, "B08": 10,
    "B05": 20, "B06": 20, "B07": 20, "B8A": 20,
    "B11": 20, "B12": 20, "SCL": 20,
    "B01": 60, "B09": 60,
}

# Full MGRS tile dimensions at 10 m resolution
TARGET_HEIGHT = 10980
TARGET_WIDTH = 10980

# Retry configuration for transient S3 errors
MAX_RETRIES = 3
RETRY_BACKOFF = 2.0  # seconds, doubled each retry

# CDSE concurrency limit
_MAX_CDSE_WORKERS = 4

print("Band configuration:")
print(f"  Model bands : {MODEL_BANDS}")
print(f"  Cloud band  : {CLOUD_BAND}")
print(f"  Target size : {TARGET_WIDTH} x {TARGET_HEIGHT} pixels (10 m)")

# Load the credentials from the hidden file
path_env_file = "../.env"
success = load_dotenv(dotenv_path=path_env_file, override=True)
if success: 
    print(f"[INFO] Loaded environment from '{path_env_file}' file.")
    print(f"\tACCESS: {'CDSE_ACCESS_KEY' in os.environ}")
    print(f"\tSECRET: {'CDSE_SECRET_KEY' in os.environ}")
else:
    print(f"[ERROR] Failed to load environment from '{path_env_file}' file.")

## §1 Load Search Results

Instead of re-running the search, we load the results saved by
`search_from_scratch.ipynb`. This avoids hitting the STAC API
repeatedly and ensures both notebooks work with the same data.

Run `search_from_scratch.ipynb` first if the files don't exist.

In [None]:
RESULTS_DIR = Path("search_results")

# Load all scenes (for reference)
scenes = gpd.read_parquet(RESULTS_DIR / "scenes.parquet")
print(f"Loaded {len(scenes)} scenes from {RESULTS_DIR / 'scenes.parquet'}")

# Load selected scene
best_gdf = gpd.read_parquet(RESULTS_DIR / "best_scene.parquet")
best_scene = best_gdf.iloc[0]

print(f"\nSelected scene:")
print(f"  ID       : {best_scene['scene_id']}")
print(f"  Tile     : {best_scene['mgrs_tile']}")
print(f"  Date     : {best_scene['datetime']}")
print(f"  Coverage : {best_scene['coverage']:.1%}")
print(f"  Cloud    : {best_scene['cloud_cover']:.1f}%")

## §2 S3 Authentication

The CDSE eodata bucket uses an **S3-compatible API** but is hosted on
Copernicus infrastructure, not AWS.  We need to configure rasterio's
`AWSSession` with:

1. **Access key** and **secret key** from your CDSE account.
2. A custom **endpoint URL** pointing to CDSE: `https://eodata.dataspace.copernicus.eu`.
3. `aws_unsigned=False` so that signed requests are sent.

This mirrors `maji.download.create_s3_session()`.

**Security note:** Never hard-code credentials.  Use environment variables
or a secrets manager.

In [None]:
import boto3

def create_s3_session(
    access_key: str,
    secret_key: str,
    endpoint_url: str = "https://eodata.dataspace.copernicus.eu",
    region_name: str = "default",
) -> AWSSession:
    """Create a rasterio AWSSession configured for CDSE S3.

    Also sets GDAL environment variables for S3 access, which ensures
    the credentials are available to all rasterio operations.

    Parameters
    ----------
    access_key : str
        S3 access key for the CDSE ``eodata`` bucket.
    secret_key : str
        Corresponding S3 secret key.
    endpoint_url : str, optional
        S3-compatible endpoint URL.
    region_name : str, optional
        AWS region (can be any value for non-AWS endpoints).

    Returns
    -------
    rasterio.session.AWSSession
        Configured session for use with rasterio.Env().
    """
    # Set GDAL environment variables for S3 access
    # These are read by GDAL's /vsis3/ driver
    os.environ["AWS_ACCESS_KEY_ID"] = access_key
    os.environ["AWS_SECRET_ACCESS_KEY"] = secret_key
    os.environ["AWS_S3_ENDPOINT"] = "eodata.dataspace.copernicus.eu"
    os.environ["AWS_HTTPS"] = "YES"
    os.environ["AWS_VIRTUAL_HOSTING"] = "FALSE"
    os.environ["AWS_NO_SIGN_REQUEST"] = "NO"
    
    # Create a boto3 session explicitly
    boto_session = boto3.Session(
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key,
        region_name=region_name,
    )
    
    return AWSSession(
        session=boto_session,
        endpoint_url=endpoint_url,
        aws_unsigned=False,
    )


# Load credentials from environment variables
CDSE_ACCESS_KEY = os.environ.get("CDSE_ACCESS_KEY", "")
CDSE_SECRET_KEY = os.environ.get("CDSE_SECRET_KEY", "")

if not CDSE_ACCESS_KEY or not CDSE_SECRET_KEY:
    print("WARNING: CDSE credentials not found in environment.")
    print("Set CDSE_ACCESS_KEY and CDSE_SECRET_KEY to download data.")
    print("")
    print("For this notebook, you can set them here (don't commit!):")
    print('  CDSE_ACCESS_KEY = "your-access-key"')
    print('  CDSE_SECRET_KEY = "your-secret-key"')
else:
    print(f"Credentials loaded: access_key={CDSE_ACCESS_KEY[:8]}...")

# Create the session (this also sets GDAL env vars)
session = create_s3_session(CDSE_ACCESS_KEY, CDSE_SECRET_KEY)
print(f"Session endpoint: {session.endpoint_url}")
print(f"GDAL S3 endpoint: {os.environ.get('AWS_S3_ENDPOINT', 'NOT SET')}")

In [None]:
# Test S3 connectivity with a simple file listing
# This helps diagnose auth/endpoint issues before attempting downloads

print("Testing S3 connectivity...")
print(f"  Endpoint: {os.environ.get('AWS_S3_ENDPOINT')}")
print(f"  Access key: {os.environ.get('AWS_ACCESS_KEY_ID', '')[:8]}...")

# Try listing a known bucket path using boto3
try:
    s3_client = boto3.client(
        "s3",
        aws_access_key_id=CDSE_ACCESS_KEY,
        aws_secret_access_key=CDSE_SECRET_KEY,
        endpoint_url="https://eodata.dataspace.copernicus.eu",
    )
    # List a small portion of the Sentinel-2 prefix
    response = s3_client.list_objects_v2(
        Bucket="eodata",
        Prefix="Sentinel-2/MSI/L2A/2025/",
        MaxKeys=3,
    )
    if "Contents" in response:
        print(f"  ✓ S3 connection successful! Found {len(response['Contents'])} objects.")
        for obj in response["Contents"][:3]:
            print(f"    - {obj['Key'][:60]}...")
    else:
        print("  ⚠ Connected but no objects found at prefix.")
except Exception as e:
    print(f"  ✗ S3 connection failed: {e}")
    print("  Check your credentials and network connection.")

In [None]:
# Verify the specific file exists on S3 before trying rasterio
# This helps distinguish between "auth failed" vs "file doesn't exist"

print("Checking if the scene files exist on S3...")
print()

# Get the B03 asset URL
test_href = best_scene["assets"].get("B03")
print(f"B03 asset URL: {test_href}")
print()

if test_href:
    # Parse the S3 path
    # URL format: s3://eodata/Sentinel-2/MSI/L2A/...
    s3_path = test_href.replace("s3://eodata/", "")
    
    print(f"Checking if file exists: {s3_path[:70]}...")
    
    try:
        s3_client = boto3.client(
            "s3",
            aws_access_key_id=CDSE_ACCESS_KEY,
            aws_secret_access_key=CDSE_SECRET_KEY,
            endpoint_url="https://eodata.dataspace.copernicus.eu",
        )
        
        # Try to get the object metadata (head_object)
        response = s3_client.head_object(Bucket="eodata", Key=s3_path)
        print(f"  ✓ File EXISTS!")
        print(f"    Size: {response['ContentLength'] / 1e6:.1f} MB")
        print(f"    Last modified: {response['LastModified']}")
        print()
        print("The file exists - the issue is with GDAL/rasterio configuration.")
        
    except s3_client.exceptions.ClientError as e:
        error_code = e.response['Error']['Code']
        if error_code == '404':
            print(f"  ✗ File NOT FOUND (404)")
            print()
            print("The file doesn't exist on CDSE. This scene may have been:")
            print("  - Archived or moved")
            print("  - The asset URL in the STAC catalog is stale")
            print()
            print("Try re-running the search notebook to get fresh scene URLs.")
        elif error_code == '403':
            print(f"  ✗ Access DENIED (403)")
            print("Your credentials don't have access to this file.")
        else:
            print(f"  ✗ Error: {error_code} - {e.response['Error']['Message']}")
    except Exception as e:
        print(f"  ✗ Error checking file: {e}")
else:
    print("No B03 asset found in scene")

In [None]:
# Try different GDAL virtual file system approaches
# /vsis3/ sometimes has issues with non-AWS endpoints
# Let's try /vsicurl/ with a signed URL instead

from botocore.config import Config

print("Testing alternative approaches to open the file with rasterio...")
print()

test_href = best_scene["assets"].get("B03")
s3_path = test_href.replace("s3://eodata/", "")

# Approach 1: Generate a presigned URL and use /vsicurl/
print("Approach 1: Presigned URL with /vsicurl/")
try:
    s3_client = boto3.client(
        "s3",
        aws_access_key_id=CDSE_ACCESS_KEY,
        aws_secret_access_key=CDSE_SECRET_KEY,
        endpoint_url="https://eodata.dataspace.copernicus.eu",
        config=Config(signature_version='s3v4'),
    )
    
    presigned_url = s3_client.generate_presigned_url(
        'get_object',
        Params={'Bucket': 'eodata', 'Key': s3_path},
        ExpiresIn=3600,  # 1 hour
    )
    print(f"  Generated presigned URL: {presigned_url[:80]}...")
    
    # Try opening with /vsicurl/
    vsicurl_path = f"/vsicurl/{presigned_url}"
    with rasterio.Env(GDAL_HTTP_UNSAFESSL="YES", CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".jp2"):
        with rasterio.open(vsicurl_path) as src:
            print(f"  ✓ SUCCESS with presigned URL!")
            print(f"    Image size: {src.width} x {src.height}")
            print(f"    CRS: {src.crs}")
            PRESIGNED_URL_WORKS = True
except Exception as e:
    print(f"  ✗ Failed: {e}")
    PRESIGNED_URL_WORKS = False

print()

# Approach 2: Try /vsis3/ with explicit GDAL config (one more time with force)
print("Approach 2: /vsis3/ with GDAL config options")
try:
    vsis3_path = test_href.replace("s3://", "/vsis3/")
    with rasterio.Env(
        AWS_ACCESS_KEY_ID=CDSE_ACCESS_KEY,
        AWS_SECRET_ACCESS_KEY=CDSE_SECRET_KEY,
        AWS_S3_ENDPOINT="eodata.dataspace.copernicus.eu",
        AWS_HTTPS="YES",
        AWS_VIRTUAL_HOSTING="FALSE",
        CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".jp2",
    ):
        with rasterio.open(vsis3_path) as src:
            print(f"  ✓ SUCCESS with /vsis3/!")
            print(f"    Image size: {src.width} x {src.height}")
            VSIS3_WORKS = True
except Exception as e:
    print(f"  ✗ Failed: {str(e)[:100]}")
    VSIS3_WORKS = False

print()
if PRESIGNED_URL_WORKS:
    print("✓ PRESIGNED URLs work! We'll use this approach for downloads.")
elif VSIS3_WORKS:
    print("✓ /vsis3/ works with explicit config!")
else:
    print("✗ Neither approach worked. May need to download via boto3 first.")

## §3 Band Configuration

Sentinel-2 bands have different native resolutions:

| Resolution | Bands |
|------------|-------|
| 10 m | B02, B03, B04, B08 |
| 20 m | B05, B06, B07, B8A, B11, B12, SCL |
| 60 m | B01, B09 |

For machine learning, we want all bands at the same resolution.
We **resample 20 m bands to 10 m** using:

* **Bilinear interpolation** for reflectance bands (smooth interpolation).
* **Nearest-neighbour** for SCL (Scene Classification Layer) to preserve discrete class values.

A full MGRS tile at 10 m is **10,980 x 10,980 pixels** (~110 km x 110 km).

In [None]:
# Show which bands we'll download and their native resolutions
print("Bands to download:")
print(f"{'Band':6s} {'Resolution':12s} {'Resampling':12s}")
print("-" * 32)
for band in ALL_BANDS:
    res = BAND_RESOLUTION.get(band, "?")
    resample = "none" if res == 10 else ("nearest" if band == "SCL" else "bilinear")
    print(f"{band:6s} {str(res) + ' m':12s} {resample:12s}")

print(f"\nTarget output: {TARGET_WIDTH} x {TARGET_HEIGHT} pixels (10 m resolution)")
print(f"Memory per band: {TARGET_WIDTH * TARGET_HEIGHT * 2 / 1e6:.1f} MB (uint16)")

## §4 Reading Bands with Retry

S3 reads can fail due to transient network issues.  The function below
implements **exponential backoff retry**:

1. Try to read the band.
2. On `RasterioIOError`, wait `RETRY_BACKOFF * 2^attempt` seconds.
3. Retry up to `MAX_RETRIES` times.
4. If all attempts fail, raise a `RuntimeError`.

The function also handles **resampling** in a single read operation—rasterio
can resample on the fly using the `out_shape` parameter.

This mirrors `maji.download._read_band_with_retry()`.

In [None]:
import tempfile
from tqdm.auto import tqdm

def read_band_with_retry(
    href: str,
    target_shape: tuple[int, int],
    resampling: Resampling,
    max_retries: int = MAX_RETRIES,
    s3_client: "boto3.client" = None,
    progress_bar: tqdm = None,
) -> tuple[np.ndarray, dict]:
    """Read a single band from S3 with retry on transient errors.

    Downloads the file via boto3 first (which handles CDSE auth correctly),
    then opens it locally with rasterio.

    Parameters
    ----------
    href : str
        S3 path to the band file (e.g. ``s3://eodata/.../B03_10m.jp2``).
    target_shape : tuple of int
        ``(height, width)`` to resample to.
    resampling : rasterio.enums.Resampling
        Resampling method.
    max_retries : int, optional
        Number of retry attempts.
    s3_client : boto3.client, optional
        Pre-configured S3 client. If None, creates one using env vars.
    progress_bar : tqdm, optional
        Progress bar to update during download. If provided, the bar's
        total will be set to the file size and updated as bytes transfer.

    Returns
    -------
    data : numpy.ndarray
        2-D uint16 array of shape ``target_shape``.
    meta : dict
        Contains ``crs``, ``transform``, and ``native_shape``.
    """
    # Parse S3 path
    # href format: s3://eodata/Sentinel-2/MSI/L2A/...
    if href.startswith("s3://"):
        parts = href[5:].split("/", 1)
        bucket = parts[0]
        key = parts[1]
    else:
        raise ValueError(f"Expected s3:// URL, got: {href}")

    # Create S3 client if not provided
    if s3_client is None:
        s3_client = boto3.client(
            "s3",
            aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID", ""),
            aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY", ""),
            endpoint_url="https://eodata.dataspace.copernicus.eu",
        )

    last_error: Exception | None = None
    
    for attempt in range(max_retries):
        try:
            # Download to a temp file, then open with rasterio
            with tempfile.NamedTemporaryFile(suffix=".jp2", delete=True) as tmp:
                # Get file size for progress bar
                if progress_bar is not None:
                    head = s3_client.head_object(Bucket=bucket, Key=key)
                    file_size = head['ContentLength']
                    progress_bar.total = file_size
                    progress_bar.refresh()
                    
                    # Track bytes for progress callback
                    bytes_transferred = [0]
                    
                    def progress_callback(bytes_amount):
                        bytes_transferred[0] += bytes_amount
                        progress_bar.update(bytes_amount)
                    
                    logger.info("Downloading %s (%.1f MB)...", key.split("/")[-1], file_size / 1e6)
                    s3_client.download_file(bucket, key, tmp.name, Callback=progress_callback)
                else:
                    logger.info("Downloading %s to temp file...", key.split("/")[-1])
                    s3_client.download_file(bucket, key, tmp.name)
                
                with rasterio.open(tmp.name) as src:
                    native_shape = (src.height, src.width)
                    meta = {
                        "crs": src.crs,
                        "transform": src.transform,
                        "native_shape": native_shape,
                    }

                    if native_shape == target_shape:
                        # No resampling needed
                        data = src.read(1)
                    else:
                        # Resample on the fly
                        data = src.read(
                            1,
                            out_shape=target_shape,
                            resampling=resampling,
                        )

                    return data, meta

        except Exception as e:
            last_error = e
            wait = RETRY_BACKOFF * (2 ** attempt)
            logger.warning(
                "S3 read failed for %s (attempt %d/%d), retrying in %.1fs: %s",
                href, attempt + 1, max_retries, wait, e,
            )
            # Reset progress bar on retry
            if progress_bar is not None:
                progress_bar.reset()
            time.sleep(wait)

    raise RuntimeError(
        f"Failed to read {href} after {max_retries} attempts"
    ) from last_error

## §5 Download a Single Tile

The `download_tile()` function orchestrates the full download:

1. **Set up output path** — `data_dir / mgrs_tile / YYYY-MM-DD_S2L2A.tif`.
2. **Skip if exists** — unless `overwrite=True`.
3. **Get CRS/transform** — read a 10 m band to get the native coordinate system.
4. **Read and write each band** — one at a time to keep memory low.
5. **Write Cloud-Optimized GeoTIFF** — deflate compression, 512×512 internal tiles.

The rasterio environment is configured with:
* `AWS_VIRTUAL_HOSTING=False` — use path-style URLs (required for CDSE).
* `GDAL_DISABLE_READDIR_ON_OPEN="TRUE"` — avoid listing the directory (faster).
* `CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".jp2"` — restrict to JP2 files.

This mirrors `maji.download.download_tile()`.

In [None]:
def download_tile(
    scene_assets: dict[str, str],
    mgrs_tile: str,
    scene_date: str,
    data_dir: Path,
    session: AWSSession = None,  # Kept for API compatibility, not used
    bands: list[str] | None = None,
    overwrite: bool = False,
) -> Path:
    """Download all bands for one scene and write a multi-band GeoTIFF.

    Downloads files via boto3 (which works with CDSE), then processes
    with rasterio locally. Shows progress bars for each band download.

    Parameters
    ----------
    scene_assets : dict[str, str]
        ``{band_name: s3_href}`` mapping from search results.
    mgrs_tile : str
        Five-character MGRS tile code.
    scene_date : str
        ISO date string (used in output filename).
    data_dir : Path
        Root data directory.
    session : rasterio.session.AWSSession, optional
        Kept for API compatibility. Not used (boto3 handles auth).
    bands : list[str] or None, optional
        Bands to download (default: ALL_BANDS).
    overwrite : bool, optional
        Re-download existing files.

    Returns
    -------
    pathlib.Path
        Path to the saved GeoTIFF.
    """
    if bands is None:
        bands = list(ALL_BANDS)

    # Output path
    tile_dir = data_dir / mgrs_tile
    tile_dir.mkdir(parents=True, exist_ok=True)
    out_path = tile_dir / f"{scene_date}_S2L2A.tif"

    if out_path.exists() and not overwrite:
        logger.info("Skipping %s (already exists)", out_path)
        return out_path

    # Create S3 client for downloads
    s3_client = boto3.client(
        "s3",
        aws_access_key_id=CDSE_ACCESS_KEY,
        aws_secret_access_key=CDSE_SECRET_KEY,
        endpoint_url="https://eodata.dataspace.copernicus.eu",
    )

    target_shape = (TARGET_HEIGHT, TARGET_WIDTH)
    reference_crs = None
    reference_transform = None

    # Read first 10m band to get CRS/transform (no progress bar for this initial read)
    for band_name in bands:
        if BAND_RESOLUTION.get(band_name) == 10:
            href = scene_assets.get(band_name)
            if href is None:
                continue
            _, meta = read_band_with_retry(
                href, target_shape, Resampling.bilinear, s3_client=s3_client,
            )
            if meta["native_shape"] == target_shape:
                reference_crs = meta["crs"]
                reference_transform = meta["transform"]
                break

    # Fallback: derive 10m transform from a 20m band
    if reference_crs is None:
        first_href = scene_assets[bands[0]]
        _, meta = read_band_with_retry(
            first_href, target_shape, Resampling.bilinear, s3_client=s3_client,
        )
        reference_crs = meta["crs"]
        t = meta["transform"]
        native_h, native_w = meta["native_shape"]
        scale = native_h / TARGET_HEIGHT
        reference_transform = Affine(
            t.a * scale, t.b, t.c,
            t.d, t.e * scale, t.f,
        )

    # Cloud-Optimized GeoTIFF profile
    profile = {
        "driver": "GTiff",
        "dtype": "uint16",
        "width": TARGET_WIDTH,
        "height": TARGET_HEIGHT,
        "count": len(bands),
        "crs": reference_crs,
        "transform": reference_transform,
        "compress": "deflate",
        "tiled": True,
        "blockxsize": 512,
        "blockysize": 512,
    }

    total_bytes = 0
    
    with rasterio.open(out_path, "w", **profile) as dst:
        # Overall band progress bar
        band_pbar = tqdm(
            enumerate(bands, start=1),
            total=len(bands),
            desc="Downloading bands",
            unit="band",
        )
        
        for band_idx, band_name in band_pbar:
            href = scene_assets.get(band_name)
            if href is None:
                raise KeyError(
                    f"No S3 href found for band {band_name} in scene assets"
                )

            resampling = (
                Resampling.nearest if band_name == "SCL"
                else Resampling.bilinear
            )

            native_res = BAND_RESOLUTION.get(band_name, 0)
            band_pbar.set_postfix(band=band_name, res=f"{native_res}m")

            # Per-file progress bar for download
            with tqdm(
                unit='B',
                unit_scale=True,
                unit_divisor=1024,
                desc=f"  {band_name}",
                leave=False,
            ) as file_pbar:
                data, _ = read_band_with_retry(
                    href, target_shape, resampling, s3_client=s3_client,
                    progress_bar=file_pbar,
                )
                total_bytes += file_pbar.n
            
            dst.write(data, band_idx)
            dst.set_band_description(band_idx, band_name)

    size_mb = out_path.stat().st_size / 1e6
    logger.info(
        "Saved %s (%d bands, %.1f MB downloaded, %.1f MB on disk)",
        out_path, len(bands), total_bytes / 1e6, size_mb,
    )
    return out_path

## §6 Download Multiple Tiles

The `download_tiles()` function iterates over a DataFrame of scenes
and calls `download_tile()` for each one.

**CDSE concurrency limit:** CDSE allows a maximum of **4 concurrent S3 connections**
per credential set.  The function clamps `max_workers` to this limit and
issues a warning if you try to exceed it.

Failures are logged but do not halt the loop—this lets you download
as many tiles as possible even if some fail.

This mirrors `maji.download.download_tiles()`.

In [None]:
def download_tiles(
    scenes: pd.DataFrame,
    data_dir: Path,
    session: AWSSession,
    bands: list[str] | None = None,
    max_workers: int = 1,
    overwrite: bool = False,
) -> list[Path]:
    """Download multiple scenes sequentially.

    Parameters
    ----------
    scenes : pandas.DataFrame
        DataFrame with ``mgrs_tile``, ``datetime``, and ``assets`` columns.
    data_dir : Path
        Root data directory.
    session : rasterio.session.AWSSession
        Authenticated session.
    bands : list[str] or None, optional
        Bands to download.
    max_workers : int, optional
        Reserved for future parallel downloads.
    overwrite : bool, optional
        Re-download existing files.

    Returns
    -------
    list[pathlib.Path]
        Paths to successfully saved GeoTIFFs.
    """
    if max_workers > _MAX_CDSE_WORKERS:
        warnings.warn(
            f"max_workers={max_workers} exceeds CDSE limit of "
            f"{_MAX_CDSE_WORKERS} concurrent connections; clamping to "
            f"{_MAX_CDSE_WORKERS}",
            stacklevel=2,
        )
        max_workers = _MAX_CDSE_WORKERS

    paths: list[Path] = []
    for _, row in scenes.iterrows():
        scene_date = row["datetime"].strftime("%Y-%m-%d")
        try:
            path = download_tile(
                scene_assets=row["assets"],
                mgrs_tile=row["mgrs_tile"],
                scene_date=scene_date,
                data_dir=Path(data_dir),
                session=session,
                bands=bands,
                overwrite=overwrite,
            )
            paths.append(path)
        except Exception:
            logger.error(
                "Failed to download %s/%s",
                row["mgrs_tile"], scene_date,
                exc_info=True,
            )

    return paths

## §7 Run the Download

Now let's download the scene we selected earlier.

**Note:** This cell will fail if you haven't set your CDSE credentials.
Set `CDSE_S3_ACCESS_KEY` and `CDSE_S3_SECRET_KEY` environment variables,
or uncomment the lines below to set them directly (don't commit!).

In [None]:
# Uncomment and fill in your credentials if not using environment variables:
# CDSE_ACCESS_KEY = "your-access-key"
# CDSE_SECRET_KEY = "your-secret-key"
# session = create_s3_session(CDSE_ACCESS_KEY, CDSE_SECRET_KEY)

# Output directory (relative to notebook)
DATA_DIR = Path("../data")

# Download the best scene
scene_date = best_scene["datetime"].strftime("%Y-%m-%d")

print(f"Downloading scene: {best_scene['scene_id']}")
print(f"  Tile: {best_scene['mgrs_tile']}")
print(f"  Date: {scene_date}")
print(f"  Bands: {ALL_BANDS}")
print(f"  Output: {DATA_DIR / best_scene['mgrs_tile']}")
print()

out_path = download_tile(
    scene_assets=best_scene["assets"],
    mgrs_tile=best_scene["mgrs_tile"],
    scene_date=scene_date,
    data_dir=DATA_DIR,
    session=session,
    bands=ALL_BANDS,
    overwrite=False,
)

print(f"\nDownload complete: {out_path}")

## §8 Inspect the Output

Let's verify the downloaded GeoTIFF has the expected structure:
7 bands, 10,980 x 10,980 pixels, correct CRS, and band descriptions.

In [None]:
with rasterio.open(out_path) as src:
    print(f"File: {out_path.name}")
    print(f"Size: {out_path.stat().st_size / 1e6:.1f} MB")
    print(f"")
    print(f"Dimensions: {src.width} x {src.height} pixels")
    print(f"Bands: {src.count}")
    print(f"CRS: {src.crs}")
    print(f"Transform: {src.transform}")
    print(f"Dtype: {src.dtypes[0]}")
    print(f"")
    print("Band descriptions:")
    for i in range(1, src.count + 1):
        desc = src.descriptions[i - 1] or f"Band {i}"
        print(f"  {i}: {desc}")

## §9 Visualise the Bands

Let's create a quick RGB composite and show all bands as a grid.

In [None]:
with rasterio.open(out_path) as src:
    # Read all bands (subsample for display)
    step = 10  # Read every 10th pixel
    bands_data = src.read(
        out_shape=(src.count, src.height // step, src.width // step),
        resampling=Resampling.nearest,
    )
    band_names = [src.descriptions[i] or f"Band {i+1}" for i in range(src.count)]

# Create figure with subplots
n_bands = len(band_names)
n_cols = 4
n_rows = (n_bands + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, n_rows * 3.5))
axes = axes.flatten()

for i, (data, name) in enumerate(zip(bands_data, band_names)):
    ax = axes[i]
    # Clip to 2nd-98th percentile for better contrast
    vmin, vmax = np.percentile(data[data > 0], [2, 98])
    ax.imshow(data, cmap="viridis", vmin=vmin, vmax=vmax)
    ax.set_title(name)
    ax.axis("off")

# Hide empty subplots
for i in range(n_bands, len(axes)):
    axes[i].axis("off")

plt.suptitle(f"Downloaded Bands \u2014 Tile {best_scene['mgrs_tile']} ({scene_date})", y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Create true-color RGB composite (B04=Red, B03=Green, B02 not available, use B03)
# Since we have B03 and B04, let's make a false-color composite:
# NIR-Red-Green (B08, B04, B03) - vegetation appears red

with rasterio.open(out_path) as src:
    # Find band indices
    band_idx = {src.descriptions[i]: i + 1 for i in range(src.count)}
    
    # Read at reduced resolution for display
    step = 10
    out_shape = (src.height // step, src.width // step)
    
    b08 = src.read(band_idx["B08"], out_shape=out_shape, resampling=Resampling.nearest)
    b04 = src.read(band_idx["B04"], out_shape=out_shape, resampling=Resampling.nearest)
    b03 = src.read(band_idx["B03"], out_shape=out_shape, resampling=Resampling.nearest)

def normalize_band(band, pmin=2, pmax=98):
    """Normalize band to 0-1 using percentile clipping."""
    valid = band[band > 0]
    if len(valid) == 0:
        return np.zeros_like(band, dtype=np.float32)
    vmin, vmax = np.percentile(valid, [pmin, pmax])
    clipped = np.clip(band, vmin, vmax)
    return (clipped - vmin) / (vmax - vmin)

# Stack as RGB
rgb = np.dstack([
    normalize_band(b08),  # NIR -> Red channel
    normalize_band(b04),  # Red -> Green channel
    normalize_band(b03),  # Green -> Blue channel
])

fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(rgb)
ax.set_title(f"False-Color Composite (NIR-Red-Green)\nTile {best_scene['mgrs_tile']} ({scene_date})")
ax.axis("off")
plt.tight_layout()
plt.show()

print("Vegetation appears in shades of red/orange.")
print("Bare soil/rock appears in brown/beige.")
print("Water appears dark blue/black.")

## Wrap-up & Key Takeaways

This notebook walked through every step that `maji.download` performs:

1. **S3 session creation** — configure rasterio for CDSE's S3-compatible endpoint.
2. **Band configuration** — understand native resolutions and resampling strategies.
3. **Retry logic** — exponential backoff for transient network errors.
4. **Tile download** — read bands one at a time, resample 20m→10m, write COG.
5. **Batch download** — iterate over scenes, handle failures gracefully.

In production, use the packaged module instead of re-implementing these steps:

```python
from maji.search import search_scenes, select_best_scenes
from maji.download import create_s3_session, download_tiles

scenes = search_scenes(bbox=BBOX, start=START, end=END)
best = select_best_scenes(scenes, strategy="least_cloudy")

session = create_s3_session(access_key, secret_key)
paths = download_tiles(best, data_dir=Path("data"), session=session)
```

**Key points**

* CDSE uses an **S3-compatible API** with path-style URLs (not AWS virtual hosting).
* Resampling happens on the fly via `rasterio.read(out_shape=...)`.
* **SCL uses nearest-neighbour** resampling to preserve classification values.
* Output is a **Cloud-Optimized GeoTIFF** with internal tiling and compression.
* CDSE limits you to **4 concurrent connections** per credential set.