![Course header](../assets/img/header.png)

# A1 ‚Äî Cloud-Native Geospatial Formats
Optional reference ‚Äî understand the file formats behind modern EO data

This appendix explains **why** the data you loaded in Notebooks 04 and 05 works so efficiently in the cloud.

## Learning Objectives

By the end of this notebook you will be able to:

- Explain what Analysis Ready Data (ARD) means
- Describe the key cloud-native raster and vector formats (COG, Zarr, GeoParquet)
- Read a Cloud-Optimised GeoTIFF (COG) with partial reads and overviews
- Open a Zarr store as an xarray Dataset
- Compare formats and choose the right one for a given task

Tooling:
- rasterio / rioxarray
- xarray (Zarr backend)
- geopandas (optional ‚Äî for vector examples)

‚è±Ô∏è Estimated time: **30 ‚Äì 45 minutes**

---

## Table of contents

1. Analysis Ready Data (ARD)
2. Cloud-native format overview (and STAC discovery metadata)
3. Cloud-Optimised GeoTIFF (COG)
4. Zarr
5. GeoParquet & vector formats
6. Format comparison
7. Exercises
8. Recap & further reading

---

## 1) Analysis Ready Data (ARD)

> *‚ÄúIt is often said that 80 % of data analysis is spent on the process of cleaning and preparing the data.‚Äù* ‚Äî Hadley Wickham

**Analysis Ready Data** means satellite imagery that has already been:

| Step | What it does |
|------|-------------|
| Radiometric calibration | Raw DN ‚Üí physical units (reflectance) |
| Atmospheric correction | Remove haze / aerosols |
| Geometric correction | Orthorectify to a known CRS |
| Co-registration | Align multi-date / multi-sensor images |
| Cloud / shadow masking | Provide quality flags for unusable pixels |

Sentinel-2 **Level-2A** (the collection we use in NB 04 / 05) is ARD ‚Äî surface reflectance with a Scene Classification Layer (SCL).

See [CEOS ARD for Land (CARD4L)](https://ceos.org/ard/) for the formal spec.

---

## 2) Cloud-native format overview

Traditional workflows download whole files, then process locally.
**Cloud-optimised formats** flip this:

- **Partial reads** ‚Äî fetch only the bytes you need (HTTP range requests)
- **Internal tiling / chunking** ‚Äî data is already split for parallel access
- **Overviews / pyramids** ‚Äî coarse previews without reading full resolution

| Format | Data type | Key feature |
|--------|-----------|-------------|
| **COG** (Cloud-Optimised GeoTIFF) | Raster | Internal tiles + overviews |
| **Zarr** | N-D arrays | Chunked + compressed, cloud-native I/O |
| **GeoParquet** | Vector / tabular | Columnar, fast filtering |
| **FlatGeobuf** | Vector | Spatial index, streaming |
| **COPC** | Point cloud | Octree LOD |

üìö [Cloud-Native Geospatial Guide](https://guide.cloudnativegeo.org/)

### STAC (discovery metadata ‚Äî not a file format)

Cloud-native workflows usually separate:
- **Data files** (COG, Zarr, GeoParquet, FlatGeobuf, ‚Ä¶)
- **Discovery metadata** describing what exists and how to access it

**STAC** (SpatioTemporal Asset Catalog) is the most common way EO data is published and searched online. It standardizes *metadata* and links to the actual data files.

Key building blocks:
- **Catalog** ‚Äî a container that organizes Collections
- **Collection** ‚Äî a dataset definition (e.g., Sentinel-2 L2A)
- **Item** ‚Äî one spatiotemporal ‚Äúscene‚Äù (one acquisition)
- **Asset** ‚Äî a link to a file for that item (often a COG, sometimes Zarr/NetCDF, etc.)
- **STAC API** ‚Äî a web API that supports searching by space/time/filters

In Notebooks 04 and 05 you search a STAC API, then open the **assets** as cloud-native formats.

---

## 3) Cloud-Optimised GeoTIFF (COG)

A GeoTIFF is a georeferenced raster format (it stores the CRS and pixel-to-map transform).
A **COG** is a GeoTIFF organised so that a client can read just the tiles and overview levels it needs via HTTP range requests.

Key properties:
- **Internal tiling** (typically 256√ó256 or 512√ó512)
- **Overviews** (pre-computed pyramids for quick preview)
- **HTTP range requests** (read only part of the file from cloud storage)
- **Compression** (DEFLATE, LZW, ZSTD, ‚Ä¶)

In [None]:
import rasterio
import matplotlib.pyplot as plt

# A Sentinel-2 Red band stored as a COG on AWS
cog_url = (
    'https://sentinel-cogs.s3.us-west-2.amazonaws.com/'
    'sentinel-s2-l2a-cogs/32/T/PS/2024/12/'
    'S2B_32TPS_20241228_0_L2A/B04.tif'
)

with rasterio.open(cog_url) as src:
    print('COG properties')
    print(f'  Size:       {src.width} √ó {src.height}')
    print(f'  CRS:        {src.crs}')
    print(f'  Block size: {src.block_shapes}')
    print(f'  Overviews:  {src.overviews(1)}')

### Partial reads
You can request just a small window ‚Äî the server sends only those bytes.

In [None]:
from rasterio.windows import Window

with rasterio.open(cog_url) as src:
    window = Window(col_off=5000, row_off=5000, width=500, height=500)
    subset = src.read(1, window=window)
    full_mb = src.width * src.height * 2 / 1024 / 1024
    print(f'Read {subset.nbytes / 1024:.1f} KB instead of {full_mb:.1f} MB')

fig, ax = plt.subplots(figsize=(6, 6))
ax.imshow(subset, cmap='Reds', vmin=0, vmax=3000)
ax.set_title('COG subset (500√ó500 px)')
plt.tight_layout()
plt.show()

### Overview reads
Overviews let you get a quick thumbnail without touching full-resolution tiles.

In [None]:
with rasterio.open(cog_url) as src:
    ovr = src.overviews(1)[2]  # 3rd overview level
    thumb = src.read(1, out_shape=(src.height // ovr, src.width // ovr))
    print(f'Overview shape: {thumb.shape}  (1/{ovr} of full res)')

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(thumb, cmap='Reds', vmin=0, vmax=3000)
ax.set_title(f'Quick preview (overview level 2, 1/{ovr})')
ax.axis('off')
plt.tight_layout()
plt.show()

---

## 4) Zarr

Zarr stores N-dimensional arrays in **chunks**, each as a separate object in cloud storage (S3, Azure Blob, GCS).

- **Chunked** ‚Äî read only the chunks that overlap your query
- **Compressed** ‚Äî multiple codecs supported
- **Parallel I/O** ‚Äî each chunk is an independent object
- **Hierarchical** ‚Äî a store can contain groups and multiple arrays
- **Metadata** ‚Äî shape, dtype, chunks, and codecs are stored alongside the data
- **xarray native** ‚Äî `xr.open_zarr()` returns a lazy Dataset

In [None]:
import xarray as xr
import warnings
warnings.filterwarnings('ignore')

# Daymet daily weather data for Hawaii (hosted on Azure)
zarr_url = 'https://daymeteuwest.blob.core.windows.net/daymet-zarr/daily/hi.zarr'

ds = xr.open_zarr(zarr_url)
print('Zarr Dataset:')
ds

In [None]:
# Efficient temporal slice ‚Äî only the 2020 chunks are read
tmax_2020 = ds['tmax'].sel(time='2020').mean(dim='time')

fig, ax = plt.subplots(figsize=(8, 6))
tmax_2020.plot(ax=ax, cmap='RdYlBu_r')
ax.set_title('Mean max temperature ‚Äî Hawaii 2020')
plt.tight_layout()
plt.show()

---

## 5) GeoParquet & vector formats

### GeoParquet

Apache Parquet extended with geometry columns:

- **Columnar** ‚Äî read only the columns you need
- **Row-group filtering** ‚Äî skip irrelevant data blocks
- **High compression** ‚Äî much smaller than Shapefile or GeoJSON
- **Ecosystem** ‚Äî DuckDB, Spark, pandas/geopandas

### FlatGeobuf

A compact binary vector format with a built-in spatial index.
Great for streaming large feature collections over HTTP.

Key properties:
- **Binary** ‚Äî smaller and faster than text formats like GeoJSON
- **Spatial indexing** ‚Äî supports efficient spatial queries
- **Random access** ‚Äî fetch only the features you need (no full-file scan)

> üí° **Tip:** Both formats can be read by `geopandas.read_file()` / `geopandas.read_parquet()` with optional `bbox` filtering.

In [None]:
# GeoParquet example ‚Äî Overture Maps buildings for Andorra
# (only ~18 k rows, downloads fast)
import geopandas as gpd

gpq_url = 'https://data.source.coop/cholmes/overture/geoparquet-country-quad-2/AD.parquet'
%time buildings = gpd.read_parquet(gpq_url)

print(f'Loaded {len(buildings):,} buildings')
buildings.head(3)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
buildings.plot(ax=ax, facecolor='lightblue', edgecolor='navy',
              linewidth=0.3, alpha=0.7)
ax.set_title('Building footprints ‚Äî Andorra (GeoParquet)')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.tight_layout()
plt.show()

---

## 6) Format comparison

| Format | Best for | Cloud-native? | xarray support |
|--------|----------|---------------|----------------|
| **COG** | Single raster images | ‚úÖ Yes | via `rioxarray` |
| **Zarr** | Large N-D arrays / time series | ‚úÖ Yes | Native |
| **NetCDF** | Climate / atmospheric data | ‚ö†Ô∏è With Kerchunk | Native |
| **GeoParquet** | Large vector datasets | ‚úÖ Yes | via `geopandas` |
| **FlatGeobuf** | Streaming vector with spatial queries | ‚úÖ Yes | via `geopandas` |

**Rule of thumb:** if you can choose, use **COG** for rasters and **GeoParquet** for vectors.

---

## 7) Exercises

### ‚úÖ Try it ‚Äî Partial COG read

1. Change the `Window` parameters to read a **1000√ó1000** pixel patch from the top-left corner of the COG.
2. Display it with `imshow`.

In [None]:
# TODO: read a 1000√ó1000 patch starting at (0, 0)


### ‚úÖ Try it ‚Äî Zarr temporal query

Using the Daymet dataset opened above, compute and plot the **mean minimum temperature** (`tmin`) for **2015**.

In [None]:
# TODO: ds['tmin'].sel(time='2015').mean(dim='time').plot()


### üß† Checkpoint

**Q1.** What is a COG‚Äôs main advantage over a regular GeoTIFF?

- A) Smaller file size
- B) Internal tiling + overviews allow partial HTTP reads
- C) It stores vector data

**Q2.** Which format is best for storing a large 4-D (time, band, y, x) data cube in the cloud?

- A) Shapefile
- B) COG
- C) Zarr

**Q3.** What does `xr.open_zarr()` return?

- A) A NumPy array loaded into memory
- B) A lazy xarray Dataset backed by Dask arrays
- C) A pandas DataFrame

---

## 8) Recap & further reading

| Concept | Key takeaway |
|---------|-------------|
| ARD | Pre-processed data ready for analysis (e.g., Sentinel-2 L2A) |
| COG | GeoTIFF with internal tiles + overviews ‚Üí partial HTTP reads |
| Zarr | Chunked N-D arrays ‚Üí cloud-native parallel I/O |
| GeoParquet | Columnar vector format ‚Üí fast filtering |

üí° **Don‚Äôt download everything** ‚Äî cloud-native formats let you read only what you need.

### Further reading

- [Cloud-Native Geospatial Guide](https://guide.cloudnativegeo.org/)
- [COG specification](https://www.cogeo.org/)
- [Zarr docs](https://zarr.readthedocs.io/)
- [CEOS ARD (CARD4L)](https://ceos.org/ard/)
- [STAC specification](https://stacspec.org/)