<img align="right" src="https://github.com/eo2cube/eo2cube_book/blob/7880672deff906b41f993c856fe1a7eb38ed5b3a/images/banner_siegel.png?raw=true" style="width:1000px;">

# Earth Observation Data Access & Analysis with Xarray

This notebook consolidates three course notebooks into one coherent, presentable flow:

1. **Data lookup & loading** with a STAC API and `odc-stac`
2. **Xarray data structure fundamentals** (how EO data are represented)
3. **Advanced Xarray** (indexing, aggregation, reshaping, plotting patterns)

> **Tip for presenting:** Run top-to-bottom once to cache data, then re-run selected sections live.

---


## How to use this notebook

This notebook consolidates three course notebooks into one coherent narrative.  
Run it **top-to-bottom** the first time.

**Structure**
1. **Part 1** — Find EO data via STAC and load it into `xarray`
2. **Part 2** — Understand `xarray` data structures and core operations
3. **Part 3** — Advanced time handling, subsetting, statistics, resampling, interpolation

**Tip for teaching**
- Treat each Part as a mini-lecture: **Agenda → Live demo → Checkpoint**.


In [None]:
# Setup (run once)
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt

from pystac_client import Client
import planetary_computer as pc
from odc.stac import stac_load

# Matplotlib defaults for notebooks
plt.rcParams["figure.figsize"] = (10, 4)


## Part 1 — Data lookup and loading (STAC + odc-stac)


### Agenda (Part 1)

- What STAC is (and why it matters for reproducible EO workflows)
- Connect to the **Microsoft Planetary Computer** STAC API
- Search for imagery with constraints (AOI, time range, cloud cover)
- Load pixels into an `xarray.Dataset` with `odc-stac`
- Minimal “sanity checks”: band rename/scale, cloud mask (SCL), RGB quicklook

**By the end you should be able to**: go from a STAC query → analysis-ready `xarray` dataset you can subset/plot.


This notebook demonstrates how to:

- Connect to the Microsoft Planetary Computer STAC API using `pystac-client`
- Search for Sentinel-2 L2A items using `catalog.search(...)`
- Load pixels into an `xarray.Dataset` using `odc.stac.stac_load(...)`

We do this **without importing custom course utility functions** — everything is written out explicitly for learning.

## Constants

We define a small set of constants used throughout the notebook.

In [None]:
STAC_URL = "https://planetarycomputer.microsoft.com/api/stac/v1"
COLLECTION = "sentinel-2-l2a"

# Würzburg, Bavaria (approx) in EPSG:4326
bbox = (9.88, 49.75, 10.00, 49.82)  # (min_lon, min_lat, max_lon, max_lat)

# Output grid
crs = "EPSG:32632"  # UTM zone 32N
resolution = 20

## Connect to the STAC API

`Client.open(...)` returns a STAC client which we use to list collections and run searches.

In [None]:
catalog = Client.open(STAC_URL)
catalog

## List collections (optional)

STAC catalogues organize data into **collections**.

In [None]:
collections = list(catalog.get_collections())
pd.DataFrame([{"id": c.id, "title": c.title} for c in collections]).sort_values("id").reset_index(drop=True).head(20)

## Search items

A STAC search filters by space/time and any additional metadata.

Typical filters for Sentinel-2 are:
- `datetime`: a string like `"2022-03"` or an interval like `"2022-03-01/2022-03-15"`
- `eo:cloud_cover`: e.g. less than 40

You can also add a tile filter (example):

```python
query={"s2:mgrs_tile": dict(eq="32UPU")}
```

In [None]:
datetime = "2022-03-01/2022-03-15"
cloud_cover_lt = 40

# Optional: uncomment to restrict to a single MGRS tile (example value)
# tile = "32UPU"

query = {"eo:cloud_cover": {"lt": cloud_cover_lt}}
# if tile is not None:
#     query["s2:mgrs_tile"] = {"eq": tile}

search = catalog.search(
    collections=[COLLECTION],
    bbox=bbox,
    datetime=datetime,
    query=query,
)

items = list(search.get_items())
len(items), items[0].id

## Inspect assets/bands

STAC items expose **assets** (often Cloud-Optimized GeoTIFFs). Here we show the asset keys of one item.

In [None]:
item = items[0]
sorted(item.assets.keys())[:30]

## Load pixels with `odc-stac`

We now load a small multi-band cube into an `xarray.Dataset`.

Key points:
- We request the **STAC asset keys** (e.g. `B02`, `B03`, `B04`, `B08`, `SCL`).
- We use `patch_url=pc.sign` so every asset URL gets signed automatically during load.
- We use `dtype="uint16"` and `nodata=0` for reading, then convert reflectance to ~0..1.

In [None]:
bands = ["B02", "B03", "B04", "B08", "SCL"]

# Resampling: categorical SCL uses nearest; reflectance uses bilinear
resampling = {"*": "bilinear", "SCL": "nearest"}

ds_raw = stac_load(
    items,
    bands=bands,
    crs=crs,
    resolution=resolution,
    chunks={"x": 2048, "y": 2048},
    patch_url=pc.sign,
    dtype="uint16",
    nodata=0,
    groupby="solar_day",
    resampling=resampling,
)

ds_raw

## Rename bands and scale reflectance

Sentinel-2 reflectance values are stored as integers. We scale by $10^{-4}$ to get approximate reflectance in 0..1.

We also rename bands to short names used in later notebooks.

In [None]:
rename_map = {
    "B02": "blue",
    "B03": "green",
    "B04": "red",
    "B08": "nir",
    "SCL": "scl",
}

present = {k: v for k, v in rename_map.items() if k in ds_raw.data_vars}
ds = ds_raw.rename(present)

for name in list(ds.data_vars):
    if name == "scl":
        continue
    ds[name] = ds[name].astype("float32") * 1e-4

ds

## Mask clouds using SCL (quick example)

SCL is a categorical layer. A common “keep” set is:
- 4 vegetation
- 5 not-vegetated
- 6 water
- 7 unclassified

In [None]:
keep_classes = np.array([4, 5, 6, 7], dtype=ds["scl"].dtype)
keep = xr.apply_ufunc(np.isin, ds["scl"], keep_classes)

ds_clear = ds.where(keep)

ds_clear


## Quick visual check (RGB)

In [None]:
rgb = ds_clear.isel(time=0)[["red", "green", "blue"]]
rgb_plot = xr.concat([rgb.red, rgb.green, rgb.blue], dim="band").transpose("y", "x", "band")

plt.figure(figsize=(6, 6))
plt.imshow(np.clip(rgb_plot.values, 0, 0.3) / 0.3)
plt.title("RGB (masked)")
plt.axis("off")
plt.show()


### Checkpoint (Part 1)

If you can do these without copy/pasting, you’re good:

1. **Change the AOI**: update `bbox` and re-run the search + load.
2. **Change the time window**: try a different month/season — does the cloud situation change?
3. **Tighten the search**: add / adjust a cloud-cover filter (or sorting) and observe the effect on results.
4. **Verify the mask**: compare `ds` vs `ds_clear` (pixel counts, quick RGB).

➡️ Next: Part 2 explains what you just created (`Dataset`, dims/coords) and how to work with it idiomatically.


***

## Additional information

<font size="2">This notebook is adapted from public EO teaching materials and updated for STAC/Planetary Computer access using `odc-stac`. Thanks!</font>

**Last modified:** 2026

## Part 2 — Xarray-I: Data structure


### Agenda (Part 2)

- `xarray.Dataset` vs `xarray.DataArray` (and when to use which)
- Dimensions, coordinates, attributes (how EO metadata lives in xarray)
- Indexing & slicing: `sel` vs `isel`
- Reshaping between Dataset/DataArray representations

**By the end you should be able to**: navigate a dataset confidently and extract meaningful subsets without getting lost in dimensions.


### Xarray-I: Data Structure 

## Background

The Python library **`xarray`** is the form in which earth observation data are usually stored in a datacube.
It is an open source project and Python package which offers a toolkit for working with ***multi-dimensional arrays*** of data. **`xarray.dataset`** is an in-memory representation of a netCDF (network Common Data Form) file. Understanding the structure of a **`xarray.dataset`** is the key to enabling our work with these data. Thus, in this notebook, we are mainly dedicated to helping users of our datacube understand its data structure.

## Description

In this notebook, topics covered include:
* **What is inside a `xrray.dataset` (the structure)?**
* **(Basic) Subset Dataset / DataArray**
* **Reshape a Dataset**

In [None]:
# If you jumped directly here, ensure the example dataset `ds` exists
try:
    ds
except NameError:
    raise NameError("Dataset `ds` is not defined. Run Part 1 first (data lookup & loading).")


In [None]:
da = ds.to_array().rename({"variable":"band"})
print(da)

<xarray.DataArray (band: 3, time: 24, y: 905, x: 977)>
array([[[[ 9640,  9624,  9536, ...,  9928,  9968,  9992],
         [ 9576,  9592,  9568, ...,  9928,  9920, 10000],
         [ 9648,  9576,  9600, ...,  9928,  9960, 10016],
         ...,
         [ 7676,  7612,  7636, ...,  9096,  9096,  9128],
         [ 7708,  7612,  7704, ...,  9040,  9096,  9056],
         [ 7676,  7568,  7656, ...,  9056,  9040,  9032]],

        [[ 2156,  1876,  1705, ...,  1531,  1570,  1626],
         [ 1700,  1602,  1626, ...,  1456,  1520,  1624],
         [ 2274,  2714,  1968, ...,  1417,  1475,  1598],
         ...,
         [ 1206,  1208,  1220, ...,  1292,  1317,  1290],
         [ 1191,  1189,  1174, ...,  1315,  1313,  1308],
         [ 1226,  1188,  1151, ...,  1323,  1291,  1338]],

        [[ 7208,  7468,  7676, ...,  9392,  9472,  9632],
         [ 6836,  7208,  7280, ...,  9400,  9424,  9512],
         [ 6180,  6556,  6684, ...,  9360,  9416,  9464],
         ...,
...
         ...,
         [ 

In [None]:
ds2 = da.to_dataset(dim="time")
ds2

## **What is inside a `xarray.dataset`?**
The figure below is a diagram depicting the structure of the **`xarray.dataset`** we've just loaded. We hope you may better interpret the texts below explaining the data structure of a **`xarray.dataset`**, with the diagram.

![xarray data structure](https://live.staticflickr.com/65535/51083605166_70dd29baa8_k.jpg)

As we read from the output block, this dataset has three ***Data Variables***, "blue", "green", and "red" (shown with colors in the diagram), referring to the individual spectral band.

Each data variable can be regarded as a **multi-dimensional *Data Array*** with the same structure. It is a **three-dimensional array** (shown as a 3D cube in the diagram) where `time`, `x`, and `y` are its ***Dimensions*** (shown as the axis along with each cube in the diagram).

In this dataset, there are 49 ***coordinates*** under the `time` dimension, which means there are 49 time steps along the `time` axis. There are 1010 coordinates under `x` dimension and 1031 coordinates under `y` dimension, indicating 1010 pixels along `x` axis and 1031 pixels along `y` axis.

The term ***dataset*** is like a *container* holding all the multi-dimensional arrays of the same structure (shown as the red-lined box containing all 3D Cubes in the diagram).

So this instance dataset has a spatial extent of 1010 by 1031 pixels at given long/lat locations, spans over 49 time stamps and includes 3 spectral band.

**In summary, *`xarray.dataset`* is substantially a container for high-dimensional *`DataArray`* with common attributes (e.g., crs) attached:**
* **Data Variables (`values`)**: It's generally the first/highest dimension to subset from a high dimensional array. Each `data variable` contains a multi-dimensional array of all other dimensions.
* **Dimensions (`dims`)**: Other dimensions arranged in hierachical order *(e.g. 'time', 'y', 'x')*.
* **Coordinates (`coords`)**: Coordinates along each `Dimension` *(e.g. timesteps along 'time' dimension, latitudes along 'y' dimension, longitudes along 'x' dimension)*
* **Attributes (`attrs`)**: A dictionary(`dict`) containing Metadata.

Now let's deconstruct the dataset we have just loaded a bit further to have things more clarified!:D

* **To check the structure of the dataset**

In [None]:
ds.values

<bound method Mapping.values of <xarray.Dataset>
Dimensions:      (time: 24, y: 905, x: 977)
Coordinates:
  * time         (time) datetime64[ns] 2022-03-02T10:19:41.024000 ... 2022-05...
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) uint16 9640 9624 9536 9552 ... 12576 12312 11480
    green        (time, y, x) uint16 8992 8872 8896 8960 ... 11080 10680 10024
    red          (time, y, x) uint16 8488 8448 8408 8416 ... 10192 9824 9352
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref>

* **To check existing dimensions of the dataset**

In [None]:
ds.dims

Frozen({'time': 24, 'y': 905, 'x': 977})

* **To check the coordinates of the dataset**

In [None]:
ds.coords

Coordinates:
  * time         (time) datetime64[ns] 2022-03-02T10:19:41.024000 ... 2022-05...
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734

* **To check all coordinates along a specific dimension**
<br>
<img src=https://live.staticflickr.com/65535/51115452191_ec160d4514_o.png, width="450">

In [None]:
ds.time
# OR
#ds.coords['time']

* **To check attributes of the dataset**

In [None]:
ds.attrs

{'crs': 'EPSG:32734', 'grid_mapping': 'spatial_ref'}

## **Subset Dataset / DataArray**

* **To select all data of "blue" band**
<br>
<img src=https://live.staticflickr.com/65535/51115092614_366cb774a8_o.png, width="350">

In [None]:
ds.blue
# OR
#ds['blue']

In [None]:
# Only print pixel values
ds.blue.values

array([[[1122, 1094, 1082, ...,  834,  839,  875],
        [1118, 1080, 1108, ...,  835,  839,  846],
        [1108, 1098, 1112, ...,  789,  828,  821],
        ...,
        [ 945,  944,  987, ...,  683,  772,  922],
        [ 982, 1036, 1042, ...,  658,  727,  866],
        [ 982, 1070, 1096, ...,  622,  700,  880]],

       [[1112, 1096, 1074, ...,    0,    0,    0],
        [1110, 1074, 1090, ...,    0,    0,    0],
        [1080, 1050, 1072, ...,    0,    0,    0],
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0]],

       [[7328, 7304, 7296, ..., 9064, 8952, 8824],
        [7368, 7296, 7296, ..., 9000, 8888, 8704],
        [7364, 7340, 7328, ..., 9000, 8920, 8760],
        ...,
        [6608, 6588, 6592, ..., 7208, 7216, 7184],
        [6596, 6644, 6620, ..., 7252, 7228, 7216],
        [6608, 6632, 6620, ..., 7260, 7220, 7252]],

       ...,

       [[ 465,  449,  44

* **To select blue band data at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51116131265_8464728bc1_o.png, width="350">

In [None]:
ds.blue[0]

* **To select blue band data at the first time stamp while the latitude is the largest in the defined spatial extent**
<img src=https://live.staticflickr.com/65535/51115337046_aeb75d0d03_o.png, width="350">

In [None]:
ds.blue[0][0]

* **To select the upper-left corner pixel**
<br>
<img src=https://live.staticflickr.com/65535/51116131235_b0cca9589f_o.png, width="350">

In [None]:
ds.blue[0][0][0]

### **subset dataset with `isel` vs. `sel`**
* Use `isel` when subsetting with **index**
* Use `sel` when subsetting with **labels**

* **To select data of all spectral bands at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51114879732_7d62db54f4_o.png, width="750">

In [None]:
ds.isel(time=[0])

* **To select data of all spectral bands of year 2020** 
<br>
<img src=https://live.staticflickr.com/65535/51116281070_75f1b46a9c_o.png, width="750">

In [None]:
ds.sel(time='2021-03-02')

## **Reshape Dataset**

* **Convert the Dataset (subset to 2022-03) to a *4-dimension* DataArray**

In [None]:
da = ds.sel(time='2021-03').to_array().rename({"variable":"band"})
da

* **Convert the *4-dimension* DataArray back to a Dataset by setting the "time" as DataVariable (reshaped)**

![ds_reshaped](https://live.staticflickr.com/65535/51151694092_ca550152d6_o.png)

In [None]:
ds_reshp = da.to_dataset(dim="time")
print(ds_reshp)

<xarray.Dataset>
Dimensions:                     (band: 3, y: 905, x: 977)
Coordinates:
  * y                           (y) float64 1.558e+07 1.558e+07 ... 1.557e+07
  * x                           (x) float64 -3.002e+05 -3.002e+05 ... -2.904e+05
    spatial_ref                 int32 32734
  * band                        (band) object 'blue' 'green' 'red'
Data variables:
    2022-03-02 10:19:41.024000  (band, y, x) uint16 9640 9624 9536 ... 8108 8056
    2022-03-05 10:29:21.024000  (band, y, x) uint16 2156 1876 1705 ... 1520 1541
    2022-03-07 10:17:59.024000  (band, y, x) uint16 7208 7468 7676 ... 7280 7416
    2022-03-10 10:27:39.025000  (band, y, x) uint16 2118 1904 1750 ... 1479 1491
    2022-03-12 10:18:31.024000  (band, y, x) uint16 2346 2050 1885 ... 1581 1640
    2022-03-15 10:28:11.025000  (band, y, x) uint16 7172 7176 7172 ... 6640 6688
    2022-03-17 10:16:49.024000  (band, y, x) uint16 6024 6016 6040 ... 6724 6640
    2022-03-20 10:26:39.024000  (band, y, x) uint16 2040 18

### Checkpoint (Part 2)

1. **Slice like a pro**: pick one band and one time step using **both** `sel` and `isel`.
2. **Reduce**: compute a spatial mean time series for a band (mean over `x`/`y`).
3. **Reshape**: convert Dataset → DataArray → Dataset again and explain what changed (dims/coords).

➡️ Next: Part 3 goes deeper on time handling + higher-level operations (resampling, conditional ops, interpolation).


***
## Additional information

This notebook is for the usage of Jupyter Notebook of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/).

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** December 2023

## Part 3 — Advanced Xarray


### Agenda (Part 3)

- Working with `datetime64` time coordinates
- Alternative time dimensions and reshaping time
- Spatial and temporal subsetting (and combining both)
- Manipulation & statistics: conditional operations, resampling, grouping
- Interpolation and merging datasets

**By the end you should be able to**: build robust EO analysis pipelines that don’t break when time axes or missing values get messy.


### Datasets used in this notebook

- `ds` is loaded in **Part 1** (used in Part 2 examples).
- `ds_adv` is loaded at the start of **Part 3** (separate time range for advanced exercises).


### Advanced Xarray

## Description
The last notebook [Part 2 (Xarray-I: Data structure)](Part 2 (Xarray-I: Data structure)) gave a first introduction to working with `xarray`.

In this notebook, we deepen the understanding of `xarray` as a container for remote sensing raster data and introduce additional `xarray` functions that are useful for analysis workflows.


## Setup
We will use `pystac-client` to search the Microsoft Planetary Computer STAC catalog and `odc-stac` (`stac_load`) to load the requested data into an `xarray.Dataset`. We use `NumPy` and `xarray` for the analysis steps.


### STAC search and load data
First, we search the Planetary Computer STAC catalog and load an example dataset using `odc-stac`.


In [None]:
# Load data from Planetary Computer (STAC)
STAC_URL = "https://planetarycomputer.microsoft.com/api/stac/v1"
COLLECTION = "sentinel-2-l2a"

# Area of interest: Würzburg (EPSG:4326)
bbox_adv = (9.88, 49.75, 10.0, 49.82)

# Output grid
crs = "EPSG:32632"
resolution = 20

# Search STAC items_adv
catalog_adv = Client.open(STAC_URL)

datetime = "2021-03-01/2021-06-15"
query = {"eo:cloud_cover": {"lt": 40}}

stac_search = catalog_adv.search(collections=[COLLECTION], bbox_adv=bbox_adv, datetime=datetime, query=query)
items_adv = list(stac_search.get_items())
len(items_adv)

# Load pixels with odc-stac
bands_adv = ["B02", "B03", "B04", "B08"]
resampling = {"*": "bilinear"}

ds_raw_adv = stac_load(
    items_adv,
    bands_adv=bands_adv,
    crs=crs,
    resolution=resolution,
    groupby="solar_day",
    patch_url=pc.sign,
    dtype="uint16",
    nodata=0,
    resampling={"*": "bilinear"},
)

# Rename to match the variable names used throughout this notebook
rename_map = {"B02": "blue", "B03": "green", "B04": "red", "B08": "nir"}
ds_adv = ds_raw_adv.rename({k: v for k, v in rename_map.items_adv() if k in ds_raw_adv.data_vars})

# Scale reflectance (Sentinel-2 L2A) to ~0..1
for name in list(ds_adv.data_vars):
    if name != "scl":
        ds_adv[name] = ds_adv[name].astype("float32") * 1e-4

ds_adv


<a id='index_array3'></a>
## **Advanced Indexing**
### 1) Temporal Subset

In the earlier tutorial, we introduced `isel()`and `sel()` for indexing data. For both methods, a **slicing** operator exists. If the function `slice()` is passed onto the index function, the dataset is sliced. 
The first example uses the slicing by position method to select the first five scenes in `ds`. The start value is included (here, 0) and the stop value (here, 5) is excluded.

#### I. Using index number

In [None]:
ds_adv.isel(time=slice(0,5))
#ds_adv.isel(time = [0,1,2,3,4])

In [None]:
ds_adv.isel(time=slice(0,5)).time

#### II. Using `datetime64` data

This example uses the slicing by label method to select the scenes between "2021-03-01" and "2021-03-10". Note, that when using the `slice()` function with the `sel()` method, both start and stop value are included.

In [None]:
print(ds_adv.sel(time=slice("2021-03-01","2021-03-10"))) 

<xarray.Dataset>
Dimensions:      (time: 4, y: 905, x: 977)
Coordinates:
  * time         (time) datetime64[ns] 2022-03-02T10:19:41.024000 ... 2022-03...
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) uint16 9640 9624 9536 9552 ... 1350 1365 1336 1349
    green        (time, y, x) uint16 8992 8872 8896 8960 ... 1422 1440 1428 1430
    red          (time, y, x) uint16 8488 8448 8408 8416 ... 1516 1509 1479 1491
    nir          (time, y, x) uint16 8768 8720 8656 8616 ... 2476 2438 2342 2402
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


In [None]:
ds_adv.sel(time=slice("2021-03-01","2021-03-10")).time

#### III. Using other time dimensions

`xarray` also includes some useful features for the inspection of the time dimension. It helps extract additional information from a dataset efficiently. The following code automatically groups the time dimension in seasons ("DJF", "MAM", JJA", "SON"). There are a lot of other `time` dimensions arguments, e.g., `month`, `week`, `weekday`, `dayofyear`.

In [None]:
ds_adv.time.dt.season

In [None]:
ds_adv.time.dt.month

In [None]:
ds_adv.time.dt.weekday

It is also possible to extract the "day of year" for a time step.

In [None]:
ds_adv.time.dt.dayofyear

In [None]:
ds_adv.groupby('time.season')

DatasetGroupBy, grouped over 'season'
2 groups with labels 'JJA', 'MAM'.

In [None]:
#ds_adv.groupby('time.season').mean()

<bound method DatasetGroupByReductions.mean of DatasetGroupBy, grouped over 'season'
2 groups with labels 'DJF', 'SON'.>

### 2) Spatial Subset
It is possible to index and **slice within the x and y dimensions**. The following example selects the value for pixels of all bands in the second column and the fifth row of the raster (`x=2,y=5`).

In [None]:
ds_adv.isel(x=2, y=5)
#ds_adv.isel(x=[0,1,2], y=5)

### 3) Combining Temporal and Spatial Subset

We can subset temporally and spatially using `slice()` operator. If you know the actual coordinate (x,y) value (extent) of the spatial subset area, use the `sel()` function.

The following example subsets the `ds` by the temporal and spatial location of the pixels. Only the pixels from the first to the fifth columns and the pixels from the first to the fifth rows are included in the output. Also, the scenes are filtered in the time dimension between the first and fifth time step.

In [None]:
ds2 = ds_adv.isel(time=slice(0,5), x= slice(0,5), y=slice(0,5))
ds2

#ds2.time
#plt.scatter(ds2.x.values, ds2.y.values)

## **Data Manipulation & Statistics**

This notebook presents some basic built-in functions of the `xarray` library to manipulate and transform data in a `xarray.Dataset`. Here, we show only a fraction of the available `xarray` functions. For a complete overview of all the available functions and tools of the `xarray` package, please visit the [documentation website](http://xarray.pydata.org/en/stable/). 

[Notebook 07](07_basic_analysis.ipynb) will cover this topic, focusing on an application-oriented remote sensing approach.
###  1) Statistical Operation

The simple built-in functions allow the user to do simple calculations with a `xarray.Dataset`.
The **basic math** built-in `xarray` functions are:
* `min()`, `max()`
* `mean()`, `median()`
* `sum()`
* `std()`

The following code demonstrates the easy use of the `max()` function to extract the maximum value of the red band in the `ds` dataset.

In [None]:
print(ds_adv.red.max())

<xarray.DataArray 'red' ()>
array(19440, dtype=uint16)
Coordinates:
    spatial_ref  int32 32734


To apply a function to every value of a specified dimension (e.g., to calculate the mean of every time step), the `dim` argument in the basic math function must be defined with the dimension label.

This example calculates the mean of the `red` band for each pixel (defined by the unique `x`, `y` combination) over every time step. The result is a data array that can be used for further time series visualization and analysis.

In [None]:
print(ds_adv.red.mean(dim=["x", "y"]))

#ds_adv.red.mean(dim=["x", "y"]).values
#plt.plot(ds_adv.red.mean(dim=["x", "y"]).values)

<xarray.DataArray 'red' (time: 42)>
array([ 1818.96105227,  8074.00635727,  1754.47787058,  5327.82408659,
        5052.20414958,  7466.13877413,  6652.25770399,  1943.12213507,
        5791.55179063,  2165.9254138 ,  6284.90501535,  1836.40194077,
        4318.13241799,  2913.1250926 , 10243.98031181,  2596.53914848,
        2306.90195604,  7080.55823498, 10008.62734835,  9668.67531795,
        3243.49281542,  1834.18206823,  1859.24376347,  8750.22720245,
        6678.98628002,  7958.82580229,  9407.54412708,  1739.37265052,
        6333.50257243,  6604.30961055,  3084.24980632,  8654.94368712,
        3030.41774063,  6430.65277289,  9552.01161069,  1698.20226649,
        1879.8146112 ,  9411.29344425,  8475.92199596,  1910.32194733,
        1629.76212444,  2132.29992705])
Coordinates:
  * time         (time) datetime64[ns] 2021-03-02T10:18:39.025000 ... 2021-06...
    spatial_ref  int32 32734


This examples works the other way around. It calculates the standard deviation of every pixel (`x`, `y`) over all timesteps of the dataset `ds`.

In [None]:
print(ds_adv.red.std(dim="time"))

<xarray.DataArray 'red' (y: 905, x: 977)>
array([[2950.13795135, 3087.9106436 , 3190.4511719 , ..., 3660.84028099,
        3649.89349835, 3641.69284818],
       [2981.50697802, 3094.12890492, 3183.76733881, ..., 3678.63263811,
        3657.9010849 , 3639.44330671],
       [2897.76457591, 2874.01064669, 3061.91223283, ..., 3680.94274286,
        3674.93537613, 3657.93154202],
       ...,
       [3267.92346919, 3279.36118553, 3303.54414343, ..., 3657.32528726,
        3648.90419461, 3617.15589938],
       [3263.62266754, 3275.78591468, 3317.94272277, ..., 3667.18960309,
        3662.70165187, 3640.87214865],
       [3273.34765963, 3271.6850201 , 3323.72091445, ..., 3678.39480058,
        3666.36207741, 3660.46720002]])
Coordinates:
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734


Remember, to access the raw `numpy` array that stores the values of the resulting `xarray.DataArrays`, the suffix `.values` is needed. This allows you to work with the "actual" data values.

In [None]:
print(ds_adv.blue.sum(dim=["x","y"]).values)
#plt.plot(ds_adv.blue.sum(dim=["x","y"]).values)

[ 1378819884  7964658670  1272194029  4698605812  4541559804  6886255640
  6110182240  1482596331  5194190235  1712707186  6006349540  1342901595
  4120623844  2508944545 10428593244  2162029094  1887043287  6769489264
  8975718324  8915725986  2999159582  1392269227  1379476980  9055516168
  5978472916  7119185436  8538318172  1328425136  5701259768  6387376838
  2870126501  8182499220  2881210000  5764934137  9266444432  1380615879
  1568188829  9715135166  7933586832  1631484539  1335706615  1833185628]


### 2) Conditional Operation

Using conditional operation can be very helpful when we need to analyze satellite scenes or pixels that lie within our interests. The `where()` function provides the option to **mask** a `xarray.Dataset` based on a logical condition. By default, the function converts all values that match the condition to NaN values. This is extremely useful when applied with a binary mask to mask your data to the desired values. The argument `other` lets you define a subset value for all values that match the condition (default is `nan`). The argument `drop` drops all values which do not correspond with the condition.
The following example masks the dataset `ds` to only the values with a reflectance value greater than 700 in the `red` band.

In [None]:
print(ds_adv.where(ds_adv.red > 700))
#print(ds_adv.where(ds_adv.red < 700))

<xarray.Dataset>
Dimensions:      (time: 42, y: 905, x: 977)
Coordinates:
  * time         (time) datetime64[ns] 2021-03-02T10:18:39.025000 ... 2021-06...
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) float32 1.938e+03 1.691e+03 ... 1.816e+03
    green        (time, y, x) float32 2.118e+03 1.872e+03 ... 1.927e+03
    red          (time, y, x) float32 2.274e+03 2.032e+03 ... 1.767e+03
    nir          (time, y, x) float32 3.493e+03 3.307e+03 ... 4.284e+03
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


This code subsets all zeros in the red band of the dataset `ds` in the first time stamp with the new value -9999.

In [None]:
replace = ds_adv.red.isel(time=0).where(ds_adv.red != 0, other = -9999)
#replace.values.min()

The implemented `xarray` function `isin()` allows us to **test each value** of `xarray.Dataset` or `xarray.DataArray` whether it is in the elements defined within the function. It returns a boolean array which can be used as a mask.
This example checks all the values of the `red` measurement if the value is in an array from 0 to 550.

In [None]:
mask_red = ds_adv.red.isin(range(550))
print(mask_red)

#plt.imshow(mask_red) #error
#plt.imshow(mask_red.isel(time=3))

<xarray.DataArray 'red' (time: 21, y: 1031, x: 1010)>
array([[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ...,
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ...,
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ...,
...
        ...,
        [False, False, False, .

The created mask can easily be combined with the `where()` function to filter the dataset based on the predefined mask. In this case, the `ds` dataset is masked with previously defined mask `mask_red`, which is based on a logical test if values of the `red` band are within a specific range of values.

In [None]:
print(ds_adv.where(mask_red)) #masking

<xarray.Dataset>
Dimensions:      (time: 21, y: 1031, x: 1010)
Coordinates:
  * time         (time) datetime64[ns] 2020-10-01T08:28:17 ... 2020-12-30T08:...
  * y            (y) float64 6.807e+06 6.807e+06 ... 6.797e+06 6.797e+06
  * x            (x) float64 8.687e+05 8.687e+05 ... 8.787e+05 8.788e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) float64 nan nan nan nan nan ... nan nan nan nan
    green        (time, y, x) float64 nan nan nan nan nan ... nan nan nan nan
    red          (time, y, x) float64 nan nan nan nan nan ... nan nan nan nan
    nir          (time, y, x) float64 nan nan nan nan nan ... nan nan nan nan
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


### 3) Resampling
Resampling is necessary when working with time-series data if we want the data product to align with the temporal window.

 - **resample()**

The **`resample()` method** allows us to summarise the `xarray.Dataset` into bigger or smaller chunks based on a dimension. It handles both upsampling and downsampling. The argument `time` needs to be defined as a datetime-like coordinate. In the following example, we resample the `ds` dataset to a monthly time interval (`time = "m"`) and then calculate the median value for every resample chunk. _(this process takes a little while to run)_

In [None]:
print(ds_adv.resample(time='m').median())

<xarray.Dataset>
Dimensions:      (y: 905, x: 977, time: 4)
Coordinates:
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734
  * time         (time) datetime64[ns] 2021-03-31 2021-04-30 ... 2021-06-30
Data variables:
    blue         (time, y, x) float64 2.828e+03 2.754e+03 ... 1.532e+03
    green        (time, y, x) float64 2.914e+03 2.85e+03 ... 1.723e+03 1.724e+03
    red          (time, y, x) float64 2.93e+03 2.882e+03 ... 1.475e+03 1.478e+03
    nir          (time, y, x) float64 3.782e+03 3.774e+03 ... 6.048e+03
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


 - **groupby() method**

The **`groupby()` method** can also be used within the `xarray` library to *aggregate data over time*. Time aggregation arguments can be e.g. "time.year", "time.season", "time.month", "time.week", "time.day".
The code below groups the `ds` dataset into two groups by year. Therefore, a new "dimension" `year` is created. Then the median for each `year` is calculated. _(this process takes a little while to run)_

In [None]:
print(ds_adv.groupby("time.year").median(dim="time"))

<xarray.Dataset>
Dimensions:      (y: 905, x: 977, year: 1)
Coordinates:
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734
  * year         (year) int64 2021
Data variables:
    blue         (year, y, x) float64 4.183e+03 3.942e+03 ... 3.946e+03
    green        (year, y, x) float64 4.252e+03 4.079e+03 ... 3.842e+03
    red          (year, y, x) float64 4.073e+03 3.961e+03 ... 3.747e+03
    nir          (year, y, x) float64 4.974e+03 5.038e+03 ... 5.954e+03
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


### 4) Interpolation
Interpolation is a common solution dealing with missing remote sensing data, either caused by the coarse temporal resolution of the satellite, high cloud cover, or bad quality of the scenes. For example, a scene of a specific date is not available in the dataset. With the implemented `interp()`, it is possible to **interpolate data** for predefined time steps. The function takes the next usable scene before and after the specified date and interpolates their values (by default, interpolation method is "linear") to build a new `xarray.Dataset`.

In this example, the `ds` dataset has missing scenes on the "2020-12-25". The `interp()` function builds a "new" scene based on a linear interpolation from the two measurements before and after the new time step.

In [None]:
print(ds_adv.time)

<xarray.DataArray 'time' (time: 42)>
array(['2021-03-02T10:18:39.025000000', '2021-03-05T10:28:09.024000000',
       '2021-03-07T10:20:21.024000000', '2021-03-10T10:30:21.024000000',
       '2021-03-12T10:17:29.024000000', '2021-03-15T10:27:09.024000000',
       '2021-03-17T10:20:21.024000000', '2021-03-20T10:30:21.024000000',
       '2021-03-22T10:16:49.024000000', '2021-03-25T10:26:39.024000000',
       '2021-03-27T10:20:21.024000000', '2021-03-30T10:30:21.024000000',
       '2021-04-01T10:15:59.024000000', '2021-04-04T10:25:59.024000000',
       '2021-04-06T10:20:21.025000000', '2021-04-09T10:30:21.024000000',
       '2021-04-11T10:15:59.024000000', '2021-04-14T10:25:59.024000000',
       '2021-04-16T10:20:21.024000000', '2021-04-19T10:30:21.024000000',
       '2021-04-21T10:15:49.024000000', '2021-04-24T10:25:49.024000000',
       '2021-04-26T10:20:21.024000000', '2021-04-29T10:30:21.024000000',
       '2021-05-01T10:15:59.024000000', '2021-05-04T10:25:59.025000000',
       '2021-0

In [None]:
ds_interp = ds_adv.interp(time=["2021-06-10"])
print(ds_interp)

<xarray.Dataset>
Dimensions:      (y: 905, x: 977, time: 1)
Coordinates:
  * y            (y) float64 1.558e+07 1.558e+07 ... 1.557e+07 1.557e+07
  * x            (x) float64 -3.002e+05 -3.002e+05 ... -2.905e+05 -2.904e+05
    spatial_ref  int32 32734
  * time         (time) datetime64[ns] 2021-06-10
Data variables:
    blue         (time, y, x) float64 4.545e+03 4.421e+03 ... 2.347e+03
    green        (time, y, x) float64 4.611e+03 4.519e+03 ... 2.374e+03
    red          (time, y, x) float64 4.595e+03 4.406e+03 ... 2.222e+03
    nir          (time, y, x) float64 5.99e+03 6.351e+03 ... 3.856e+03 4.634e+03
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


The `merge()` function allows us to **merge/join** `xarray.Datasets` or variables. By default, the `merge()` function uses an "inner" join as a merging operation. 
In our example, the interpolated `xarray.Dataset` created above is merged to the `ds` dataset using the `merge()` function.

In [None]:
print(ds_adv.merge(ds_interp).time)

<xarray.DataArray 'time' (time: 43)>
array(['2021-03-02T10:18:39.025000000', '2021-03-05T10:28:09.024000000',
       '2021-03-07T10:20:21.024000000', '2021-03-10T10:30:21.024000000',
       '2021-03-12T10:17:29.024000000', '2021-03-15T10:27:09.024000000',
       '2021-03-17T10:20:21.024000000', '2021-03-20T10:30:21.024000000',
       '2021-03-22T10:16:49.024000000', '2021-03-25T10:26:39.024000000',
       '2021-03-27T10:20:21.024000000', '2021-03-30T10:30:21.024000000',
       '2021-04-01T10:15:59.024000000', '2021-04-04T10:25:59.024000000',
       '2021-04-06T10:20:21.025000000', '2021-04-09T10:30:21.024000000',
       '2021-04-11T10:15:59.024000000', '2021-04-14T10:25:59.024000000',
       '2021-04-16T10:20:21.024000000', '2021-04-19T10:30:21.024000000',
       '2021-04-21T10:15:49.024000000', '2021-04-24T10:25:49.024000000',
       '2021-04-26T10:20:21.024000000', '2021-04-29T10:30:21.024000000',
       '2021-05-01T10:15:59.024000000', '2021-05-04T10:25:59.025000000',
       '2021-0

The `xarray` package contains a variety of other useful functions besides those shown here. For more information about the `xarray` package, visit the [documentation website](http://xarray.pydata.org/en/stable/).

### Checkpoint (Part 3)

1. **Resample**: produce a monthly median product and inspect how many observations contributed per month.
2. **Conditionals**: mask values based on a threshold (e.g., band or index) and quantify how many pixels remain.
3. **Interpolation**: interpolate to a new date and *validate* the result (compare neighbors / sanity check).

If this feels easy: implement a tiny helper function that takes a dataset and returns a clean, resampled, analysis-ready version.


***

## Additional information

<font size="2">This notebook is provided for teaching by the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/). It has been updated to use Planetary Computer STAC + `odc-stac`. </font>

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

**Data access:** Sentinel-2 L2A pixels are loaded from the Microsoft Planetary Computer via STAC using `odc-stac`.

**Data license:** See the dataset/collection metadata on Planetary Computer for license and attribution details.
