# Mosaics

Even though the processing algorithms for optical data have been designed with great care, the images still can contain imperfections, such as unmasked clouds or shadows. And even if they are masked, we could be left with images that have massive gaps! A common solution to solve both the gaps and imperfections is the creation of multi-temporal mosaics. This term covers various techniques designed to “patch” areas covered by clouds/shadows whilst trying to avoid including abnormal values (called "artifacts"). On simple terms, is using several observation so the “holes” in one image are filled by another. A common technique to create annual mosaics is to calculate of annual quantiles. This specific statistic is chosen because it is just concerned about the order, ensuring artifacts are left in a very low (e.g., 0 or 0.05, shadows) or a very high quantile (e.g., 0.95, 1.0, clouds). However, statistics like the mean could be tainted by outliers as they are affected by all the observations. Mosaics are [Level-3 products](https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels), periodic (statistical) summaries.

## Imports

In [None]:
from pathlib import Path

import numpy as np
import rioxarray as rxr
import xarray as xr

from envrs.hls_tools import preprocess_bands, preprocess_fmask

## Set the directories

In [None]:
PARENT_DIR = Path(r"~/Downloads/hls").expanduser()

IN_DIR = PARENT_DIR / "automated"

IN_DIR.mkdir(exist_ok=True, parents=True)

## List the available files

First we look for the files containing the bands.

In [None]:
# We are looking for products containing...
product, tile, res, version = "HLS", "T36KXE", "30", "v2.0"

# And they also have to end with "bands.tif"
band_paths = sorted(IN_DIR.glob(f"{product}_{tile}_*{res}_{version}_bands.tif"))

 Then we take only the matching `fmask` files (remember, the bands were not downloaded if the image was too cloudy) 

In [None]:
fmask_paths = [p.parent / p.name.replace("_bands", "_fmask") for p in band_paths]

## Stack the individual files into a data cube

the [`Xarray`](https://tutorial.xarray.dev/intro.html) library allows us to open many files together as a single [datacube](https://en.wikipedia.org/wiki/Data_cube), which can be very convenient for handling large volumes of data. To make our mosaics we will have to read both the bands, but also the `fmask` files, as we need to exclude the cloudy/shaded observations.

We will use [`xarray.open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html) to read the files and concatenate them along the `time` dimension. The argument `preprocess` allows to transform the input data as needed to ensure the datacube creation is as smooth as possible. The tool is run twice, one for `band` and another for `time`, because their preprocessing is different. 

### Preprocess the Fmask

In [None]:
fmask = xr.open_mfdataset(
    fmask_paths,
    chunks="auto",
    concat_dim="time",
    combine="nested",
    preprocess=preprocess_fmask,
    mask_and_scale=False,
    engine="rasterio",
    parallel=True,
)

### Preprocess the bands

In [None]:
bands = xr.open_mfdataset(
    band_paths,
    chunks="auto",
    concat_dim="time",
    combine="nested",
    preprocess=preprocess_bands,
    engine="rasterio",
    parallel=True,
    band_as_variable=True,
)

## Make a raster definition

Our mosaics will need to "inherit" the same CRS and geotransform as the original files, and an `encoding` that ensures the image file volume is kept to a minimum. We will we will get these by using the first file as an example, and place this information on dictionary to use it during the writing

In [None]:
example_bands = rxr.open_rasterio(band_paths[0])  # .encoding

crs = example_bands.rio.crs
transform = example_bands.rio.transform()

example_attrs = example_bands.attrs
example_encoding = example_bands.encoding

out_attrs = {}
out_encoding = {
    "dtype": example_encoding["rasterio_dtype"],
    "add_offset": example_attrs["add_offset"],
    "scale_factor": example_attrs["scale_factor"],
    "_FillValue": example_attrs["_FillValue"],
    "zlib": True,
}

## Mask and make the yearly mosaics

Once we have everything ready we select the data from a `target_year`. If `any` `cloud_flag` in the `flag` dimension was `True` the observation will be ignored. The rest of the data is used to calculate the quantiles along the time dimension (`0.0` - `1.0`, with a `0.1` step). The last steps are setting the `CRS`, the (geo)`transform`, the attributes and the encoding just before writing `to_netcdf`

In [None]:
# Determine which years should be processed
first_year = bands["time"].min().dt.year.item()
last_year = bands["time"].max().dt.year.item()

# What to exclude
cloud_flags = ["cloud shadow", "adjacent to cloud", "cloud", "cirrus cloud"]

# For every year
for target_year in range(first_year, last_year + 1):
    out_path = PARENT_DIR / f"{product}_{tile}_{target_year}_b{res}_{version}.nc"
    if out_path.exists():
        continue

    # define what to mask
    is_cloudy = (
        fmask["masks"]
        .sel(time=(fmask.time.dt.year == target_year), flag=cloud_flags)
        .any(dim="flag")
    )

    # Mask the bands, calculate the quantiles
    quantiles = (
        bands.sel(time=(bands.time.dt.year == target_year))
        .sortby("time")
        .where(np.logical_not(is_cloudy))
        .quantile(np.arange(0, 1.01, 0.1), dim="time", skipna=True)
        .sortby("y", ascending=False)
    )

    # Apply the scaling
    referenced = quantiles.rio.write_crs(crs).rio.write_transform(transform)

    # set the attributes and the encoding
    for band_name in quantiles:
        referenced[band_name].attrs.update(long_name=band_name, **out_attrs)
        referenced[band_name].encoding.update(**out_encoding)

    # Write
    referenced.to_netcdf(out_path)

    print(target_year)