# <u>Cloud coverage and valid pixels statistics</u> <img align="right" src="../resources/csiro_easi_logo.png">

**Contents**

  - [Overview](#Overview)
  - [Notebook setup](#Notebook-setup)
    - [Load packages](#Load-packages)
    - [Dask](#Dask)
  - [Satellite data](#Satellite-data)
    - [Platforms and products](#Platforms-and-products)
    - [Datacube query](#Datacube-query)
  - [Data clean up](#Data-clean-up)
    - [Raw data display](#Raw-data-display)
    - [Valid pixel mask](#Valid-pixel-mask)
    - [Clear pixel mask](#Clear-pixel-mask)
  - [Statistics](#Statistics)
    - [Tabular results for each date](#Tabular-results-for-each-date)
    - [Time series plot](#Time-series-plot)
    - [Spatial plot of clear pixel percentage](#Spatial-plot-of-clear-pixel-percentage)


# Overview

This notebook explores a Landsat 8 dataset around the city of Hobart in Tasmania, Australia, and generates various statistics related to the percentage of valid and/or cloud-free pixels over the selected time window. This represents potential information of interest to inform subsequent analyses applied the dataset. For example, if there are extensive clouds for a given season, it may significantly impact a resulting mosaic product or a derived dataset of index values. Another example is that a user may want to find a single date when there are few clouds to assess land features.

This notebook considers two types of "valid pixel" statistics. On one hand, we define _valid pixels_ as those pixels that have valid data associated with them &ndash; these valid observations might include clear land and/or water, as well as clouds, cloud shadows, saturated pixels, etc. It may be that part of an image in a time slice over a given region of interest has not been imaged by the sensor, in which case the corresponding pixels are _not_ included in the set of _valid pixels_.

On the other hand, of those pixels providing valid observations, some may be affected by acquisition quality issues such as clouds, saturation, etc. and thus do not provide clear observations of the land and/or water over the region of interest. The pixels _not_ affected by these types of quality issues are here labelled as _clear pixels_.

This notebook is adapted from a [CEOS Open Data Cube Notebook](https://github.com/ceos-seo/data_cube_notebooks/blob/master/notebooks/DCAL/DCAL_Cloud_Statistics.ipynb) example.

**A note regarding compatibility**

As stated above, this notebook makes use of a specific dataset of remote sensing data, and is applied over a given region of interest. The code below will thus not run properly if the EASI deployment used to run this notebook does not also contain a similar dataset. Users can nonetheless investigate the outputs provided in this demonstration notebook, and also modify certain variables in the code below to allow the notebook to run with a different EASI deployment (e.g. different remote sensing data, region of interest, etc.)

# Notebook setup

## Load packages

In the next cell, we load the key Python packages and supporting functions required for the subsequent analysis, and then connect to the EASI data cube.

In [None]:
### System
import sys, os
import re

### Datacube 
import datacube
from datacube.utils import masking  # https://github.com/opendatacube/datacube-core/blob/develop/datacube/utils/masking.py
from odc.algo import enum_to_bool   # https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py

### Data tools
import numpy as np
import pandas as pd

### Plotting
%matplotlib inline
import matplotlib.pyplot as plt

### EASI tools
sys.path.append(os.path.expanduser('../scripts'))
os.environ['USE_PYGEOS'] = '0'
import notebook_utils
from app_utils import display_map
from easi_tools import EasiDefaults
easi = EasiDefaults()

In [None]:
dc = datacube.Datacube(app="cloud_stats")

## Dask
We will just use a local cluster for this notebook

In [None]:
cluster, client = notebook_utils.initialize_dask(use_gateway=False)
display(client)
display(cluster)

The `Dashboard` link provided above (`Client` section) can be used to monitor the status of the Dask cluster and associated processing tasks during the execution of various cells in the rest of this notebook.

#### AWS configuration
We will be using data in public requester-pays buckets, so the following configuration is required:

In [None]:
"""This function obtains credentials for S3 access and passes them on to
   processing threads, either local or on dask cluster.
   Note that AWS credentials may need to be renewed between sessions or
   after a period of time."""

from datacube.utils.aws import configure_s3_access
configure_s3_access(aws_unsigned=False, requester_pays=True, client=client)

# If not using a dask cluster then remove 'client':
# configure_s3_access(aws_unsigned=False, requester_pays=True)

# Satellite data

In this section, we select and load up the relevant dataset of Landsat-8 data for subsequent analyses in this notebook.

## Platforms and products

Users can refer to the [ODC Explorer](https://explorer.csiro.easi-eo.solutions/products/ga_ls8c_ard_3/extents) in order to investigate the various datasets on the current EASI deployment, as well as their respective temporal and spatial extents. From within the notebook, we can also load a list of all available products using the `dc.list_products` function, to the same effect.

In [None]:
products_info = dc.list_products()
products_info[:10]   # first 10 rows

We can subsequently filter the results to only display the products related to a given satellite sensor. Here, we are specifically interested in products related to Landsat-8.

In [None]:
# The output of list_products() changed between datacube versions 1.8.4 and 1.8.6
# See https://github.com/opendatacube/datacube-core/blob/develop/datacube/api/core.py#L93
# "platform" is not included. Would need to write a custom "list_products".
# Use the product name instead.

### Available Landsat-8 products 
pattern = re.compile('landsat8')
selected_products = [p for p in products_info.index if pattern.search(p)]
products_info.loc[selected_products]

We can further list the various spectral (and ancillary) measurements for a specific product, as follows.

In [None]:
product = 'landsat8_c2l2_sr'
product_measurements = dc.list_measurements().loc[product]
product_measurements

Further below, we will mask the data to identify the pixels containing valid information (i.e. not `no-data`), as well as those pixels not affected by clouds, cloud shadow, etc. (i.e. clear pixels). Information regarding the quality of each pixel is provided in the `oa_fmask` data layer, which can be investigated in more detail as follows.

In [None]:
QA_layer = 'qa_pixel'
product_measurements.loc[QA_layer]['flags_definition']

## Datacube query

We can now create a query dictionary that will be used to load up the desired dataset from the EASI data cube. We will here load up the data in its native projection, which can be determined by making use of the `mostcommon_crs` function. Also, in addition to the pixel quality band (`oa_fmask`), we select the red, green and blue Landsat-8 bands to be able to plot the data in a true-colour display later on.

In [None]:
query = { 'product': product,
          'lat': (37.75, 37.85),
          'lon': (-122.45, -122.55),
          'measurements': ['red', 'green', 'blue', 'qa_pixel'],
          'time': ('2020-01-01', '2020-07-01'),
          'group_by': 'solar_day' } # scene ordering

### Because we aren't using the default configuration, we need to discover the CRS based on the most common CRS in the data that we are loading and then add that configuration to the query.
native_crs = notebook_utils.mostcommon_crs(dc, query)
print(f"The dataset's native CRS is: {native_crs}")

query.update({ 'output_crs': native_crs,   # EPSG code
               'resolution': (-30, 30),     # target resolution
               'dask_chunks': {'x': 512, 'y': 512} })   # Dask processing

For insight, the selected spatial region can be visualised using the `display_map` utility function.

In [None]:
display_map(x=query['lon'], y=query['lat'])

And we can now load up the data, using the `query` parameters defined above.

In [None]:
data = dc.load(**query)
data

# Data clean up

Typically, a first step to perform after loading a dataset of satellite imagery is to clean up the data. In particular, this involves the removal (masking out) of pixels affected by clouds, cloud shadows, etc., as well as those pixels identified as having no valid data in specific bands and/or time slices.

## Raw data display

We start by creating a true-colour display of some time slices (here, the first three time indices) from the un-processed dataset. This will allow us to better understand the various masks that will be produced in order to clean up the data.

In [None]:
### True-colour display
image_array = data[['red', 'green', 'blue']].isel(time=range(6)).to_array()
image_array.plot.imshow(vmin=7000, vmax=16000, col='time', col_wrap=3) # vmin and vmax is deliberately used here instead of robust=True, because the clouds make the images very dark

In these plots, we can clearly identify pixels with no valid data (shown in black – typically located outside a scene of satellite imagery in some time slices), as well as valid pixels affected by clouds (bright pixels). The rest are mostly clean pixels, i.e. pixels with valid data that are not affected by quality issues such as clouds, shadows, etc.

## Valid pixel mask

In the [04 - Masking Data](../EASI%20Training/04%20-%20Masking%20Data.ipynb) EASI Training notebook, examples are provided for masking clouds from data. The section below demonstrates how to work directly with cloud masks rather than using them to mask other bands.

First, we can create a mask layer to identify the `nodata` pixels, i.e. invalid pixels. This mask can be used to keep only those pixels corresponding to valid data, which will here correspond to clear observations, as well as clouds, shadows, etc.

In [None]:
### Valid (not 'no-data') masks, for all bands
data_valid_mask = masking.valid_data_mask(data)
data_valid_mask = data_valid_mask.persist()

data_valid_mask

We can see that `data_valid_mask` has a separate layer for each of the satellite bands: this is because different measurements (i.e. bands, such as `red`, `blue`, etc.) each have different pixels flagged as `nodata`, perhaps as a result of differing data acquisition or pre-processing procedures. 

> Note that the ` masking.mask_invalid_data()` function (e.g. as in [04 - Masking Data](../EASI%20Training/04%20-%20Masking%20Data.ipynb)) automatically removes `nodata` values from each measurement. The function `valid_data_mask()` is used to generate a mask from these `nodata` values.

---
In this next step, these three measurement-specific masks are combined into a single mask. This results in a maks of the pixels that have **valid data in all bands** in our dataset.

In [None]:
valid_mask = data_valid_mask['red'] & data_valid_mask['green'] & data_valid_mask['blue'] & data_valid_mask['qa_pixel']
valid_mask = valid_mask.persist()
valid_mask

---
We can now plot the first six masks in the `valid_mask` time series, corresponding to the same dates as the raw data plots displayed above.

In [None]:
valid_mask[:6].plot(col='time', col_wrap=3) # Note that these are just the first 6 dates

As can be seen here, the `valid_mask` layers clearly identify those pixels in the dataset containing valid data (yellow pixels with a value of `1`), as opposed to those with missing data (purple pixels with a value of `0`).
This is a combination of data outside of the scene as well as bad data in each scene.

## Clear pixel mask

Based on the pixel quality flag (`qa_pixel` in the current dataset), we can also derive a mask of clear observations for each time slice, identifying those pixels not affected by clouds, cloud shadows, etc. Because such phenomena simultaneously affect all bands in the dataset, there will only be one such mask per time slice (for all measurements).

In [None]:
### L2_FLAGS mask
clear_pixel_flags = {
    'nodata': False,
    'cloud': 'not_high_confidence',
    'cloud_shadow': 'not_high_confidence'
}

clear_mask = masking.make_mask(data[QA_layer], **clear_pixel_flags)
clear_mask.name = "clear_mask"
clear_mask = clear_mask.persist()

clear_mask

---
Once again, let's plot the first six masks in the `clear_mask` time series, and compare it with the previous plots.

In [None]:
clear_mask[:6].plot(col='time', col_wrap=3) # Note that these are just the first 6 dates

As can be seen above, the pixels identified as providing clear observations (as per the `clear_mask` layers) are also only those which have valid data (as with `valid_mask`). Other valid pixels affected by pixel quality issues (here, mainly clouds) have also been removed from the clear pixel mask (`clear_mask`).

# Statistics

Based on these datasets derived from the raw Landsat-8 imagery, we can now calculate some statistics of interest that can be potentially used to inform further analyses of the data.

## Tabular results for each date

For each time slice in the time series, we can calculate the percentage of valid pixels, i.e. the sum of all valid pixels (all values which are not `nodata` values) divided by the total number of pixels in the image. Because each valid pixel is represented by a value of `1.0` in the `valid_mask` object (and a value of `0.0` otherwise), we can simply use the `mean` function to implement this operation. The same can be done for the percentage of clear pixels (i.e. not affected by clouds, shadows, nodata, etc.).

Note that these values represent the percentage of clear (respectively valid) pixels as a proportion of the _total number of pixels over the whole imaging region_, as opposed to the clear percentage representing the fraction of clear pixels _as a percentage of valid pixels_. 

And because clear pixels are a subset of valid pixel, the clear pixel percentage (for each image, as well as overall) should always be lower than the valid pixel percentage.

In [None]:
### Valid pixel percentage in each time slice
count_valid = valid_mask.sum(dim=['x','y']).values
percentage_valid = np.round( valid_mask.mean(dim=['x','y']).values*100.0, 2)

### Clear pixel percentage in each time slice
count_clear = clear_mask.sum(dim=['x','y']).values
percentage_clear = np.round( clear_mask.mean(dim=['x','y']).values*100.0, 2)

### Tabular results
tmp = {"date": data.time.values,
       "valid_count": count_valid, 
       "valid_percentage": percentage_valid,
       "clear_count": count_clear, 
       "clear_percentage": percentage_clear }
valid_DF = pd.DataFrame(data=tmp)
valid_DF

## Time series plot

The table above contains the numeric results related to the metrics of interest (clear and valid pixel percentages). We can also show these same results in a time series plot, where the percentages are displayed in a bar plot with different colours.

In [None]:
plt.figure(figsize=(15,5))
plt.bar(valid_DF["date"].values, valid_DF["valid_percentage"].values, 1.0)
plt.bar(valid_DF["date"].values, valid_DF["clear_percentage"].values, 1.0, color='orange')
plt.title("Percentage of valid pixels (blue) and clear pixels (orange)");

## Spatial plot of clear pixel percentage

Finally, we can also plot of overall percentage of clear observations (this time, as a _fraction of valid observations_) for each pixel over the whole time series. The result is a map indicating which regions are more or less affected by data quality issues, and thus providing information on whether subsequent analyses might be affected by the lack of clear observations in certain areas.

In [None]:
### Valid & clear pixel counts (temporal aggregation)
valid_sum = valid_mask.sum(dim='time')
clear_sum = clear_mask.sum(dim='time')

### Fraction of clear pixels to valid pixels
percentage_clear = 100.0 * clear_sum / valid_sum
percentage_clear = percentage_clear.where(percentage_clear<=100.0)   # remove some potential data glitches...
percentage_clear = percentage_clear.persist()

percentage_clear.plot(figsize=(12,10), cmap="Spectral", vmin=0, vmax=100)
plt.title('Percentage of clear observations (as fraction of valid pixels)');

---
It is also possible to use Holoviews to produce an interactive plot of this data:

In [None]:
# Holoviews, Datashader and Bokeh
import holoviews as hv
import panel as pn
import cartopy.crs as ccrs
from datashader import reductions
from holoviews import opts
# import geoviews as gv
from holoviews.operation.datashader import rasterize
hv.extension('bokeh', logo=False)

In [None]:
options = {
    'title': 'Percentage of clear observations (as fraction of valid pixels)',
    'width': 800,
    'height': 800,
    'aspect': 'equal',
    'cmap': 'Spectral',
    'clim': (0, 100),                          # Limit the color range depending on the layer_name
    'colorbar': True,
    'tools': ['hover'],
}
    
percentage_clear.hvplot.image(
    x = 'x', y = 'y',                        # Dataset x,y dimension names
    rasterize = True,                        # Use Datashader
    # aggregator = reductions.mean(),        # Datashader selects mean value
    precompute = True,                       # Datashader precomputes what it can
    crs = native_crs,                        # Dataset crs
    projection = ccrs.PlateCarree(),         # Output projection (use ccrs.PlateCarree() when coastline=True)
    coastline='10m',                         # Coastline = '10m'/'50m'/'110m'
).options(opts.Image(**options)).hist(bin_range = options['clim'])

Further overall (whole-of-time-series) statistics can also be provided as follows.

In [None]:
print("Overall percentage of valid pixels: {:.2%}".format(valid_mask.mean().values))
print("Overall percentage of clear pixels: {:.2%}".format(clear_mask.mean().values))

In [None]:
print("Number of scenes which have no clear data:", np.sum((clear_mask.mean(dim=['x','y']).values==0)))
print("Number of pixels which have no clear data:", (percentage_clear==0).sum().values)
print("Total number of pixels in each time slice:", data.dims['x']*data.dims['y'])

In [None]:
### End notebook