# CLOUD PROBABILITY PRODUCT

In this notebook we explore how to create an annual cloud probability product using Sentinel-2 satellite imagery data. We describe how to use various parameters and configurations to obtain raw band data and product.

To calculate the sun exposure/sunniness/daylight (?) index, we would have to use many spatio-temporal variables, including the area cloudiness, or probability of retrieveing clouds in the particular point in space and time.

Py-STAC solution is based on the [carpentries guide](https://carpentries-incubator.github.io/geospatial-python/05-access-data.html).
This Notebook has been partially based on the [Sentinel Hub API example](https://sentinelhub-py.readthedocs.io/en/latest/examples/process_request.html).

### Dependencies
We use the base [Docker image](https://hub.docker.com/r/behzad89/geo-miniconda3) to run the Notebook and [helper functions](raster_utils.py) and also install [`pystac-client`](https://pystac-client.readthedocs.io/en/stable/tutorials/authentication.html) to access data.

#### DEPRECATED - Sentinel Hub API (Process API)

~~Process API requires Sentinel Hub account. Please check [configuration instructions](https://sentinelhub-py.readthedocs.io/en/latest/configure.html) about how to set up your Sentinel Hub credentials.~~

In [None]:
"""
from sentinelhub import SHConfig
from oauthlib.oauth2 import BackendApplicationClient
from requests_oauthlib import OAuth2Session
# it's recommended to install sentinelhub[AWS]: pip install sentinelhub[AWS]

config = SHConfig()

CLIENT_ID = "..."
CLIENT_SECRET = "..."
TOKEN_URL = "https://services.sentinel-hub.com/auth/realms/main/protocol/openid-connect/token"

# set up credentials
client = BackendApplicationClient(client_id=CLIENT_ID)
oauth = OAuth2Session(client=client)

# get an authentication token
token = oauth.fetch_token(
    token_url=TOKEN_URL,
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET
)
"""

"""
if not config.sh_client_id or not config.sh_client_secret:
    print("Warning! To use Process API, please provide the credentials (OAuth client ID and client secret).")

if config.sh_client_id and config.sh_client_secret:
    print("config found")

print(config)

# to check where the configuration is 
SHConfig.get_config_location()"""

# TODO - to find out how adjust config.toml file

~~The access token is obtained through Copernicus Data Ecosystem: https://documentation.dataspace.copernicus.eu/APIs/SentinelHub/Overview/Authentication.html#:~:text=Client%20in%20your-,account%20settings,-.%20This%20is%20so~~

~~Here is the instruction to register a token: https://documentation.dataspace.copernicus.eu/APIs/SentinelHub/Overview/Authentication.html~~

#### Py-stac solution

Let's configure our pystac client first, find out all available collections and stick to one of them. It has been found out that cloud probabilities are described by the Sentinel-2 Collection 1 L2A product.

Usually, the cloud probability products are available through Sentinel Hub API which is subject to significant access restrictions. Although many hubs provide free access to Sentinel-2 products, they do not include additional bands, such as a cloud probability (Copernicus Data Space Ecosystem, Microsoft Planetary). However, Element84 (Earth Search) provides access to cloud probability products to kept on AWS storage. For details, see the registry key ([here](https://registry.opendata.aws/sentinel-2-l2a-cogs/)) and STAC catalogue of the relevant Sentinel-2 collection([here](https://radiantearth.github.io/stac-browser/#/external/earth-search.aws.element84.com/v1/collections/sentinel-2-c1-l2a)).

**TODO:** back-up plan if Element84 terminate access. Any limit rate for earth84?

In [3]:
# pip install pystac-client if not installed
from pystac_client import Client

api_url = "https://earth-search.aws.element84.com/v1"
# OTHER ENDPOINTS TO GET DATA
#api_url = "https://stac.dataspace.copernicus.eu/v1/" # NOTE: this doesn't contain cloud bands in the assets
# https://hub.openeo.org/ (not production ready)
client = Client.open(api_url)

collections = client.get_collections()
for collection in collections:
    print(collection)
collection = "sentinel-2-c1-l2a" # NOTE: do not use "sentinel-2-l2a" as it doesn't contain cloud probabilities

datetime = '2023-01-01/2023-12-31'

# NOTE: Connecting to client might freeze sometimes, kernel restart would help

<CollectionClient id=sentinel-2-pre-c1-l2a>
<CollectionClient id=cop-dem-glo-30>
<CollectionClient id=naip>
<CollectionClient id=cop-dem-glo-90>
<CollectionClient id=landsat-c2-l2>
<CollectionClient id=sentinel-2-l2a>
<CollectionClient id=sentinel-2-l1c>
<CollectionClient id=sentinel-2-c1-l2a>
<CollectionClient id=sentinel-1-grd>


DEPRECATED - EARTH DAILY:

~~Let's also try Earth Daily STAC:
ACCOUNT TO BE CREATED YET (https://console.earthdaily.com/platform/signin)~~

In [4]:
"""import os
import requests

from dotenv import load_dotenv
from pystac.client import Client

load_dotenv()  # take environment variables from .env.

CLIENT_ID = os.getenv("EDS_CLIENT_ID")
CLIENT_SECRET = os.getenv("EDS_SECRET")
EDS_AUTH_URL = os.getenv("EDS_AUTH_URL")
API_URL = os.getenv("EDS_API_URL")
STAC_API_URL = f"{API_URL}/platform/v1/stac"

# Setup requests session
session = requests.Session()
session.auth = (CLIENT_ID, CLIENT_SECRET)
"""

"""
def get_new_token(session):
    '''Obtain a new authentication token using client credentials.'''
    token_req_payload = {"grant_type": "client_credentials"}
    try:
        token_response = session.post(EDS_AUTH_URL, data=token_req_payload)
        token_response.raise_for_status()
        tokens = token_response.json()
        return tokens["access_token"]
    except requests.exceptions.RequestException as e:
        print(f"Failed to obtain token: {e}")

token = get_new_token(session)

client = Client.open(STAC_API_URL, headers={"Authorization": f"bearer {token}"}) 
"""

'\ndef get_new_token(session):\n    \'\'\'Obtain a new authentication token using client credentials.\'\'\'\n    token_req_payload = {"grant_type": "client_credentials"}\n    try:\n        token_response = session.post(EDS_AUTH_URL, data=token_req_payload)\n        token_response.raise_for_status()\n        tokens = token_response.json()\n        return tokens["access_token"]\n    except requests.exceptions.RequestException as e:\n        print(f"Failed to obtain token: {e}")\n\ntoken = get_new_token(session)\n\nclient = Client.open(STAC_API_URL, headers={"Authorization": f"bearer {token}"}) \n'

### 1. CLOUD PROBABILITY ACCESS

We will download Sentinel-2 imagery of Tyne and Wear Area. Let's try with a just one 20-km tile of the area of interest.
These tiles have already been prepared.

The bounding box of this tile in `WGS84` coordinate system is `[54.933089, -1.689407, 55.114004, -1.374500]` (longitude and latitude coordinates of lower left and upper right corners).

![area_of_interest_tile](illustrations/area_of_interest_tile.png)

In [5]:
from shapely.geometry import Polygon

# define polygon vertices as (lat, lon) tuples - correct for Shapely
tile = Polygon([
    (-1.689407, 54.933089),  # lat, lon swapped
    (-1.374500, 54.933089),
    (-1.374500, 55.114004),
    (-1.689407, 55.114004),
    (-1.689407, 54.933089)
])

search = client.search(
    collections=[collection],
    intersects=tile,
    datetime=datetime
)

print(f"Number of matched scenes: {search.matched()}")

# TODO - to write a function to transform the tile extent into Shapely polygon


Number of matched scenes: 713


For a tile 20x20 km in the UK we will usually have hundreds of acquisitions per year. For example, for the sample tile we got 713 items.

In [6]:
items = search.item_collection()
print(f"Number of items: {len(items)}")
for item in items:
    print(item)

Number of items: 713
<Item id=S2A_T30UWF_20231230T112455_L2A>
<Item id=S2A_T30UXF_20231230T112455_L2A>
<Item id=S2A_T30UWG_20231230T112455_L2A>
<Item id=S2A_T30UXG_20231230T112455_L2A>
<Item id=S2B_T30UWG_20231228T113409_L2A>
<Item id=S2A_T30UWF_20231227T111453_L2A>
<Item id=S2A_T30UXF_20231227T111453_L2A>
<Item id=S2A_T30UWG_20231227T111453_L2A>
<Item id=S2A_T30UXG_20231227T111453_L2A>
<Item id=S2B_T30UWF_20231225T112407_L2A>
<Item id=S2B_T30UXF_20231225T112407_L2A>
<Item id=S2B_T30UWG_20231225T112407_L2A>
<Item id=S2B_T30UXG_20231225T112407_L2A>
<Item id=S2A_T30UWF_20231223T113502_L2A>
<Item id=S2A_T30UWG_20231223T113502_L2A>
<Item id=S2B_T30UWF_20231222T111432_L2A>
<Item id=S2B_T30UXF_20231222T111432_L2A>
<Item id=S2B_T30UWG_20231222T111432_L2A>
<Item id=S2B_T30UXG_20231222T111432_L2A>
<Item id=S2A_T30UWF_20231220T112457_L2A>
<Item id=S2A_T30UXF_20231220T112457_L2A>
<Item id=S2A_T30UWG_20231220T112457_L2A>
<Item id=S2A_T30UXG_20231220T112457_L2A>
<Item id=S2B_T30UWF_20231218T113411_

For inspection, let's check out one of the scenes:

In [7]:
index = 250
item = items[index]
try:
    print(item.datetime)
    print(item.geometry)
    print(item.properties)
    print(item.properties.get('proj:code') or item.properties.get('proj:epsg'))
except Exception as e:
    print(f"Error checking item[{index}]: {e}")

2023-08-24 11:16:13.143000+00:00
{'type': 'Polygon', 'coordinates': [[[-1.4855325379228823, 55.037555983560885], [-1.9429687247198262, 54.05556822592197], [-1.3232267769574286, 54.04852390275896], [-1.2822817967468176, 55.034856954918745], [-1.4855325379228823, 55.037555983560885]]]}
{'created': '2024-01-09T08:58:01.293Z', 'platform': 'sentinel-2b', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'eo:cloud_cover': 99.922079, 'proj:centroid': {'lat': 54.46115, 'lon': -1.52953}, 'mgrs:utm_zone': 30, 'mgrs:latitude_band': 'U', 'mgrs:grid_square': 'WF', 'grid:code': 'MGRS-30UWF', 'view:azimuth': 109.29815214038962, 'view:incidence_angle': 10.685709790274707, 'view:sun_azimuth': 160.718125785181, 'view:sun_elevation': 45.2099639285788, 's2:tile_id': 'S2B_OPER_MSI_L2A_TL_2BPS_20230824T153429_A033768_T30UWF_N05.09', 's2:degraded_msi_data_percentage': 0.0076, 's2:nodata_pixel_percentage': 75.513142, 's2:saturated_defective_pixel_percentage': 0, 's2:dark_features_percentage': 0, 's2:clou

Let's try to download publicly available bands in the Sentinel collection:

In [8]:
assets = item.assets
print(assets.keys())
print(assets["thumbnail"].href)

dict_keys(['red', 'green', 'blue', 'visual', 'nir', 'swir22', 'rededge2', 'rededge3', 'rededge1', 'swir16', 'wvp', 'nir08', 'scl', 'aot', 'coastal', 'nir09', 'cloud', 'snow', 'preview', 'granule_metadata', 'tileinfo_metadata', 'product_metadata', 'thumbnail'])
https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/30/U/WF/2023/8/S2B_T30UWF_20230824T110622_L2A/L2A_PVI.jpg


In [9]:
cloud = assets["cloud"]
if cloud is None:
    raise KeyError("The asset 'cloud' not found")
print(type(cloud))

print(cloud.href)   # URL to the asset
print(cloud.media_type)
print(cloud.roles) # roles might be non-intuitive
print(cloud.title)

<class 'pystac.asset.Asset'>
https://e84-earth-search-sentinel-data.s3.us-west-2.amazonaws.com/sentinel-2-c1-l2a/30/U/WF/2023/8/S2B_T30UWF_20230824T110622_L2A/CLD_20m.tif
image/tiff; application=geotiff; profile=cloud-optimized
['data', 'cloud']
Cloud Probabilities


### EXPORT
First, we would like to export the full image locally to check it out:

In [10]:
import rioxarray
cloud_href = assets["cloud"].href
cloud = rioxarray.open_rasterio(cloud_href)
print(cloud)

<xarray.DataArray (band: 1, y: 5490, x: 5490)> Size: 30MB
[30140100 values with dtype=uint8]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 44kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05
  * y            (y) float64 44kB 6.1e+06 6.1e+06 6.1e+06 ... 5.99e+06 5.99e+06
    spatial_ref  int64 8B 0
Attributes:
    OVR_RESAMPLING_ALG:        AVERAGE
    AREA_OR_POINT:             Area
    STATISTICS_MAXIMUM:        100
    STATISTICS_MEAN:           97.506272511084
    STATISTICS_MINIMUM:        1
    STATISTICS_STDDEV:         10.967665157928
    STATISTICS_VALID_PERCENT:  24.27
    _FillValue:                0
    scale_factor:              1.0
    add_offset:                0.0


In [18]:
target_id="S2B_T30UWF_20230116T111315_L2A" # another ID - S2A_T30UWG_20230117T113415_L2A

items_by_id = {it.id: it for it in items}
item = items_by_id.get(target_id)
if item is None:
    raise ValueError(f"Item with id {target_id} not found")

print(item.id)
print(item.properties)

assets = item.assets
print(assets.keys())
print(assets["thumbnail"].href)

snow = assets["snow"]
if snow is None:
    raise KeyError("The asset 'snow' not found")
print(type(snow))

print(snow.href)   # URL to the asset
print(snow.media_type)
print(snow.roles) # roles might be non-intuitive
print(snow.title)

import rioxarray
snow_href = assets["snow"].href
snow = rioxarray.open_rasterio(snow_href)
print(snow)

# save whole image to disk
snow.rio.to_raster(f"data/snow_{target_id}.tif")

# Number of snowy tile - S2A_T30UWG_20230117T113415_L2A

S2B_T30UWF_20230116T111315_L2A
{'created': '2024-01-05T10:38:43.970Z', 'platform': 'sentinel-2b', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'eo:cloud_cover': 40.400359, 'proj:centroid': {'lat': 54.45592, 'lon': -1.51934}, 'mgrs:utm_zone': 30, 'mgrs:latitude_band': 'U', 'mgrs:grid_square': 'WF', 'grid:code': 'MGRS-30UWF', 'view:azimuth': 109.43686600784984, 'view:incidence_angle': 10.68987277116687, 'view:sun_azimuth': 165.103752471263, 'view:sun_elevation': 13.341900559606003, 's2:tile_id': 'S2B_OPER_MSI_L2A_TL_2BPS_20230116T124654_A030622_T30UWF_N05.09', 's2:degraded_msi_data_percentage': 0, 's2:nodata_pixel_percentage': 76.868796, 's2:saturated_defective_pixel_percentage': 0, 's2:dark_features_percentage': 0.241603, 's2:cloud_shadow_percentage': 19.930786, 's2:vegetation_percentage': 3.727176, 's2:not_vegetated_percentage': 24.767007, 's2:water_percentage': 6.099357, 's2:unclassified_percentage': 1.956721, 's2:medium_proba_clouds_percentage': 13.74151, 's2:high_proba_clo

Note that snow probability product has the same issue. On the sample image you can see two types of pixels=0:
1. Pixels not covered by snow (these pixels are useful for further analysis)
2. Pixels outside of the satellite coverage, included into the grid, but divided from the snow pixels with a sharp straight line. These pixels are not useful for further analysis!

<img src="illustrations/nodata_issue_snow.png" alt="nodata_snow_issue" style="width:50%;">
<img src="illustrations/nodata_issue_snow_legend.png" alt="nodata_snow_issue_legend" style="width:30%;">


Take a note that some scenes might have a small number of valid pixels (see **'STATISTICS_VALID_PERCENT'**). That doesn't mean these pixels are not suitable for follow-up analysis as this attribute describes the entire whole scene area whereas the actual area of interest might have larger share of valid pixels.

Moreover, no data values in this product usually mean that these pixels just belong to other non-cloudy categories (eg, vegetation, water, or snow).


In [19]:
# save whole image to disk
cloud.rio.to_raster(f"data/cloud_{index}.tif")

Let's also download another band for a visual comparison with cloud probability - SCL, which divide scene by rough 'land-cover' categories, including clouds. As you can see, no data value (0) in the cloud probability product are usually observed in non-cloudy SCL categores, such as vegetation or water.

Therefore, no data value in a pixel doesn't mean we can't say for sure if it's a cloud or not - it usually means that it's not a cloud. In other words, `STATISTICS_VALID_PERCENT` is not a quality metric of cloud probability product and doesn't describe accuracy.

In [20]:
scl_href = assets["scl"].href
scl = rioxarray.open_rasterio(scl_href)
print(scl)

# save whole image to disk
scl.rio.to_raster(f"data/scl_{index}.tif")

<xarray.DataArray (band: 1, y: 5490, x: 5490)> Size: 30MB
[30140100 values with dtype=uint8]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 44kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05
  * y            (y) float64 44kB 6.1e+06 6.1e+06 6.1e+06 ... 5.99e+06 5.99e+06
    spatial_ref  int64 8B 0
Attributes:
    OVR_RESAMPLING_ALG:        MODE
    AREA_OR_POINT:             Area
    STATISTICS_MAXIMUM:        11
    STATISTICS_MEAN:           6.434488415921
    STATISTICS_MINIMUM:        2
    STATISTICS_STDDEV:         2.5714036867869
    STATISTICS_VALID_PERCENT:  23.13
    _FillValue:                0
    scale_factor:              1.0
    add_offset:                0.0


Now, we would like to work with particular tiles. Let's call a separate function which will clip the scene by the extent of the tile of interest. 

In [21]:
import importlib # to reload external changes
import raster_utils

importlib.reload(raster_utils)

<module 'raster_utils' from '/app/raster_utils.py'>

Let's find out the National Grid tiles Tyne and Wear area intersects. For that purpose, we are going to use [20x20km grid](https://github.com/OrdnanceSurvey/OS-British-National-Grids?tab=readme-ov-file). Let's call the external function to find out the intersected tiles:

In [81]:
aoi_path = "data/NewcastleUponTyne.gpkg"
tile_path = "data/uk_20km_grid.gpkg"

touched, aoi_crs, tile_crs = raster_utils.touched_tiles(aoi_path, tile_path)
print(touched, aoi_crs, tile_crs )

    tile_name  country                                           geometry
413      NZ04  England  POLYGON ((400000 540000, 420000 540000, 420000...
414      NZ06  England  POLYGON ((400000 560000, 420000 560000, 420000...
418      NZ24  England  POLYGON ((420000 540000, 440000 540000, 440000...
419      NZ26  England  POLYGON ((420000 560000, 440000 560000, 440000...
423      NZ44  England  POLYGON ((440000 540000, 460000 540000, 460000...
424      NZ46  England  POLYGON ((440000 560000, 460000 560000, 460000... EPSG:27700 EPSG:27700


For now, we will limit calculations with one tile, mentioned above. Let's call the external function to clip the scene:

In [82]:
clipped_cloud=raster_utils.clip_scene_by_one_tile(cloud, touched, index=3)
clipped_cloud

Again, you will see the **'STATISTICS_VALID_PERCENT'**, but this attribute inherited value from the initial image and is not correct anymore. If you wish, you can recalculate stats and find out how many non-cloudy pixels you have in your tile of interest now. It should be different from what you found out earlier in the metadata.

In [None]:
nodata_val=clipped_cloud.rio.nodata
print(f"Nodata value is {nodata_val}")
total_pixels=clipped_cloud.size
print(f"Total number of pixels is {total_pixels}")
nodata_pixels=((clipped_cloud == nodata_val).sum().item())
print(f"Number of pixels=0 is {nodata_pixels}")

nodata_share=(nodata_pixels/total_pixels)*100 if total_pixels >0 else 0
print(f"Share of pixels=0 is {nodata_share} %")

Nodata value is 0
Total number of pixels is 648000
Number of pixels=0 is 397834
Share of pixels=0 is 61.394135802469144 %


#### CHECKING NO DATA VALUES
The problem is that there are two types of pixels with cloud probability encoded as 0:
1. within area covered by satellite, where there are no clouds. We want these pixels!
2. outside area covered by satellite, but included in the product image. We don't want these pixels!

For the valid calculations, we have to eliminate pixels of type 2.

<img src="illustrations/nodata_issue.png" alt="nodata_issue" style="width:50%;">
<img src="illustrations/nodata_issue_legend.png" alt="nodata_issue_legend" style="width:10%;">

# TODO - to write a function on separating true no data values and 0

Now, let's export the output locally:

In [110]:
clipped_cloud.rio.to_raster(f"data/clipped_cloud_{index}.tif")

### 2. CLOUD PROBABILITY CALCULATION **(ONE tile, ONE year)**

So far, everything was relatively simple, wasn't it?

Now, to understand how frequent you can enjoy sun (or suffer) in a particular pixel, we can try to extract the average value of cloud probability over a year. 

First, we could try it for only one tile of interest, for one year.
So, we will use all scenes (or timestamps, or items) available as we sure so far it won't take too long to calculate average values across <1000 of images.

**NOTE**: some images do not cover tiles entirely as satellites provide images in so-called swaths (MGRS grids), so swaths may slice the area of interest, partly leaving it without values.

In [22]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

2.1. First, let's try just loop over all items. 
We could loop over all STAC items, open them, create one enormously giant array and calculate cloud probability, but that's definitely not the best day to do it.

So, we will test another way:
* loop over scenes (items)
* open first scene and extract value in each pixel
* store values in each pixel
* open next scene and extract value in each pixel
* add value in each pixel to previous value
* once all values across all scenes are cumulated, divide the cumulated value by the number of scenes

Just be aware before running cells, it may take quite a while.
Non-cached calculation for 2023 year (713 scenes) took 29 minutes (all tiles).

In [None]:
import xarray as xr
import matplotlib.pyplot as plt

# Example: average snow probability across all items
arrays = []
print ("Looping over all scenes (items):")

for item in items:   # your filtered STAC items
    cloud_asset = item.assets.get("cloud")
    if cloud_asset:
        da = rioxarray.open_rasterio(cloud_asset.href, masked=True)
        arrays.append(da)

# stack them along a new "time" dimension
stack = xr.concat(arrays, dim="time")
print(stack)

KeyboardInterrupt: 

Now, calculate the mean average for each pixel:

In [None]:
mean_raster = stack.mean(dim="time", skipna=True)

Export and visualise:

In [None]:
mean_raster.rio.to_raster("data/snow_mean.tif")
mean_raster.plot(figsize=(6,6), cmap="Blues")

In [None]:
import numpy as np

mean_accum = None
count = 0
print (f"Looping over all scenes ({len(items)} items):")

for i,item in enumerate(items):
    
    cloud_asset = item.assets.get("cloud")
    if cloud_asset:
        da = rioxarray.open_rasterio(cloud_asset.href)
        data = da.squeeze().values.astype("float32")

        if mean_accum is None:
            mean_accum = np.zeros_like(data)
            ref_da = da  # keep reference for spatial metadata

        mean_accum += data
        # print(mean_accum)
        count += 1
        print(f"{i} - ID {item.id}")

mean_accum /= count # calculate average # it's numpy ndarray

NameError: name 'items' is not defined

In [120]:
print(mean_accum)

[[41.196354 41.41655  41.681625 ... 41.056103 40.92146  41.008415]
 [41.659187 41.841515 41.71669  ... 40.87377  40.80645  41.068726]
 [42.004208 41.79944  41.527348 ... 41.       40.90042  41.10659 ]
 ...
 [43.4993   43.518932 43.3885   ... 42.225807 42.430576 42.37167 ]
 [43.741936 43.670406 43.664795 ... 42.196354 42.401123 42.039272]
 [43.17812  43.45442  43.67321  ... 42.41094  42.431976 41.68303 ]]


Export and visualise the output:

In [4]:
import rasterio
import matplotlib.pyplot as plt

transform = ref_da.rio.transform()
crs = ref_da.rio.crs
height, width = ref_da.shape[-2:]

profile = {
    "driver": "GTiff",
    "dtype": "float32",
    "count": 1,
    "height": height,
    "width": width,
    "crs": crs,
    "transform": transform,
    "compress": "lzw"
}

with rasterio.open("data/cloud_mean.tif", "w", **profile) as dst:
    dst.write(mean_accum, 1)

print("✅ Exported mean raster to data/cloud_mean.tif")

# to drop the 'band' dimension (not needed, we have only one possible band)
# wrap numpy array with spatial metadata from referenece
mean_da = ref_da.squeeze().copy(data=mean_accum)

mean_da.plot(
    figsize=(6, 6),
    cmap="Blues",
    cbar_kwargs={"label": "Mean cloud probability (%)"}
)
plt.title("Mean yearly cloud probability")
plt.show()

NameError: name 'ref_da' is not defined

Takes an enormous amount of time, isn't it? And that's all just for one year.

### TILE OF INTEREST
We can test the performance for our tile of interest. Let's define parameters:


In [68]:
%reload_ext autoreload
%autoreload 2

# parameters
aoi_path = "data/NewcastleUponTyne.gpkg"
tile_path = "data/uk_20km_grid.gpkg"

collection="sentinel-2-c1-l2a"
datetime='2023-01-01/2023-12-31'
asset="cloud"

touched, aoi_crs, tile_crs = raster_utils.touched_tiles(aoi_path, tile_path)
print(touched)
spatial_extent= touched.iloc[[3]].copy()  # double brackets to keep as DataFrame
print(spatial_extent) # THAT'S OUR TILE
print(type(spatial_extent)) 

Area of interest is EPSG:27700
    tile_name  country                                           geometry
413      NZ04  England  POLYGON ((400000 540000, 420000 540000, 420000...
414      NZ06  England  POLYGON ((400000 560000, 420000 560000, 420000...
418      NZ24  England  POLYGON ((420000 540000, 440000 540000, 440000...
419      NZ26  England  POLYGON ((420000 560000, 440000 560000, 440000...
423      NZ44  England  POLYGON ((440000 540000, 460000 540000, 460000...
424      NZ46  England  POLYGON ((440000 560000, 460000 560000, 460000...
    tile_name  country                                           geometry
419      NZ26  England  POLYGON ((420000 560000, 440000 560000, 440000...
<class 'geopandas.geodataframe.GeoDataFrame'>


Now, calculate cloud probability, masking the scenes with the tile of interest. We avoid loading the full raster into memory.

In [None]:
import numpy as np
from rasterio.windows import from_bounds
from rasterio.warp import reproject, Resampling, calculate_default_transform
import geopandas as gpd

mean_accum=None
count = 0
ref_profile=None
print(f"Looping over all scenes ({len(items)} items):")
print("-" * 40)

spatial_extent.to_file("data/spatial_extent.gpkg", driver="GPKG")

for i,item in enumerate(items):
    cloud_asset = item.assets.get("cloud")
    if cloud_asset:
        # da = da.rio.clip(spatial_extent.geometry, spatial_extent.crs, drop=False)  # NOTE: another masking option (still loads everything to memory)
        with rasterio.open(cloud_asset.href) as src:            
            window = from_bounds(*spatial_extent.total_bounds, transform=src.transform)

            data=src.read(1,window=window).astype("float32")

            """# AOI bounds in raster CRS
            minx, miny, maxx, maxy = spatial_extent.total_bounds
            print(minx,miny,maxx,maxy)
            print(spatial_extent.crs)
            print(src.crs)"""

            if spatial_extent.crs != src.crs:
                transform, width, height = calculate_default_transform(
                    src.crs, spatial_extent.crs, src.width, src.height, *src.bounds
                )
                data_reproj = np.empty((height, width), dtype=np.float32)

                reproject(
                    source=rasterio.band(src, 1),
                    destination=data_reproj,
                    src_transform=src.transform,
                    src_crs=src.crs,
                    dst_transform=transform,
                    dst_crs=spatial_extent.crs,
                    resampling=Resampling.nearest #NOTE:nearest or bilinear
                )

                data = data_reproj
            else:
                data = src.read(1).astype("float32")
                #NOTE: crucial because we need cartesian CRS (recorded in tile, not in Sentinel scenes)
                # TODO - to reproject rasterio object to spatial extent crs 

            # NOTE: DEBUG
            # AOI bounds in raster CRS
            minx, miny, maxx, maxy = spatial_extent.total_bounds
            '''print(minx,miny,maxx,maxy)'''
            # Raster resolution (pixel size)
            res_x, res_y = src.res
            width = int((maxx - minx) / res_x)
            height = int((maxy - miny) / res_y)
            print(f"AOI width: {width}, height: {height}, resolution: {res_x}")

            print(f"{i} - ID {item.id}, shape: {data.shape}, bounds: {spatial_extent.total_bounds}")
            print("-"*40)

            # keep reference profile for export
            if ref_profile is None:
                ref_profile = src.profile.copy()
                ref_profile.update({
                    "height": data.shape[0],
                    "width": data.shape[1],
                    "transform": src.window_transform(window),
                    "dtype": "float32",
                    "count": 1,
                    "compress": "lzw"
                })
                mean_accum = np.zeros_like(data)
                'valid_count = np.zeros_like(data) # TODO - to consider later when no data values and true ZEROs are separated'

            mean_accum += data #TODO - shape of arrays is different because they can cover the AOI only partially. Resample to the area of tile?
            count += 1
        
        if mean_accum is None:
            mean_accum = np.zeros_like(data)

    
mean_accum /= count # calculate average # it's numpy ndarray
print(f"Processed {len(items)} scenes.")

Looping over all scenes (713 items):
AOI width: 1000, height: 1000, resolution: 20.0
0 - ID S2A_T30UWF_20231230T112455_L2A, shape: (5568, 5568), bounds: [420000. 560000. 440000. 580000.]
----------------------------------------
AOI width: 1000, height: 1000, resolution: 20.0
1 - ID S2A_T30UXF_20231230T112455_L2A, shape: (5568, 5568), bounds: [420000. 560000. 440000. 580000.]
----------------------------------------
AOI width: 1000, height: 1000, resolution: 20.0
2 - ID S2A_T30UWG_20231230T112455_L2A, shape: (5568, 5568), bounds: [420000. 560000. 440000. 580000.]
----------------------------------------
AOI width: 1000, height: 1000, resolution: 20.0
3 - ID S2A_T30UXG_20231230T112455_L2A, shape: (5568, 5568), bounds: [420000. 560000. 440000. 580000.]
----------------------------------------
AOI width: 1000, height: 1000, resolution: 20.0
4 - ID S2B_T30UWG_20231228T113409_L2A, shape: (5568, 5568), bounds: [420000. 560000. 440000. 580000.]
----------------------------------------
AOI widt

Surprisingly, the time required is comparable to previous calculation because computational performance probably depends mostly on the loading scenes, not on calculating the cloud probability. Still, we have to download the full scenes, and calculating the cloud probability in the smaller area doesn't help much to reduce computation time.

In [None]:
with rasterio.open("data/cloud_mean_bbox.tif", "w", **profile) as dst:
    dst.write(mean_accum, 1)

print("Exported mean raster to data/cloud_mean.tif")

# to drop the 'band' dimension (not needed, we have only one possible band)
# wrap numpy array with spatial metadata from referenece
mean_da = ref_da.squeeze().copy(data=mean_accum)

mean_da.plot(
    figsize=(6, 6),
    cmap="Blues",
    cbar_kwargs={"label": "Mean cloud probability (%)"}
)
plt.title("Mean yearly cloud probability")
plt.show()

Upon checking the results, we can spot some issues with buildings roofs

<img src="illustrations/issue_roofs.png" alt="roofs_issue" style="width:80%;">

#### TODO - RASTER STATS

It would be nice to provide a quick statistical analysis of cloud probability distribution. For example, range of values is not large because clouds persistently cover all of the pixels - there are no regions where clouds are very rare (at first glance)

#### 2.2. Parallelised calculation

# TODO - Test Dask locally for CPU parallelisation