## **Cloud probability per LSOA**

This Notebook explores ways to translate raster-calculated cloud probability per pixels to the polygon-based cloud probability. This ensures the data representation is understandable and much easier to implement into the regional and local policy-making.

In this workflow we are using LSOA geographies from [here](https://github.com/Imago-SDRUK/geographies?tab=readme-ov-file#list-of-data) **(n=46844)**. The previous workflow provides output xarray Datasets exported to the tiled GeoTIFFs **(n=858)**. There are plenty of ways to calculate the LSOA-based stats [see here](https://gis.stackexchange.com/questions/481295/fastest-way-to-calculate-zonal-statistics-on-millions-of-small-polygons-and-stor), but we'll explore just a few:

In [59]:
import numpy as np
import pandas as pd
import xarray as xr

from pystac_client import Client
import geopandas as gpd
import rioxarray
from rioxarray import merge

import dask.array as da
import dask.dataframe as dd

from rasterstats import zonal_stats
import xvec
from exactextract import exact_extract

import sys

from dask.distributed import Client as DaskClient, LocalCluster
from dask import delayed, compute

import xvec
print(xvec.__version__)
sys.path.append('/src/external_libs')

0.4.0


In [79]:
# input
lsoa="data/uk_datazones_within_newcastle.gpkg"  # "data/uk_datazones.gpkg" if you want all LSOAs
lsoa_id_col='LSOA21CD' # column name with unique LSOA IDs, as in `geographies`
lsoa_gdf=gpd.read_file(lsoa) 

# output
lsoa_stats_exactextract="data/lsoa_stats_exactextract.gpkg"
lsoa_stats_rasterstats="data/lsoa_stats_rasterstats.gpkg"
lsoa_stats_xvec_datasets="data/lsoa_stats_xvec_datasets.gpkg"
lsoa_stats_sjoin="data/lsoa_stats_sjoin.gpkg"

In this code we run computations for the **data subset**:
- six cloud probability datasets (Newcastle tiles)
- all LSOAs entirely covered by the tiles (`within`). We don't include any LSOAs covering pixels in other tiles! In this case, we **replicate the real data relations** from the UK scale on the local scale -  all LSOAs are contained within tiles.

For the sake of clarity, we open the previously exported GeoTIF cloud probability rasters as xarray Datasets. In the final version of workflow, this step can be omitted as **we don't need any intermediate exports**, taking up much resource.

In [80]:
# Open input
rasters = [
    "data/Newcastle_NZ26_2024.tif",
    "data/Newcastle_NZ04_2024.tif",
    "data/Newcastle_NZ24_2024.tif",
    "data/Newcastle_NZ06_2024.tif",
    "data/Newcastle_NZ44_2024.tif",
    "data/Newcastle_NZ46_2024.tif",
]

datasets = []
for r in rasters:
    ds = xr.open_dataset(r, engine="rasterio")
    datasets.append(ds)
print(f"Successfully opened xarray Datasets")

Successfully opened xarray Datasets


### 1. Zonal statistics
Here, we utilise the classic`rasterstats.zonal_stats` tools to calculate statistics of cloud probability per LSOA.
The output will create columns for each raster and band - it takes significant time to compute that (even without weighting mean values by the share of LSOA area in each raster tile).

In [60]:
all_stats=[]
stats_list=['mean','median','min','max','std','count']

# 1. Calculate stats for each raster (only first band)
for raster in rasters:
    zs = zonal_stats(
        lsoa_gdf,
        raster,
        band=1, # only first band
        stats=stats_list,
        all_touched=True, # True or False without partial cell weighting
        nodata=None,
        geojson_out=False  # returns list of dictionaries
    )

    # 2. Convert to dataframe and add LSOA id
    df = pd.DataFrame(zs)
    df[lsoa_id_col] = lsoa_gdf[lsoa_id_col].values

    # 3. Rename columns to include raster name
    stat_cols = [col for col in df.columns if col != lsoa_id_col]
    raster_name = raster.split('/')[-1].split('.')[0]  # extract raster filename
    df.rename(columns={col: f"{raster_name}_band_1_{col}" for col in stat_cols}, inplace=True)

    # 4. Append to list
    all_stats.append(df)
    print(f"Processed raster: {raster}")

# 5. Set LSOA ID as index for each dataframe, concat side by side
df_stats = pd.concat([df.set_index(lsoa_id_col) for df in all_stats], axis=1).reset_index()

# 6. Merge with original geodataframe to retain geometries
gdf_with_stats = lsoa_gdf.merge(df_stats, on=lsoa_id_col, how='left')

# 7. convert and export
gdf_with_stats = gpd.GeoDataFrame(gdf_with_stats, geometry='geometry', crs=lsoa_gdf.crs)
gdf_with_stats.to_file(lsoa_stats_rasterstats, driver="GPKG")
print(f"Saved final LSOA stats to {lsoa_stats_rasterstats}")

Processed raster: data/Newcastle_NZ26_2024.tif
Processed raster: data/Newcastle_NZ04_2024.tif
Processed raster: data/Newcastle_NZ24_2024.tif
Processed raster: data/Newcastle_NZ06_2024.tif
Processed raster: data/Newcastle_NZ44_2024.tif
Processed raster: data/Newcastle_NZ46_2024.tif
Saved final LSOA stats to data/lsoa_stats_rasterstats.gpkg


**Time**: 10s

**Conclusion:**
- can't accept xarray Dataset/Array - either opens GeoTIFF, or `rasterio` objects - **not recommended to use at all**
- relatively slow (even without weighting)
- extracts stats from all rasters even if they don't intersect (replacing values with 0). It requires additional processing and calculation by tile weight for each LSOA
- no partial cell-weighting

### 2. `Exactextract`

Library written in C++, perfectly working for single tiles (see Shaonlee's Notebook). Provides less custom statistics than `rasterstats`, but enough for our purposes.

[Documentation here](https://isciences.github.io/exactextract/exactextract.html)

The code below runs the calculations with weighting.

In [68]:
results = []
stats = ['mean','min','max','count']

# 1. extract stats for each dataset
for ds in datasets:
    stat_df = exact_extract(
        rast=ds,
        vec=lsoa_gdf,
        ops=stats,
        include_cols=[lsoa_id_col],
        include_geom=True,
        output='pandas',
        strategy="feature-sequential"
    )
    results.append(stat_df)
    # NOTE: this will create a dataframe with columns like "band_data_band_1_mean"

# 3. combine all results
combined = pd.concat(results)

# 4. identify statistic columns
# since we only have one band, we just define its column names
mean_col = "band_data_band_1_mean"
min_col  = "band_data_band_1_min"
max_col  = "band_data_band_1_max"
count_col = "band_data_band_1_count"

# 5. Pre-index geometries to avoid repeated access
geom_map = lsoa_gdf.set_index(lsoa_id_col)['geometry'].to_dict()

# 5. Compute weighted sum per row for weighted mean
combined['weighted_sum'] = combined[mean_col] * combined[count_col]

# 6. Aggregate by LSOA (vectorized)
agg_df = combined.groupby(lsoa_id_col).agg(
    weighted_sum=('weighted_sum', 'sum'),
    total_count=(count_col, 'sum'),
    min_val=(min_col, 'min'),
    max_val=(max_col, 'max')
).reset_index()

# 7. Compute final weighted mean
agg_df[mean_col] = agg_df['weighted_sum'] / agg_df['total_count']
agg_df[count_col] = agg_df['total_count']
agg_df = agg_df.drop(columns=['weighted_sum','total_count'])

# 8. Add geometries using pre-indexed map
agg_df['geometry'] = agg_df[lsoa_id_col].map(geom_map)

# 9. convert and save
out_gdf = gpd.GeoDataFrame(agg_df, geometry='geometry', crs=lsoa_gdf.crs)
out_gdf.to_file(lsoa_stats_exactextract, driver="GPKG")
print(f"Saved weighted aggregated LSOA statistics to {lsoa_stats_exactextract}")

Saved weighted aggregated LSOA statistics to data/lsoa_stats_exactextract.gpkg


**Time**: 3s (with weighting)

**Conclusion:**
- fast, but nothing is clear about in-built parallelism
- can be wrapped by Dask delayed
- supports partial cell-weighting
- considers LSOA coverage > 1 tile. To confirm, see values `band_data_band_2_mean` - this mean value of pixel value is not integer in LSOAs covering >1 tile (was integer in original dataset)\
- `exactextract` doesn't accept list of xarray Datasets as an input - only one, so we would have to loop over datasets, then aggregate results per LSOA
- Requires C++ extensions installation, and might be not available through `pip`. Faces issues with the current image Python version (probably required <=3.11)
- `exact_extract` can keep geometry and returns geodataframe in this case. You don't need to merge the output with the original dataframe - very efficient!

#### 3. Xvec with Datasets

`xvec` library can be used for creating a 'vector data cube' and calculating a zonal statistics.
This method is similar to the classic zonal statistcs - keeping the structure of the original cube with attributes, but indexing by a polygon geometry. Here is the implementation of `xvec.zonal_stats` with Datasets. There is another option with `xvec.to_geodataframe` and spatial join after, but that one faced issues.

[Documentation is here](https://xvec.readthedocs.io/en/stable/generated/xarray.Dataset.xvec.zonal_stats.html)

Code below calculates average for each LSOA across tiles without weights. That could be implemented later though.

In [78]:
results = []
stats_list=['mean','median','min','max','count']

# 1. Iterate over datasets to calculate stats, extracting single-band DataArray (first band only)
for i, ds in enumerate(datasets):
    raster_name = getattr(ds, "name", f"raster_{i}")

    da = ds["band_data"].isel(band=0).expand_dims("band")  # use only the first band
    #print(da)
    #print(da.coords)
    
    stat_ds = da.xvec.zonal_stats(
        geometry=lsoa_gdf["geometry"],
        x_coords="x",
        y_coords="y",
        stats=stats_list,
        method="rasterize"
    )
    
    # 2. linking each geometry to its LSOA code and transform to geodataframe
    stat_ds = stat_ds.assign_coords({lsoa_id_col: ("geometry", lsoa_gdf[lsoa_id_col].values)})
    stat_ds.name = "band_data"
    df = stat_ds.xvec.to_geodataframe(geometry="geometry").reset_index()  # required to reset, otherwise `zonal_stats` will fly away to index
    df["source_raster"] = raster_name
    df["band"] = 1  # first band
    
    results.append(df)

# 3. Combine all stats into one dataframe
combined = pd.concat(results, ignore_index=True)
print(combined.head(10))

# 4. pivot table to multiindex format: rows = LSOA, columns = band + stat, values = band_data
lsoa_stats_df = combined.pivot_table(
    index=lsoa_id_col,
    columns=['band', 'zonal_statistics'],  # multi-index columns
    values='band_data',
    aggfunc='mean'  # average across multiple occurrences
)

# 5. flatten multiindex columns: band_1_mean, band_1_max, etc.
lsoa_stats_df.columns = [f"band_{b}_{stat}" for b, stat in lsoa_stats_df.columns]
lsoa_stats_df = lsoa_stats_df.reset_index()

# 6. convert and export
lsoa_stats_df['geometry'] = combined.groupby(lsoa_id_col).geometry.first().values
lsoa_stats_gdf = gpd.GeoDataFrame(lsoa_stats_df, geometry='geometry', crs=combined.crs)
lsoa_stats_gdf.to_file(lsoa_stats_xvec_datasets, driver="GPKG")
print(f"Saved final LSOA stats to {lsoa_stats_xvec_datasets}")

  zonal_statistics  band                                           geometry  \
0             mean     1  MULTIPOLYGON (((426698.655 562983.803, 426704....   
1           median     1  MULTIPOLYGON (((426698.655 562983.803, 426704....   
2              min     1  MULTIPOLYGON (((426698.655 562983.803, 426704....   
3              max     1  MULTIPOLYGON (((426698.655 562983.803, 426704....   
4            count     1  MULTIPOLYGON (((426698.655 562983.803, 426704....   
5             mean     1  MULTIPOLYGON (((425968.068 562574.153, 425971....   
6           median     1  MULTIPOLYGON (((425968.068 562574.153, 425971....   
7              min     1  MULTIPOLYGON (((425968.068 562574.153, 425971....   
8              max     1  MULTIPOLYGON (((425968.068 562574.153, 425971....   
9            count     1  MULTIPOLYGON (((425968.068 562574.153, 425971....   

   spatial_ref   LSOA21CD    band_data source_raster  
0            0  E01008162    60.460705      raster_0  
1            0  E010

**Time varies across methods**:

- `exactextract` - 3s
- `rasterize` - 3s (results identical to `exactextract`)
- `iterate` - 29s

**Conclusion:**
- fast, but might be not clean (quirky - saves stats to the rows, not columns
- newer versions work directly with xarray Datasets (not even Arrays), which is good
- might conflict with the current container (currently using `xvec==0.4.0`). 
- `xvec.zonal_stats` is using `exactextract` under the hood (one of the methods)
- as other methods, it cannot aggregate zonal stats for each LSOA across multiple rasters - you need to do it yourself the same as with `exactextract`

#### 4. Spatial join and `dask_geopandas`

The idea is simple:
- create a geodataframe from pixel centroids
- perform spatial join between LSOA geodatagrame and cloud geodataframe
- calculate zonal statistics

Additionally, it's possible to use `dask_geopandas`. In this case geodataframes should be converted to Dask geodataframes. 

In [55]:
# Try without Dask
all_stats = []
for i, ds in enumerate(datasets):
    raster_name = f"raster_{i}"
    da = ds["band_data"]

    # 1. convert to geodataframe
    df = da.to_dataframe(name="value").reset_index()
    df["raster"] = raster_name 
    # 2. make geodataframe
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df["x"], df["y"]), crs="EPSG:27700")
    # 3. spatial join
    joined = gpd.sjoin(gdf, lsoa_gdf, how="inner", predicate="intersects")
    # 4. calculate zonal stats per LSOA, per band, per raster
    zonal_stats = joined.groupby([lsoa_id_col, "band", "raster"])["value"].agg(
        mean="mean",
        median="median",
        std="std",
        min="min",
        max="max",
        count="count"
    ).reset_index()

    all_stats.append(zonal_stats)

# 5. combine all raster stats into one dataframe
all_stats_df = pd.concat(all_stats, ignore_index=True)

# 6. merge with LSOA polygons without aggregating raster names
lsoa_stats_gdf = lsoa_gdf.merge(all_stats_df, on=lsoa_id_col, how="inner")

# 7. convert to geodataframe and save
lsoa_stats_gdf = gpd.GeoDataFrame(lsoa_stats_gdf, geometry="geometry", crs=lsoa_gdf.crs)
lsoa_stats_gdf.to_file(lsoa_stats_sjoin, driver="GPKG")
print(f"Saved final LSOA stats to {lsoa_stats_sjoin}")

Saved final LSOA stats to data/lsoa_stats_sjoin.gpkg


**Time:** 10s for non-Dasked version

**Conclusion**:
- not fast without Dask
- Dask fails as refers to `geopandas` instead of `dask_geopandas` which requires GeoDataFrame, while our objects are Dask-backed
- can ingest `chunks_size` argument from the cloud probability raster workflow
- Output partitions are the intersection of left and right geodataframes. When passing `chunks_size`, it's recommended to use partitions for the left deodataframe only. Otherwise chunks number may become excessively large.
- only includes pixels whose centroids are within LSOAs
- Dask doesn't work with operations like `merge`

`dask-geopandas` is experimental (eg, only `inner` join is supported), so could require maintenance. 