Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods #209

j08lue · 2023-10-19T22:02:42Z

To complete our goal of providing grid-cell area weighted zonal statistic

https://github.com/NASA-IMPACT/veda-architecture/issues/334

we want to make sure that our zonal statistics calculations are accurate.

For that, we need to quantify

What difference grid cell area weighting makes
Whether the new implementation in TiTiler is accurate

j08lue · 2023-10-19T22:03:04Z

Here is a notebook comparing different methods: https://gist.github.com/j08lue/ac09ed69e5dd58f4058466ec3b108b99

It runs without extra AWS S3 configuration in https://hub.ghg.center/ - please download it into the hub and run it there, if you want to try out or fix things.

Once the validation is completed, we will add this notebook to the GHG Center docs.

Main takeaways

The zonal average calculated with the current TiTiler (v0.2.3) implementation is identical to a plain (unweighted) zonal average computed with Xarray / rioxarray. So far so good.

Grid cell area weighting (here implemented by calculating area on a spherical Earth) makes a difference (of on average 35%) for CASA-GFED3 averaged over the Northern Hemisphere. The difference is negligible though for an area spanning ~2000 km north-south near the equator (Dem Rep Congo bbox, not shown here).

Next steps

We need to repeat this calculation with the new TiTiler implementation, either running locally or deployed.

If you have any comments on this comparison or want to further iterate on it, please go ahead - contributions welcome!

vincentsarago · 2023-11-02T10:40:12Z

@j08lue

I've replaced the titiler endpoint with a direct rio-tiler call

from rio_tiler.io import Reader

def generate_stats(item, geojson, asset_name):
    with Reader(item.assets[asset_name].href) as src:
        feature = geojson["features"][0]
        
        data = src.feature(feature, dst_crs=src.crs)

        # Get the coverage % array
        coverage_array = data.get_coverage_array(
            feature,
        )

        stats = data.statistics(
            coverage=coverage_array,
        )       

    return {
        **stats["b1"].model_dump(exclude_none=True),
        "start_datetime": item.properties["start_datetime"],
    }

but I'm getting the same results as with titiler.

Looking at the data and the AOI, it seems they are aligned

url = "s3://ghgc-data-store/casagfed-carbonflux-monthgrid-v3/GEOSCarb_CASAGFED3v3_NPP_Flux_Monthly_x720_y360_201712.tif"
with Reader(url) as src:
    print(src.bounds)
    print(src.dataset.transform)

(-180.0, -90.0, 180.0, 90.0)
| 0.50, 0.00,-180.00|
| 0.00,-0.50, 90.00|
| 0.00, 0.00, 1.00|

Because the AOI is perfectly aligned with the pixels, there is no reason for the weight to be something different than 1. which explain why I don't see difference between weighted and non-weighted statistics.

vincentsarago · 2023-11-02T10:50:09Z

😓 looking at the notebook I think there is a miss understood about what we are trying to achieve here!

in TiTiler/rio-tiler we added a method to take sub-pixel shape intersection in order to weight the statistics by the pixel coverage.

In the notebook you're using the area created from an external dataset (area.nc) to apply weights.

with xr.open_dataset("area.nc") as ds:
    area_cdo = ds["cell_area"].values
...

        data = xds_clip["band_data"].isel(band=0).to_masked_array()
        weights = xds_clip["area"].to_masked_array()
        weights.mask = data.mask
        timeseries["average_weighted"].append((data * weights).sum() / weights.sum())

Those are two different things and applying weights from another dataset is not possible via titiler/rio-tiler directly!

j08lue · 2023-11-02T11:03:54Z

Thanks for checking! The external weights (CDO-generated area.nc) were just for comparison / validation. We very much expect the API to take care of the weighting (or reprojection?).

j08lue · 2023-11-02T12:42:44Z

Had a clarifying call. Conclusion was that I'll try out the rio-tiler snippet above and make use of the dst_crs parameter to reproject to an equal-area projection before taking the statistics. If it works, we can expose that parameter in TiTiler, where it currently is not.

Furthermore, we discussed that climate data (mostly NetCDF or Zarr in lat/lon projection) specific methods might be better served by a package that is optimized for that kind of data, possibly titiler-xarray or a future evolution of a Xarray-serving service.

j08lue · 2023-11-03T14:56:54Z

I did some testing with dst_crs and a few different equal-area projections - not successful yet, unfortunately:

Updated notebook:
https://gist.github.com/j08lue/ac09ed69e5dd58f4058466ec3b108b99

The very different behavior of different equal-area projections makes me think that the method of reprojecting the image and geometry to some different projection is perhaps not safe enough. Not a surprise - there probably does not exist one projection that works for all regions and sizes.

Since this method is most important for climate datasets (mostly on regular geographic coordinate grids), maybe we should focus on the approach to implement this for that specific case only.

j08lue · 2023-11-09T23:30:58Z

Ok, turns out the rio-tiler based function works well when I pick a good projection. Equal Area Cylindrical ("+proj=cea") gave the expected results as the only equal area projection I found in PROJ or EPSG.

I take my true / benchmark to be the lat/lon grid cell area weighted average (with weights from CDO or an equivalent function). For a dataset with great variation over latitude like global net primary production (NPP), the methods compare like this:

Where average_titiler is the plain /cog/statistics endpoint and underestimates NPP, because high-latitude low production is falsely weighted the same as higher production in lower latitudes.

Both the rio-tiler method above and a rasterio function I coded up that uses WarpedVRT in a few lines give the expected results.

tl;dr

@vincentsarago, I need to use this dst_crs=rasterio.crs.CRS.from_proj4("+proj=cea") in the VEDA TiTiler /cog/statistics in one way or the other, so the application can calculate accurate statistics.

How do we do this best? Expose dst_crs with some heuristics to find out whether the user passed a PROJ, WKT, or EPSG code? 🤷 Or adding an area_weighted=True flag? Anything is fine. I saw that nearest neighbor is hard-coded as the method for reprojecting to dst_crs. That may not be ideal, but it was fine for my tests with large areas so far.

I'll do some more testing with other source image reference systems than geographic coordinates and smaller geometries, to be on the safe side, but let's move forward making a plan for the implementation in VEDA and (most importantly) GHG Center backend, please.

vincentsarago · 2023-11-09T23:35:32Z

@j08lue I might have an idea!

We could have a custom algorithm which could be applied on the image before calculating the statistics. This will only be available for the /statistics [POST] endpoint

Basically you would tell the tiler to apply algorithm=area_weight (naming TBD). The area_weigh algorithm will take most of the code from your notebook and will have to be registered in the algorithm dependency within the application (just like the colormap we just added)

This will be possible with next version of titiler developmentseed/titiler#726 (but compatible with the next veda-backend release)

Algorithm sketch:

import numpy
from rio_tiler.models import ImageData
from rio_tiler.constants import WGS84_CRS

from titiler.core.algorithm.base import BaseAlgorithm

def _get_coord_specs(arr):
    assert numpy.ndim(arr) == 1
    delta = _get_unique_diff(arr)
    return ((arr[0] - delta / 2), (arr[-1] + delta / 2), len(arr))


def surfaceAreaGrid(lat_specs, lon_specs):
    """
    Construct a 2D array (nlat, nlon) where each element is the surface area of the corresponding grid in square meters.
    lat_specs is a tuple, (starting latitude, ending latitude, latitude divisions)
    lon_specs is a tuple, (starting longitude, ending longitude, longitude divisions)

    Example: surfaceAreaGrid((-30,60,45), (20,80,120)) should return a (45,120) array with the surface areas of the
    (2 degree lat x 0.5 degree lon) cells in square meters.
    """
    lat_beg, lat_end, lat_div = lat_specs
    lon_beg, lon_end, lon_div = lon_specs

    R_e = 6371000.0 # Radius of Earth in meters
    dLon = (numpy.pi / 180.) * (lon_end - lon_beg) / lon_div

    dS = numpy.zeros((lat_div + 1, lon_div), numpy.float64)

    Lats = (numpy.pi / 180.) * numpy.linspace(lat_beg, lat_end, lat_div + 1)
    for i, lat in enumerate(Lats):
        dS[i] = R_e * R_e * dLon * numpy.sin(lat)

    dS = numpy.diff(dS, axis=0)
    return dS


class WeightedArea(BaseAlgorithm):
    """Apply Area Weight."""

    # metadata
    input_nbands: int = 1
    output_nbands: int = 1

    def __call__(self, img: ImageData) -> ImageData:
        """apply weight."""
        assert img.crs == WGS84_CRS, "'WeighedArea' algorithm only work with dataset in EPSG:4326 projection"
        
        b1 = img.array[0].astype("float32")

        # Compute weight array using img.bounds and shape
        lat = numpy.linspace(img.bounds[1], img.bounds[3], img.height)

        area_lat = _spherical_grid_cell_area(lat)
        area_lat_2d = numpy.ones((img.height, img.width)) * area_lat[:, numpy.newaxis]

        weights = surfaceAreaGrid(
            (img.bounds[1], img.bounds[3], img.height),
            (img.bounds[0], img.bounds[2], img.width),
        )

        arr = numpy.where(img.mask, img.array, img.array / weights)
        
        return ImageData(
            arr,
            assets=img.assets,
            crs=img.crs,
            bounds=img.bounds,
            band_names=img.band_names,
        )

I'll do some more testing with our main use case - datasets with WGS84 coordinates. Scientific accuracy is key. If I find that the dst_crs method is not reliable enough, we may need to consider the custom algorithm option you suggested. But for now I expect that we are very close to having all we need.

j08lue · 2023-11-10T10:19:35Z

Hmm, getting a 500 error with a POST request to

https://test-raster.delta-backend.com/cog/statistics?url=s3%3A%2F%2Fveda-data-store-staging%2Fno2-monthly%2FOMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif&dst_crs=%2Bproj%3Dcea

while the request without dst_crs works. Simple example reproducing the issue:

RASTER_API_URL = "https://test-raster.delta-backend.com"

ITEM_ASSET_URL = "s3://veda-data-store-staging/no2-monthly/OMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif"

AOI = {
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "coordinates": [
          [
            [
              -115,
              82
            ],
            [
              -115,
              -82
            ],
            [
              -43,
              -82
            ],
            [
              -43,
              82
            ],
            [
              -115,
              82
            ]
          ]
        ],
        "type": "Polygon"
      }
    }
  ]
}

result = requests.post(
    f"{RASTER_API_URL}/cog/statistics",
    params={
        "url": ITEM_ASSET_URL,
        "dst_crs": "+proj=cea"
    },
    json=AOI,
)
result.raise_for_status()
result.json()

Do you have access to the function logs to see what is going wrong, @vincentsarago?

vincentsarago · 2023-11-10T11:03:23Z

Note:

https://test-raster.delta-backend.com/cog/statistics?url=s3%3A%2F%2Fveda-data-store-staging%2Fno2-monthly%2FOMI_trno2_0.10x0.10_202109_Col3_V4.nc.tif&dst_crs=%2Bproj%3Dcea

dst_crs is not available for /statistics [GET] only for the [POST] geojson endpoint

There might be a bug in rio-tiler or titiler

[ERROR]	2023-11-10T10:51:37.886Z	49d3191d-81ad-4546-801f-0ad3e7855b5d	An error occurred running the application.
Traceback (most recent call last):
  File "/tmp/pip-target-f75zho4z/lib/python/mangum/protocols/http.py", line 58, in run
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/applications.py", line 1106, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/applications.py", line 122, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/errors.py", line 184, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/errors.py", line 162, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/titiler/core/middleware.py", line 63, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette_cramjam/middleware.py", line 112, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette_cramjam/middleware.py", line 142, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/cors.py", line 83, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/exceptions.py", line 79, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/middleware/exceptions.py", line 68, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/middleware/asyncexitstack.py", line 20, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/middleware/asyncexitstack.py", line 17, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 718, in __call__
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 276, in handle
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/routing.py", line 66, in app
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/routing.py", line 274, in app
  File "/tmp/pip-target-f75zho4z/lib/python/fastapi/routing.py", line 193, in run_endpoint_function
  File "/tmp/pip-target-f75zho4z/lib/python/starlette/concurrency.py", line 41, in run_in_threadpool
  File "/tmp/pip-target-f75zho4z/lib/python/anyio/to_thread.py", line 33, in run_sync
  File "/tmp/pip-target-f75zho4z/lib/python/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
  File "/tmp/pip-target-f75zho4z/lib/python/anyio/_backends/_asyncio.py", line 807, in run
  File "/tmp/pip-target-f75zho4z/lib/python/titiler/core/factory.py", line 482, in geojson_statistics
  File "/tmp/pip-target-f75zho4z/lib/python/rio_tiler/models.py", line 826, in get_coverage_array
  File "/tmp/pip-target-f75zho4z/lib/python/rasterio/env.py", line 398, in wrapper
  File "/tmp/pip-target-f75zho4z/lib/python/rasterio/env.py", line 630, in wrapper
  File "/tmp/pip-target-f75zho4z/lib/python/rasterio/warp.py", line 104, in transform_geom
  File "rasterio/_warp.pyx", line 114, in rasterio._warp._transform_geom
  File "rasterio/_warp.pyx", line 71, in rasterio._warp._transform_single_geom
  File "rasterio/_features.pyx", line 693, in rasterio._features.OGRGeomBuilder.build
ValueError: Unsupported geometry type Feature

☝️ this is a log from titiler.xyz (devseed aws)

vincentsarago · 2023-11-10T11:15:58Z

fix cogeotiff/rio-tiler#653

Sadly this will require a new deployment to catch newest rio-tiler version

j08lue · 2023-11-13T18:20:24Z

I can confirm that with this bug fix (testing against dev.ghg.center/api/raster), the method to reproject the data before extracting statistics gives the same average as calculating weights by hand with the geographic coordinates formula.

I will prepare the notebooks for our (GHG Center and VEDA) docs, so we have somewhere to point to for others to validate our methods.

j08lue assigned j08lue and vincentsarago Oct 19, 2023

vincentsarago mentioned this issue Nov 10, 2023

feat: update titiler and titiler-pgstac requirement NASA-IMPACT/veda-backend#246

Closed

j08lue mentioned this issue Nov 14, 2023

Add notebook for titiler zonal statistics method validation NASA-IMPACT/veda-docs#107

Merged

10 tasks

j08lue closed this as completed Nov 15, 2023

j08lue mentioned this issue Dec 7, 2023

Validate API results for grid cell area weighted zonal statistics #251

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods #209

Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods #209

j08lue commented Oct 19, 2023

j08lue commented Oct 19, 2023 •

edited

Loading

vincentsarago commented Nov 2, 2023

vincentsarago commented Nov 2, 2023

j08lue commented Nov 2, 2023

j08lue commented Nov 2, 2023

j08lue commented Nov 3, 2023

j08lue commented Nov 9, 2023

vincentsarago commented Nov 9, 2023 •

edited

Loading

j08lue commented Nov 10, 2023 •

edited

Loading

vincentsarago commented Nov 10, 2023

j08lue commented Nov 10, 2023

j08lue commented Nov 10, 2023 •

edited

Loading

vincentsarago commented Nov 10, 2023

vincentsarago commented Nov 10, 2023

j08lue commented Nov 13, 2023

Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods #209

Compare different zonal statistics method to quantify the effect of area weights and cross-validate the methods #209

Comments

j08lue commented Oct 19, 2023

j08lue commented Oct 19, 2023 • edited Loading

Main takeaways

Next steps

vincentsarago commented Nov 2, 2023

vincentsarago commented Nov 2, 2023

j08lue commented Nov 2, 2023

j08lue commented Nov 2, 2023

j08lue commented Nov 3, 2023

j08lue commented Nov 9, 2023

tl;dr

vincentsarago commented Nov 9, 2023 • edited Loading

Next

j08lue commented Nov 10, 2023 • edited Loading

vincentsarago commented Nov 10, 2023

j08lue commented Nov 10, 2023

j08lue commented Nov 10, 2023 • edited Loading

vincentsarago commented Nov 10, 2023

vincentsarago commented Nov 10, 2023

j08lue commented Nov 13, 2023

j08lue commented Oct 19, 2023 •

edited

Loading

vincentsarago commented Nov 9, 2023 •

edited

Loading

j08lue commented Nov 10, 2023 •

edited

Loading

j08lue commented Nov 10, 2023 •

edited

Loading