Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask load returning all nan values #1479

Open
JackLidge opened this issue Jul 24, 2023 · 3 comments
Open

Dask load returning all nan values #1479

JackLidge opened this issue Jul 24, 2023 · 3 comments

Comments

@JackLidge
Copy link

Expected behaviour

I'm trying to load in a long timeseries with dask to compute the mean values for the area over time. Here's a code trace for loading in the data and computing the mean without using dask:

dc = datacube.Datacube()
query = {'x': [-7.43862660215137, -4.377843103769999],
 'y': [54.02144007175762, 56.83505068441397],
 'time': ['2017-01-01', '2017-02-01'],
 'output_crs': 'EPSG:32630',
 'resolution': [100, 100],
 'group_by': 'solar_day',
 #'dask_chunks': {'time': 1}
}
rainfall = dc.load(product='cmorph_rainfall', **query)
rainfall_list = rainfall.rainfall.mean(dim=['x', 'y']).values

and rainfall_list, returns:

array([1.2073916e-03, 0.0000000e+00, 1.3340176e+00, 0.0000000e+00,
       6.5343648e-01, 1.4796243e+01, 5.5638492e-01, 3.7439067e+00,
       7.6677237e+00, 3.8516700e+00, 5.1759934e-01, 1.3966684e-01,
       5.4357775e-02, 8.6581379e-01, 4.9658670e+00, 1.5595733e+00,
       1.6349301e-01, 1.8843752e-01, 1.0391021e-04, 8.2616024e-02,
       1.3581085e-01, 1.2322235e-01, 1.0261511e-04, 2.4435787e+00,
       3.7526662e+00, 0.0000000e+00, 4.2349246e-01, 3.7233862e-01,
       9.1776066e-02, 3.3019252e+00, 1.0951433e+01, 1.4380445e+00],
      dtype=float32)

Actual behaviour

When I load the data using dask and then running compute, it returns all nans, for both rainfall_list and then just selecting one individual timestep.

dc = datacube.Datacube()
query = {'x': [-7.43862660215137, -4.377843103769999],
 'y': [54.02144007175762, 56.83505068441397],
 'time': ['2017-01-01', '2017-02-01'],
 'output_crs': 'EPSG:32630',
 'resolution': [100, 100],
 'group_by': 'solar_day',
 'dask_chunks': {'time': 1}
}
rainfall = dc.load(product='cmorph_rainfall', **query)
rainfall_list = rainfall.rainfall.mean(dim=['x', 'y']).values

this returns:

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan], dtype=float32)

Steps to reproduce the behaviour

I've loaded a Sentinel-2 image for the same area using dask and time period and that loads correctly, so either I am doing something wrong when using dask to calculate a 2D mean or the images themselves are registered weirdly.

The product yaml for cmorph_rainfall is:

name: cmorph_rainfall
description: NOAA Climate Data Record (CDR) of CPC Morphing Technique (CMORPH) High Resolution Global Precipitation Estimates, Version 1. From https://www.ncei.noaa.gov/access/metadata>
metadata_type: eo3

metadata:
  product:
    name: cmorph_rainfall

measurements:
  - name: rainfall
    dtype: float32
    nodata: NaN
    units: "degree"

and one of the dataset yamls looks as follows:

---
# Dataset
$schema: https://schemas.opendatacube.org/dataset
id: 1b668dbc-89f0-4b8f-bf0c-e3a16fd43dcb

label: cmorph_rainfall_2022-01-01
product:
  name: cmorph_rainfall

crs: epsg:4326
geometry:
  type: Polygon
  coordinates: [[[-180.0, 60.0], [180.0, 60.0], [180.0, -60.0], [-180.0, -60.0], [
        -180.0, 60.0]]]
grids:
  default:
    shape: [480, 1440]
    transform: [0.25, 0.0, -180.0, 0.0, 0.25, -60.0, 0.0, 0.0, 1.0]

properties:
  datetime: 2022-01-01 00:00:00Z
  odc:processing_datetime: 2023-03-28 14:09:45.581911Z
  odc:product_family: cmorph_rainfall

measurements:
  rainfall:
    path: cmorph_0.25_daily_20220101.tif

accessories: {}

lineage: {}
...

I used eodatasets3 to generate the yamls (both data and yamls are in the same s3 bucket), with the following script being used to create the yamls.

with DatasetPrepare(metadata_path=Path(local_yamlpath)) as p:
            p.product_family = 'cmorph_rainfall'
            p.datetime = start_date.strftime('%Y%m%dT%H%M%S')
            p.processed_now()

            p.note_measurement('rainfall', Path(local_tifpath))
            p.done()

Environment information

datacube version: 1.8.12

Any pointers on this greatly appreciated!

Note: Stale issues will be automatically closed after a period of six months with no activity.
To ensure critical issues are not closed, tag them with the Github pinned tag.
If you are a community member and not a maintainer please escalate this issue to maintainers via
GIS StackExchange or Slack.

@Kirill888
Copy link
Member

probably due to this issue: #957

@SpacemanPaul
Copy link
Contributor

I agree with Kirill. Datasets that span -180 to 180 longitude in native coordinates are problematic. You can hack around it by manually changing to something like -179.99999 to 180. It's a bit ugly, but it works.

@JackLidge
Copy link
Author

I'll give it a go with that hack!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants