Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datacube #27

Merged
merged 21 commits into from
Nov 15, 2023
Merged

Datacube #27

merged 21 commits into from
Nov 15, 2023

Conversation

lillythomas
Copy link
Contributor

This script collects overlapping Sentinel-1, Sentinel-2 and DEM for an AOI and time range, aligns to a common extent and merges the data arrays.

To do:

  • fix the search_sentinel1_calc_max_area() function
  • choose the time to reset the merged dataarray time attribute to (right now it holds all of the respective times of its children arrays in the time dimension)
  • add the tiler
  • filter cloudiness on tile level
  • filter out tiles with high percent of nodata for any specific channel

Copy link
Collaborator

@srmsoumya srmsoumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super neat, the datacube pipeline is coming in well.

datacube.py Outdated
"args": [
{
"op": "s_intersects",
"args": [{"property": "geometry"}, geom_CENTROID.__geo_interface__],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CENTROID is a point object, may be we can use that directly instead of geom_CENTROID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Yes we can. Implemented.

datacube.py Outdated
"""
da_sen2: xr.DataArray = stackstac.stack(
items=s2_items[0],
epsg=26910, # UTM Zone 10N
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we read the projection from the s2 item collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assigned this to the same projection used for the S2 item.

datacube.py Outdated
Comment on lines 296 to 307
da_sen1.attrs["spec"] = str(da_sen1.spec)

# To fix ValueError: unable to infer dtype on variable None
for key, val in da_sen1.coords.items():
if val.dtype == "object":
print("Deleting", key)
da_sen1 = da_sen1.drop_vars(names=key)

# Create xarray.Dataset datacube with VH and VV channels from SAR
da_vh: xr.DataArray = da_sen1.sel(band="vh", drop=True).rename("vh")
da_vv: xr.DataArray = da_sen1.sel(band="vv", drop=True).rename("vv")
ds_sen1: xr.Dataset = xr.merge(objects=[da_vh, da_vv], join="override")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

da_sen1 = stackstac.mosaic(da_sen1, dim="time") will create a composite of S1 tiles overlapping the S2 bbox. We can use that directly in the merge, and don't have to worry about pixel picking stratergy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word. I like this. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So stackstac.mosaic returns an xarray.DataArray, whereas xarray.merge returns an xarray.Dataset. The main difference is that xarray.Dataset allows us to clearly show named data variables (e.g. VV, VH, B1, B2, etc), whereas xarray.DataArray stores these in the 'band' dimension. Do we have a preference for either format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having named data vars helps - stackstac.mosaic takes care of merging the pixels along the given dim (time in our case) which we don't have to handle at our end.
What if we create the data arrays & then concat them together as a dataset with vars attached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still retain the named data vars I think. Merged data array using mosaic method for Sentinel 1:

Merged datarray:  <xarray.Dataset>
Dimensions:                                     (time: 1, band: 14, x: 11365,
                                                 y: 11363)
Coordinates: (12/67)
  * time                                        (time) datetime64[ns] 2018-06...
  * band                                        (band) <U4 'B02' 'B03' ... 'vv'
  * x                                           (x) float64 2.973e+05 ... 4.1...
  * y                                           (y) float64 4.202e+06 ... 4.0...
    id                                          (time) <U54 'S2B_MSIL2A_20180...
    s2:generation_time                          <U24 '2020-10-12T02:49:45.926Z'
    ...                                          ...
    sar:instrument_mode                         <U2 'IW'
    sar:product_type                            <U3 'GRD'
    s1:product_timeliness                       <U8 'Fast-24h'
    s1:resolution                               <U4 'high'
    description                                 (band) object nan ... 'Terrai...
    proj:shape                                  object {3600}
Data variables:
    stackstac-95bc7f7975c3ca621cbffb094f33316e  (time, band, y, x) float32 dask.array<chunksize=(1, 1, 1024, 1024), meta=np.ndarray>
    stackstac-80df347b660c0b771e7ba977f16777bc  (band, y, x) float32 dask.array<chunksize=(13, 1024, 1024), meta=np.ndarray>
    stackstac-602439b1960e897b8044c8056eb3ce96  (band, y, x) float32 dask.array<chunksize=(14, 1024, 1024), meta=np.ndarray>
Attributes:
    spec:        RasterSpec(epsg=32611, bounds=(297340, 4088320, 410990, 4201...
    crs:         epsg:32611
    transform:   | 10.00, 0.00, 297340.00|\n| 0.00,-10.00, 4201950.00|\n| 0.0...
    resolution:  10

datacube.py Outdated Show resolved Hide resolved
datacube.py Outdated Show resolved Hide resolved
datacube.py Outdated
Comment on lines 182 to 184
# TODO: Find a way to get minimal number of S1 tiles to cover the entire S2 bbox
s1_gdf["overlap"] = s1_gdf.intersection(box(*BBOX)).area
s1_gdf.sort_values(by="overlap", inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick tiles from ascending or descending capture based on their overlap with the S2 bbox.

state = (s1_gdf[["sat:orbit_state", "overlap"]]
             .groupby(["sat:orbit_state"])
             .sum()
             .sort_values(by="overlap", ascending=False)
             .index[0])

state = ascending OR descending.

Filter the s1_gdf by state: s1_gdf = s1_gdf[s1_gdf["sat:orbit_state"] == state]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented

datacube.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have this file in the top-level directory, or in another folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think we can have this in a more appropriate subdir. I just put it here as we haven't decided on the structure yet. I'm happy to make the call and we can modify as we see fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put it in a subdir, will leave you up to you on the naming 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think @yellowcap is putting things under scripts/ in #29, so we could follow that?

datacube.py Outdated Show resolved Hide resolved
datacube.py Outdated
return da_merge


def process(year1, year2, aoi, resolution, epsg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about needing to pass in an EPSG code here. For wide areas (in the West-to-East direction) that can span multiple UTM zones, wouldn't this force a single EPSG instead of multiple?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to overlap data from S1 & DEMs over S2 tiles. As S2 tiles are following the MGRS format, each tile will fall under a single UTM zone (this is my understanding, correct me if I am wrong).
S1 & DEMs might span across multiple UTM zones, but stackstac should reproject all of them together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I guess I should switch this to just read the projection based on the MGRS tile's UTM zone instead of have a user argument for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the EPSG code directly from the Sentinel-2 item properties now.

datacube.py Outdated
s2_items, s1_items, dem_items, BBOX, resolution, epsg
)

da_merge = merge_datarrays(da_sen2, da_sen1, da_dem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity check, could you plot each invidual band from this datacube and see if they look ok? I've had issues with striped outputs when running xr.merge before on Sentinel-1 and Copernicus DEM, and just want to double check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed in #27 (comment)

@lillythomas
Copy link
Contributor Author

Ok. As of now, we are able to confidently generate a datacube w S2, S1 and DEM bands concatenated. The time dimension is that of the "best" S2 image in terms of cloud coverage. The S1 images are all of the same orbit state. Some validation results:

All bands:
Screen Shot 2023-11-14 at 2 33 47 PM

Time var:
Screen Shot 2023-11-14 at 2 33 55 PM

Some subset plots from each sensor:
Screen Shot 2023-11-14 at 2 34 32 PM
Screen Shot 2023-11-14 at 2 34 46 PM
Screen Shot 2023-11-14 at 2 34 57 PM
Screen Shot 2023-11-14 at 2 35 20 PM

@weiji14 @srmsoumya if looks good to you both, can merge and move on to the tiling part.

Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lilly, I think we should try to merge this by today. Just a few minor suggestions, and we can improve on stuff in follow-up PRs.

datacube.py Outdated Show resolved Hide resolved
datacube.py Outdated Show resolved Hide resolved
@weiji14 weiji14 mentioned this pull request Nov 14, 2023
Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter is complaining a bit, but we can fix those later since the script will need some refining. Thanks again @lillythomas, and also @srmsoumya for reviewing!

@lillythomas lillythomas merged commit 3608d6d into main Nov 15, 2023
1 of 2 checks passed
@lillythomas lillythomas deleted the datacube_fusion branch November 15, 2023 00:15
@weiji14 weiji14 added the data-pipeline Pull Requests about the data pipeline label Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline Pull Requests about the data pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEM specs and retrieval Sentinel 1 input spec and retrieval Sentinel 2 input spec and retrieval
3 participants