Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let LightningDataModule return spatiotemporal metadata #66

Merged
merged 5 commits into from
Dec 6, 2023

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Dec 4, 2023

Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as:

  • bounding box (bbox): (xmin, ymin, xmax, ymax)
  • coordinate reference system (epsg): EPSG code
  • date: YYYY-MM-DD string format

Note:

  • The bbox and epsg is in the raster image's native UTM projection for now, rather than lonlat coordinates.

This is part 1/2 of adding spatiotemporal metadata to the output embedding table later, as mentioned at #35 (comment).

Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through.
@weiji14 weiji14 self-assigned this Dec 4, 2023
Comment on lines +32 to +33
# Get date
date: str = pathlib.Path(filepath).name[15:25] # YYYY-MM-DD format
Copy link
Contributor Author

@weiji14 weiji14 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like that we have to parse the date from the filename in a hardcoded way. @yellowcap, I hinted on this at #54 (comment), but would it be possible to save the datetime information in the GeoTIFF's metadata somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #72

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the next pipeline run you'll be able to get the date using

dataset.tags()["date"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, can't wait!

Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages.
Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function.
@weiji14 weiji14 marked this pull request as ready for review December 5, 2023 01:16
@weiji14 weiji14 added the data-pipeline Pull Requests about the data pipeline label Dec 5, 2023
# Get date
date: str = pathlib.Path(filepath).name[15:25] # YYYY-MM-DD format

return {"image": tensor, "bbox": bbox, "crs": crs, "date": date}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tempted to store this in an Arrow Table with FixedShapedTensorArray for the image and bbox 'columns'. Possibly revisit this in the future.

@weiji14
Copy link
Contributor Author

weiji14 commented Dec 5, 2023

Gonna leave this up for review for a day or so before merging. Once merged, I'll proceed to work on part 2/2, which is to get the model to output embeddings with spatiotemporal metadata columns!

Copy link
Member

@yellowcap yellowcap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just one small comment on variable name.

bbox: torch.Tensor = torch.as_tensor( # xmin, ymin, xmax, ymax
data=dataset.bounds, dtype=torch.float64
)
crs: int = torch.as_tensor(data=dataset.crs.to_epsg(), dtype=torch.int32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable could be called epsg to make clear its a crs as epsg integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was considering this actually, I've changed it in 6a0e8cf!

Since we're storing the EPSG integer and not the CRS representation.
@weiji14 weiji14 merged commit bf485fe into main Dec 6, 2023
2 checks passed
@weiji14 weiji14 deleted the datamodule/spatiotemporal-metadata branch December 6, 2023 23:44
weiji14 added a commit that referenced this pull request Dec 8, 2023
Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
weiji14 added a commit that referenced this pull request Dec 8, 2023
* ✨ Save embeddings with spatiotemporal metadata to GeoParquet

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.

* 📝 Document how embeddings are generated and saved to geoparquet

Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.

* 📝 Mention in main README.md that embeddings are saved to geoparquet

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.

* 🎨 Update type hint of batch inputs, and add some inline comments

Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* 🗃️ Let LightningDataModule return spatiotemporal metadata

Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through.

* ♻️ Refactor test_geotiffdatapipemodule to use parametrization

Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages.

* 📝 Document returned outputs from _array_to_torch function

Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function.

* 🚚 Rename crs to epsg

Since we're storing the EPSG integer and not the CRS representation.
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* ✨ Save embeddings with spatiotemporal metadata to GeoParquet

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.

* 📝 Document how embeddings are generated and saved to geoparquet

Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.

* 📝 Mention in main README.md that embeddings are saved to geoparquet

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.

* 🎨 Update type hint of batch inputs, and add some inline comments

Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline Pull Requests about the data pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants