Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ClayDataModule to load GeoTIFF files directly from s3 #92

Merged
merged 7 commits into from
Dec 19, 2023

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Dec 19, 2023

Similar to work done in #85 on the GeoTIFFDataPipeModule, this PR implements similar functionality in ClayDataModule to load GeoTIFF files from an s3 bucket. Plus a few more minor tweaks to align both LightningDataModules.

Implementation uses torchdata's S3FileLister to get the files, but instead of returning an iterator, a list is returned.

TODO:

Continuing on from #91, this PR is part 2/3 of working towards generating new embeddings from the model developed at #47.

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.
Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule
@weiji14 weiji14 added the data-pipeline Pull Requests about the data pipeline label Dec 19, 2023
@weiji14 weiji14 self-assigned this Dec 19, 2023
The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).
Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.
Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.
@weiji14 weiji14 marked this pull request as ready for review December 19, 2023 08:25
Not just testing one, but two different LightningDataModules now!
Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
@@ -80,13 +80,16 @@ def __getitem__(self, idx):
cube = self.read_chip(chip_path)

# remove nans and convert to tensor
cube["pixels"] = torch.nan_to_num(torch.as_tensor(data=cube["pixels"]), nan=0.0)
cube["pixels"] = torch.as_tensor(data=cube["pixels"], dtype=torch.float16)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up @srmsoumya that I've remove the NaN to 0 clipping here, since the new batch of GeoTIFF files shouldn't have NaNs anymore per #68.

@weiji14 weiji14 merged commit df9aff5 into main Dec 19, 2023
2 checks passed
@weiji14 weiji14 deleted the predict-from-s3 branch December 19, 2023 08:40
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

* 🚚 Rename datacube's path key to source_url

Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule

* 🚑 Use try-except to get absolute chip_path or fallback to str

The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).

* ✨ Implement predict_dataloader for ClayDataModule

Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.

* ✅ Add parametrized test for checking ClayDataModule

Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.

* 📝 Edit docstrings in test_datamodule.py to be more generic

Not just testing one, but two different LightningDataModules now!

* 🔧 Add GDAL environment variables that might help with s3 loading

Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline Pull Requests about the data pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant