Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt model to load 512x512 images from s3 bucket #85

Merged
merged 7 commits into from
Dec 11, 2023

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Dec 11, 2023

Modify the ViT MAE model to accept input images of size 512x512 pixel after #78. Also making a few small enhancements to the datapipe.

TODO:

  • Change hardcoded image_size from 256 to 512, patch_size from 32 to 64.
  • Allow loading of GeoTIFF files directly from s3 bucket instead of local drive
  • Optimize data loading using sharding filter

Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.
@weiji14 weiji14 self-assigned this Dec 11, 2023
Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72.
New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter.
@weiji14 weiji14 changed the title Adapt model to 512x512 input image sizes Adapt model to load 512x512 images from s3 bucket Dec 11, 2023
Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.
Ensure that *.ckpt files in sub-folders are ignored too.
Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`
@weiji14 weiji14 marked this pull request as ready for review December 11, 2023 04:12
Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.
Comment on lines +101 to +108
# Step 1 - Get list of GeoTIFF filepaths from s3 bucket or data/ folder
if self.data_path.startswith("s3://"):
dp = torchdata.datapipes.iter.IterableWrapper(iterable=[self.data_path])
self.dp_paths = dp.list_files_by_s3(masks="*.tif")
else: # if self.data_path is a local data path
self.dp_paths = torchdata.datapipes.iter.FileLister(
root=self.data_path, masks="*.tif", recursive=True
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that loading data from s3 is still a bit slower than a local data folder, but it can be useful for cases like one-off inference/prediction when we don't want to store large amounts of data locally.

@weiji14
Copy link
Contributor Author

weiji14 commented Dec 11, 2023

Gonna merge this PR directly, as it consists of mostly minor tweaks (which I've accumulated over the past few weeks). Some of the options (e.g. patch_size) can be changed later, but thought it'd be good to have a datapipe/model that works with the 512x512 images soon-ish.

@weiji14 weiji14 merged commit 9757bbf into main Dec 11, 2023
2 checks passed
@weiji14 weiji14 deleted the model/input-512-images branch December 11, 2023 04:28
weiji14 added a commit that referenced this pull request Dec 19, 2023
Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.
weiji14 added a commit that referenced this pull request Dec 19, 2023
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

* 🚚 Rename datacube's path key to source_url

Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule

* 🚑 Use try-except to get absolute chip_path or fallback to str

The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).

* ✨ Implement predict_dataloader for ClayDataModule

Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.

* ✅ Add parametrized test for checking ClayDataModule

Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.

* 📝 Edit docstrings in test_datamodule.py to be more generic

Not just testing one, but two different LightningDataModules now!

* 🔧 Add GDAL environment variables that might help with s3 loading

Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64

Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.

* 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename

Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72.

* ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket

New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter.

* 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor

Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.

* 🙈 Gitignore checkpoints in nested folders

Ensure that *.ckpt files in sub-folders are ignored too.

* ⚡ Set float32 matmul precision to medium

Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`

* 📝 Mention in main README.md that data_path can be an s3 bucket

Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

* 🚚 Rename datacube's path key to source_url

Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule

* 🚑 Use try-except to get absolute chip_path or fallback to str

The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).

* ✨ Implement predict_dataloader for ClayDataModule

Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.

* ✅ Add parametrized test for checking ClayDataModule

Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.

* 📝 Edit docstrings in test_datamodule.py to be more generic

Not just testing one, but two different LightningDataModules now!

* 🔧 Add GDAL environment variables that might help with s3 loading

Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
)
# Step 1 - Get list of GeoTIFF filepaths from s3 bucket or data/ folder
if self.data_path.startswith("s3://"):
dp = torchdata.datapipes.iter.IterableWrapper(iterable=[self.data_path])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting datapipes are deprecated, is this a long term solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants