Adapt model to load 512x512 images from s3 bucket #85

weiji14 · 2023-12-11T01:04:49Z

Modify the ViT MAE model to accept input images of size 512x512 pixel after #78. Also making a few small enhancements to the datapipe.

TODO:

Change hardcoded image_size from 256 to 512, patch_size from 32 to 64.
Allow loading of GeoTIFF files directly from s3 bucket instead of local drive
Optimize data loading using sharding filter

Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.

Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72.

New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter.

Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.

Ensure that *.ckpt files in sub-folders are ignored too.

Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`

Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.

weiji14 · 2023-12-11T04:23:42Z

src/datamodule.py

+        # Step 1 - Get list of GeoTIFF filepaths from s3 bucket or data/ folder
+        if self.data_path.startswith("s3://"):
+            dp = torchdata.datapipes.iter.IterableWrapper(iterable=[self.data_path])
+            self.dp_paths = dp.list_files_by_s3(masks="*.tif")
+        else:  # if self.data_path is a local data path
+            self.dp_paths = torchdata.datapipes.iter.FileLister(
+                root=self.data_path, masks="*.tif", recursive=True
+            )


Note that loading data from s3 is still a bit slower than a local data folder, but it can be useful for cases like one-off inference/prediction when we don't want to store large amounts of data locally.

weiji14 · 2023-12-11T04:26:50Z

Gonna merge this PR directly, as it consists of mostly minor tweaks (which I've accumulated over the past few weeks). Some of the options (e.g. patch_size) can be changed later, but thought it'd be good to have a datapipe/model that works with the 512x512 images soon-ish.

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

* 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64 Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo. * 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72. * ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter. * 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result. * 🙈 Gitignore checkpoints in nested folders Ensure that *.ckpt files in sub-folders are ignored too. * ⚡ Set float32 matmul precision to medium Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.` * 📝 Mention in main README.md that data_path can be an s3 bucket Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.

* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

robmarkcole · 2024-05-02T07:01:56Z

src/datamodule.py

-        )
+        # Step 1 - Get list of GeoTIFF filepaths from s3 bucket or data/ folder
+        if self.data_path.startswith("s3://"):
+            dp = torchdata.datapipes.iter.IterableWrapper(iterable=[self.data_path])


Noting datapipes are deprecated, is this a long term solution?

🔧 Increase image_size from 256 to 512, patch_size from 32 to 64

9ae91c0

Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.

weiji14 self-assigned this Dec 11, 2023

weiji14 added 2 commits December 11, 2023 14:22

👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename

b8c3e97

Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72.

weiji14 changed the title ~~Adapt model to 512x512 input image sizes~~ Adapt model to load 512x512 images from s3 bucket Dec 11, 2023

weiji14 added 3 commits December 11, 2023 16:07

🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor

60e7205

Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.

🙈 Gitignore checkpoints in nested folders

5659021

Ensure that *.ckpt files in sub-folders are ignored too.

⚡ Set float32 matmul precision to medium

82ec6f5

Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`

weiji14 marked this pull request as ready for review December 11, 2023 04:12

📝 Mention in main README.md that data_path can be an s3 bucket

7aeb431

Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.

weiji14 commented Dec 11, 2023

View reviewed changes

weiji14 merged commit 9757bbf into main Dec 11, 2023
2 checks passed

weiji14 deleted the model/input-512-images branch December 11, 2023 04:28

weiji14 mentioned this pull request Dec 19, 2023

Allow ClayDataModule to load GeoTIFF files directly from s3 #92

Merged

4 tasks

robmarkcole reviewed May 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt model to load 512x512 images from s3 bucket #85

Adapt model to load 512x512 images from s3 bucket #85

weiji14 commented Dec 11, 2023 •

edited

weiji14 Dec 11, 2023

weiji14 commented Dec 11, 2023 •

edited

robmarkcole May 2, 2024

Adapt model to load 512x512 images from s3 bucket #85

Adapt model to load 512x512 images from s3 bucket #85

Conversation

weiji14 commented Dec 11, 2023 • edited

weiji14 Dec 11, 2023

Choose a reason for hiding this comment

weiji14 commented Dec 11, 2023 • edited

robmarkcole May 2, 2024

Choose a reason for hiding this comment

weiji14 commented Dec 11, 2023 •

edited

weiji14 commented Dec 11, 2023 •

edited