Skip to content

Add SolarPlantsBrazil dataset for photovoltaic panel detection (semantic segmentation) #2797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

FederCO23
Copy link

@FederCO23 FederCO23 commented May 24, 2025

This PR adds the SolarPlantsBrazil dataset to TorchGeo.

SolarPlantsBrazil is a geospatial dataset curated for binary semantic segmentation, focused on detecting photovoltaic (PV) power stations in satellite imagery. It includes 272 manually annotated 256×256 GeoTIFF image patches from various regions of Brazil. Each image patch comes with:

  • A 4-band image (RGB + NIR, float32)
  • A binary mask indicating solar panel regions

Changes introduced:

  • Added SolarPlantsBrazil dataset class under torchgeo.datasets
  • Added SolarPlantsBrazilDataModule datamodule class under torchgeo.datamodules
  • Integrated automatic download from Hugging Face: FederCO23/solar-plants-brazil
  • Parsed RGB + NIR float32 GeoTIFFs and binary mask labels
  • Implemented a plot() method using RGB bands for sample visualization
  • Registered dataset in __init__.py and non_geo_datasets.csv
  • Added full unit test coverage under tests/datasets/test_solar_plants_brazil.py

Example output from plot():

Sample_from_Solar_Plants_Brazil

References:

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels May 24, 2025
@FederCO23
Copy link
Author

@microsoft-github-policy-service agree

@FederCO23
Copy link
Author

FederCO23 commented May 24, 2025 via email

@FederCO23
Copy link
Author

FederCO23 commented May 24, 2025 via email

@robmarkcole
Copy link
Contributor

robmarkcole commented May 25, 2025

@FederCO23 care to add a datamodule too? Also this dataset is only 4 channel, don't suppose you investigated adding the other channels too?

@adamjstewart adamjstewart added this to the 0.8.0 milestone May 25, 2025
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad data loader, but it doesn't yet follow the style of other TorchGeo datasets/tests. Take a look at https://torchgeo.readthedocs.io/en/latest/tutorials/contribute_non_geo_dataset.html or any other builtin dataset for inspiration.

@github-actions github-actions bot added the datamodules PyTorch Lightning datamodules label May 26, 2025
@FederCO23
Copy link
Author

FederCO23 commented May 27, 2025

Hi @adamjstewart , just circling back on your previous review.

The requested changes have been addressed in the latest commits:

  • Replaced unittest.mock.patch with pytest.MonkeyPatch
  • Moved dummy data generation to tests/data/solar_plants_brazil/data.py
  • Removed unused citation block and huggingface_hub dependency
  • Switched to relative imports
  • Updated exception handling to use DatasetNotFoundError
  • Refactored tests into a single TestSolarPlantsBrazil class following TorchGeo style
  • Added full unit test coverage for dataset + datamodule (100%)
  • Lint clean (ruff, mypy)

The dataset currently uses 4 bands (RGB + NIR), which matched the original training setup. For now, I’ve kept it limited to these bands for consistency, but I’m open to extending it in future updates.

I’ve also replied to your earlier comments individually with a bit more context, in case further clarification is helpful.

Please let me know if there’s anything else you’d like me to adjust, and thanks again for your time and feedback!

@FederCO23 FederCO23 requested a review from adamjstewart May 27, 2025 21:26
@robmarkcole
Copy link
Contributor

Test failures same as #2707 (comment)

@robmarkcole
Copy link
Contributor

@FederCO23 hit the Update branch button as the failing test is fixed on main

@FederCO23
Copy link
Author

@FederCO23 hit the Update branch button as the failing test is fixed on main

@robmarkcole, I’ve updated the branch as requested and all checks are passing.
I also addressed all the requested changes from @adamjstewart and left a comment summarizing them above.
Let me know if anything else is needed on my end. Thanks again!

@adamjstewart
Copy link
Collaborator

Swamped with deadlines, will review soon (Tue or Wed).

@FederCO23
Copy link
Author

@adamjstewart, thanks again for the detailed feedback!
All 16 requested changes have been addressed. I replied to each one directly above.
Let me know if there’s anything else needed on my end. Happy to contribute!

@robmarkcole, just tagging you here to keep you in the loop

@adamjstewart
Copy link
Collaborator

Accidentally deleted my notification, but remind me to review this again this week.

@adamjstewart adamjstewart self-assigned this Jun 15, 2025
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still quite a lot of differences between this dataset and the other builtin datasets


"""

url = 'https://huggingface.co/datasets/FederCO23/solar-plants-brazil/resolve/main/solarplantsbrazil.zip'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
url = 'https://huggingface.co/datasets/FederCO23/solar-plants-brazil/resolve/main/solarplantsbrazil.zip'
url = 'https://huggingface.co/datasets/FederCO23/solar-plants-brazil/resolve/1dc13a453ef6acabf08a1781c523fd1db3d9bcc5/solarplantsbrazil.zip'

Use a stable commit hash so that the MD5 doesn't change even if the dataset is later updated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Updated the dataset URL to use a fixed commit hash to ensure stable MD5

@@ -55,6 +55,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
`SkyScript`_,IC,"NAIP, orthophotos, Planet SkySat, Sentinel-2, Landsat 8--9",MIT,5.2M,-,100--1000,0.1--30,RGB
`So2Sat`_,C,Sentinel-1/2,"CC-BY-4.0","400,673",17,32x32,10,"SAR, MSI"
`SODA`_,OD,Aerial,"CC-BY-NC-4.0","2513",9,"~2700x~4800","RGB"
`Solar Plants Brazil`_,S,Aerial,"CC-BY-NC-4.0","272",2,256x256,10,"RGB + NIR"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 38 to 39
sample = dataset[0].copy()
sample['prediction'] = sample['mask'].clone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a copy or clone is actually needed is it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, since prediction is not being modified, both .clone() and .copy() are unnecessary and can be removed.
For reference, I had initially followed a similar pattern used in other datasets like cloud_cover and chesapeake, which use .clone() in their test_plot() methods."

sample = dataset[0]
assert torch.all(sample['image'] > 0)

def test_download_called(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we test this for any other datasets, it's better to test that it's possible to "download" files from the tests/data directory to the test directory. See all of the other dataset tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Removed test_download_called and added test_already_downloaded and test_not_downloaded instead, following the structure used in other datasets.


assert called['triggered']

def test_missing_dataset_triggers_error(self, tmp_path: Path) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this identical to test_missing_dataset_raises?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed test_missing_dataset_triggers_error (duplicate of test_not_downloaded) and renamed for consistency with other datasets

Comment on lines 114 to 115
Returns:
None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip

Comment on lines 104 to 105
if len(self.image_paths) == 0:
raise DatasetNotFoundError(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this checked by _verify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve now removed the if len(self.image_paths) == 0 check from __init__() to align with the datasets pattern.
That said, I originally added this check to catch cases where the folder exists but is empty (ex: failed unzip). In that situation, _verify() would silently pass, but indexing into the dataset would fail later on in a less clear way.
I transfer that check to _verify()

Comment on lines 128 to 129
Returns:
None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip

"""Retrieve an image-mask pair by index.

Args:
index (int): Index of the sample to retrieve.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
index (int): Index of the sample to retrieve.
index: Index of the sample to retrieve.

Already gotten from type hints

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this has been addressed

Comment on lines 145 to 147
dict: A dictionary with the following keys:
- 'image': A float32 tensor of shape (C, H, W)
- 'mask': A long tensor of shape (1, H, W), containing binary labels
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to format correctly, maybe try:

Suggested change
dict: A dictionary with the following keys:
- 'image': A float32 tensor of shape (C, H, W)
- 'mask': A long tensor of shape (1, H, W), containing binary labels
A dictionary with the following keys:
- 'image': A float32 tensor of shape (C, H, W)
- 'mask': A long tensor of shape (1, H, W), containing binary labels

or just copy-n-paste the same description used by the other 125+ builtin datasets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll simplify the docstring to match the style used by the other datasets. Thanks for the tip!


This datamodule wraps the SolarPlantsBrazil dataset, which contains
predefined train/val/test splits. This design ensures spatial separation
between samples by solar plant, preventing data leakage during training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. versionadded:: 0.8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, included 🙌

FederCO23 added 29 commits June 20, 2025 17:57
…PlantsBrazil

- Deleted  as requested
- Created  for trainer config
- Registered solar_plants_brazil in
- Updated dummy dataset and dataset test to match 256x256 input requirement
- Return mask as [H, W] to match loss function expectations
- Updated test_getitem assertion to reflect correct shape
- Trainer test now passes with correct mask format
- Enforced cross-platform compatibility using os.path.join in tests.
- Used Literal typing for 'split' argument to improve static type safety.
- Silenced mypy type check for the 'invalid split' test using # type: ignore[arg-type].
- Fully updated all docstrings in dataset and datamodule files to comply with standards (Args and Returns included).
…ring in datasets/solar_plants_brazil.py (__getitem__ method)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants