Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dict specified val_dataloader, test_dataloader and predict_dataloader #16015

Closed
kotchin opened this issue Dec 12, 2022 · 10 comments · Fixed by #17163 or #17402
Closed

Dict specified val_dataloader, test_dataloader and predict_dataloader #16015

kotchin opened this issue Dec 12, 2022 · 10 comments · Fixed by #17163 or #17402
Labels
data handling Generic data-related topic discussion In a discussion stage feature Is an improvement or enhancement

Comments

@kotchin
Copy link

kotchin commented Dec 12, 2022

Description & Motivation

Conveniently, a DataModule can return a train_dataloader() which is a dict, where the name of a dataset is a key and the value is the dataloader.
This is unfortunately not possible with val_dataloader(), test_dataloader() and predict_dataloader(). Yet this could be particularly interesting from a logging perspective allowing for the DataModule to name the datasets and reuse those names when logging the data from the LitModule.

Perhaps there is a specific reason for why this hasn't been implemented or can't be implemented, but it's not obvious to me.

I have tested the dict based approaches above and the documentation in DataModule.from_datasets() also reflects that impossibility.

Pitch

No response

Alternatives

No response

Additional context

No response

Edit: typo

cc @Borda @justusschock @awaelchli

@kotchin kotchin added the needs triage Waiting to be triaged by maintainers label Dec 12, 2022
@awaelchli
Copy link
Member

Hi, thanks for your interest in this feature!

Perhaps there is a specific reason for why this hasn't been implemented or can't be implemented, but it's not obvious to me.

Yes sort of. Multiple dataloader support for validation has existed before it was added to the training loop. Until that point, validation was running sequentially over the multiple dataloaders, which made the most sense because validation was typically just about collecting the metrics. When multiple dataloader support was considered for training_step, we found that the most practical use would be to fetch the data for the dataloaders individually and give the collated batches to the training_step. This is why there is a difference.

As you say, perhaps in some cases it also makes sense to do the same for validation, where you could log the outputs side-by-side as we iterate through the dataloaders jointly.

@justusschock probably knows a bit more about this and the edge cases here.

@awaelchli awaelchli added discussion In a discussion stage data handling Generic data-related topic feature Is an improvement or enhancement and removed needs triage Waiting to be triaged by maintainers labels Dec 13, 2022
@stale
Copy link

stale bot commented Jan 21, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

1 similar comment
@stale
Copy link

stale bot commented Apr 14, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 14, 2023
@awaelchli
Copy link
Member

awaelchli commented Apr 14, 2023

Support for dictionary in the eval steps was added in #16726 :)
(release notes: https://github.com/Lightning-AI/lightning/releases/tag/2.0.0)

@awaelchli awaelchli reopened this Apr 14, 2023
@stale stale bot removed the won't fix This will not be worked on label Apr 14, 2023
@awaelchli
Copy link
Member

Support for this will be added with #17163
cc @carmocca

@carmocca
Copy link
Contributor

It's not clear to me what this issue requests. Support for multiple dataloaders has existed for a while, but they are run sequentially during evaluation.

#17163 removes this sequential constraint for .validate() and .test()

But the top post doesn't explicitly state that the problem is running sequentiallly

@kotchin
Copy link
Author

kotchin commented Apr 17, 2023

@carmocca this issue requests the same dict based dataloader support for val_dataloader, test_dataloader and predict_dataloader as exists for train_dataloader.

Specifically, when setting up a LightningDataModule, a train_dataloader can return a dict, where a key can correspond to a dataloader.
For val_dataloader, test_dataloader and predict_dataloader, this is not possible. At best, one can return a list of dataloaders, but this breaks a possible naming potential of each of the dataloaders in the list.

One example is that if I want my val_dataloader to contain 3 different dataloaders "val_dataloader A", "val_dataloader B" and "val_dataloader C", the only thing I can get is a list of dataloaders, but I can't reuse any name or any definition to what each dataloader is, so if by any chance the order of the dataloaders is changed, this can't be automatically detected/accounted for my the LightningModule.

This is why the dict based train_dataloader is convenient, it allows to return:

{"dataloader A": dataloader_A, "dataloader B": dataloader_B, "dataloader C": dataloader_C}

instead of
[dataloader_A, dataloader_B, dataloader_C]

The question is mostly why this dict based dataloader organization is only available for train_dataloader and not the other methods.

Edit: This is also reflected in the documentation for the LightningDataModule:
https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.core.LightningDataModule.html#lightning.pytorch.core.LightningDataModule.from_datasets

Where the val_dataset, test_dataset and predict_dataset can't be of type Mapping[str, Dataset] unlike train_dataset.

Edit2: this person is also reporting the limitation I'm talking about: #16830 (comment)

@carmocca
Copy link
Contributor

A dict can be returned at least in version 2.0. Try playing around with this script:

import os

import torch
from torch.utils.data import DataLoader, Dataset

from lightning.pytorch import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def test_step(self, batch, batch_idx, dataloader_idx=0):
        print(dataloader_idx, batch_idx, batch)

    def test_dataloader(self):
        return {"A": DataLoader(RandomDataset(32, 64), batch_size=2), "B": DataLoader(RandomDataset(32, 64), batch_size=2)}

def run():
    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_test_batches=1,
        barebones=True,
    )
    trainer.test(model)


if __name__ == "__main__":
    run()

But the dataloaders will be consumed sequentially (during validate, test and predict), and the dictionary keys are not included in the batch. Perhaps you meant that you are interested in consuming them using a different mode: https://lightning.ai/docs/pytorch/stable/data/iterables.html#multiple-iterables

Example:

    def test_dataloader(self):
        from lightning.pytorch.utilities import CombinedLoader
        dataloaders = {"A": DataLoader(RandomDataset(32, 64), batch_size=2), "B": DataLoader(RandomDataset(32, 64), batch_size=2)}
        return CombinedLoader(dataloaders, mode="max_size_cycle")

Which in 2.0.1 raises:

ValueError: `trainer.test()` only supports the `CombinedLoader(mode="sequential")` mode.

But I just implemented support in master with #17163 if this is blocking you

@carmocca
Copy link
Contributor

carmocca commented Apr 17, 2023

The from_datasets method needs an update though. Good catch. Opened #17402

@kotchin
Copy link
Author

kotchin commented Aug 23, 2023

@carmocca it seems your proposal goes in the direction I'm pitching.

If I can indeed consume the validation data and test data with ("dataloader name", batch) that should be fine. This helps the validation and test set, when validating and testing on multiple dataloaders while knowing which dataloader is provided everytime and allow for the logging to reuse the name of the dataloader.

Thank you for your help.

Edit: we may close this issue as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic discussion In a discussion stage feature Is an improvement or enhancement
Projects
None yet
3 participants