Glue MNLI task fails due to missing 'validation' key in dataset #213

mariomeissner · 2021-12-01T09:10:15Z

🐛 Bug

MNLI has two validation and two test sets, called validation_matched, validation_mismatched, test_matched and test_matched. I assume that this was not taken into account in the datamodule.

To Reproduce

Steps to reproduce the behavior:

Run the following command:

python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli

Expected behavior

I would expect the dataloader to handle the special case of MNLI and load validation_matched and test_matched by default. Maybe add an option to additionally test on test_mismatched as well, when desired.

Environment

A standard pip install from source, as of 2021.12.01. Fails with or without GPU.

The text was updated successfully, but these errors were encountered:

mariomeissner · 2021-12-01T09:16:21Z

Huggingface datasets does have a dataset called mnli_matched where the two dictionary keys are simply called validation and test, which would solve the issue. Unfortunately this dataset does not have a train set, so we are stuck with using mnli.

mariomeissner · 2021-12-01T09:28:35Z

I'd submit a PR if I knew what approach is recommended to accommodate this edge case.

mathemusician · 2021-12-01T19:09:19Z

To anyone else trying to reproduce this, this only works if you run pip install pytorch-lightning==1.4
My environment: Google Colab

I'd change lightning_transformers/core/data.py to something like this:

    def train_dataloader(self) -> DataLoader:
        if hasattr(self.cfg, "train_dataset"):
            dataset_name = self.cfg.train_dataset 
        elif "train" in self.ds:
            dataset_name = "train"
        else:
            raise KeyError("'train' subset not found in dataset")
        
        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

    def val_dataloader(self) -> DataLoader:
        if hasattr(self.cfg, "valid_dataset"):
            dataset_name = self.cfg.valid_dataset 
        elif "validation" in self.ds:
            dataset_name = "validation"
        else:
            raise KeyError("'validation' subset not found in dataset")
        
        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

    def test_dataloader(self) -> Optional[DataLoader]:
        if hasattr(self.cfg, "test_dataset"):
            dataset_name = self.cfg.test_dataset 
        elif "test" in self.ds:
            dataset_name = "test"
        else:
            raise KeyError("'test' subset not found in dataset")
        
        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

This way, I can specify subsets if possible. This should work on any dataset, not just mnli

You'd have to write that this is possible somewhere in the docs, but yeah, you get the jist.

Specifying this in hydra should be as easy as:

!pl-transformers-train                       \
      task=nlp/text_classification           \
      dataset=nlp/text_classification/glue   \
      dataset.cfg.dataset_config_name=mnli   \
      ++dataset.cfg.valid_dataset=validation_matched

mariomeissner · 2021-12-02T02:14:48Z

Interesting, when I run your notebook with lightning >= 1.5, I get:

TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps'

I assume that when that error is fixed, the next error will be the validation key missing though.
When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error.
Regardless, it's probably better to open a new issue for the above error...

mathemusician · 2021-12-02T16:27:29Z

Interesting, when I run your notebook with lightning >= 1.5, I get:
TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps'
I assume that when that error is fixed, the next error will be the validation key missing though. When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error. Regardless, it's probably better to open a new issue for the above error...

There already is an issue open for the above error: #212

That said, were you still looking to do a pull request for this feature? Because if not, I just might do it as well.

mariomeissner · 2021-12-03T01:57:54Z

Oh, I'd like to make the PR then! Thanks for the opportunity.

…

On Fri, Dec 3, 2021, 01:27 Jv Kyle Eclarin ***@***.***> wrote: Interesting, when I run your notebook with lightning >= 1.5, I get: TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps' I assume that when that error is fixed, the next error will be the validation key missing though. When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error. Regardless, it's probably better to open a new issue for the above error... There already is an issue open for the above error: #212 <#212> That said, were you still looking to do a pull request for this feature? Because if not, I just might do it as well. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFLAOR2Z5K2XS6DDLAQNMOLUO6M7ZANCNFSM5JD5RT4Q> .

mariomeissner · 2021-12-10T04:16:28Z

Sorry for the delay, been through a tough week.
I've been looking into this again, and it seems the suggested approach would not work if we set the arguments train_val_split (which we honestly shouldn't if we're using a dataset that already has a split) and, more importantly, limit_[subset]_samples. Reason being, we also hard-code "validation" there.

I've thought about two approaches to solve this more universally.

Store the subset names as attributes and always access subsets using the attribute as key instead of the hard-coded strings "validation"/"test".
Rename any odd subset names into the standard convention, so that any following code can work with hard-coded subset names as usual. I personally think this approach is better because it can avoid further issues if people don't realize they should be using the attribute key.

I'll PR the second approach, and welcome any comments.

mathemusician · 2021-12-10T04:27:58Z

@mariomeissner, I agree with the second approach, and I'm curious to see how you implement it

mariomeissner · 2021-12-10T05:48:01Z

Created a PR #214

mathemusician · 2021-12-10T21:51:23Z

Created a PR #214

I like it! Very neat.

mariomeissner added bug / fix Something isn't working help wanted Extra attention is needed labels Dec 1, 2021

Borda assigned SeanNaren Dec 9, 2021

mariomeissner mentioned this issue Dec 10, 2021

Use and rename special subset name if provided #214

Merged

mathemusician mentioned this issue Dec 30, 2021

Update imports to match with pl_bolts #219

Closed

SeanNaren closed this as completed in #214 Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glue MNLI task fails due to missing 'validation' key in dataset #213

Glue MNLI task fails due to missing 'validation' key in dataset #213

mariomeissner commented Dec 1, 2021

mariomeissner commented Dec 1, 2021

mariomeissner commented Dec 1, 2021

mathemusician commented Dec 1, 2021 •

edited by Borda

mariomeissner commented Dec 2, 2021

mathemusician commented Dec 2, 2021

mariomeissner commented Dec 3, 2021 via email

mariomeissner commented Dec 10, 2021

mathemusician commented Dec 10, 2021

mariomeissner commented Dec 10, 2021

mathemusician commented Dec 10, 2021

Glue MNLI task fails due to missing 'validation' key in dataset #213

Glue MNLI task fails due to missing 'validation' key in dataset #213

Comments

mariomeissner commented Dec 1, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

mariomeissner commented Dec 1, 2021

mariomeissner commented Dec 1, 2021

mathemusician commented Dec 1, 2021 • edited by Borda

mariomeissner commented Dec 2, 2021

mathemusician commented Dec 2, 2021

mariomeissner commented Dec 3, 2021 via email

mariomeissner commented Dec 10, 2021

mathemusician commented Dec 10, 2021

mariomeissner commented Dec 10, 2021

mathemusician commented Dec 10, 2021

mathemusician commented Dec 1, 2021 •

edited by Borda