Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Glue MNLI task fails due to missing 'validation' key in dataset #213

Closed
mariomeissner opened this issue Dec 1, 2021 · 10 comments · Fixed by #214
Closed

Glue MNLI task fails due to missing 'validation' key in dataset #213

mariomeissner opened this issue Dec 1, 2021 · 10 comments · Fixed by #214
Assignees
Labels
bug / fix Something isn't working help wanted Extra attention is needed

Comments

@mariomeissner
Copy link
Contributor

🐛 Bug

MNLI has two validation and two test sets, called validation_matched, validation_mismatched, test_matched and test_matched. I assume that this was not taken into account in the datamodule.

To Reproduce

Steps to reproduce the behavior:

Run the following command:

python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli

Expected behavior

I would expect the dataloader to handle the special case of MNLI and load validation_matched and test_matched by default. Maybe add an option to additionally test on test_mismatched as well, when desired.

Environment

A standard pip install from source, as of 2021.12.01. Fails with or without GPU.

@mariomeissner mariomeissner added bug / fix Something isn't working help wanted Extra attention is needed labels Dec 1, 2021
@mariomeissner
Copy link
Contributor Author

Huggingface datasets does have a dataset called mnli_matched where the two dictionary keys are simply called validation and test, which would solve the issue. Unfortunately this dataset does not have a train set, so we are stuck with using mnli.

@mariomeissner
Copy link
Contributor Author

I'd submit a PR if I knew what approach is recommended to accommodate this edge case.

@mathemusician
Copy link
Contributor

mathemusician commented Dec 1, 2021

To anyone else trying to reproduce this, this only works if you run pip install pytorch-lightning==1.4
My environment: Google Colab

I'd change lightning_transformers/core/data.py to something like this:

    def train_dataloader(self) -> DataLoader:
        if hasattr(self.cfg, "train_dataset"):
            dataset_name = self.cfg.train_dataset 
        elif "train" in self.ds:
            dataset_name = "train"
        else:
            raise KeyError("'train' subset not found in dataset")
        
        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

    def val_dataloader(self) -> DataLoader:
        if hasattr(self.cfg, "valid_dataset"):
            dataset_name = self.cfg.valid_dataset 
        elif "validation" in self.ds:
            dataset_name = "validation"
        else:
            raise KeyError("'validation' subset not found in dataset")
        
        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

    def test_dataloader(self) -> Optional[DataLoader]:
        if hasattr(self.cfg, "test_dataset"):
            dataset_name = self.cfg.test_dataset 
        elif "test" in self.ds:
            dataset_name = "test"
        else:
            raise KeyError("'test' subset not found in dataset")
        
        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

This way, I can specify subsets if possible. This should work on any dataset, not just mnli

You'd have to write that this is possible somewhere in the docs, but yeah, you get the jist.

Specifying this in hydra should be as easy as:

!pl-transformers-train                       \
      task=nlp/text_classification           \
      dataset=nlp/text_classification/glue   \
      dataset.cfg.dataset_config_name=mnli   \
      ++dataset.cfg.valid_dataset=validation_matched

@mariomeissner
Copy link
Contributor Author

Interesting, when I run your notebook with lightning >= 1.5, I get:

TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps'

I assume that when that error is fixed, the next error will be the validation key missing though.
When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error.
Regardless, it's probably better to open a new issue for the above error...

@mathemusician
Copy link
Contributor

Interesting, when I run your notebook with lightning >= 1.5, I get:

TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps'

I assume that when that error is fixed, the next error will be the validation key missing though. When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error. Regardless, it's probably better to open a new issue for the above error...

There already is an issue open for the above error: #212

That said, were you still looking to do a pull request for this feature? Because if not, I just might do it as well.

@mariomeissner
Copy link
Contributor Author

mariomeissner commented Dec 3, 2021 via email

@mariomeissner
Copy link
Contributor Author

Sorry for the delay, been through a tough week.
I've been looking into this again, and it seems the suggested approach would not work if we set the arguments train_val_split (which we honestly shouldn't if we're using a dataset that already has a split) and, more importantly, limit_[subset]_samples. Reason being, we also hard-code "validation" there.

I've thought about two approaches to solve this more universally.

  1. Store the subset names as attributes and always access subsets using the attribute as key instead of the hard-coded strings "validation"/"test".
  2. Rename any odd subset names into the standard convention, so that any following code can work with hard-coded subset names as usual. I personally think this approach is better because it can avoid further issues if people don't realize they should be using the attribute key.

I'll PR the second approach, and welcome any comments.

@mathemusician
Copy link
Contributor

@mariomeissner, I agree with the second approach, and I'm curious to see how you implement it

@mariomeissner
Copy link
Contributor Author

Created a PR #214

@mathemusician
Copy link
Contributor

Created a PR #214

I like it! Very neat.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug / fix Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants