Abstract datasets intro base-classes #19

AndreasMadsen · 2021-02-21T18:23:01Z

I need the datasets to be subclasses for TorchScript support, and we will anyway need it if we add more datasets. Unfortunately the cache will have to be rebuild, becuase this changes some of the .pkl file formats and filenames.

This fixes ROAR dataset is loaded from cache when result is missing #14.
This sets the seed for some of the split generation. That could properly be done better, but at least it is set.
For babi it just hardcodes in the labels, instead of computing them dynammically. That simplifies things a lot. I noticed there were some odd labels, I don't know if they are actually used. But may be worth looking into.
If the encode dataset and vocab exists, it will not build any of the intermediate files. Should make syncing less expensive.
The test dataset now uses the test dataset instead of the validation dataset. (Recently introduced bug)

work in progress: Everything should work. I ran a few iterations locally and rebuild the cache. I need to rerun experiments on compute-canada to make sure it works completely.

Fixes #14

AndreasMadsen · 2021-02-22T10:31:10Z

@ncmeade @vaibhavad This appears to work now. Please review when you have time.

ncmeade · 2021-03-02T19:38:51Z

I tried running download.sh on Beluga to test the dataset pre-processing, but while running the command:

pip3 install --no-index --find-links $HOME/python_wheels \
    'numpy>=1.19.0' 'tqdm>=4.53.0' 'torch>=1.7.0' 'pytorch-lightning>=1.0.0' \
    'spacy>=2.2.0' $HOME/python_wheels/en_core_web_sm-2.2.0.tar.gz 'torchtext>=0.6.0' \
    'scikit-learn>=0.23.0' 'nltk>=3.5' 'gensim>=3.8.0' 'pandas>=1.1.0'

I get the requirements error:

ERROR: Could not find a version that satisfies the requirement fsspec[http]>=0.8.1 (from pytorch-lightning>=1.0.0) (from versions: 0.6.1, 0.8.0)
ERROR: No matching distribution found for fsspec[http]>=0.8.1 (from pytorch-lightning>=1.0.0)

It looks like newer versions of pytorch-lightning require fsspec[http]>=0.8.1 and the newest version available on Beluga is 0.8.0. When we were running experiments in December we were likely using version 1.1.0 of pytorch-lightning (when we first downloaded the wheel files).

We can fix this issue on Beluga by either constraining pytorch-lightning<=1.1.0, or by pre-downloading the wheel for fsspec[http] (and one of its requirements) via:

pip3 download --no-deps 'fsspec[http]>=0.8.1' 'idna-ssl>=1.0'

I can't remember, but is there a reason why we don't pre-download all of the dependencies for pytorch-lightning (i.e. pip download pytorch-lightning vs. pip install --no-deps pytorch-lightning)?

AndreasMadsen · 2021-03-02T19:56:33Z

We can fix this issue on Beluga by either constraining pytorch-lightning<=1.1.0, or by pre-downloading the wheel for fsspec[http] (and one of its requirements) via:

Contrain it to pytorch-lightning<=1.1.0 for now. In #23 I'm updating to 1.2.0 becaue we need it for a metric, and there the issue is fixed.

I can't remember, but is there a reason why we don't pre-download all of the dependencies for pytorch-lightning (i.e. pip download pytorch-lightning vs. pip install --no-deps pytorch-lightning)?

Some of the dependencies are allready available via the shared wheel house. If we download all the dependencies then I don't think it will use the shared wheel house. The shared wheel house is prefered because it is optimized for the CPU arch, etc..

ncmeade · 2021-03-02T20:00:32Z

Minor and unrelated to this PR: I think we should also change the constraint on spacy in setup.py from spacy>=2.2.0 to spacy>=2.2.0,<3.0.0. The version of en_core_web_sm we are using expects a 2.x.x version of spacy and if you are using a newer version of spacy like 3.0.0, it throws compatibility warnings.

This is only an issue when running the code on my local machine as the latest version of spacy available on Beluga is 2.2.0.

AndreasMadsen · 2021-03-02T20:02:55Z

@ncmeade Yes, it should just be spacy>=2.2.0,<2.3.0 everywhere. spacy is not really semver compatiable, because a specific versions of spacy needs a specific version of en_core_web_sm, so >= doesn't make sense to use.

AndreasMadsen · 2021-03-02T20:15:19Z

Thanks for the fixes, the changes Looks Good To Me.

ncmeade · 2021-03-02T21:22:32Z

This looks good to merge to me. I ran the dataset pre-processing in a clean environment and trained baseline models for SST, IMDB, and Babi as a sanity check and everything looked reasonable.

AndreasMadsen added the wip Work In Progress label Feb 21, 2021

AndreasMadsen added 2 commits February 21, 2021 19:36

refactor dataset to use common base-classes

1d1f735

read only ROAR cache for k-1

cf19a3a

Fixes #14

AndreasMadsen force-pushed the dataset-abstraction branch from f9d0166 to cf19a3a Compare February 21, 2021 18:36

early stop in snli.prepare_data

2a582bb

AndreasMadsen removed the wip Work In Progress label Feb 22, 2021

fix test dataset in ROAR

bb64588

AndreasMadsen mentioned this pull request Feb 23, 2021

Gpu enable roar #20

Merged

Update dataset names in dataset download script

2784691

ncmeade added 2 commits March 2, 2021 15:08

Constrain versions of pytorch-lightning and spacy

d36c72d

Misc fixes in dataset download script

6833ff1

AndreasMadsen merged commit 11f74b0 into master Mar 2, 2021

AndreasMadsen deleted the dataset-abstraction branch March 2, 2021 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract datasets intro base-classes #19

Abstract datasets intro base-classes #19

AndreasMadsen commented Feb 21, 2021 •

edited

AndreasMadsen commented Feb 22, 2021

ncmeade commented Mar 2, 2021

AndreasMadsen commented Mar 2, 2021

ncmeade commented Mar 2, 2021

AndreasMadsen commented Mar 2, 2021 •

edited

AndreasMadsen commented Mar 2, 2021

ncmeade commented Mar 2, 2021

Abstract datasets intro base-classes #19

Abstract datasets intro base-classes #19

Conversation

AndreasMadsen commented Feb 21, 2021 • edited

AndreasMadsen commented Feb 22, 2021

ncmeade commented Mar 2, 2021

AndreasMadsen commented Mar 2, 2021

ncmeade commented Mar 2, 2021

AndreasMadsen commented Mar 2, 2021 • edited

AndreasMadsen commented Mar 2, 2021

ncmeade commented Mar 2, 2021

AndreasMadsen commented Feb 21, 2021 •

edited

AndreasMadsen commented Mar 2, 2021 •

edited