Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract datasets intro base-classes #19

Merged
merged 7 commits into from Mar 2, 2021
Merged

Conversation

AndreasMadsen
Copy link
Owner

@AndreasMadsen AndreasMadsen commented Feb 21, 2021

I need the datasets to be subclasses for TorchScript support, and we will anyway need it if we add more datasets. Unfortunately the cache will have to be rebuild, becuase this changes some of the .pkl file formats and filenames.

  • This fixes ROAR dataset is loaded from cache when result is missing #14.
  • This sets the seed for some of the split generation. That could properly be done better, but at least it is set.
  • For babi it just hardcodes in the labels, instead of computing them dynammically. That simplifies things a lot. I noticed there were some odd labels, I don't know if they are actually used. But may be worth looking into.
  • If the encode dataset and vocab exists, it will not build any of the intermediate files. Should make syncing less expensive.
  • The test dataset now uses the test dataset instead of the validation dataset. (Recently introduced bug)

work in progress: Everything should work. I ran a few iterations locally and rebuild the cache. I need to rerun experiments on compute-canada to make sure it works completely.

@AndreasMadsen AndreasMadsen added the wip Work In Progress label Feb 21, 2021
@AndreasMadsen AndreasMadsen removed the wip Work In Progress label Feb 22, 2021
@AndreasMadsen
Copy link
Owner Author

@ncmeade @vaibhavad This appears to work now. Please review when you have time.

@AndreasMadsen AndreasMadsen mentioned this pull request Feb 23, 2021
@ncmeade
Copy link
Collaborator

ncmeade commented Mar 2, 2021

I tried running download.sh on Beluga to test the dataset pre-processing, but while running the command:

pip3 install --no-index --find-links $HOME/python_wheels \
    'numpy>=1.19.0' 'tqdm>=4.53.0' 'torch>=1.7.0' 'pytorch-lightning>=1.0.0' \
    'spacy>=2.2.0' $HOME/python_wheels/en_core_web_sm-2.2.0.tar.gz 'torchtext>=0.6.0' \
    'scikit-learn>=0.23.0' 'nltk>=3.5' 'gensim>=3.8.0' 'pandas>=1.1.0'

I get the requirements error:

ERROR: Could not find a version that satisfies the requirement fsspec[http]>=0.8.1 (from pytorch-lightning>=1.0.0) (from versions: 0.6.1, 0.8.0)
ERROR: No matching distribution found for fsspec[http]>=0.8.1 (from pytorch-lightning>=1.0.0)

It looks like newer versions of pytorch-lightning require fsspec[http]>=0.8.1 and the newest version available on Beluga is 0.8.0. When we were running experiments in December we were likely using version 1.1.0 of pytorch-lightning (when we first downloaded the wheel files).

We can fix this issue on Beluga by either constraining pytorch-lightning<=1.1.0, or by pre-downloading the wheel for fsspec[http] (and one of its requirements) via:

pip3 download --no-deps 'fsspec[http]>=0.8.1' 'idna-ssl>=1.0'

I can't remember, but is there a reason why we don't pre-download all of the dependencies for pytorch-lightning (i.e. pip download pytorch-lightning vs. pip install --no-deps pytorch-lightning)?

@AndreasMadsen
Copy link
Owner Author

We can fix this issue on Beluga by either constraining pytorch-lightning<=1.1.0, or by pre-downloading the wheel for fsspec[http] (and one of its requirements) via:

Contrain it to pytorch-lightning<=1.1.0 for now. In #23 I'm updating to 1.2.0 becaue we need it for a metric, and there the issue is fixed.

I can't remember, but is there a reason why we don't pre-download all of the dependencies for pytorch-lightning (i.e. pip download pytorch-lightning vs. pip install --no-deps pytorch-lightning)?

Some of the dependencies are allready available via the shared wheel house. If we download all the dependencies then I don't think it will use the shared wheel house. The shared wheel house is prefered because it is optimized for the CPU arch, etc..

@ncmeade
Copy link
Collaborator

ncmeade commented Mar 2, 2021

Minor and unrelated to this PR: I think we should also change the constraint on spacy in setup.py from spacy>=2.2.0 to spacy>=2.2.0,<3.0.0. The version of en_core_web_sm we are using expects a 2.x.x version of spacy and if you are using a newer version of spacy like 3.0.0, it throws compatibility warnings.

This is only an issue when running the code on my local machine as the latest version of spacy available on Beluga is 2.2.0.

@AndreasMadsen
Copy link
Owner Author

AndreasMadsen commented Mar 2, 2021

@ncmeade Yes, it should just be spacy>=2.2.0,<2.3.0 everywhere. spacy is not really semver compatiable, because a specific versions of spacy needs a specific version of en_core_web_sm, so >= doesn't make sense to use.

@AndreasMadsen
Copy link
Owner Author

Thanks for the fixes, the changes Looks Good To Me.

@ncmeade
Copy link
Collaborator

ncmeade commented Mar 2, 2021

This looks good to merge to me. I ran the dataset pre-processing in a clean environment and trained baseline models for SST, IMDB, and Babi as a sanity check and everything looked reasonable.

@AndreasMadsen AndreasMadsen merged commit 11f74b0 into master Mar 2, 2021
@AndreasMadsen AndreasMadsen deleted the dataset-abstraction branch March 2, 2021 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ROAR dataset is loaded from cache when result is missing
2 participants