Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Fix dataset downloading errors #387

Conversation

akihironitta
Copy link
Contributor

@akihironitta akihironitta commented Nov 20, 2020

What does this PR do?

As pointed out in #377 (comment) by @Borda, the tests try to download datasets, which sometimes fail with the following error:

UNEXPECTED EXCEPTION: RuntimeError('Failed download from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz')

Description of the changes

  1. It seems that those failing tests are often doctest, so this PR simply removes the doctest from ci_test-full.yml as we still have doctest in ci_test-base.yml.
    https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/b8ac85154465956b06fd1005b21b071af5493f11/.github/workflows/ci_test-full.yml#L86
    https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/b8ac85154465956b06fd1005b21b071af5493f11/.github/workflows/ci_test-base.yml#L69
  2. This PR also includes minor changes in some tests using LitMNIST to utilize dataset caching since they currently download and store MNIST datasets in ./ instead of in ./datasets/ (datadir fixture). See tests: Use cached datasets in LitMNIST and the doctests #414.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Nov 20, 2020

Codecov Report

Merging #387 (308129c) into master (b8ac851) will decrease coverage by 0.30%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
- Coverage   81.18%   80.88%   -0.31%     
==========================================
  Files         100      100              
  Lines        5714     5701      -13     
==========================================
- Hits         4639     4611      -28     
- Misses       1075     1090      +15     
Flag Coverage Δ
cpu 25.24% <ø> (+0.96%) ⬆️
pytest 25.24% <ø> (+0.96%) ⬆️
unittests 80.24% <ø> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pl_bolts/datasets/base_dataset.py 84.09% <0.00%> (-11.37%) ⬇️
pl_bolts/datamodules/__init__.py 77.35% <0.00%> (-11.33%) ⬇️
...l_bolts/models/rl/vanilla_policy_gradient_model.py 91.96% <0.00%> (-2.68%) ⬇️
pl_bolts/datasets/ssl_amdim_datasets.py 73.91% <0.00%> (-1.09%) ⬇️
pl_bolts/datasets/cifar10_dataset.py 96.80% <0.00%> (-1.07%) ⬇️
..._bolts/models/self_supervised/moco/moco2_module.py 78.74% <0.00%> (-0.50%) ⬇️
pl_bolts/optimizers/lars_scheduling.py 95.34% <0.00%> (-0.40%) ⬇️
...lts/models/self_supervised/simclr/simclr_module.py 71.21% <0.00%> (-0.28%) ⬇️
pl_bolts/models/rl/common/gym_wrappers.py 89.65% <0.00%> (-0.27%) ⬇️
pl_bolts/models/rl/per_dqn_model.py 86.66% <0.00%> (-0.22%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8ac851...005db66. Read the comment docs.

@akihironitta
Copy link
Contributor Author

It seems that those failing tests are often doctest

This is a false statement... Some tests are still failing other than doctest.

@Borda
Copy link
Member

Borda commented Nov 20, 2020

It seems that those failing tests are often doctest

This is a false statement... Some tests are still failing other than doctest.

I agree with @akihironitta the point why doctests may seem to be failing more is they are executed before other tests in the tests folder :]

@Borda Borda added fix fixing issues... ci/cd Continues Integration and delivery labels Nov 20, 2020
@Borda Borda added this to In progress in Code Health / Refatoring Nov 20, 2020
@akihironitta
Copy link
Contributor Author

akihironitta commented Nov 20, 2020

@Borda I actually have no idea why the tests still fail even after removing the doctests from the full testing CI... e.g. here

tests/models/test_autoencoders.py::test_vae[cifar10] FAILED
...

Is there anything else to do here?

@akihironitta akihironitta changed the title ci: Fix dataset caching [wip] ci: Fix dataset downloading errors Nov 20, 2020
@akihironitta akihironitta marked this pull request as ready for review November 20, 2020 12:31
@Borda Borda self-assigned this Nov 22, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just temporary in one run print what is content of the Dataset folder so see if it restored correctly and eventually fro each DataModule show the source folder...

.github/workflows/ci_test-full.yml Outdated Show resolved Hide resolved
@Borda Borda added datamodule Anything related to datamodules Priority High priority task labels Nov 22, 2020
@akihironitta
Copy link
Contributor Author

akihironitta commented Nov 24, 2020

In docstrings of datasets/datamodules, we're not using data_dir, so the doctests try to download and store datasets in ./ instead of using cached datasets in ./datasets/.

For example, the following test command tries to download and save ./CIFAR10/ instead of reusing cached one in ./datasets/CIFAR10/ which slows down the tests and may be the cause of CI failure.

coverage run --source pl_bolts -m pytest pl_bolts/datasets/cifar10_dataset.py -v


Fixed in faeebf2.

@akihironitta
Copy link
Contributor Author

Not sure why the tests on windows tend to fail very often...

@akihironitta akihironitta marked this pull request as draft November 26, 2020 13:28
@akihironitta
Copy link
Contributor Author

akihironitta commented Nov 26, 2020

Might be off topic, but I just realised that CI full testing / pytest (ubuntu-20.04, 3.7, latest) particularly tend to fail with the error (not sure why):

/home/runner/work/_temp/5ef79e81-ccef-44a4-91a6-610886c324a6.sh: line 2:  1855 Killed                  coverage run --source pl_bolts -m pytest pl_bolts tests --exitfirst -v --junitxml=junit/test-results-Linux-3.7-latest.xml
Error: Process completed with exit code 137.

WIP in #409.

@akihironitta
Copy link
Contributor Author

akihironitta commented Nov 26, 2020

Not sure why the tests on windows tend to fail very often...

Seems the last number of ci runs on windows went good without any problems... Not sure why


WIP in #409.

@Borda
Copy link
Member

Borda commented Dec 3, 2020

@akihironitta how is it going, can I help somehow? 🐰

@akihironitta
Copy link
Contributor Author

@Borda I was trying to reproduce the error, RuntimeError('Failed download...) and see the content of the directory datasets/ when the error occurs, but the error didn't occur at all for the last tens of runs...

Now, I'll be making several empty commits again to run the tests in parallel to reproduce the error and to see if the cifar10 is correctly cached.

@akihironitta
Copy link
Contributor Author

Now, all say Cache not found (even for pip):


This is only a side note, but it is strange that on runs on the same commit 30a7f6d, the hash value on windows is different from the ones on macOS and Ubuntu.

@akihironitta
Copy link
Contributor Author

datasets/ is only 1.1GB (confirmed locally with du -hs datasets), so caches should not exceed the GitHub limit which is 5GB.

A repository can have up to 5GB of caches. Once the 5GB limit is reached, older caches will be evicted based on when the cache was last accessed. Caches that are not accessed within the last week will also be evicted.

https://github.com/actions/cache#cache-limits

@akihironitta
Copy link
Contributor Author

It seems ci on the above commit (03fdeab) managed to find the caches. I don't understand why the runs on 30a7f6d and on other commits weren't able to find the caches even though the hash value for the caches was the same because tests/conftest.py has never been modified on this PR.

@Borda
Copy link
Member

Borda commented Dec 3, 2020

looks strange, they say that GH actions are still in development but as much... :]

@akihironitta
Copy link
Contributor Author

akihironitta commented Dec 5, 2020

looks strange, they say that GH actions are still in development but as much... :]

@Borda Right... so what else do you think we can do here? I don't think I can solve this problem because I cannot reproduce the error UNEXPECTED EXCEPTION: RuntimeError('Failed download...') for quite a number of runs on GitHub Actions in this branch...

Also, shall we try to fix the below error which happens very often in #409? (All the runs on the two commits right above this comment failed due to this.)

/home/runner/work/_temp/5ef79e81-ccef-44a4-91a6-610886c324a6.sh: line 2:  1855 Killed                  coverage run --source pl_bolts -m pytest pl_bolts tests --exitfirst -v --junitxml=junit/test-results-Linux-3.7-latest.xml
Error: Process completed with exit code 137.

@Borda Borda removed the Priority High priority task label Dec 7, 2020
@Borda Borda force-pushed the master branch 2 times, most recently from feb8f77 to 3d49fd4 Compare January 22, 2021 21:30
@akihironitta
Copy link
Contributor Author

Closing this PR as there isn't anything to investigate on this further.

Code Health / Refatoring automation moved this from In progress to Done Mar 30, 2021
@akihironitta akihironitta deleted the ci/fix-dataset-caching branch March 30, 2021 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/cd Continues Integration and delivery datamodule Anything related to datamodules fix fixing issues...
Development

Successfully merging this pull request may close these issues.

None yet

2 participants