Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don’t download model files for every test run #238

Open
JostMigenda opened this issue May 16, 2023 · 5 comments
Open

Don’t download model files for every test run #238

JostMigenda opened this issue May 16, 2023 · 5 comments
Assignees

Comments

@JostMigenda
Copy link
Member

Right now, tests on GitHub Actions need to download all model files for every test run. This has multiple disadvantages:

  • It breaks tests on PRs that add new model files, see e.g. Missing directions/masses in Walk and Tamborra models #231 and in particular this comment
  • It means tests running on GitHub may behave subtly different than on a developer’s local computer (where the model files are likely already downloaded and available in snewpy.model_path)
  • Needing to download 100s of MB of model files slows down tests and wastes energy and bandwidth.

At least for model files that are in the GitHub repository anyway, the files are already available on the test runner; it would be sufficient to change snewpy.model_path to point to the models/ directory in the repo. We could then also eliminate a lot of duplicate testing code in tests/test_0{1,2}*.py.
(Instead, we would probably need one download test running as part of the release workflow—after tagging, but before uploading the release to PyPI.)

Open questions: How to deal with models whose files are not in the GitHub repo? (E.g. on Zenodo?)

@JostMigenda JostMigenda self-assigned this May 16, 2023
@Sheshuk
Copy link
Contributor

Sheshuk commented May 17, 2023

Hi Jost, this looks like a good solution for the time being.
But I thought the idea in the long run was to eventually move the data files to Zenodo or elsewhere, no?
Currently having the data strictly tied to the current release of SNEWPY on github leads to problems like this - when we change the data, but have no possibility to work with it properly, since it is not in the release yet.

I think moving all the data to an outside repository/Zenodo entry would solve all of the mentioned problems.
Also developer won't have to download the data that is not needed, and cloning SNEWPY will be fast. And when the data files are updated, we can just make a new release of the data repository, decoupled from the development of the software

@JostMigenda
Copy link
Member Author

Moving the data files elsewhere would solve the first point; but not the second and third one. (In particular, having the files outside of GitHub would potentially make the third point much worse—downloading files over the internet from e.g. Zenodo is much slower than downloading them from another GitHub server that’s potentially a few meters away in the same data centre.)

For adding new models, yes, being able to use them directly from Zenodo (where modeling groups have hopefully made them available themselves) would be good. For existing model files, I’m not sure whether it’s worth the effort to move them elsewhere. But as you say, that’s a discussion for the long run; if we agree that the proposal above is a good solution in the meantime & since it doesn’t impede other long-term solutions, I think that’s good enough for now.

@Sheshuk
Copy link
Contributor

Sheshuk commented May 17, 2023

For the second problem: having all the models decoupled from the main SNEWPY repository means that the developer will be in exactly the same position as the testing script in the GitHub job. Download the models, or reuse the ones that are already in cache.

In particular, having the files outside of GitHub would potentially make the third point much worse—downloading files over the internet from e.g. Zenodo is much slower than downloading them from another GitHub server that’s potentially a few meters away in the same data centre.

I agree, this might be the case (although it's worth checking if this problem is real).
But then what we can do is to put all the models we have into a separate GitHub repository, and version it any way we like.

@JostMigenda
Copy link
Member Author

For Zenodo, we’ve previously observed issues where downloads would fail completely or tests would take ~40 min to run, largely due to slow downloads. It’s possible that was some temporary hiccup for a few days; but even if it’s reliable and faster now, it would be good to avoid even a few min extra wait on every test run.

@Sheshuk
Copy link
Contributor

Sheshuk commented May 17, 2023

I see, thanks for pointing to the issue.
But what do you think about having models just in a separate repository (or several)?
This can be done real quick.

One more thought about running tests: it's possible to cache the directories between the runs, I think this is done especially for such use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants