Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove input data folder from SpeedyWeather.jl #178

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dmey
Copy link
Contributor

@dmey dmey commented Nov 7, 2022

This PR removes the input_data folder from SpeedyWeather.jl and adds functions to check and download input data at the first run of SpeedyWeather.jl. As more input data are likely to be added soon (e.g. surface fluxes param), I thought this is a good time to add something like this before this repo gets too bloated. For now, I use the same licence as in SpeedyWeather.jl but we should eventually correct the copyright and licencing information to that of the data. For now, and for simplicity, the input_data archive is stored on GitHub at https://github.com/dmey/speedy-weather-input-data but we should eventually move it onto HTTP storage. @milankl more immediately, if you like the sound of this, I would like to move the input_data repo under your username or on org for SpeedyWeather.jl.

@dmey dmey changed the title Remove input data folder from Remove input data folder from SpeedyWeather.jl Nov 7, 2022
@maximilian-gelbrecht
Copy link
Member

maximilian-gelbrecht commented Nov 7, 2022

Maybe that's now a good time to start an org for Speedy and move the repo to it, as this would also help with setting up GPU CI. Then the input data repo could also be there.

@milankl
Copy link
Member

milankl commented Nov 8, 2022

Just set up github.com/SpeedyWeather and therein a repository called SpeedyWeatherInputData

@milankl
Copy link
Member

milankl commented Nov 8, 2022

I think we should agree on an interface between SpeedyWeather.jl and SpeedyWeatherInputData.jl, because this might be also the time to think about what to do if someone wants to use their own orography for example. One way could be do officially register SpeedyWeatherInputData.jl and then have as a default

run_speedy(input_data=SpeedyWeatherInputData)

but someone could just load another module

import MyInputData
using SpeedyWeather
run_speedy(input_data=MyInputData)

For orography for example we need just an array on a grid that is supported by the spectral transform in SpeedyWeather.jl, so SpeedyWeatherInputData could provide a function like get_orography and the transform is then done in SpeedyWeather.jl. What do you think?

Downloads input files as required by SpeedyWeather.jl
"""
function download_input_data(input_data_path)
input_data_url = "https://github.com/dmey/speedy-weather-input-data/archive/refs/heads/main.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we shouldn't hardcode the url. But I'm not sure what the best way is to provide a nice interface, which remains compatible or at least provides some control between input data version and SpeedyWeather.jl version. I like the idea to just create a download some netcdf files and create a folder once. But not sure what undesired consequences of that are compared to having a registered package with version and we can then adapt and adjust version compatibility in SpeedyWeather.jl.

@milankl
Copy link
Member

milankl commented Nov 8, 2022

Regarding orga see #182

@milankl milankl added input data 💾 Input data used by default (Orography, land-sea mask, etc) user interface 🎹 How users use our user interface dependency 🦮 Dependencies on other packages labels Nov 8, 2022
@dmey
Copy link
Contributor Author

dmey commented Nov 8, 2022

I'll come back to this tomorrow but some food for thought... How do we want the data to be consumed eventually. Storing data on GitHub maybe fine initially but that's not really what we should do in the long-er term. We could theoretically store the data as artefacts in the release etc but I wonder if by registering a package like SpeedyWeatherInputData.jl you were thinking of permanently keeping it as a git repo...

@milankl
Copy link
Member

milankl commented Nov 8, 2022

I'm honestly thinking to postpone this idea. While I agree with it in general, moving the data somewhere else and finding a good interface should probably only be done once we've had the chance to play around with it more. This is to understand what we would want from such a code/input data separation and what interface would make most sense. At the moment, I wanted to have primitive equations (but not necessarily all the parameterizations) for v0.5 and only if the surface parameterizations are implemented we will know better how to structure this separation.

Does that make sense?

@dmey
Copy link
Contributor Author

dmey commented Nov 9, 2022

I think we should agree on an interface between SpeedyWeather.jl and SpeedyWeatherInputData.jl, because this might be also the time to think about what to do if someone wants to use their own orography for example. One way could be do officially register SpeedyWeatherInputData.jl and then have as a default

run_speedy(input_data=SpeedyWeatherInputData)

but someone could just load another module

import MyInputData
using SpeedyWeather
run_speedy(input_data=MyInputData)

For orography for example we need just an array on a grid that is supported by the spectral transform in SpeedyWeather.jl, so SpeedyWeatherInputData could provide a function like get_orography and the transform is then done in SpeedyWeather.jl. What do you think?

Something simpler--why not just move the data for now and we can see if there is demand for adding an interface in the future? A standard dataset should be provided by default but my point was about not making it part of the codebase.

Yeah, we shouldn't hardcode the url. But I'm not sure what the best way is to provide a nice interface, which remains compatible or at least provides some control between input data version and SpeedyWeather.jl version. I like the idea to just create a download some netcdf files and create a folder once. But not sure what undesired consequences of that are compared to having a registered package with version and we can then adapt and adjust version compatibility in SpeedyWeather.jl.

For now it's fine to keep it on GitHub they are not much data anyway. Re hard coding I don't see an issue as I would create a release for the dataset at https://github.com/SpeedyWeather/SpeedyWeatherInputData an use that in one or more SpeedyWeather releases. I think data releases will be very few so it will not create much overhead.

I'm honestly thinking to postpone this idea. While I agree with it in general, moving the data somewhere else and finding a good interface should probably only be done once we've had the chance to play around with it more. This is to understand what we would want from such a code/input data separation and what interface would make most sense. At the moment, I wanted to have primitive equations (but not necessarily all the parameterizations) for v0.5 and only if the surface parameterizations are implemented we will know better how to structure this separation.

Does that make sense?

I was not thinking of creating an interface, I only meant it as a way to remove static artefacts from the current code repo. What you propose about interface sounds good but could be done at a later stage. The additional point I made about not storing static artifacts on GitHub still applies but I meant it as a future step--moving the data to HTTP storage would make little sense resource-wise for us now. In other words, moving the data to https://github.com/SpeedyWeather/SpeedyWeatherInputData will do for now...

@milankl
Copy link
Member

milankl commented Nov 9, 2022

Another point is reproducibility. If we turn some SpeedyWeather.jl release into a doi (say with zenodo) in the future then we would have a dependency on another repository which should be clearly tagged with a version number. What I like about having the smallest default data set within the repository itself is that we don't have to deal with that.

@dmey
Copy link
Contributor Author

dmey commented Nov 9, 2022

Another point is reproducibility. If we turn some SpeedyWeather.jl release into a doi (say with zenodo) in the future then we would have a dependency on another repository which should be clearly tagged with a version number.

Yes but you'd do anyway. In fact I would have suggested to put both repos on Zenodo as soon as you can and then pull the data with the DOI. Reproducibility has little to do with it as you would version it anyway as you do with dependencies and that does not make SpeedyWeather unreproducible. I outlined this briefly at https://doi.org/10.1029/2019MS001961, Chapter 5.

What I like about having the smallest default data set within the repository itself is that we don't have to deal with that.

Yes I agree for 2 something megabytes it does not make sense this is to start getting things ready for the future.

@milankl
Copy link
Member

milankl commented Nov 10, 2022

Alright, I give this PR a v0.6 milestone for now and we can pick that up again once the primitive equation model is up and running and we actually use all that input data.

@milankl milankl added this to the v0.6 milestone Nov 10, 2022
@milankl milankl modified the milestones: v0.6, v0.7 Aug 30, 2023
@milankl milankl removed this from the v0.7 milestone Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency 🦮 Dependencies on other packages input data 💾 Input data used by default (Orography, land-sea mask, etc) user interface 🎹 How users use our user interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants