Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate hadronsampledata into hadron #295

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

martin-ueding
Copy link
Contributor

@martin-ueding martin-ueding commented Aug 3, 2020

Closes #292.

I have made the repository hadron_example_data into a real R package. It needs to be build using R CMD build . and deployed in a moment.

Then I have created a new repository r-repo which will serve to hold the data for serving over HTTP. The example data package is then deployed via drat::insertPackage('hadronexampledata_0.0.0.9000.tar.gz', repodir = '../r-repo/', action = 'prune'). Go to the r-repo, commit and push.

At r-repo I have enabled GitHub Pages to serve the content at https://hiskp-lqcd.github.io/r-repo/ such that we have our own R repository (like CRAN).

This pull request adds a little more meta data to hadron, such that one can do install.packages('hadronexampledata') after loading the hadron library. This seems to work just fine after installing and loading the version with the pull request:

> install.packages('hadronexampledata')
Installiere Paket nach ‘/home/mu/R/x86_64-redhat-linux-gnu-library/4.0’
(da ‘lib’ nicht spezifiziert)
versuche URL 'https://hiskp-lqcd.github.io/r-repo/src/contrib/hadronexampledata_0.0.0.9000.tar.gz'
Content type 'application/gzip' length 18625262 bytes (17.8 MB)
==================================================
downloaded 17.8 MB

* installing *source* package ‘hadronexampledata’ ...
** using staged installation
** inst
** help
No man pages found in package  ‘hadronexampledata’ 
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (hadronexampledata)

Die heruntergeladenen Quellpakete sind in 
        ‘/tmp/RtmpkeqDrH/downloaded_packages’

The data that was already present in the example data package was not in R format (Rdata/rds). As such I have moved it into the inst directory. One can now add more Rdata/rds files to that package and have it exported.

The size limit on GitHub is 1 GB, we should be able to use it even for larger files in this way.

@kostrzewa
Copy link
Member

This is a very nice solution!

@martin-ueding
Copy link
Contributor Author

I've just implemented what you have linked in the blog posts, so it isn't my original idea 😄.

Carsten suggested in the linked issue that we should rather use a web server, I guess we would then offer a bunch of ZIP archives with data to download. Perhaps we should pick something and then perhaps the whole PR here is obsolete.

@urbach
Copy link
Member

urbach commented Aug 3, 2020 via email

@martin-ueding
Copy link
Contributor Author

CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here.

And just because I put in the work does not mean that we have to keep it. If having a loose collection of ZIP files is easier in the long run, we should discard the work here.

@urbach
Copy link
Member

urbach commented Aug 3, 2020 via email

@urbach
Copy link
Member

urbach commented Aug 3, 2020 via email

@martin-ueding
Copy link
Contributor Author

We just have it in Suggests, where we already have soft dependencies on Bio Conductor. That itself does not seem to be a problem. And the Additional Repositories entry makes sure that the existence can be verified during check.

We could also just use the .onLoad and not list the package in the DESCRIPTION if that makes it better. Perhaps we just wait for the next round of feedback from CRAN when we want to publish an update?

@martin-ueding
Copy link
Contributor Author

The dependency is a weak one, users need to manually install the example data package, and it will live in a different namespace.

If we would limit ourselves to 5 MB, then we could make a data package, publish it on CRAN and make it a hard dependency. For the data that has been in mind, the 5 MB would not be enough, I thought.

@urbach
Copy link
Member

urbach commented Aug 3, 2020 via email

@martin-ueding
Copy link
Contributor Author

It seems that if one promises to update a package less often, one can exceed the 5 MB of data. What exactly do we want to offer? So far I have only seen the pion form factor data, which would be used to show the reading routines. The data would be more useful as a ZIP archive than installed into some location in the R library, I think. What other data would we want to provide to end users?

@kostrzewa
Copy link
Member

I want to add gradient flow and loop files, as well as some stuff for raw_cf examples. These are all data files which will be several hundred MB in total.

@kostrzewa
Copy link
Member

As far as I have understood the literature that I posted, CRAN data packages should allow for this.

@martin-ueding
Copy link
Contributor Author

If we just have it on our webserver (GitHub Pages), we would have the maximum flexibility.

If the package is several hundred MB, then we will have a problem after a few releases as a package as GitHub only allows to have repositories of 1 GB in size. We would need to truncate the history of the r-repo repository and force-push, which isn't too bad. If we just opt for a directory of archives, then this issue would only arise if the history of that data gets too long or the data too large in total.

I guess the most interesting question is whether we want the data to be available via the data() mechanism in R or whether using readRDS or load would be fine as well.

@kostrzewa
Copy link
Member

If I use the data in examples, it must be available in inst/ in some way, right? This means that it probably has to be a package.

You are right that this will basically be impossible to do on github. Pulling it as a zip file might be an option, although I'm not sure if the examples will continue to work.

@kostrzewa
Copy link
Member

I need to see how small I can make the example data for it to still be practically useful.

@martin-ueding
Copy link
Contributor Author

There must be additional possibilities to host some files, Carsten could for instance do that in his home directory at the institute. Or we ask for webspace, that should not be that hard. Otherwise 1 GB on GitHub would bring us pretty far, we could also use multiple repositories to circumvent the limitations or just ask for an upgrade as part of our academic plan.

In machine learning notebooks with Python there one usually downloads the data when it is needed from some website. I guess with a package like curl or so we can do the same thing in R. The disadvantage of a package would be that all data needs to be in one place.

If you want to use the data in examples of hadron, we will need the data package to be on CRAN. Installing it from external sources only seems to work after the fact, and we can't have install.packages('hadronexampledata') in the examples I believe. I am not sure whether one could download stuff via HTTPS during CRAN building, so that might not be possible either.

A data package on CRAN would be the most formal variant and easiest to use afterwards. The process is just the most tedious and the limitations the strictest.

@urbach
Copy link
Member

urbach commented Aug 3, 2020

having so large data sets in regular examples is not a good idea anyhow, because they would run for too long presumably!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

external data package
3 participants