Integrate hadronsampledata into hadron #295

martin-ueding · 2020-08-03T06:35:30Z

Closes #292.

I have made the repository hadron_example_data into a real R package. It needs to be build using R CMD build . and deployed in a moment.

Then I have created a new repository r-repo which will serve to hold the data for serving over HTTP. The example data package is then deployed via drat::insertPackage('hadronexampledata_0.0.0.9000.tar.gz', repodir = '../r-repo/', action = 'prune'). Go to the r-repo, commit and push.

At r-repo I have enabled GitHub Pages to serve the content at https://hiskp-lqcd.github.io/r-repo/ such that we have our own R repository (like CRAN).

This pull request adds a little more meta data to hadron, such that one can do install.packages('hadronexampledata') after loading the hadron library. This seems to work just fine after installing and loading the version with the pull request:

> install.packages('hadronexampledata')
Installiere Paket nach ‘/home/mu/R/x86_64-redhat-linux-gnu-library/4.0’
(da ‘lib’ nicht spezifiziert)
versuche URL 'https://hiskp-lqcd.github.io/r-repo/src/contrib/hadronexampledata_0.0.0.9000.tar.gz'
Content type 'application/gzip' length 18625262 bytes (17.8 MB)
==================================================
downloaded 17.8 MB

* installing *source* package ‘hadronexampledata’ ...
** using staged installation
** inst
** help
No man pages found in package  ‘hadronexampledata’ 
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (hadronexampledata)

Die heruntergeladenen Quellpakete sind in 
        ‘/tmp/RtmpkeqDrH/downloaded_packages’

The data that was already present in the example data package was not in R format (Rdata/rds). As such I have moved it into the inst directory. One can now add more Rdata/rds files to that package and have it exported.

The size limit on GitHub is 1 GB, we should be able to use it even for larger files in this way.

kostrzewa · 2020-08-03T08:41:42Z

This is a very nice solution!

martin-ueding · 2020-08-03T08:49:06Z

I've just implemented what you have linked in the blog posts, so it isn't my original idea 😄.

Carsten suggested in the linked issue that we should rather use a web server, I guess we would then offer a bunch of ZIP archives with data to download. Perhaps we should pick something and then perhaps the whole PR here is obsolete.

urbach · 2020-08-03T08:50:36Z

I've just implemented what you have linked in the blog posts, so it isn't my original idea 😄. Carsten suggested in the linked issue that we should rather use a web server, I guess we would then offer a bunch of ZIP archives with data to download. Perhaps we should pick something and then perhaps the whole PR here is obsolete.

well, since you already did it... Is the idea to have this on CRAN as well? Otherwise we cannot depend on it.

martin-ueding · 2020-08-03T08:57:35Z

CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here.

And just because I put in the work does not mean that we have to keep it. If having a loose collection of ZIP files is easier in the long run, we should discard the work here.

urbach · 2020-08-03T09:01:58Z

CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here.

But how do we implement the dependency here? Maybe I don't fully understand yet.

urbach · 2020-08-03T09:04:49Z

> CRAN has a 5 MB limit, the idea is to specifically not have it on CRAN and put a bunch of data here. But how do we implement the dependency here? Maybe I don't fully understand yet.

CRAN policy is that packages can only depend on CRAN packages (also in Suggests, as far as I understand)

martin-ueding · 2020-08-03T09:15:58Z

We just have it in Suggests, where we already have soft dependencies on Bio Conductor. That itself does not seem to be a problem. And the Additional Repositories entry makes sure that the existence can be verified during check.

We could also just use the .onLoad and not list the package in the DESCRIPTION if that makes it better. Perhaps we just wait for the next round of feedback from CRAN when we want to publish an update?

martin-ueding · 2020-08-03T09:17:34Z

The dependency is a weak one, users need to manually install the example data package, and it will live in a different namespace.

If we would limit ourselves to 5 MB, then we could make a data package, publish it on CRAN and make it a hard dependency. For the data that has been in mind, the 5 MB would not be enough, I thought.

urbach · 2020-08-03T09:19:37Z

We just have it in `Suggests`, where we already have soft

dependencies on Bio Conductor. That itself does not seem to be a problem. And the `Additional Repositories` entry makes sure that the existence can be verified during *check*. I had to remove the dependency on Bio Conductor for CRAN.

We could also just use the `.onLoad` and not list the package in the

`DESCRIPTION` if that makes it better. Perhaps we just wait for the next round of feedback from CRAN when we want to publish an update? CRAN offers data repositories. I didn't read the policy for those yet.

martin-ueding · 2020-08-03T11:10:48Z

It seems that if one promises to update a package less often, one can exceed the 5 MB of data. What exactly do we want to offer? So far I have only seen the pion form factor data, which would be used to show the reading routines. The data would be more useful as a ZIP archive than installed into some location in the R library, I think. What other data would we want to provide to end users?

kostrzewa · 2020-08-03T11:21:42Z

I want to add gradient flow and loop files, as well as some stuff for raw_cf examples. These are all data files which will be several hundred MB in total.

kostrzewa · 2020-08-03T11:22:13Z

As far as I have understood the literature that I posted, CRAN data packages should allow for this.

martin-ueding · 2020-08-03T11:32:56Z

If we just have it on our webserver (GitHub Pages), we would have the maximum flexibility.

If the package is several hundred MB, then we will have a problem after a few releases as a package as GitHub only allows to have repositories of 1 GB in size. We would need to truncate the history of the r-repo repository and force-push, which isn't too bad. If we just opt for a directory of archives, then this issue would only arise if the history of that data gets too long or the data too large in total.

I guess the most interesting question is whether we want the data to be available via the data() mechanism in R or whether using readRDS or load would be fine as well.

kostrzewa · 2020-08-03T11:38:11Z

If I use the data in examples, it must be available in inst/ in some way, right? This means that it probably has to be a package.

You are right that this will basically be impossible to do on github. Pulling it as a zip file might be an option, although I'm not sure if the examples will continue to work.

kostrzewa · 2020-08-03T11:38:41Z

I need to see how small I can make the example data for it to still be practically useful.

martin-ueding · 2020-08-03T11:51:36Z

There must be additional possibilities to host some files, Carsten could for instance do that in his home directory at the institute. Or we ask for webspace, that should not be that hard. Otherwise 1 GB on GitHub would bring us pretty far, we could also use multiple repositories to circumvent the limitations or just ask for an upgrade as part of our academic plan.

In machine learning notebooks with Python there one usually downloads the data when it is needed from some website. I guess with a package like curl or so we can do the same thing in R. The disadvantage of a package would be that all data needs to be in one place.

If you want to use the data in examples of hadron, we will need the data package to be on CRAN. Installing it from external sources only seems to work after the fact, and we can't have install.packages('hadronexampledata') in the examples I believe. I am not sure whether one could download stuff via HTTPS during CRAN building, so that might not be possible either.

A data package on CRAN would be the most formal variant and easiest to use afterwards. The process is just the most tedious and the limitations the strictest.

urbach · 2020-08-03T12:17:09Z

having so large data sets in regular examples is not a good idea anyhow, because they would run for too long presumably!

martin-ueding added 4 commits August 3, 2020 08:16

Integrate hadronsampledata into hadron

c1c3ba5

Add an onLoad function to add the repository

9d93d01

Bump version

e7a7a9a

Merge branch 'master' into data-repository

a2c22ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate hadronsampledata into hadron #295

Integrate hadronsampledata into hadron #295

martin-ueding commented Aug 3, 2020 •

edited

Loading

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020 via email

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020 via email

urbach commented Aug 3, 2020 via email

martin-ueding commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020 via email

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020

Integrate hadronsampledata into hadron #295

Are you sure you want to change the base?

Integrate hadronsampledata into hadron #295

Conversation

martin-ueding commented Aug 3, 2020 • edited Loading

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020 via email

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020 via email

urbach commented Aug 3, 2020 via email

martin-ueding commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020 via email

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

urbach commented Aug 3, 2020

martin-ueding commented Aug 3, 2020 •

edited

Loading