Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about R packages #2951

Closed
adamjstewart opened this issue Jan 27, 2017 · 15 comments
Closed

Question about R packages #2951

adamjstewart opened this issue Jan 27, 2017 · 15 comments
Labels
documentation Improvements or additions to documentation question R

Comments

@adamjstewart
Copy link
Member

adamjstewart commented Jan 27, 2017

I've never used R before, but a user asked for a few R modules, which happen to depend on hundreds of others, so I've found myself adding a lot of new packages. Since I don't really know much about R, I have a few questions on how I should proceed with these packages:

  1. I noticed there are some modules that come with R (methods, grids, stats) that don't exist on cran and can be imported easily. But there are also a few modules that come with R (rpart, survival) that are on cran and print a message when imported. Should I add packages for/dependencies on the latter?

  2. I noticed there are several types of dependencies (depends, imports, linkingTo, suggests). What are the differences? If I had to guess, I would say depends==build/run, imports==run, linkingTo==build/link, and suggests==build/run but is optional. Is this correct?

  3. Should I add suggested dependencies as long as they don't create a circular dependency? I imagine that build failures are rare and things build quickly, just like with Python, so I'm inclined to add them.

@glennpj @JavierCVilla

@adamjstewart
Copy link
Member Author

adamjstewart commented Jan 27, 2017

On second thought, packages like lava make me not want to add suggested dependencies. Most of its suggested dependencies aren't in Spack, and concretization of R packages is incredibly slow as it is.

@citibeth
Copy link
Member

I would imagine that which suggested dependencies you add should depend on your users' requirements.

@HenrikBengtsson
Copy link
Contributor

I'd say I know lots about R - it's been my main development environment the last 16+ years.

I noticed there are some modules that come with R (methods, grids, stats) that don't exist on cran and can be imported easily.

These packages are part of the core R distribution and are tied to the R version installed. You can basically consider these to be "R itself" (non-official https://github.com/wch/r-source/tree/trunk/src/library). These are so essential to R so it would not make sense that they could be updated via CRAN. If so, you would basically get a different version of R. Thus, they're update when R is updated.

But there are also a few modules that come with R (rpart, survival) that are on cran and print a message when imported. Should I add packages for/dependencies on the latter?

These are so called "Recommended" packages (legacy term). They are for historical reasons quite tied to the core R distribution, developed by the R core team or people closely related to it. The R core distribution "knows" about these package (cf. https://github.com/wch/r-source/blob/trunk/share/make/vars.mk), but they are indeed distributed via CRAN. Because they're distributed via CRAN, they can also be updated between R version releases.

I noticed there are several types of dependencies (depends, imports, linkingTo, suggests). What are the differences? If I had to guess, I would say depends==build/run, imports==run, linkingTo==build/link, and suggests==build/run but is optional. Is this correct?

Basically, all of Depends, Imports, and LinkingTo are required dependencies during install and run-time. Anything under Suggests is optional (e.g. rarely used features) at both install and run-time.

Depends is required and will cause those R packages to be attached, that is, their APIs are exposed to user. Imports loads packages so that the package importing these packages can access their APIs, while not being exposed to the user. When a user calls library(foo) s/he attaches package foo and all of the packages under Depends. Any function in one of these package can be called directly as bar(). If there are conflicts, user can also specify pkgA::bar() and pkgB::bar() to distinguish between them. Historically, there was only Depends and Suggests, hence the confusing names. Today, maybe Depends would have been named Attaches.

The LinkingTo is not perfect and there was recently an extensive discussion about API/ABI among other things on the R-devel mailing list among very skilled R developers:

Should I add suggested dependencies as long as they don't create a circular dependency? I imagine that build failures are rare and things build quickly, just like with Python, so I'm inclined to add them.

No, I would not install Suggests:ed packages.

As a rule of thumb, you can assume that dependencies under Depends and Imports (and LinkingTo) are all available on CRAN (*), whereas those under Suggests may exist elsewhere (or even need to be installed manually - they're getting stricter about this but it's still allowed).

(*) Both CRAN and Bioconductor are actually considered mainstream R repositories, so you can have dependencies from a CRAN package to a Bioconductor package, e.g. https://cran.r-project.org/package=PSCBS (imports DNAcopy from Bioconductor - yeah, I'm guilty to that one).

Now to something I wanted to ask for a while. As you might be aware, CRAN hit 10,000 R packages yesterday. Do you intend to provide Spack package.py:s for all of them? I think that's going to be very hard to keep up with it. There are about 6 new packages added per day at the current rate and the rate keeps increasing(!) (https://stat.ethz.ch/pipermail/r-devel/2017-January/073676.html). This not to mention the number of updates that appear each day. It sounds like you need to automate this process (possibly with a set of manual patches) in order not to go insane.

There exists a few efforts targeting reproducibility (package version dependencies etc) for R that builds on top of the R framework to handle this (https://cran.r-project.org/web/views/ReproducibleResearch.html), e.g.

It might be a better idea to have the R community worry about this, especially since that's where you most likely where users are going to get support for these type of needs.

The way I can see Spack being most valuable for R, in addition to install R itself, is to provide / install necessary compilers and libraries needed on the PATH / LD_LIBRARY_PATH / ... for those packages to install out of the box when the user calls:

install.packages("foo")

I'd assume this will be the 99.9% use case everyone has. Personally, I don't think that I'll be installing R packages via spack install r-foo. I could see it be a fallback for when it's not clear what the lib* dependencies are or how to install them on your particular platform. Some R package provides very helpful error messages during configure that user's basically can cut'n'paste, whereas others assume user is much more savvy. But the general trend is that the information on LinkingTo and SystemRequirements (e.g. https://cran.r-project.org/package=stringi) is becoming better and better.

@citibeth
Copy link
Member

This is a really good writeup. I've tagged this thread so it can make it into our documentation.

@citibeth citibeth added the documentation Improvements or additions to documentation label Jan 28, 2017
@adamjstewart
Copy link
Member Author

@HenrikBengtsson Thanks for the thorough explanation! So it sounds like as far as Spack is concerned, depends/imports/linkingTo are all basically equivalent, and all of these dependencies are needed at build/run time but aren't needed for linking (with RPATH).

I agree with you that it is a total pain in the ass to package all of these dependencies manually. I do plan on making automatic creation of R packages simpler, where you could run something like:

$ spack create --template r rpart

and Spack would create something that looked like:

from spack import *                                                             
                                                                                
                                                                                
class RRpart(RPackage):                                                         
    """FIXME: Put a proper description of your package here."""                                                          
                                                                                
    homepage = "https://cran.r-project.org/package=rpart"                       
    url      = "https://cran.r-project.org/src/contrib/rpart_4.1-10.tar.gz"     
    list_url = "https://cran.r-project.org/src/contrib/Archive/rpart"           
                                                                                
    version('4.1-10', '15873cded4feb3ef44d63580ba3ca46e')

Then, you would just have to add a description and add dependencies. This really isn't that bad, but updating all of the packages to the latest version is kind of annoying. We do need a better way of automating that.

I'm very new to the R ecosystem (like about a week into it), but I see a lot of parallels between the difficulty I've had with installing Python packages with Spack. Basially, Spack is an incredible build system when it comes to C/C++/Fortran packages that need to be compiled and linked, but it's a bit overboard for packages that don't require linking (Python, R). For these packages, it isn't hard to upgrade a single package without breaking others, and I've yet to have a user who really cared exactly how it was built or what version it was built with. For things like HDF5 or NetCDF, people really care about building with certain compilers or MPI libraries, but for Python and R, no one really cares. I'm torn between investing time in Spack's R/Python packages and just giving up and using Anaconda. Commands like spack activate frequently break, and without activation I can't realistically expect my users to know that r-rminer requires them to load a hundred other R packages to actually use.

We've flirted with the idea of using other package managers like pip internally in Spack to install things, but we've never committed. One thing I can say is that for users who do not have internet access (on restricted clusters), Spack makes it easy for them to download an entire mirror of R packages and install them on their own. And for R packages that require non-R dependencies (not sure how common this is) Spack makes it easy to install them.

As someone much more familiar with R, how likely is it that different users will want different versions of an R package, or ones built in a different way? Can I safely assume that they really don't care and just want the latest and greatest of everything they require? Can I run install.packages("rminer") and R will install all of the dependencies for me? How do I update all installed packages?

@HenrikBengtsson
Copy link
Contributor

HenrikBengtsson commented Jan 29, 2017

$ spack create --template r rpart

CRAN provides a single file https://cran.r-project.org/src/contrib/PACKAGES, which is used by R's install.packages() et al.. For instance, the entry for RcppArmadillo looks like this:

Package: RcppArmadillo
Version: 0.7.600.1.0
Imports: Rcpp (>= 0.11.0), stats, utils
LinkingTo: Rcpp
Suggests: RUnit, Matrix, pkgKitten
License: GPL (>= 2)
NeedsCompilation: yes

This should allow you to pull down dependencies too.

I'm actually surprise that it does not provide full package DESCRIPTION; I was hoping you could pull out the package Title as well and even the Description field.

A more modern alternative that also provides this information would be to use the METACRAN (https://r-pkg.org/services#api & https://github.com/metacran/crandb#readme) API. For instance, compare https://cran.r-project.org/package=RcppArmadillo with:

curl https://crandb.r-pkg.org/RcppArmadillo
{"Package":"RcppArmadillo","Type":"Package","Title":"'Rcpp' Integration for the 'Armadillo' Templated Linear Algebra\u000aLibrary","Version":"0.7.600.1.0","Date":"2016-12-16","Author":"Dirk Eddelbuettel, Romain Francois and Doug Bates","Maintainer":"Dirk Eddelbuettel <edd@debian.org>","Description":"'Armadillo' is a templated C++ linear algebra library (by Conrad\u000aSanderson) that aims towards a good balance between speed and ease of use. Integer,\u000afloating point and complex numbers are supported, as well as a subset of\u000atrigonometric and statistics functions. Various matrix decompositions are\u000aprovided through optional integration with LAPACK and ATLAS libraries.\u000aThe 'RcppArmadillo' package includes the header files from the templated\u000a'Armadillo' library. Thus users do not need to install 'Armadillo' itself in\u000aorder to use 'RcppArmadillo'. 'Armadillo' is licensed under the MPL 2.0, while\u000a'RcppArmadillo' (the 'Rcpp' bindings/bridge to Armadillo) is licensed under the\u000aGNU GPL version 2 or later, as is the rest of 'Rcpp'.  Note that Armadillo\u000arequires a fairly recent compiler; for the g++ family at least version 4.6.*\u000ais required.","License":"GPL (>= 2)","LazyLoad":"yes","LinkingTo":{"Rcpp":"*"},"Imports":{"Rcpp":">= 0.11.0","stats":"*","utils":"*"},"Suggests":{"RUnit":"*","Matrix":"*","pkgKitten":"*"},"URL":"http://dirk.eddelbuettel.com/code/rcpp.armadillo.html","BugReports":"https://github.com/RcppCore/RcppArmadillo/issues","NeedsCompilation":"yes","Packaged":"2016-12-16 11:55:10.107195 UTC; edd","Repository":"CRAN","Date/Publication":"2016-12-18 10:31:12","crandb_file_date":"2016-12-18 09:32:52","date":"2016-12-18T09:31:12+00:00","releases":[]}

The guy behind METACRAN is a solid active long-term contributor to the R community. I'd consider this source reliable and sustainable.

As someone much more familiar with R, how likely is it that different users will want different versions of an R package, or ones built in a different way? Can I safely assume that they really don't care and just want the latest and greatest of everything they require?

Yes, there are quite a few people who use different versions of R in parallel. For instance, the Bioconductor Project provides a "release" and "devel" branch of R packages. The "release" set is frozen twice a year (and only allows bug fixes). People who need access to the latest bioinformatics methods are likely to use the "devel" branch. Now, the "release" branch is tied to the most recent R version (e.g. R 3.3.2) whereas the "devel" branch requires the user to run the developers version of R (e.g. R 3.4.0 devel). (it's actually a bit more complicated than this depending on time of the year, but lets ignore that).

When a user installs a package in R, it defaults to installing under ~/R/ and in subdirectories that are architecture and R x.y version specific. For instance, I have:

  • /home/hb/R/x86_64-pc-linux-gnu-library/3.2
  • /home/hb/R/x86_64-pc-linux-gnu-library/3.3
  • /home/hb/R/x86_64-pc-linux-gnu-library/3.4

R will automatically take care to which a package is installed. The user don't have to worry about that, e.g.

$ Rscript --version
R scripting front-end version 3.3.2 (2016-10-31)
$ Rscript -e ".libPaths()[1]"
[1] "/home/hb/R/x86_64-pc-linux-gnu-library/3.3"

It is possible to control these paths via different environment variables, so you could imagine that you extend the above directory structure to reflect various compiler options etc.

Can I run install.packages("rminer") and R will install all of the dependencies for me? How do I update all installed packages?

Yes, that will automatically install dependencies.

install.packages("rminer")

defaults to

install.packages("rminer", dependencies = c("Depends", "Imports", "LinkingTo"))

An R user can update all installed R packages using:

update.packages(ask = FALSE)

That's all. Note that this will only updated packages with R x.y.*. That is, I can update my R 3.3.* packages this way, but when R is updated to the next major release (e.g. R 3.4.0), I have to reinstall all packages again.

2017-01-29: Updated a link to point directly to https://r-pkg.org/services#api and https://github.com/metacran/crandb#readme

@adamjstewart
Copy link
Member Author

@HenrikBengtsson A couple more questions for you.

  1. Can R packages be built in different stages?

For example, Autotools has:

configure
make
make check
make install
make installcheck

while Python has:

python setup.py build
python setup.py test
python setup.py install

We are currently using:

R CMD INSTALL

but I noticed there are a few other phases as well. There is a build CMD and a check CMD. Not sure why they are lowercase, but can we run:

R CMD build
R CMD check
R CMD INSTALL

for each package to separate things out? I'm hoping the check phase could be a reliable way to tell whether or not the package was built correctly.

  1. Can R packages be built in parallel?

I saw somewhere that someone recommended:

MAKE='make -j8' R CMD INSTALL

How reliable is this?

@adamjstewart
Copy link
Member Author

It looks like R has some built-in support for testing installed packages. We might want to add that. We can also attempt to import (require/library) these packages after installation.

@adamjstewart
Copy link
Member Author

adamjstewart commented Jan 30, 2017

I finally got all of my packages to install, only to find that I can't activate them. R comes with a few pre-installed libraries:

$ l /blues/gpfs/home/software/spack-0.10.0/opt/spack/linux-centos6-x86_64/gcc-6.1.0/r-3.3.2-puezz6voxkdfcnjbq7jxcmraojulsw72/rlib/R/library/
base     codetools  graphics    lattice  mgcv      rpart    stats4    translations
boot     compiler   grDevices   MASS     nlme      spatial  survival  utils
class    datasets   grid        Matrix   nnet      splines  tcltk
cluster  foreign    KernSmooth  methods  parallel  stats    tools

Due to this, I can't activate any of these packages, or packages that depend on them. Perhaps we should remove these packages from Spack and always depend on the versions that come with R?

@HenrikBengtsson
Copy link
Contributor

  1. Package developers use R CMD build <path>/ to build the source tar ball from a package directory. This output of this is a <pkg>_<ver>.tar.gz file, e.g. future_1.2.0.tar.gz.

  2. Package developers checks this packaged package using R CMD check <pkg>_<ver>.tar.gz, which runs through very extensive tests and reports ERRORs, WARNINGs and NOTEs.

  3. Package that are submitted to CRAN needs to pass an even stricter set of tests by running R CMD check --as-cran <pkg>_<ver>.tar.gz. Packages must not have any of the above flags reported, that is, a package must pass with all OKs in order to be accepted on CRAN. This is already true for any updates. More over, you are requested to pass these tests on current stable release of R (e.g. R 3.3.2), on previous stable release (R 3.2.5) and a recent R devel version (R 3.4.0 devel). They also ask you to check on Windows (https://win-builder.r-project.org/). (Lots of users these days use Travis CI (Linux + macOS) and AppVeyor CI (Windows) to run these tests as well as the new rhub::check_for_cran() service).

  4. When submitted to CRAN, the package / update appears online typically within 24 hours. However, before doing so, CRAN will run the tests on there server farm which covers many different OSes (including Solaris), cf. https://cran.r-project.org/web/checks/check_results_future.html. If you pass these tests, your package / update goes live.

When you install an R package from source, your install the <pkg>_<ver>.tar.gz file, which is basically the same file that the developer uploaded (it actually has be annotated with a bit more information inside, e.g. MD5 and time stamps), but in principle it's that same build. You can install a package by either:

  • downloading the tar.gz file manually and run R CMD INSTALL <pkg>_<ver>.tar.gz, or
  • call install.packages("<pkg>", type = "source").

If you want a specific version, you can pass the tar.gz URL as in install.packages("<url>", type = "source").

To run tests post-installation, you would run them on on the download tar.gz file, i.e. R CMD check --as-cran <pkg>_<ver>.tar.gz. But note, CRAN itself has run lots of these tests for you already, but of course maybe not on your specific architecture.

I'm not aware of a way to install in parallel from the command line using R CMD INSTALL <pkg>_<ver>.tar.gz, but I also haven't look for one. From R, you can use install.packages("<pkg>", type = "source", Ncpus = 4). If you don't specify Ncpus it defaults to getOption("Ncpus", 1). If you want to call this from the command line, and install from a URL, you can do:

Rscript -e "install.packages('<url>', type = 'source', Ncpus = 4)"

For more information on how R does the parallel builds, see help("install.packages"). For instance, it mentions how it from the dependency DAG decides how to parallelize and how it also can pass flag -j to make.

FYI, R CMD <cmd> basically calls shell script / executable <cmd> part of the R installation. The reason for INSTALL being in upper case is that install would clash with /usr/bin/install (say).

@adamjstewart
Copy link
Member Author

@HenrikBengtsson @tgamblin @glennpj @JavierCVilla Ok, I need an executive decision here. R comes with a few pre-installed packages:

$ ls /blues/gpfs/home/software/spack-0.10.0/opt/spack/linux-centos6-x86_64/gcc-6.1.0/r-3.3.2-puezz6voxkdfcnjbq7jxcmraojulsw72/rlib/R/library/
base     codetools  graphics    lattice  mgcv      rpart    stats4    translations
boot     compiler   grDevices   MASS     nlme      spatial  survival  utils
class    datasets   grid        Matrix   nnet      splines  tcltk
cluster  foreign    KernSmooth  methods  parallel  stats    tools

These packages are also available on CRAN, and some of them are in Spack. The problem is that since the packages are already present in the R installation, I am unable to activate any of them or any of the packages that depend on them due to conflicts. I can think of two options:

1. Remove these packages from Spack

Never depend on Spack-built versions of these packages. Instead of removing them completely, I might just leave them present but raise an error during install saying to never depend on this fake package. That will prevent this problem from creeping back.

Pros: Less building, less concretization,
Cons: Can't pick a specific version for these packages, ugly fake packages

2. Keep them but ignore everything when symlinking

Users can still link to whatever version they want, but when they activate the package, they'll get the versions that come with R.

Pros: Can pick a specific version if you really want to, no fake packages
Cons: Non-deterministic behavior? If you activate a package, you will get the version from R, not the version from Spack. May be more trouble than it's worth

Until we make a decision on this, spack activate + R is a no-go.

@citibeth
Copy link
Member

citibeth commented Feb 2, 2017 via email

@JavierCVilla
Copy link
Contributor

I'd say also Option 1. Since these packages work as R extensions I think there's no reason to provide them twice as R core and R packages.

Users can still link to whatever version they want, but when they activate the package, they'll get the versions that come with R.

In case that a user needs an specific version for one of these packages Spack should propose him to use a different R version that may include it. Giving the option to specify a version and them activate a different one could end up in a mess for the user.

@trevorld
Copy link

trevorld commented May 16, 2018

Can I run install.packages("rminer") and R will install all of the dependencies for me?

Technically R will automatically (try to) install all R package dependencies on CRAN but does not automatically install the (non-R) system requirements listed in some packages' DESCRIPTION file in theSystemRequirements field. This sometimes causes install.packages to fail. For example I remember in the past install.packages("devtools") failing on a vanilla version of Ubuntu Linux and needing to manually install a llibcurl dependency for one of its R package dependencies and then re-try installing it by running install.packages("devtools") again. Sometimes I find I need to do this a few times installing different missing system dependency each time before the install.packages will finally succeed in installing all R package dependencies.

Many R packages don't have non-R system requirements or depend on any R packages with system requirements in which case install.packages should be fairly robust so automation of this fairly large subset of CRAN packages should be fairly straightforward. However automation of R packages with SystemRequirements fields might need more care - the guys trying to build Debian/Ubuntu packages of as much of CRAN as possible might have done some of the legwork for this.

@adamjstewart
Copy link
Member Author

Thanks @trevorld, this will be documented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question R
Projects
None yet
Development

No branches or pull requests

5 participants