Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conda as package manager for R in rstudio-server as replacement for packrat #9423

Open
curious-odd-man opened this issue May 22, 2018 · 6 comments

Comments

@curious-odd-man
Copy link

@curious-odd-man curious-odd-man commented May 22, 2018

As suggessted by @mingwandroid I'm moving this post here:

We are currently working on the combination of rstudio-server and conda as well as we plan to use conda for python package management (one environment.yml per Data Science application). There are a lot of questions/answers scattered around the internet about this topic, but there was nowhere completely described how to do it and how to troubleshoot or solve different troubles that might come up on the way.

We would like to share our experience and hear opinions and critics for our approach and rise a discussion on the topic. Also, we would like to encourage everyone to collect references to other material about the topic on the question and finally create some complete guide that will help whoever will want to do the same.

Environment and overview

For each user in our system we have independent containers with rstudio-server. Data Scientists are using both R (and python) for development. For R we’ve used packrat as package manager, but following concerns arises:

  1. Packrat will compile each package first time – that means that recreation of environment for the project usually takes long time. Also, creation of clean testing environment is time consuming.
  2. Packrat is usable only with Rpython developers can’t use it - that means that different projects are using different package managers.
  3. Packrat does not handle package dependencies on system libraries (e.g. RODBC package depends on unixodbc)
  4. Packrat cannot handle binary packages that are not available in cran (such as catboost).

Since conda is language agnostic package and environment manager that supports number of R packages, it gains influence in python world and it has constantly growing community of contributors who add new packages to conda we decided to try it out. The solution we’ve thought about looks like this:

2018-05-18 13_32_49-conda graphml - yed

The following questions we have formulated:

  1. How to combine conda and the containerized rstudio-server?
  2. How to handle packages that are not available in conda?
  3. Is it possible to combine conda with install.packages() or devtools::install_github() for missing packages at least to tryout packages?
  4. Is it possible to combine conda with packrat?
  5. How to use conda with spark (sparklyr, pyspark)?
  6. Is it feasible to use conda?

Below are the answers we have for the questions

1. How to combine conda and the containerized rstudio-server?


This is quite simple:

First put following line .libPaths(paste0(R.home(), "/library")) inside ~/.Rprofile file. This will exclude paths to any library directory, but the one that is located in R home directory. This way whenever R home directory is switched – R library directory will also be changed.

Second – substitute line rsession-which-r=path/to/r/home in /etc/rstudio/rserver.conf file to path to R within conda environment.

We’ve developed following script to do it automatically:

# First argument is a name of the conda environment that needs to be activated
echo Target environment is: $1 

# Find path to the environment (second column of output of command)
new_env_path=`conda info --env | grep -E "^$1\s+" | awk '{print $2}'`

# In case if the environment is active in conda - read path in third column
if [ "$new_env_path" = "*" ]; then
    echo "Env is active!" 
    new_env_path=`conda info --env | grep -E "^$1\s+" | awk '{print $3}'`
else
    echo "Env is not active"  
fi;

echo "New path = $new_env_path"

# sudo is required to edit configuraion file
sudo sed -i "s|rsession-which-r=.*|rsession-which-r=$new_env_path/bin/R|" /etc/rstudio/rserver.conf

# ... and to restart service
sudo rstudio-server restart

This will successfully switch the R interpreter and libraries to a selected conda environment.

2. How to handle packages that are not available in conda?


Here are 2 solution that we’ve tried:

Conda skeleton – for packages that are available in cran or some other package managers. It will automatically create and compile package for you and will let you to (or even automatically) upload the package to anaconda.org own channel. To use conda skeleton is easy:

conda skeleton cran <package_name>
conda build <package>

The guide on how to use it is here.

The error we’ve got doing it is:

Undefined Jinja2 variables remain (['cran_mirror', 'cran_mirror']). Please enable source downloading and try again.

This issue can be resolved by creating file with address of the cran mirror:

echo "cran_mirror: https://cloud.r-project.org/" > cfg.yam"

Add argument to build command:

conda build <package> -m cfg.yam

For the packages that are not available in cran it’s possible to use conda-forge and contribute to community by creating receipts for the missing packages. The process is well documented and available here.

3. Is it possible to combine conda with install.packages() or devtools::install_github() for missing packages at least to tryout packages?


It seems that is should be possible for purely R packages, but there are currently 2 errors with to be compiled packages, as the conda build environment is not utilized automatically:

  1. x86_64-conda_cos6-linux-gnu-cc not found – when trying to compile
    package. There is a workaround for it - using Sys.setenv() prepend
    path to the conda environment bin folder to PATH variable:
    Sys.setenv(PATH=paste0("path/to/conda/env/bin:", Sys.getenv(“PATH”)))
  2. The <somlibrary> library that is required to build <someotherlibrary> was not found. – There is no currently solution
    how to fix it, but I assume, that there should be some complete
    solution how to fix both, this and the previous error.

4. Is it possible to combine conda with packrat?


Does not look like as packrat does not recognize packages already installed via conda and it faces the same build issues as 3.

5. How to use conda with spark (sparklyr, pyspark)?


This question is how to bring conda environment to the executors. Some similar question was already discussed here, so that’s where I’ll start my investigation.

6. Is it feasible to use conda?


Seems like yes?! >90-95% of the required R packages are anyway available via conda-forge. The rest can be handled locally via conda skeleton cran & build eventually followed by an upload to own anaconda channel or by contributing to conda-forge. Conda + packrat seems not to work and maybe also not be required. devtools::install_github() would be nice to have, but how to get it running if the code needs to be compiled?
Is there any other solution to handle additional packages on top of conda - shell or R setup script?

@MichaelPeibo
Copy link

@MichaelPeibo MichaelPeibo commented Dec 10, 2018

Hi,
I have an issue using Rstudio server via anaconda, in short, I got an configuration error rsession-which-r I do not install anaconda in root account, so R version in root account is quite low.
Instead, I install it in my own account.

Is there any suggestion about how to fix this error? I would like to use Rstudio server in remote control.

Thanks!

@qamaraden
Copy link

@qamaraden qamaraden commented Aug 16, 2019

Would this work for a server not connected to the internet?

@grst
Copy link

@grst grst commented Jun 13, 2021

Linking the results from my experiments here: https://github.com/grst/rstudio-server-conda

Using install.packages or install_github works fine - I needed to wrap the rsession when running a local rstudio server instance for that or to run run the entire rserver process in the conda env when using the containerized version.

I use this approach in production for a while now and have not encountered any major issues.

@h-vetinari
Copy link

@h-vetinari h-vetinari commented Jun 13, 2021

@grst - cool! :)

Not sure if you're up for it, but in case you're interested - there's also an attempt to build rstudio server directly for conda-forge: conda-forge/staged-recipes#13760

If you want to contribute, you'd be more than welcome - currently it's a bit stalled.

@grst
Copy link

@grst grst commented Jun 14, 2021

Hi @h-vetinari,

that looks interesting! Do I get it right this approach would require rstudio server to be installed within the same conda env that is used for the analysis?

@h-vetinari
Copy link

@h-vetinari h-vetinari commented Jun 14, 2021

Do I get it right this approach would require rstudio server to be installed within the same conda env that is used for the analysis?

Not necessarily - conda supports stacking environments - but generally, the idea would be that everything (including rstudio-server) could be installed by conda and work "out of the box".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants