Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package list for a robust Base R workspace for geoscience applications on MAAP #742

Closed
pahbs opened this issue Jun 23, 2023 · 20 comments
Closed
Assignees
Labels
ADE Algorithm Development Environment Subsystem Enhancement New feature or request JPL JPL related issues
Milestone

Comments

@pahbs
Copy link

pahbs commented Jun 23, 2023

Is your feature request related to a problem? Please describe.
MAAP scientists need a robust Base R workspace that will limit the need for repeatedly installing many basic add-ons upon each workspace re-start. While we recognize the need to do this for a variety of special-case packages that are not commonly used, we should not do this for packages that are routinely used across almost all R notebooks.

Routine packages would include those that support, for example:

  • data frame filtering and manipulation,
  • aws file access,
  • geospatial file access,
  • spatial analysis,
  • statistical analysis, and
  • map generation.

We can keep and update a list of package requests here:
https://docs.google.com/spreadsheets/d/1mrQ3gdcxZHZNTksUmLz6qqqNSNxhAoonB9znLU0c0pk/edit?usp=sharing

Describe the solution you'd like
We would like:

  1. a robust list of R packages that UWG R users routinely use
  2. this list installed/tested and formalized as Base R workspace

Describe alternatives you've considered
Alternatives include always adding these packages manually, by each user, every time they need to re-start a workspace running R

@wildintellect
Copy link
Collaborator

Other notes:

  • Should this be in the base environment for the workspace so users don't have to always switch
  • R should be > 4.x
  • rgdal and related sp packages are being retired in 2023, we might want to skip on those to push people to start using newer methods (this raises the question of having a legacy R workspace for compatibility with existing code - or using Packrat to generate a list of old version for particular projects)
  • R workspaces need to support maap-py and python. #684

@wildintellect wildintellect pinned this issue Aug 16, 2023
@wildintellect
Copy link
Collaborator

Rocker is often a good starting point of comparison https://github.com/rocker-org/geospatial/blob/master/Dockerfile

@wildintellect
Copy link
Collaborator

I checked and VEDA does in fact use Rocker. The full list to match that would be:

BiocManager
classInt
covr
DBI
deldir
dplyr
geoR
geosphere
gstat
hdf5r
knitr
lidR
lwgeom
mapdata
maps
maptools
mapview
microbenchmark
ncdf4
odbc
pool
proj4
protolite
RandomFields
raster
RColorBrewer
rgdal
rgeos
rlas
rmarkdown
RNetCDF
RPostgres
RPostgreSQL
RSQLite
sf
sf
sfarrow
sp
sp
spacetime
spatialreg
spatstat
spdep
stars
terra
testthat
tibble
tidync
tidyr
tmap
units

Along with various Binary libraries required.

    gdal-bin \
    lbzip2 \
    libfftw3-dev \
    libgdal-dev \
    libgeos-dev \
    libgsl0-dev \
    libgl1-mesa-dev \
    libglu1-mesa-dev \
    libhdf4-alt-dev \
    libhdf5-dev \
    libjq-dev \
    libpq-dev \
    libproj-dev \
    libprotobuf-dev \
    libnetcdf-dev \
    libsqlite3-dev \
    libssl-dev \
    libudunits2-dev \
    netcdf-bin \
    postgis \
    protobuf-compiler \
    sqlite3 \
    tk-dev \
    unixodbc-dev

@grallewellyn grallewellyn self-assigned this Dec 5, 2023
@grallewellyn
Copy link
Collaborator

I have a image building successfully for this, but had a couple questions @wildintellect

  1. I was not able to find these packages on conda or pip: fgmutils, leaflet_proxy, nlraa, sfarrow, tools. cc @pahbs who requested leaflet_proxy, nlraa, and tools
  2. Is it okay that for every package recommended, if it had an r version (r- so like biocmanager to r-biocmanager), I used that even if there was a "normal" version? I already had to convert a lot of packages to the r- version because there was no "normal" version so I wanted to be consistent (ie package biocmanager was not present)

@grallewellyn
Copy link
Collaborator

Also, I cannot find the binary library libgsl0-dev to pin the version? Do we want to skip it?

@wildintellect
Copy link
Collaborator

@grallewellyn I'll look into these. Generally in conda/conda-forge all R packages start with r- prefix
As with Python, some packages may not be available in conda-forge at all, in which case we need to check if they are on CRAN (the main R package archive), or must be installed from github directly (least desired).

@grallewellyn
Copy link
Collaborator

I see Fgmutils, nlraa, and sfarrow on CRAN, but I don't see leaflet_proxy and tools
What is the best way to install these packages from CRAN? Something like RUN R -e "install.packages('Fgmutils',dependencies=TRUE, repos='http://cran.rstudio.com/')"?

@wildintellect
Copy link
Collaborator

  • You want to use repos='https://cloud.r-project.org/' which auto redirects to best choice, https://cran.r-project.org/mirrors.html
  • dependencies=NA as we don't want 'Suggests'
  • I would usually use Rscript install_code.R so that you can create a sophisticated approach to the installs that runs in a single Run

@wildintellect
Copy link
Collaborator

Have a look at how Rocker does it, this is our reference, as it's what VEDA deploys and very similar to our target goal. https://github.com/rocker-org/rocker-versioned2/blob/master/dockerfiles/geospatial_4.3.1.Dockerfile

@wildintellect
Copy link
Collaborator

wildintellect commented Dec 20, 2023

From CRAN:

  • fgmutils
  • sfarrow
  • nlraa

Maybe from Conda:

  • r-biocmanager ( in Rocker this is specifically to install rhdf5), however conda forge has r-hdf5r (not sure it's the same package but it's already on our list), conda also has the bionconda repo https://anaconda.org/bioconda/bioconductor-rhdf5 🤔 maybe lets skip this time around until we untangle what users actually need to read HDF5 files aside from ncdf4 which is already included as part of terra.

A note about -dev libraries, these are only needed if a package that depends on them is not in conda-forge, they are C headers for compiling C/C++ packages from source.

@grallewellyn don't wait on a response from @pahbs, just go ahead with everything else for this release, and we'll work to clear this up for next release.

  • leaflet_proxy doesn't exist but leafletproxy does?
  • tools doesn't exist, did you mean a different package? there are several possibilities

@grallewellyn
Copy link
Collaborator

@wildintellect Where is the API for install_code? I am not sure how to use this

I am not sure what you mean by copying the way this link https://github.com/rocker-org/rocker-versioned2/blob/master/dockerfiles/geospatial_4.3.1.Dockerfile does it. Why would we want a dockerfile for every version?

Regarding, r-biocmanager, I was just using this as an example, nothing is wrong with it and it seems fine to just install r-biocmanager? Unless you want it removed?

The image can build fine without all the binary libraries, are you saying remove all of them? Or is it fine to just remove libgsl0-dev?

I don't see leafletproxy on CRAN, conda or pip- are you referring to a GitHub repo?

@wildintellect
Copy link
Collaborator

Oh, leafletproxy is a command in the leaflet package.

What I mean is look at how that dockerfile works, it calls https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_geospatial.sh which then uses https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/bin/install2.r to actually do the install.

We can drop biocmanager if rhdf5 is provided another way.
For clarity and size, yes drop any binary/dev that isn't needed since conda removes the need to build.

@grallewellyn
Copy link
Collaborator

@wildintellect So something like

install_cran_packages_r.sh contains

#!/bin/bash
set -e

# always set this for scripts but don't declare as ENV..
export DEBIAN_FRONTEND=noninteractive

## build ARGs
NCPUS=${NCPUS:--1}

# a function to install apt packages only if they are not installed
function apt_install() {
    if ! dpkg -s "$@" >/dev/null 2>&1; then
        if [ "$(find /var/lib/apt/lists/* | wc -l)" = "0" ]; then
            apt-get update
        fi
        apt-get install -y --no-install-recommends "$@"
    fi
}

install2.r --error --skipmissing --skipinstalled -n "$NCPUS" \
    Fgmutils \
    nlraa \
    sfarrow 

# Clean up
rm -rf /var/lib/apt/lists/*
rm -r /tmp/downloaded_packages

strip /usr/local/lib/R/site-library/*/libs/*.so

And https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/bin/install2.r is the same then we call
RUN /rocker_scripts/install_cran_packages_r.sh?

@wildintellect
Copy link
Collaborator

@grallewellyn that's the basic idea, you can probably drop the apt section since we used conda to install the system libraries.

@grallewellyn
Copy link
Collaborator

I have this working, but I am not able to install Fgmutils via the r script and I am still looking into this
How do we install maap-py in an r workspace again? This is the current behavior in ops cc @anilnatha
Screenshot 2023-12-27 at 2 10 17 PM

@wildintellect
Copy link
Collaborator

@grallewellyn

  • Are you installing to the base conda environment?
  • What kernel do you have active in that screen shot?
  • I'm looking into the Fgmutils, can you provide a build log error?

@wildintellect
Copy link
Collaborator

wildintellect commented Jan 2, 2024

@grallewellyn for Fgmutils I have verified that all it's dependencies (which require compilation) are in conda-forge

conda install -c conda-forge --solver=libmamba  r-rsqlite r-chron r-png r-devemf r-plyr r-sqldf r-proto r-gsubfn r-plogr

The following NEW packages will be INSTALLED:

  r-chron            conda-forge/linux-64::r-chron-2.3_61-r42h57805ef_1 
  r-devemf           pkgs/r/linux-64::r-devemf-4.0_2-r42h884c59f_0 
  r-gsubfn           conda-forge/noarch::r-gsubfn-0.7-r42hc72bb7e_1004 
  r-plogr            conda-forge/noarch::r-plogr-0.2.0-r42hc72bb7e_1005 
  r-plyr             conda-forge/linux-64::r-plyr-1.8.9-r42ha503ecb_0 
  r-png              conda-forge/linux-64::r-png-0.1_8-r42h81d01c5_1 
  r-proto            conda-forge/noarch::r-proto-1.0.0-r42ha770c72_2005 
  r-rsqlite          conda-forge/linux-64::r-rsqlite-2.3.4-r42ha503ecb_0 
  r-sqldf            conda-forge/noarch::r-sqldf-0.4_11-r42hc72bb7e_4 

So add those to the environment yaml for conda, and then the install.packages for Fgmutils should work.

@grallewellyn
Copy link
Collaborator

@wildintellect

  1. Anil and I changed the workspaces so that the default conda environment isn't base but r for the r workspaces, pangeo for pangeo, etc. so that is what I was trying to install Fgmutils into
  2. That screenshot is from an R kernel in an R workspace
  3. Sorry, it wasn't giving me any build error besides
also installing the dependencies ‘chron’, ‘sqldf’, ‘devEMF’


Warning message in install.packages("Fgmutils", repos = "http://cran.us.r-project.org/"):
“installation of package ‘chron’ had non-zero exit status”
Warning message in install.packages("Fgmutils", repos = "http://cran.us.r-project.org/"):
“installation of package ‘devEMF’ had non-zero exit status”
Warning message in install.packages("Fgmutils", repos = "http://cran.us.r-project.org/"):
“installation of package ‘sqldf’ had non-zero exit status”
Warning message in install.packages("Fgmutils", repos = "http://cran.us.r-project.org/"):
“installation of package ‘Fgmutils’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

But I was struggling to figure out why the dependencies weren't installing

Thanks for the conda install -c conda-forge --solver=libmamba r-rsqlite r-chron r-png r-devemf r-plyr r-sqldf r-proto r-gsubfn r-plogr, that worked!! I didn't realize I could just install those packages via conda and that would satisfy the dependencies for the CRAN package Fgmutils since install.packages("chron") was failing with no error message

@wildintellect
Copy link
Collaborator

wildintellect commented Jan 2, 2024

maap-py won't work in an R kernel, since it needs python. Try running the python kernel in the r conda env. The real test will be to use the R reticulate package to call the python code from R. If you have an image with it all installed on DIT I can help test this out.

@grallewellyn
Copy link
Collaborator

maap-py does work with a python kernel
I have an image on dit 'mas.dit.maap-project.org/root/maap-workspaces/jupyterlab3/r:7edbac3' which is my most current R image with Fgmutils working and it has r-reticulate installed
This works for me in a r notebook

library(reticulate)
maap <- import("maap.maap")
maap_obj <- maap$MAAP(maap_host='api.dit.maap-project.org')

result <- maap_obj$submitJob()

print(result)

@grallewellyn grallewellyn added this to the 3.1.4 milestone Jan 10, 2024
@anilnatha anilnatha added Enhancement New feature or request JPL JPL related issues ADE Algorithm Development Environment Subsystem labels Jan 22, 2024
@wildintellect wildintellect unpinned this issue Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADE Algorithm Development Environment Subsystem Enhancement New feature or request JPL JPL related issues
Projects
None yet
Development

No branches or pull requests

5 participants