Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WISH/BUG: Respect CPU resource limitations set by Linux CGroups to avoid CPU overuse and slowdown #5620

Open
HenrikBengtsson opened this issue Mar 23, 2023 · 2 comments
Labels

Comments

@HenrikBengtsson
Copy link

HenrikBengtsson commented Mar 23, 2023

Issue

data.table::getDTthreads() is not agile to Linux CGroups settings. If CGroups limits the number of CPU cores, then data.table will overuse the CPU resources is available to the R process.

For example, the 'Free' Posit Cloud plan gives you a single CPU core to play with. They use CGroups v1 to limit the CPU resource. Running the following from within their RStudio server reveals this:

> total <- as.integer(readLines("/sys/fs/cgroup/cpu/cpu.cfs_period_us"))
> total
[1] 100000
> quota <- as.integer(readLines("/sys/fs/cgroup/cpu/cpu.cfs_quota_us"))
> quota
[1] 100000
> cores <- quota / total
> cores
[1] 1

A user on the 'Premium' plan has 4 CPUs to play with, so they would get quota = 400000 and cores = 4 above.

The defaults of data.table does not pick this up:

> data.table::getDTthreads(verbose = TRUE)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            16
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          16
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 8 threads with throttle==1024. See ?setDTthreads.
[1] 8

This means multi-threaded data.table tasks will overuse the CPU resources by 800%, which results in lots of overhead from context switching (unless there are other low-level mechanisms in data.table detecting this). CPU overuse will slow down the performance.

The overuse problem becomes worse the more CPU cores the host has. For example, the Posit Cloud instances currently runs with 16 vCPUs, but if they upgrade to say 64 vCPUs, the overuse will be 3200%. On research HPC environments, it's now common to see 192 CPUs, and I'd expect this number to grow over time.

FWIW, parallelly::availableCores() queries also CGroups/CGroups v2, e.g.

> parallelly:::availableCores()
cgroups.cpuquota 
               1 

> parallelly:::availableCores(which = "all")
          system   cgroups.cpuset cgroups.cpuquota            nproc 
              16               16                1               16 

Session info

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_4.2.3 tools_4.2.3   
@tdhock
Copy link
Member

tdhock commented Mar 27, 2023

similar to #5573 about using data.table on slurm cluster.
currently we assume this kind of configuration should be handled by the user. For example, the user can set R_DATATABLE_NUM_THREADS environment variable.
in terms of dev/maintenance time, how many types of environment variables like this should we support? (SLURM, CGroups, ...?) how would we test each of them? given constraints on dev time, I would argue that it would be better to keep asking users to handle this.

@tdhock tdhock added the omp label Mar 27, 2023
@HenrikBengtsson
Copy link
Author

it would be better to keep asking users to handle this

Given that data.table is such a central infrastructure package used internally by many packages and pipelines, I wonder how many users even know they are using data.table yet know they need to configure the number threads it should use.

For the problem reported here, CGroups throttling, I believe there are lots of data.table instances out there running slower than a single-thread version would do, and this without anyone even noticing the problem. It's only the savvy user who would know that this could be a problem and that it should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants