Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.100] Thread-safety of the read access to elements of sparse matrix #179

Closed
dselivanov opened this issue Oct 12, 2017 · 20 comments
Closed

Comments

@dselivanov
Copy link

I have this chunk of code where I read elements of arma::sp_mat sparse matrix from many threads. With Armadillo 7.* series it worked fine, with latest 8.100 it crashes with some weird traceback. Any thoughts?

*** longjmp causes uninitialized stack frame ***: /usr/lib64/R/bin/exec/R terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fb6a6844d87]
/lib64/libc.so.6(+0x10fcad)[0x7fb6a6844cad]
/lib64/libc.so.6(__longjmp_chk+0x29)[0x7fb6a6844c09]
/usr/lib64/R/lib/libR.so(+0x2111c6)[0x7fb6a961b1c6]
/lib64/libc.so.6(+0x35270)[0x7fb6a676a270]
/lib64/libpthread.so.0(+0xe42b)[0x7fb6a6b0642b]
/lib64/libpthread.so.0(+0x9dcb)[0x7fb6a6b01dcb]
/lib64/libpthread.so.0(pthread_mutex_lock+0x68)[0x7fb6a6b01c98]
/usr/lib64/libjemalloc.so.1(+0x267c2)[0x7fb6a9a587c2]
/usr/lib64/libjemalloc.so.1(free+0x350)[0x7fb6a9a36640]
/home/dselivanov/R/x86_64-redhat-linux-gnu-library/3.4/reco/libs/reco.so(_ZNSt8_Rb_treeIySt4pairIKydESt10_Select1stIS2_ESt4lessIyESaIS2_EE8_M_eraseEPSt13_Rb_tree_nodeIS2_E+0x13a)[0x7fb68b7242aa]
/home/dselivanov/R/x86_64-redhat-linux-gnu-library/3.4/reco/libs/reco.so(_ZN4arma6MapMatIdEaSERKNS_5SpMatIdEE+0x10f)[0x7fb68b73afdf]
/home/dselivanov/R/x86_64-redhat-linux-gnu-library/3.4/reco/libs/reco.so(+0x306fb)[0x7fb68b7396fb]
/home/dselivanov/R/x86_64-redhat-linux-gnu-library/3.4/reco/libs/reco.so(_Z13dotprod_top_kRKN4arma3MatIdEES3_jjRN4Rcpp8NullableIKNS_5SpMatIdEEEE+0x4f3)[0x7fb68b73a5d3]
/home/dselivanov/R/x86_64-redhat-linux-gnu-library/3.4/reco/libs/reco.so(_reco_dotprod_top_k+0x2b5)[0x7fb68b71e095]
/usr/lib64/R/lib/libR.so(+0xdcedd)[0x7fb6a94e6edd]
/usr/lib64/R/lib/libR.so(+0x1106bd)[0x7fb6a951a6bd]
/usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fb6a952a218]
/usr/lib64/R/lib/libR.so(+0x12228f)[0x7fb6a952c28f]
/usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fb6a952a3d4]
/usr/lib64/R/lib/libR.so(+0x12409e)[0x7fb6a952e09e]
/usr/lib64/R/lib/libR.so(Rf_eval+0x585)[0x7fb6a952a605]
/usr/lib64/R/lib/libR.so(Rf_ReplIteration+0x232)[0x7fb6a9553e42]
/usr/lib64/R/lib/libR.so(+0x14a221)[0x7fb6a9554221]
/usr/lib64/R/lib/libR.so(run_Rmainloop+0x4f)[0x7fb6a95542df]
/usr/lib64/R/bin/exec/R(main+0x1b)[0x40080b]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb6a6756c05]

@dselivanov
Copy link
Author

Thanks, tested, problem gone.

@eddelbuettel
Copy link
Member

Good report, and helpful and very timely answer.

Any idea how we could document the need for const better on our side? Time to start a FAQ vignette for RcppArmadillo just as we do for Rcpp?

@coatless
Copy link
Contributor

'Tis always a good time to start an FAQ ;-)

@eddelbuettel
Copy link
Member

Point 1: Don't buy a mac.

Point 2: Seriously, just don't.

Point 3: Reference to

@dselivanov
Copy link
Author

Unfortunately I'm going to reopen this ticket because of 2 problems for random access to sparse matrix elements (not sure what happened since last time, but I remember everything worked fine!):

  1. Significant performance degradation for single thread (0.7.960 -> 0.8.100) - 10x and more
  2. Crash of R session when multiple threads try to have read access (0.8.100)

Minimal reproducible example on gist and here:

#include <RcppArmadillo.h>
#include <queue>
#include <iostream>
#include <vector>
#ifdef _OPENMP
#include <omp.h>
#endif

#define GRAIN_SIZE 10

using namespace Rcpp;
using namespace RcppArmadillo;
using namespace arma;

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
double test_spmat(const arma::sp_mat &x, IntegerVector I, IntegerVector J, int n_threads) {
  int *i_ptr = I.begin();
  int *j_ptr = J.begin();

  double sum = 0;
  #ifdef _OPENMP
  #pragma omp parallel for num_threads(n_threads) schedule(dynamic, GRAIN_SIZE) reduction(+:sum)
  #endif
  for(int k = 0; k < J.size(); k++) {
    //adjust to 0-based indexes
    int i = i_ptr[k] - 1;
    int j = j_ptr[k] - 1;
    sum += x(i, j);
  }
  return(sum);
}
library(Rcpp)
library(Matrix)

n = 100000
m = 10000
nnz = 0.001 * n * m
set.seed(1)

x = sparseMatrix(i = sample(n, nnz, T), j = sample(m, nnz, T), x = 1, dims = c(n, m))
i = sample(n, nnz * 10, T)
j = sample(m, nnz * 10, T) 

install.packages("~/Downloads/RcppArmadillo_0.7.960.1.2.tar.gz", repos = NULL, type = "source")
sourceCpp("~/Downloads/tst-arma.cpp", rebuild = T)

system.time(temp <- test_spmat(x, i, j, 1))
# user  system elapsed 
# 0.568   0.003   0.572
temp
# 9830
system.time(temp <- test_spmat(x, i, j, 4))
# user  system elapsed 
# 0.636   0.004   0.164 
temp
# 9830

install.packages("~/Downloads/RcppArmadillo_0.8.100.1.0.tar.gz", repos = NULL, type = "source")
sourceCpp("~/Downloads/tst-arma.cpp", rebuild = T)

system.time(temp <- test_spmat(x, i, j, 1))
# user  system elapsed 
# 6.199   0.037   6.253 
temp
# 9830

# this one crash R session
system.time(temp <- test_spmat(x, i, j, 4))

@dselivanov dselivanov reopened this Nov 8, 2017
@eddelbuettel
Copy link
Member

Can you force a deep copy via clone() or an alike into a new non-R allocated variable, and then try again?

That would essentially be the lesson from RcppParallel -- any R-owned memory object may get gc()-ed.

@dselivanov
Copy link
Author

dselivanov commented Nov 8, 2017

Same crash with const arma::sp_mat x2 = arma::sp_mat(x). And this doesn't explain slow-down... so my guess is that this is not related to R's gc(). I have feeling that something goes wrong in Armadillo/RcppArmadillo.

PS I realize that in general element-by-element access to sparse matrix should be avoided, but in my case according to benchmark it wasn't bottleneck (initially I've panned to convert in to hash map of triplets).

@eddelbuettel
Copy link
Member

I can't help you here. There is nothing as far as I can see that the package does to get in the way.

"If it doesn't work, it doesn't work." Use an older (Rcpp)Armadillo or do something. Multithreading and R require a lot of care.

I suggest we close this, and I would propose you work out if a plain C++ example (no R) also crashes. In which case you need to talk Conrad.

@dselivanov
Copy link
Author

dselivanov commented Nov 8, 2017 via email

@eddelbuettel
Copy link
Member

I was also thinking that ... maybe the fact that you use threading, and that Conrad switched to more OpenMP use can get into each others way?

@dselivanov
Copy link
Author

dselivanov commented Nov 8, 2017 via email

@kevinushey
Copy link
Contributor

kevinushey commented Nov 8, 2017

FWIW it all runs fine on my macOS machine, with clang-5.0 + sanitizers. I do have OpenMP disabled though, and I'm not using or linking to the R-LLVM toolchain.

> system.time(temp <- test_spmat(x, i, j, 4))
   user  system elapsed 
  2.043   0.017   2.072 
> sessionInfo() R Under development (unstable) (2017-10-12 r73548) Platform: x86_64-apple-darwin17.0.0 (64-bit) Running under: macOS High Sierra 10.13.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/lapack/3.7.1/lib/liblapack.3.7.1.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] Matrix_1.2-11 Rcpp_0.12.13.2 testthat_1.0.2.9000 rmarkdown_1.6 knitr_1.17 roxygen2_6.0.1 devtools_1.13.3

loaded via a namespace (and not attached):
[1] xml2_1.1.1 magrittr_1.5 lattice_0.20-35 R6_2.2.2 RcppArmadillo_0.8.200.1.1
[6] rlang_0.1.2 stringr_1.2.0 tools_3.5.0 grid_3.5.0 withr_2.0.0
[11] htmltools_0.3.6 commonmark_1.4 yaml_2.1.14 rprojroot_1.2 digest_0.6.12
[16] crayon_1.3.4 memoise_1.1.0 evaluate_0.10.1 stringi_1.1.5 compiler_3.5.0
[21] backports_1.1.1

@dselivanov
Copy link
Author

I think that problem is in Armadillo 0.8.100 because code works fine with latest 0.8.200.
Here is pure c++ code which works fine with arma 8.200 and produce segmentation fault with 0.8.100

#include "include/armadillo"
#include <stdio.h>
#define GRAIN_SIZE 10

int main() {
  int n_threads = 4;
  int n = 1000000;
  int m = 10000;
  double nnz_prop = 0.001;
  int nnz = n * m * nnz_prop;
  const arma::sp_mat x = arma::sprandu(n, m, nnz_prop);
  arma::ivec i = arma::randi<arma::ivec>(nnz, arma::distr_param(0, n - 1));
  arma::ivec j = arma::randi<arma::ivec>(nnz, arma::distr_param(0, m - 1));

  double sum = 0;
  #pragma omp parallel for num_threads(n_threads) schedule(dynamic, GRAIN_SIZE) reduction(+:sum)
  for(int k = 0; k < nnz; k++) {
    sum += x.at(i[k], j[k]);
  }
  printf("%f\n", sum);
  return(0);
}

But it is still ~ 5-10x slower than 0.7.960

FYI @conradsnicta

@eddelbuettel
Copy link
Member

That can happen. I'll get to 0.8.200.* when I have a moment.

@eddelbuettel
Copy link
Member

@dselivanov I just pushed a new branch with 0.8.200.2.0 -- untested as of now -- but with small changes. I may have time to put it through the test harness tomorrow and then merge to master.

Feel free to experiment in the interim.

@dselivanov
Copy link
Author

@eddelbuettel thank you, I can confirm that code works fine with 0.8.200.2.0 branch
So I think issue related to RcppArmadillo can be closed.

@conradsnicta I checked on 2 different code chunks:

From last pure C++ example:

On my system (OS X) and single thread:
g++-7:

# 8.200.2 (43c38c2e99952aaa99fba2c27fa0e06795de1f7c)
real	0m7.398s
user	0m7.276s
sys	0m0.114s

# 7.960 (057c1e84a64476f2cb9388733e0a4972c904614e)
real	0m7.206s
user	0m7.087s
sys	0m0.114s

clang 4:

# 8.200.2 (43c38c2e99952aaa99fba2c27fa0e06795de1f7c)
real	0m9.685s
user	0m9.383s
sys	0m0.295s

# 7.960 (057c1e84a64476f2cb9388733e0a4972c904614e)
real	0m5.773s
user	0m5.647s
sys	0m0.118s

From my initial example

clang 4:

# 8.200.2 (43c38c2e99952aaa99fba2c27fa0e06795de1f7c)
1 CORE
user  system elapsed 
5.838   0.050   5.892 

4 CORE
user  system elapsed 
6.223   0.052   5.573

# 7.960 (057c1e84a64476f2cb9388733e0a4972c904614e)
1 CORE
user  system elapsed 
0.677   0.004   0.682 

4 CORE
user  system elapsed 
1.842   0.006   0.466 

@conradsnicta I'm not sure why difference is so huge in second case. I can provide sparse matrix in market matrix triplet format and subsetting indices if it can help. How can I help in investigation?

@conradsnicta
Copy link
Contributor

@dselivanov - I don't know what would be causing this. It works properly under gcc, so I suspect it's an issue with the openmp implementation in clang and/or macOS. Apple has been a bit iffy about providing openmp as part of the standard compiler on macOS, which would also suggest that either openmp in clang and/or its interaction with macOS is problematic.

@dselivanov
Copy link
Author

dselivanov commented Nov 9, 2017 via email

@eddelbuettel
Copy link
Member

The 0.8.200.2.0 release candidate looks good otherwise and I will merge that into master later, and probably prepare a drat release too.

@eddelbuettel
Copy link
Member

FWIW the 0.8.200.2.0 tarball is now in the drat repo of the RcppCore organization so you can install it via your preferred R package tool just like prior pre-releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants