Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Use ALTREP system to return C++-allocated arrays #6213

Merged
merged 17 commits into from
May 29, 2024

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Nov 26, 2023

This PR modifies the R interface to return some C++-allocated objects as ALTREP'd R vectors, thereby avoiding redundant memory copies.

I think the most useful thing to ALTREP here would be the externalptr handles through R_make_altlist_class introduced in R 4.3, so that serialization (calls to LGBM_BoosterSaveModelToString and LGBM_BoosterLoadModelFromString) would only be triggered on calls to functions like saveRDS / readRDS (like the python interface does), but seeing as the handle serialization requires metadata like importance_type, the interface would first need to be modified to make these parameters fixed instead of user-modifyiable.

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 26, 2023

I unfortunately am unable to find why the CI jobs are failing. Seems to run fine in my machine with r-debug builds (RDsan, RDcsan, RDvalgrind), and the jobs that would give a hint of what went wrong (sanitizers) are not failing here.

@jameslamb
Copy link
Collaborator

why the CI jobs are failing

The lint job is failing with errors from cppliint. Those errors with line numbers are printed in the logs (build link), and you can run https://github.com/microsoft/LightGBM/blob/master/.ci/lint-cpp.sh to reproduce them.


The r-package (debian, R-devel, clang) job is failing with the issue reported in #6212. That's some interaction between the newest R-devel as of a few days ago and clang 15.0.7+. That's unrelated to this PR's changes and I'll try to get a fix up this week, sorry for the inconvenience.


For all the other R-package job failures... they are not showing up on master or other PRs, so I strongly suspect they're related to this PR's changes. Note that they're all for the CMake-based, not CRAN-style, builds of the R package.

Screen Shot 2023-11-26 at 11 08 01 PM

(build links)

Those can be reproduced by using the build_r.R script at the root of this repo to build a package tarball. Here's how that's done in those CI jobs:

if [[ $R_BUILD_TYPE == "cmake" ]]; then
Rscript build_r.R -j4 --skip-install || exit -1

I haven't reviewed this PR yet so I can't provide more specific guidance than that yet.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most useful thing to ALTREP here would be the externalptr handles through R_make_altlist_class introduced in R 4.3,

Before we go too far with this, I want to be clear... I don't support dropping support for R < 4.3 for the benefit "avoiding redundant memory copies" at this time, unless you can make the case that those redundant copies are significantly problematic for a majority of LightGBM's users.

The first R 4.3.x, v4.3.0, has only been out for about 7 months (release history).

r-oldrel on CRAN is still on R 4.2.3, for example:

rversions::r_oldrel()
#     version                date         nickname
# 130   4.2.3 2023-03-15 08:06:01 Shortstop Beagle

Also, just for my understanding, are the changes in this PR the same type of uses of ALTREP that you referred to as "rather hacky" over in dmlc/xgboost#9734 (comment)?

I'm not familiar with ALTREP and will need to do some reading to effectively review this. Any evidence or documentation you could provide to help us understand what this mechanism is, how it works, and what the benefit of these changes are to LightGBM would be helpful.

@jameslamb jameslamb changed the title [R-packages] Use ALTREP system to return C++-allocated arrays [R-package] Use ALTREP system to return C++-allocated arrays Nov 27, 2023
@david-cortes
Copy link
Contributor Author

Note that they're all for the CMake-based, not CRAN-style, builds of the R package.

From some investigation, this is because the cmake build uses name lib_lightgbm for the shared object instead of lightgbm like the CRAN build does.

It can be fixed for the cmake builds by changing the name of the init function from R_init_lightgbm to R_init_lib_lightgbm and changing the registration for the ALTREP classes to lib_lightgbm, but then this wouldn't run on the CRAN build. Also I suspect this discrepancy might already be affecting current cmake builds, but I'm not sure what kind of things could break from this mismatch.

Is there perhaps some macro that is defined for the cmake builds but not for the CRAN ones or vice-versa? Otherwise, I don't know what would be the right way to solve this.

Also, just for my understanding, are the changes in this PR the same type of uses of ALTREP that you referred to as "rather hacky" over in dmlc/xgboost#9734 (comment)?

No, this is standard usage of ALTREP as it was designed to, and as can be found on articles and examples elsewhere:
https://purrple.cat/blog/2018/10/14/altrep-and-cpp/
https://www.r-project.org/dsc/2017/slides/dsc2017.pdf
https://github.com/ALTREP-examples/Rpkg-simplemmap

unless you can make the case that those redundant copies are significantly problematic for a majority of LightGBM's users.

Currently, lightbm creates a serialized version of models on every call. Lots of models created in interactive sessions don't end up being saved to disk. This increases fitting times and memory usage in the R interface in a way which doesn't happen in other interfaces.

Then, it also makes models read from a serialized file (like readRDS) differ in timings from the first call to predict vs. the subsequent ones, and I have to guess that there might be a sizable subset of users who might be inadvertently serving a model in a forked process (as done through e.g. Rserve and RestRserve) where each model prediction de-serializes the object without restoring it in the original process.

But that'd be for using ALTREP to handle serialization of externalptr, which is not in the scope of this PR. This one should work with R >= 3.6 which I understand is already required.

@jameslamb
Copy link
Collaborator

Sorry for the long delay! I'm ready to come back and help with reviews to move this forward.

Is there perhaps some macro that is defined for the cmake builds but not for the CRAN ones or vice-versa?

There is not, currently.

Elsewhere in the R package, we handle this particular thing by having "lib_lightgbm" in source code and then using sed to replace that with lightgbm when preparing a CRAN-style source distribution.

That's done here:

# When building an R package with 'configure', it seems
# you're guaranteed to get a shared library called
# <packagename>.so/dll. The package source code expects
# 'lib_lightgbm.so', not 'lightgbm.so', to comply with the way
# this project has historically handled installation
echo "Changing lib_lightgbm to lightgbm"
for file in R/*.R; do
sed \
-i.bak \
-e 's/lib_lightgbm/lightgbm/' \
"${file}"
done
sed \
-i.bak \
-e 's/lib_lightgbm/lightgbm/' \
NAMESPACE

Could you try:

  • adding src/lightgbm_R.cpp to that part of the script
  • using "lib_lightgbm" instead of "lightgbm" in code added in this PR

No, this is standard usage of ALTREP as it was designed to, and as can be found on articles and examples ...

Excellent, thanks very much for the links! That's helpful.


that'd be for using ALTREP to handle serialization of externalptr

Thanks very much for the explanation of the benefits of more ALTREP changes. I learned a lot from your descriptions in dmlc/xgboost#9847.

At this time I still don't support dropping support for R 3.6 in future PRs in exchange for those benefits.

If it's possible to make thosechanges that require R >= 4.3 in a way that's backwards compatible (e.g. by detecting the R version in configure/configure.win and passing through something like -DR_GTE_43 to drive a preprocessor macro), I'd support that.

@david-cortes
Copy link
Contributor Author

Modified to use lib_lightgbm as the shared object name, and then make the substitutions also for .cpp files.

@david-cortes
Copy link
Contributor Author

Now it looks like only the msvc jobs are failing while other cmake jobs are passing.

I cannot find any error message in those msvc jobs so don't know what went wrong.


void R_init_lightgbm(DllInfo *dll) {
void R_init_lib_lightgbm(DllInfo *dll) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this and the similar change above it are necessary. I suspect those might be the cause of the MSVC failures you're seeing.

Can you please revert them? This function should be called R_init_lightgbm, for both the CRAN (autotools) package and the CMake-based builds.

As described in https://cran.r-project.org/doc/manuals/R-exts.html

It is good practice for DLLs to register their symbols (see Registering native routines), restrict visibility (see Controlling visibility) and not allow symbol search (see Registering native routines). It should be possible for a DLL to have only one visible symbol, R_init_pkgname, on suitable platforms84, which would completely avoid symbol conflicts.

See also:

and #4494

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fail in my setup if I rename the init function to not have lib_. Changing it in the altrep registration name doesn't seem to generate any failures though, so I've changed it there to see if it makes a difference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be very surprised to learn that this function name needs to change as a part of this PR. It's already a part of the C API for lib_lightgbm.dll (from the perspective of the R package built with MSVC), and the changes in this PR are all otherwise not exposed as C API changes.

At a minimum, I suspect the CI failures you're seeing for MSVC are because you're changing this function name but not changing the lightgbm-win.def file I linked above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other cmake-based jobs that were tested here, leaving the name as R_init_lightgbm makes the job fail, while the same jobs succeed when changing it to R_init_lib_lightgbm.

The -win.def file that you mentioned AFAIK is not used in the msvc build from cmake.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm ok. Thanks for looking into that!

I still don't understand though what about these changes would require that function name to be changed to R_init_lib_lightgbm when it is working on master as R_init_lightgbm. Given that you're just adding new behavior inside that function.

Anyway, maybe this and the .def file are not the right place to look anyway.... looking more closely at the failing builds, I see that quite a few unit tests are passing before those jobs eventually fail.

e.g. on https://github.com/microsoft/LightGBM/actions/runs/7413984613/job/20174104011?pr=6213

----- end of build and install logs -----
Running tests with testthat.R
basic: ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
custom_objective: ..........
dataset: ................................S.............................................
learning_to_rank: ....................................................
lgb.Booster: ..........................................................................
Error: Process completed with exit code -1.

I have a Windows laptop with MSVC and R on it.... I can pull this branch and test. Maybe I'll at least be able to get some logs pointing to what is happening. Powershell provided on GitHub Actions runners (which we use to orchestrate those tests, see https://github.com/microsoft/LightGBM/blob/master/.ci/test_r_package_windows.ps1) is very picky about how it manages stderr and we've put in many changes to work around that in the future. For example:

# External utilities like R.exe / Rscript.exe writing to stderr (even for harmless
# status information) can cause failures in GitHub Actions PowerShell jobs.
# See https://github.community/t/powershell-steps-fail-nondeterministically/115496
#
# Using standard PowerShell redirection does not work to avoid these errors.
# This function uses R's built-in redirection mechanism, sink(). Any place where
# this function is used is a command that writes harmless messages to stderr
function Run-R-Code-Redirect-Stderr {
param(
[string]$rcode
)
$decorated_code = "out_file <- file(tempfile(), open = 'wt'); sink(out_file, type = 'message'); $rcode; sink()"
Rscript --vanilla -e $decorated_code
}

Maybe some useful error or warning is being written to stderr and we're not seeing it in CI logs as a result of those changes.

Thanks for all your efforts so far! I can take over the investigation for a bit. I think it's likely the issue is about the peculiarities of LightGBM's CI setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-msvc cmake ones, before the name change to R_init_lib_lightgbm, the tests would fail when they try to do something with the variables representing altrep classes introduced here (R_altrep_class_t), which would not have been initialized by the time the .Call uses them and thus have invalid pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe a start could be to verify whether lines like this one are run before .Call:

lgb_altrepped_char_vec = R_make_altraw_class("lgb_altrepped_char_vec", "lightgbm", dll);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iiiiinteresting, thanks for that!

Since you mention this being maybe related to the timing of when those statements run, I think it's important to point out a difference between the MSYS/MinGW CMake CI jobs and the MSVC ones.

The MSYS/MinGW ones run R CMD check.

if ($env:COMPILER -ne "MSVC") {

Run-R-Code-Redirect-Stderr "result <- processx::run(command = 'R.exe', args = $check_args, echo = TRUE, windows_verbatim_args = FALSE, error_on_status = TRUE)" ; $check_succeeded = $?

The MSVC ones just run Rscript --vanilla R-package/tests/testthat.R

if ($env:COMPILER -eq "MSVC") {
Write-Output "Running tests with testthat.R"
cd R-package/tests
# NOTE: using Rscript.exe intentionally here, instead of Run-R-Code-Redirect-Stderr,
# because something about the interaction between Run-R-Code-Redirect-Stderr
# and testthat results in failing tests not exiting with a non-0 exit code.
Rscript.exe --vanilla "testthat.R" ; Check-Output $?
}

That could result in different code paths being followed and a different order of things being initialized? I'm not sure. If that's the case, maybe we'll have to add a .onLoad() or .onAttach() to the package (though I really hope not).

I'll look into that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to work on removing this difference in the library name (lib_lightgbm vs. lightgbm) in the different types of R builds here. Work in progress at #6432. Hopefully it'll help with this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6432 has been merged, so:

  • all this lib_lightgbm vs. lightgbm stuff in the R package is gone
  • the CMake-based builds now should do routine registration and invoke R_init_lightgbm() just like the CRAN-style build do

Let's see in CI if it helps.

@jameslamb
Copy link
Collaborator

Sorry for the VERY long delay in reviews. As I'm sure you've noticed, we are struggling with a lack of maintainer availability here.

I have a Windows laptop set up for development now, will look into this today.

@david-cortes
Copy link
Contributor Author

I am unable to build the library with MSVC after the latest updates and cannot find full error traces in the failing jobs, so not sure what is failing now.

@jameslamb
Copy link
Collaborator

Ok @david-cortes I have some more debugging information to share, hopefully it'll help.

On my Windows laptop (Windows 10 Home, R 4.4.0, Rtools44), on latest master, I was able to reproduce the build failures you were seeing with Visual Studio.

The change in #6451 appears to fix them. I think the root cause was CMake detecting an old target Windows version and therefore Windows SDK.

With that change applied on this branch, I was also able to reproduce the test failure we're seeing for MSVC jobs in CI, and narrow it down to a specific test case.

test_that("Saving a model with unknown importance type fails", {

To reproduce, you could apply the changes from #6451 to this branch, then install the package with build_r.R.

Rscript build_r.R --no-build-vignettes

Then try the following in an R session.

library(lightgbm)
    
set.seed(708L)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
bst <- lightgbm::lightgbm(
    data = as.matrix(train$data)
    , label = train$label
    , params = list(
        num_leaves = 4L
        , objective = "binary"
        , verbose = 1
        , num_threads = 2
    )
    , nrounds = 2L
)

# NOTE: only values of 0 or 1 are supported in the underlying API called here
bst$save_model_to_string(
    feature_importance_type = 2L
)

On master or #6451, it successfully and correctly raises this R error:

[LightGBM] [Fatal] Unknown importance type: only support split=0 and gain=1
Error in bst$save_model_to_string(feature_importance_type = 2L) : 
  Unknown importance type: only support split=0 and gain=1

And the session does not crash.

On this branch, installing the same way, that code emits that error text to stderr and then the R session crashes.

In case it helps, here's the place where that error is raised from:

Log::Fatal("Unknown importance type: only support split=0 and gain=1");

@david-cortes
Copy link
Contributor Author

I can confirm that the code under R_init_lightgbm is being executed as expected after the latest changes in master and the cmake build fix.

No idea why it'd be crashing though, and am unfortunately not familiar with debugging tools for MSVC.

Could this perhaps have something to do with the compile define STRUCT_SUBTYPES for altrep classes? It doesn't seem to be user-configurable, but it looks like it has an alternative definition of altrep class structure for what I assume could be problematic compilers.

@david-cortes
Copy link
Contributor Author

After some testing, it's actually not anyhow related to STRUCT_SUBTYPES - forcbly modifying the R built-in header to use the alternative definition still leads to the same crash.

Only solution I can think of would be to disable this feature if it detects that it's being compiled with MSVC.

@david-cortes
Copy link
Contributor Author

After some testing, it is actually a matter of having the alternative definition of data structures when STRUCT_SUBTYPES is not defined.

After manually disabling it in the R Altrep.h header, it doesn't crash anymore, so it should be a matter of having it undefined so that this line would be reached.

Unfortunately, the macro is forcibly defined in the same R header:
https://github.com/wch/r-source/blob/531d0fd05d3bdc0f8a0350e3ecd3e101d315ab59/src/include/R_ext/Altrep.h#L33

So I cannot think of any mechanism that could be used to prevent its definition in the lines below that one.

@jameslamb
Copy link
Collaborator

Thank you so much for investigating that!

I read back through all the links you provided (#6213 (comment)) and did my own searches for MSVC-specific discussions about ALTREP... didn't find anything suggesting another path forward.

I'd support modifying this PR such that the feature is disabled when compiling with MSVC.

I know that adds some complexity to the code, but maybe it'd be worth it for a while to keep that separation... in case there's some specific planned breaking changes to the ALTREP API that led Dr. Tierney to push this 3 weeks ago: wch/r-source@fb52ac1

/* 
   Not part of the API, subject to change at any time.
   This API is experimental and may change on short notice.
   Package authors using this API should be prepared to adapt to changes
   when they occur.
*/

@david-cortes
Copy link
Contributor Author

I couldn't find any better solution, so I've now disabled usage of altrep classes when compiling with msvc.

@david-cortes
Copy link
Contributor Author

Did some additional investigation about the failing MSVC tests. It looks like there's some additional issue apart from the altrep struct declaration.

The failing test from the previous snippet doesn't even reach the Altrep section - it crashes somewhere inside the CHECK_CALL mechanism (function LGBM_BoosterSaveModelToString returns an error code correctly).

If the std::unique_ptr<std::vector<char>> type is changed back to std::vector<char>, then things work correctly.

I am guessing there must either be a compiler bug somewhere, or there must be some additional memory corruption happening before the call to LGBM_BoosterSaveModelToString_R, given that changing the Altrep header somehow solves it.

So I'm reverting back the whole function for MSVC this time.

@jameslamb
Copy link
Collaborator

jameslamb commented May 16, 2024

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/9105975044

Status: failure ❌.

@jameslamb jameslamb self-requested a review May 16, 2024 03:23
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working through this! And for starting a thread on r-pkg-devel (link).

This looks great to me, and now that I'm caught up on how ALTREP works (thanks for the links), I'm excited for the potential.

So I'm reverting back the whole function for MSVC this time.

I support isolating this function for MSVC this way.

I've triggered the valgrind checks on this branch. Assuming that those pass, I support merging this and getting it into the v4.4.0 release (#6439).

I'd appreciate if we could get one more C++ review @shiyu1994 @guolinke @btrotta

#ifndef _MSC_VER
SEXP LGBM_BoosterSaveModelToString_R(SEXP handle,
SEXP num_iteration,
SEXP feature_importance_type) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just merged #6398. In that, @mayer79 exposed start_iteration on LGBM_BoosterSaveModelToString_R(). Could you add that on this implementation?

Sorry for that, it's unfortunate timing that these two PRs both modifying LGBM_BoosterSaveModelToString_R() reached the ready-to-be-merged state at roughly the same time.

@david-cortes
Copy link
Contributor Author

From the valgrind logs, I see two of them mention:

GOMP_parallel

Which is most likely a false positive since valgrind doesn't work correctly with libgomp.

The other doesn't mention GOMP, but it's the exact same size (368 bytes) as the others, and mentions:

pthread_create@@GLIBC_2.34
std::thread::_M_start_thread

So I'd say there's a very high chance that it's also coming from libgomp.

@jameslamb
Copy link
Collaborator

Which is most likely a false positive since valgrind doesn't work correctly with libgomp.

I totally agree with you. And even if it isn't a false positive.... it's not a result of this PR's changes. I just ran these tests on a different branch (#6463 (comment)) and see the exact same findings from valgrind.

I don't have permissions to admin-merge over a failing CI check here. I'll get those tests passing on #6463 then do a rebuild here and merge this.

@jameslamb
Copy link
Collaborator

jameslamb commented May 28, 2024

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/9276361940

Status: success ✔️.

@jameslamb
Copy link
Collaborator

Thanks very much for this!!

@jameslamb jameslamb merged commit dee8a18 into microsoft:master May 29, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants