[cuda] fail with clear error when device_type changes after Dataset construct#2
Merged
BelixRogner merged 1 commit intoMay 9, 2026
Conversation
…onstruct
A Dataset constructed without device_type=cuda and then trained with
device_type=cuda silently leaves the Dataset's cuda_metadata_ unset and
SIGSEGVs inside CUDAObjectiveInterface::Init when it dereferences a
null metadata.cuda_metadata(). Repro: lgb.Dataset(X, y).construct()
followed by lgb.train({"device_type": "cuda", ...}, ds, ...).
- Add device_type to CheckDatasetResetConfig so changing it after
construction triggers the standard "free handle and reconstruct"
path used by other unchangeable params such as max_bin. Reuses the
existing UpdateParamChecking flow on the Python side, which will
reconstruct the Dataset if its raw data is still in memory or surface
a clear LightGBMError otherwise.
- Add defensive null checks in CUDAObjectiveInterface::Init and
CUDAMetricInterface::Init that emit a clear error if any other code
path reaches them with a null cuda_metadata, instead of segfaulting.
Add test_cuda_dataset_device_type_unchangeable_after_construct that
asserts the new clean error.
Cross-references upstream PR lightgbm-org#7265 — same fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Thanks Max for contributing this, and thanks to Claude Code for helping trace it. The c_api change is exactly the right place — it slots into the existing Minor optional follow-up: the two 5-line null checks in |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port of upstream lightgbm-org#7265. Different scope from #1 — this is the SIGSEGV that fires before any histogram kernel runs, at
CUDAObjectiveInterface::Init, when a Dataset was constructed withoutdevice_type=cudaand the user then trains withdevice_type=cuda.Repro on this fork (
numerai-cuda-fast, base006bf32c, before this PR)The dereference is at
include/LightGBM/cuda/cuda_objective_function.hpp:33:cuda_labels_ = metadata.cuda_metadata()->cuda_label();where
metadata.cuda_metadata()isnullptrbecauseDataset::FinishLoadonly callsmetadata_.CreateCUDAMetadatawhendevice_type_ == "cuda", and the train-sidedevice_type=cudanever propagated to the Dataset because there's no entry for it inBooster::CheckDatasetResetConfig.After this PR
Same call sequence raises a clean
LightGBMError:… and the existing Python-side
_update_paramsflow lets a user withfree_raw_data=Falseautomatically re-construct the Dataset for CUDA.What's in this PR
Three code changes plus a test:
src/c_api.cpp— adddevice_type(and itsdevicealias) toBooster::CheckDatasetResetConfigso changing it after Dataset construction triggers the standard "free handle and reconstruct" path used by other unchangeable params (max_bin,categorical_feature, etc.).include/LightGBM/cuda/cuda_objective_function.hpp— defensive null check inCUDAObjectiveInterface::Init. If anything else in the codebase ever lands here with a nullcuda_metadata, a clearLog::Fatalmessage fires instead of a segfault.include/LightGBM/cuda/cuda_metric.hpp— same null check inCUDAMetricInterface::Init.tests/python_package_test/test_engine.py—test_cuda_dataset_device_type_unchangeable_after_constructruns the repro under CUDA and asserts the clean error.Why this isn't fixed in PR #1
Different bug. PR #1 is about the discretized histogram path's leaf-state propagation, which only fires once training is in flight. This SIGSEGV fires at booster construction, before
CUDAGradientDiscretizeris even touched.Verified
Built and tested on Linux + CUDA 13.2 + RTX 5090 (sm_120). Required a local-only
CMAKE_CUDA_ARCHITECTURES="120"patch to skip dropped Pascal arches; that's not part of this PR.Same disclosure as #1: Claude Code helped trace this; the bug and fix are real. Let me know if any part wants more detail.