[cuda] fail with clear error when device_type changes after Dataset construct by maxwbuckley · Pull Request #2 · BelixRogner/ExaBoost

maxwbuckley · 2026-05-09T16:05:54Z

Summary

Port of upstream lightgbm-org#7265. Different scope from #1 — this is the SIGSEGV that fires before any histogram kernel runs, at CUDAObjectiveInterface::Init, when a Dataset was constructed without device_type=cuda and the user then trains with device_type=cuda.

Repro on this fork (`numerai-cuda-fast`, base `006bf32c`, before this PR)

import numpy as np, lightgbm as lgb
rng = np.random.default_rng(0)
X = rng.uniform(size=(100, 5)).astype(np.float32)
y = rng.uniform(size=100).astype(np.float32)
ds = lgb.Dataset(X, label=y).construct()           # CPU-side construct
lgb.train({"device_type": "cuda", "objective": "regression", "verbose": -1},
          ds, num_boost_round=1)                   # SIGSEGV (exit 139)

The dereference is at include/LightGBM/cuda/cuda_objective_function.hpp:33:

cuda_labels_ = metadata.cuda_metadata()->cuda_label();

where metadata.cuda_metadata() is nullptr because Dataset::FinishLoad only calls metadata_.CreateCUDAMetadata when device_type_ == "cuda", and the train-side device_type=cuda never propagated to the Dataset because there's no entry for it in Booster::CheckDatasetResetConfig.

After this PR

Same call sequence raises a clean LightGBMError:

LightGBMError: Cannot change device_type after constructed Dataset handle.

… and the existing Python-side _update_params flow lets a user with free_raw_data=False automatically re-construct the Dataset for CUDA.

What's in this PR

Three code changes plus a test:

src/c_api.cpp — add device_type (and its device alias) to Booster::CheckDatasetResetConfig so changing it after Dataset construction triggers the standard "free handle and reconstruct" path used by other unchangeable params (max_bin, categorical_feature, etc.).
include/LightGBM/cuda/cuda_objective_function.hpp — defensive null check in CUDAObjectiveInterface::Init. If anything else in the codebase ever lands here with a null cuda_metadata, a clear Log::Fatal message fires instead of a segfault.
include/LightGBM/cuda/cuda_metric.hpp — same null check in CUDAMetricInterface::Init.
tests/python_package_test/test_engine.py — test_cuda_dataset_device_type_unchangeable_after_construct runs the repro under CUDA and asserts the clean error.

Why this isn't fixed in PR #1

Different bug. PR #1 is about the discretized histogram path's leaf-state propagation, which only fires once training is in flight. This SIGSEGV fires at booster construction, before CUDAGradientDiscretizer is even touched.

Verified

Built and tested on Linux + CUDA 13.2 + RTX 5090 (sm_120). Required a local-only CMAKE_CUDA_ARCHITECTURES="120" patch to skip dropped Pascal arches; that's not part of this PR.

Same disclosure as #1: Claude Code helped trace this; the bug and fix are real. Let me know if any part wants more detail.

…onstruct A Dataset constructed without device_type=cuda and then trained with device_type=cuda silently leaves the Dataset's cuda_metadata_ unset and SIGSEGVs inside CUDAObjectiveInterface::Init when it dereferences a null metadata.cuda_metadata(). Repro: lgb.Dataset(X, y).construct() followed by lgb.train({"device_type": "cuda", ...}, ds, ...). - Add device_type to CheckDatasetResetConfig so changing it after construction triggers the standard "free handle and reconstruct" path used by other unchangeable params such as max_bin. Reuses the existing UpdateParamChecking flow on the Python side, which will reconstruct the Dataset if its raw data is still in memory or surface a clear LightGBMError otherwise. - Add defensive null checks in CUDAObjectiveInterface::Init and CUDAMetricInterface::Init that emit a clear error if any other code path reaches them with a null cuda_metadata, instead of segfaulting. Add test_cuda_dataset_device_type_unchangeable_after_construct that asserts the new clean error. Cross-references upstream PR lightgbm-org#7265 — same fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-05-09T18:52:06Z

Thanks Max for contributing this, and thanks to Claude Code for helping trace it. The c_api change is exactly the right place — it slots into the existing CheckDatasetResetConfig pattern verbatim, and the regression test pins the behavior cleanly. Merging now.

Minor optional follow-up: the two 5-line null checks in cuda_objective_function.hpp and cuda_metric.hpp are identical and could be a single helper. Small enough not to block on though.

BelixRogner merged commit 87f1f7d into BelixRogner:numerai-cuda-fast May 9, 2026

BelixRogner mentioned this pull request May 10, 2026

[cuda] fix unweighted percentile formula (L1 & quantile leaf renewal) #6

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] fail with clear error when device_type changes after Dataset construct#2

[cuda] fail with clear error when device_type changes after Dataset construct#2
BelixRogner merged 1 commit into
BelixRogner:numerai-cuda-fastfrom
maxwbuckley:cuda-objective-init-context

maxwbuckley commented May 9, 2026

Uh oh!

BelixRogner commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented May 9, 2026

Summary

Repro on this fork (numerai-cuda-fast, base 006bf32c, before this PR)

After this PR

What's in this PR

Why this isn't fixed in PR #1

Verified

Uh oh!

BelixRogner commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Repro on this fork (`numerai-cuda-fast`, base `006bf32c`, before this PR)