Skip to content

[cuda] fail with clear error when device_type changes after Dataset construct#2

Merged
BelixRogner merged 1 commit into
BelixRogner:numerai-cuda-fastfrom
maxwbuckley:cuda-objective-init-context
May 9, 2026
Merged

[cuda] fail with clear error when device_type changes after Dataset construct#2
BelixRogner merged 1 commit into
BelixRogner:numerai-cuda-fastfrom
maxwbuckley:cuda-objective-init-context

Conversation

@maxwbuckley
Copy link
Copy Markdown

Summary

Port of upstream lightgbm-org#7265. Different scope from #1 — this is the SIGSEGV that fires before any histogram kernel runs, at CUDAObjectiveInterface::Init, when a Dataset was constructed without device_type=cuda and the user then trains with device_type=cuda.

Repro on this fork (numerai-cuda-fast, base 006bf32c, before this PR)

import numpy as np, lightgbm as lgb
rng = np.random.default_rng(0)
X = rng.uniform(size=(100, 5)).astype(np.float32)
y = rng.uniform(size=100).astype(np.float32)
ds = lgb.Dataset(X, label=y).construct()           # CPU-side construct
lgb.train({"device_type": "cuda", "objective": "regression", "verbose": -1},
          ds, num_boost_round=1)                   # SIGSEGV (exit 139)

The dereference is at include/LightGBM/cuda/cuda_objective_function.hpp:33:

cuda_labels_ = metadata.cuda_metadata()->cuda_label();

where metadata.cuda_metadata() is nullptr because Dataset::FinishLoad only calls metadata_.CreateCUDAMetadata when device_type_ == "cuda", and the train-side device_type=cuda never propagated to the Dataset because there's no entry for it in Booster::CheckDatasetResetConfig.

After this PR

Same call sequence raises a clean LightGBMError:

LightGBMError: Cannot change device_type after constructed Dataset handle.

… and the existing Python-side _update_params flow lets a user with free_raw_data=False automatically re-construct the Dataset for CUDA.

What's in this PR

Three code changes plus a test:

  1. src/c_api.cpp — add device_type (and its device alias) to Booster::CheckDatasetResetConfig so changing it after Dataset construction triggers the standard "free handle and reconstruct" path used by other unchangeable params (max_bin, categorical_feature, etc.).

  2. include/LightGBM/cuda/cuda_objective_function.hpp — defensive null check in CUDAObjectiveInterface::Init. If anything else in the codebase ever lands here with a null cuda_metadata, a clear Log::Fatal message fires instead of a segfault.

  3. include/LightGBM/cuda/cuda_metric.hpp — same null check in CUDAMetricInterface::Init.

  4. tests/python_package_test/test_engine.pytest_cuda_dataset_device_type_unchangeable_after_construct runs the repro under CUDA and asserts the clean error.

Why this isn't fixed in PR #1

Different bug. PR #1 is about the discretized histogram path's leaf-state propagation, which only fires once training is in flight. This SIGSEGV fires at booster construction, before CUDAGradientDiscretizer is even touched.

Verified

Built and tested on Linux + CUDA 13.2 + RTX 5090 (sm_120). Required a local-only CMAKE_CUDA_ARCHITECTURES="120" patch to skip dropped Pascal arches; that's not part of this PR.

Same disclosure as #1: Claude Code helped trace this; the bug and fix are real. Let me know if any part wants more detail.

…onstruct

A Dataset constructed without device_type=cuda and then trained with
device_type=cuda silently leaves the Dataset's cuda_metadata_ unset and
SIGSEGVs inside CUDAObjectiveInterface::Init when it dereferences a
null metadata.cuda_metadata(). Repro: lgb.Dataset(X, y).construct()
followed by lgb.train({"device_type": "cuda", ...}, ds, ...).

- Add device_type to CheckDatasetResetConfig so changing it after
  construction triggers the standard "free handle and reconstruct"
  path used by other unchangeable params such as max_bin. Reuses the
  existing UpdateParamChecking flow on the Python side, which will
  reconstruct the Dataset if its raw data is still in memory or surface
  a clear LightGBMError otherwise.
- Add defensive null checks in CUDAObjectiveInterface::Init and
  CUDAMetricInterface::Init that emit a clear error if any other code
  path reaches them with a null cuda_metadata, instead of segfaulting.

Add test_cuda_dataset_device_type_unchangeable_after_construct that
asserts the new clean error.

Cross-references upstream PR lightgbm-org#7265 — same fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BelixRogner
Copy link
Copy Markdown
Owner

Thanks Max for contributing this, and thanks to Claude Code for helping trace it. The c_api change is exactly the right place — it slots into the existing CheckDatasetResetConfig pattern verbatim, and the regression test pins the behavior cleanly. Merging now.

Minor optional follow-up: the two 5-line null checks in cuda_objective_function.hpp and cuda_metric.hpp are identical and could be a single helper. Small enough not to block on though.

@BelixRogner BelixRogner merged commit 87f1f7d into BelixRogner:numerai-cuda-fast May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants