-
-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Problem Description
Users running parameter sweeps with EnsembleGPUKernel are encountering challenges with trajectory failures and heterogeneous computational requirements across different parameter combinations. This is limiting the practical usability of GPU acceleration for ensemble simulations.
Key Issues
1. Trajectory Failures
Some parameter combinations cause GPU solver failures, likely due to overflow errors rather than divide-by-zero issues. Currently, there's no graceful way to handle these failures within DiffEqGPU.
2. Heterogeneous Computation Times
Not all trajectories require the same execution duration. Some parameter combinations complete quickly while others timeout, creating inefficiencies where the entire batch must wait for the slowest trajectory or fail entirely.
3. Performance Variability
GPU acceleration doesn't uniformly outperform CPU computation across all problem types, but there's no built-in mechanism to adaptively choose between GPU and CPU execution.
Current Workaround
From the Discourse thread (https://discourse.julialang.org/t/diffeqgpu-trajectory-failure-handling-and-heterogeneous-trajectories/129962/7), users are implementing bash-level wall-clock time limits:
"If the time limit is exceeded, I switch to running the batch on the CPU"
This hybrid approach can outperform pure GPU or pure CPU execution by:
- Leveraging GPU speed for tractable parameter combinations
- Falling back to CPU for problematic trajectories
- Working particularly well when problematic trajectories are relatively rare
Potential Solutions
- Built-in timeout handling: Allow individual trajectories to timeout and fallback to CPU execution automatically
- Trajectory-level error handling: Provide options to skip, retry on CPU, or handle failed trajectories without failing the entire ensemble
- Adaptive execution: Automatically route trajectories to GPU or CPU based on problem characteristics or runtime heuristics
- Better error reporting: Distinguish between different failure modes (overflow, convergence issues, etc.) to help users diagnose problems
Related Discussion
Full context: https://discourse.julialang.org/t/diffeqgpu-trajectory-failure-handling-and-heterogeneous-trajectories/129962/7
This affects users running large parameter sweeps where robustness and mixed GPU/CPU execution could significantly improve performance.