Improved handling of trajectory failures and heterogeneous computation times in ensemble GPU solves

## Problem Description

Users running parameter sweeps with `EnsembleGPUKernel` are encountering challenges with trajectory failures and heterogeneous computational requirements across different parameter combinations. This is limiting the practical usability of GPU acceleration for ensemble simulations.

## Key Issues

### 1. Trajectory Failures
Some parameter combinations cause GPU solver failures, likely due to overflow errors rather than divide-by-zero issues. Currently, there's no graceful way to handle these failures within DiffEqGPU.

### 2. Heterogeneous Computation Times
Not all trajectories require the same execution duration. Some parameter combinations complete quickly while others timeout, creating inefficiencies where the entire batch must wait for the slowest trajectory or fail entirely.

### 3. Performance Variability
GPU acceleration doesn't uniformly outperform CPU computation across all problem types, but there's no built-in mechanism to adaptively choose between GPU and CPU execution.

## Current Workaround

From the Discourse thread (https://discourse.julialang.org/t/diffeqgpu-trajectory-failure-handling-and-heterogeneous-trajectories/129962/7), users are implementing bash-level wall-clock time limits:

> "If the time limit is exceeded, I switch to running the batch on the CPU"

This hybrid approach can outperform pure GPU or pure CPU execution by:
- Leveraging GPU speed for tractable parameter combinations
- Falling back to CPU for problematic trajectories
- Working particularly well when problematic trajectories are relatively rare

## Potential Solutions

1. **Built-in timeout handling**: Allow individual trajectories to timeout and fallback to CPU execution automatically
2. **Trajectory-level error handling**: Provide options to skip, retry on CPU, or handle failed trajectories without failing the entire ensemble
3. **Adaptive execution**: Automatically route trajectories to GPU or CPU based on problem characteristics or runtime heuristics
4. **Better error reporting**: Distinguish between different failure modes (overflow, convergence issues, etc.) to help users diagnose problems

## Related Discussion

Full context: https://discourse.julialang.org/t/diffeqgpu-trajectory-failure-handling-and-heterogeneous-trajectories/129962/7

This affects users running large parameter sweeps where robustness and mixed GPU/CPU execution could significantly improve performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved handling of trajectory failures and heterogeneous computation times in ensemble GPU solves #376

Problem Description

Key Issues

1. Trajectory Failures

2. Heterogeneous Computation Times

3. Performance Variability

Current Workaround

Potential Solutions

Related Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improved handling of trajectory failures and heterogeneous computation times in ensemble GPU solves #376

Description

Problem Description

Key Issues

1. Trajectory Failures

2. Heterogeneous Computation Times

3. Performance Variability

Current Workaround

Potential Solutions

Related Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions