Tune single-GPU embed resource heuristic#2089
Conversation
Greptile SummaryThis PR tunes the default Ray resource plan for single-GPU batch ingest: on a system with exactly one GPU, the embedder is now allocated 1 actor at 0.2 GPU rather than the prior scaled default (1 actor at 0.5 GPU), leaving more headroom for OCR and page-element actors. Multi-GPU paths and all explicit overrides are unchanged.
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py | Adds two single-GPU constants and a post-resolution override block in resolve_requested_plan that applies them when available_gpu_count == 1 and no explicit override is set; multi-GPU paths are untouched. |
| nemo_retriever/tests/test_resource_heuristics.py | Updates the 1-GPU default assertion to match the new constants, adds embed_min_actors coverage that was previously absent, and adds a new test verifying that explicit overrides win over the single-GPU heuristic. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[resolve_requested_plan called] --> B[Compute available_gpu_count]
B --> C{available_gpu_count == 0\nand not allow_no_gpu?}
C -- Yes --> D[Raise ValueError]
C -- No --> E[_resolve_int/float_actors\nfor all actors using\nmulti-GPU defaults]
E --> F{available_gpu_count == 1?}
F -- No --> G[Keep multi-GPU resolved values]
F -- Yes --> H{override_embed_initial_actors\nis None?}
H -- Yes --> I[embed_initial_actors =\nEMBED_SINGLE_GPU_ACTORS=1]
H -- No --> J[Keep override value]
I --> K{override_embed_min_actors\nis None?}
J --> K
K -- Yes --> L[embed_min_actors =\nEMBED_SINGLE_GPU_ACTORS=1]
K -- No --> M[Keep override value]
L --> N{override_embed_max_actors\nis None?}
M --> N
N -- Yes --> O[embed_max_actors =\nEMBED_SINGLE_GPU_ACTORS=1]
N -- No --> P[Keep override value]
O --> Q{override_embed_gpus_per_actor\nis None?}
P --> Q
Q -- Yes --> R[embed_gpus_per_actor =\nEMBED_SINGLE_GPU_GPUS_PER_ACTOR=0.2]
Q -- No --> S[Keep override value]
R --> T[Continue resolving\nall other actor types]
S --> T
G --> T
T --> U[Return RequestedPlan]
Reviews (2): Last reviewed commit: "Update nemo_retriever/src/nemo_retriever..." | Re-trigger Greptile
| if override_embed_initial_actors is None: | ||
| embed_initial_actors = EMBED_SINGLE_GPU_ACTORS | ||
| if override_embed_min_actors is None: | ||
| embed_min_actors = EMBED_SINGLE_GPU_ACTORS | ||
| if override_embed_max_actors is None: | ||
| embed_max_actors = EMBED_SINGLE_GPU_ACTORS | ||
| if override_embed_gpus_per_actor is None: | ||
| embed_gpus_per_actor = EMBED_SINGLE_GPU_GPUS_PER_ACTOR |
There was a problem hiding this comment.
Override sentinel mismatch between heuristic block and resolvers
The single-GPU block guards on override is None, but the _resolve_int_actors / _resolve_float_actors helpers treat any value <= 0 the same as None (i.e., they fall through to the default). If a caller passes override_embed_gpus_per_actor=0.0, _resolve_float_actors ignores it and returns 0.5, but the single-GPU block also skips the heuristic (since 0.0 is not None), so the final value is 0.5 — neither the caller's intended value nor the single-GPU default of 0.2. A guard of override is None or override <= 0 (matching the resolver semantics) would make the two layers consistent.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py
Line: 612-619
Comment:
**Override sentinel mismatch between heuristic block and resolvers**
The single-GPU block guards on `override is None`, but the `_resolve_int_actors` / `_resolve_float_actors` helpers treat any value `<= 0` the same as `None` (i.e., they fall through to the default). If a caller passes `override_embed_gpus_per_actor=0.0`, `_resolve_float_actors` ignores it and returns `0.5`, but the single-GPU block also skips the heuristic (since `0.0 is not None`), so the final value is `0.5` — neither the caller's intended value nor the single-GPU default of `0.2`. A guard of `override is None or override <= 0` (matching the resolver semantics) would make the two layers consistent.
How can I resolve this? If you propose a fix, please make it concise.…s.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
NOTE: general perf bump, not too worried about this functionality for 26.05 release. PR against main
Description
--embed-workersand--embed-gpus-per-actoroverrides.Motivation
On a single H100 NVL, the previous default local embedding plan used two embedding actors at 0.5 GPU each. In
retriever ingest --run-mode batch, that shape can over-reserve the single GPU for embedding and reduce scheduler room for the rest of the local pipeline.Empirically, one local vLLM embedding actor with a 0.2 GPU reservation kept the GPU busier and improved end-to-end ingest time while producing the same LanceDB output row count.
Measured on the same machine:
For bo767, LanceDB row count remained
79978, withdropped_no_embedding=0and the samedropped_bad_length=64pattern as the prior run. Average GPU utilization increased from ~38.8% to ~65.6%.Implementation
This keeps the existing heuristic model intact:
resolve_requested_plan();available_gpu_count == 1;Validation
uv run --frozen --extra local --extra dev pytest tests/test_resource_heuristics.pyChecklist