feat(crash): lazy Zarr loading for multi-GPU DDP training#1703
feat(crash): lazy Zarr loading for multi-GPU DDP training#1703Thabhelo wants to merge 3 commits into
Conversation
Defer Zarr mesh and point-data materialization until sample access so each DDP rank no longer loads the full dataset at construction time. Document DistributedSampler memory behavior and enable lazy_load by default. closes NVIDIA#1550 Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
Follow-up: interrogate hook false positives (separate PR)What this PR is for: lazy Zarr loading and DDP memory documentation for #1550. While getting this through pre-commit, the interrogate check failed even though the flagged symbols ( Root cause: Because that header is not matched, Temporary workaround in this PR: added small docstrings on those symbols so they are no longer reported as Actual fix: #1704 broadens the header parser so baseline matching works correctly — this means no docstring workarounds needed going forward even for the next person who touches an example file with baseline gaps. |
|
Follow-up parser fix opened in #1704. |
Greptile SummaryThis PR adds lazy Zarr loading (
Important Files Changed
|
Use uniform m/num_samples accumulation for lazy feature stats, compute lazy edge stats from normalized positions to match eager graph construction, and skip re-normalization when loading eager samples. Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
|
Reviewed PR comments: triage:
Thanks @greptile-apps for the catch. |
|
Thanks for the detailed triage and for addressing all three issues. The lazy stat accumulation, normalized-position edge stats, and eager double-normalize fixes in One remaining issue worth flagging: the eager # datapipe.py – _compute_edge_stats (eager)
edge_mean = m if edge_mean is None else edge_mean + m / self.num_samples
# ↑ full weight for i=0; all others get 1/N weightThis should use the same def _compute_edge_stats(self):
edge_mean = None
edge_meansqr = None
edge_dim = None
for i in range(self.num_samples):
x_e = self.graphs[i].edge_attr.to(torch.float32) # [E,De]
if edge_dim is None:
edge_dim = x_e.shape[1]
edge_mean = torch.zeros(edge_dim, dtype=torch.float32)
edge_meansqr = torch.zeros(edge_dim, dtype=torch.float32)
m = torch.mean(x_e, dim=0)
msq = torch.mean(x_e * x_e, dim=0)
edge_mean += m / self.num_samples
edge_meansqr += msq / self.num_samplesThe same sentinel pattern also appears in Tip: You can customize Greptile's behavior for this repo with |
Apply the same zeros + m/num_samples pattern to _compute_edge_stats so the first graph no longer receives full weight in the running mean. Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
|
Good catch, fixed in 349312a. |
This is a pre-existing bug (not introduced by lazy loading), but you are right @greptile-apps that it’s still in the file we’re editing. |
|
Looking at the current PR branch, def _compute_edge_stats(self):
edge_mean = None
edge_meansqr = None
edge_dim = None
for i in range(self.num_samples):
x_e = self.graphs[i].edge_attr.to(torch.float32)
if edge_dim is None:
edge_dim = x_e.shape[1]
edge_mean = torch.zeros(edge_dim, dtype=torch.float32) # ✓ zeros, not m
edge_meansqr = torch.zeros(edge_dim, dtype=torch.float32)
m = torch.mean(x_e, dim=0)
msq = torch.mean(x_e * x_e, dim=0)
edge_mean += m / self.num_samples # ✓ uniform weight
edge_meansqr += msq / self.num_samplesThe |
Summary
lazy_load: trueby default) so mesh trajectories and point features materialize on first sample access instead of at dataset construction.DistributedSamplershards indices, not host RAM) in the crash README and log a hint intrain.pywhen multi-GPU + lazy Zarr are active.closes #1550
Test plan
pytest examples/structural_mechanics/crash/tests/test_zarr_reader.py examples/structural_mechanics/crash/tests/test_datapipe_lazy.pyReader(lazy_load=False)(all samples materialized at init)Notes
check_docstring_coverage.py(details in follow-up comment).