Monitoring improvements #425

le1nux · 2025-12-06T12:15:35Z

What does this PR do?

This PR adds multiple changes at the same time.

Configurable multi-layer FSDP units
Option to provide experiment root path to modalities
Added steppable profiler (e.g., for tracing of forward/backward passes)
Fix: Hybrid sharding now correctly configurable
Completely refactored the Profiling
Improved error handling. Errors are now captured and stored as JSON
Add tutorials on Einsum Transformer (Example model integration) and profiling

General Changes

..

Breaking Changes

..

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

… time measurement

…tivatinos and weights

… steps

…d of CMD args

…rror handling

…ulation

…dingly.

…y using scalars

Einsum transformer

BlueCrescent · 2026-02-11T09:56:21Z

src/modalities/utils/profilers/profilers.py

+    def __exit__(self, exc_type, exc_value, traceback):
+        if self._curr_step is None:
+            raise RuntimeError("SteppableMemoryProfilerContext exited without being entered")
+        if self._curr_step < self._num_wait_steps + self._num_warmup_steps + self._num_active_steps:
+            # if we exit before finishing all steps, dump the memory snapshot
+            raise RuntimeError("SteppableMemoryProfilerContext exited before finishing all steps")
+        return


Should we log the error if this exit is reached by an exception?

It's not supressing the error and it's still propagated up in the call stack. I would leave error handling to the caller.

BlueCrescent · 2026-02-11T10:00:22Z

src/modalities/utils/profilers/profilers.py

+            with open(self._memory_snapshot_path, "wb") as output:
+                pickle.dump(torch.cuda.memory._snapshot(), output)


Suggested change

with open(self._memory_snapshot_path, "wb") as output:

pickle.dump(torch.cuda.memory._snapshot(), output)

torch.cuda.memory._dump_snapshot(self._memory_snapshot_path)

BlueCrescent · 2026-02-11T10:55:45Z

src/modalities/evaluator.py

+            self.dp_degree = get_parallel_degree(
+                device_mesh, [ParallelismDegrees.DP_REPLICATE, ParallelismDegrees.DP_SHARD]
+            )


Could we encode the information that these two form the data parallel degree somewhere more globally (e.g. in a get_data_parallel_degree(device_mesh) function at an appropriate place or use components.settings.step_profile.dp_degree)?

BlueCrescent · 2026-02-11T11:35:12Z

tests/conftest.py

-    return new_path
+    return original_path


BlueCrescent · 2026-02-11T11:44:26Z

tutorials/einsum_transformer/model/einsum_transformer.py

+    @overload
+    def forward(self, inputs: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
+        """
+        Forward pass of the GPT2LLM module.
+
+        Args:
+            inputs (dict[str, torch.Tensor]): A dictionary containing input tensors.
+                - sample_key (str): Key for the input tensor containing token ids.
+
+        Returns:
+            dict[str, torch.Tensor]: A dictionary containing output tensors.
+                - prediction_key (str): Key for the output tensor containing logits.
+        """
+        ...
+
+    @overload
+    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
+        """
+        Forward pass of the module.
+
+        Args:
+            inputs (torch.Tensor): A tensor containing input token ids.
+
+        Returns:
+            torch.Tensor: A tensor containing output logits.
+        """
+        ...
+
+    def forward(self, inputs: dict[str, torch.Tensor] | torch.Tensor) -> dict[str, torch.Tensor] | torch.Tensor:
+        """
+        Forward pass of the module.
+
+        Args:
+            inputs (dict[str, torch.Tensor] | torch.Tensor): Input data.
+
+        Returns:
+            dict[str, torch.Tensor] | torch.Tensor: Model output.
+        """
+        if isinstance(inputs, dict):
+            return {self.prediction_key: self.forward_impl(inputs[self.sample_key])}
+        else:
+            return self.forward_impl(inputs)


Maybe we should consider moving this code into NNModel so that you have the option to derive from that class and do not need to copy this stuff when adding a new model.

BlueCrescent · 2026-02-11T12:05:53Z

tutorials/einsum_transformer/README.md

+**What’s inside**
+- `train.py`: registers the custom model and launches the run.
+- `einsum_transformer_config.yaml`: training + model config.
+- `run.sh`: example `torchrun` command for 8 GPUs.


Suggested change

- `run.sh`: example `torchrun` command for 8 GPUs.

- `run.sh`: example `torchrun` command for 4 GPUs.

le1nux added 30 commits December 6, 2025 12:30

refactor: removed unnecessary all-reduce ops and improved accuracy of…

feabe5a

… time measurement

chore: added documentation and renamed pytorch rms norm key

a4775ad

feat: added timestamp and dtype to debugged model for input/output ac…

719e35e

…tivatinos and weights

feat: steppable component can now perform backward pass and optimizer…

03354c1

… steps

feat: added fused and foreach options to Adam and AdamW optimizers

9e661dc

refactor: profilers are now components

3518d00

feat: logger outputs now rank info

02e8fdd

refactor: step information in profiling now part of the config instea…

37b25d8

…d of CMD args

refactor: added new profiling setup to the profiling tutorial's config

52924ea

refactor: experiments_root_path now passed in from outside

3cfa305

feat: profiling now available also in training loop

361ddc5

feat: added memory profiling to kernel profiler

68dd9d2

refactor: added experiments_root_path to warmstart API and improved e…

e4fe4b0

…rror handling

refactor: refactored wamstart tutorial scripts

fbab937

chore: Merge remote-tracking branch 'refs/remotes/origin/main'

6b359c9

chore: Merge branch 'main' into monitoring_improvements

93bd721

fix: HSDP was not applied at all due to wrong condition check

6bef8b0

refactor: allow data_parallel_replicate_degree to be -1 for auto-calc…

a400a7c

…ulation

chore: improved device mesh logging

6d0e864

fix: in case of tp, we DP_SHARD > 1. Fixed the validation logic accor…

b893532

…dingly.

fix: tp can now be used with dp_shard or dp_replicate

85388cd

chore: improved tokenizer vocabulary warning

eb747fd

feat: added einsum transformer starter scripts

ba15c23

feat: added tokenizer to einsum example

5692d6e

feat: added einsum transformer implementation

840ece0

feat: added einsum transformer collate fn and fsdp wrapping

a6ed891

feat: added einsum transformer config

c1825b8

chore: added example dataset

81392e3

chore: Merge remote-tracking branch 'refs/remotes/origin/main'

6b058ae

chore: removed fixme since it was invalid

9420c0b

le1nux added 8 commits January 23, 2026 19:31

chore: Merge branch 'main' into monitoring_improvements

0e153ef

fix: fixed merge conflict bug

212bd12

chore: added maybe_model_list to compiled model

8b9134f

refactor: optimized training loop by detaching compute graphs and onl…

e07e7d7

…y using scalars

feat: added grouped sharding of fsdp units as blocks

c3f03e4

chore: added einsum tranformer tutorial documentation

da6c2d2

chore: fixed paths in einsum transformer tutorial

b2016ff

Merge pull request #428 from Modalities/einsum_transformer

71a40e8

Einsum transformer

le1nux marked this pull request as ready for review February 10, 2026 09:45

le1nux requested a review from BlueCrescent February 10, 2026 14:35

le1nux added 4 commits February 10, 2026 17:58

chore: updated test configs to latest component changes

ade0097

refactor: reverted back to allowing dp_shard only with TP

e4e7d53

chore: updated optional config parameters

447d8c9

fix: all unit and e2e tests running through again

e4439cb

BlueCrescent reviewed Feb 11, 2026

View reviewed changes

le1nux added 3 commits February 11, 2026 16:02

refactor: all tutorials are running through again

94019ca

chore: fixed checkpointing test

76b07cb

chore: referencing now modalities preprint in README

4234f55

le1nux merged commit e462f57 into main Feb 11, 2026
3 checks passed

le1nux deleted the monitoring_improvements branch February 11, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring improvements #425

Monitoring improvements #425

Uh oh!

le1nux commented Dec 6, 2025 •

edited

Loading

Uh oh!

BlueCrescent Feb 11, 2026

Uh oh!

le1nux Feb 11, 2026

Uh oh!

BlueCrescent Feb 11, 2026

Uh oh!

BlueCrescent Feb 11, 2026

Uh oh!

BlueCrescent Feb 11, 2026

Uh oh!

BlueCrescent Feb 11, 2026

Uh oh!

BlueCrescent Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		with open(self._memory_snapshot_path, "wb") as output:
		pickle.dump(torch.cuda.memory._snapshot(), output)

	with open(self._memory_snapshot_path, "wb") as output:
	pickle.dump(torch.cuda.memory._snapshot(), output)
	torch.cuda.memory._dump_snapshot(self._memory_snapshot_path)

	- `run.sh`: example `torchrun` command for 8 GPUs.
	- `run.sh`: example `torchrun` command for 4 GPUs.

Monitoring improvements #425

Monitoring improvements #425

Uh oh!

Conversation

le1nux commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

BlueCrescent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

le1nux Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

le1nux commented Dec 6, 2025 •

edited

Loading