Deterministic global sampling of corpus ids #66

Waino · 2024-05-06T11:05:45Z

Tasks (corpus ids) are now sampled in a deterministic way such that all TaskQueueManagers can be aware of which task all other devices are training.

One benefit is that it is no longer necessary to communicate ready_t during gradient sync. This simplifies the algorithm, but does not seem to have any performance impact.

The PR includes refactoring of multiple locations in the code to use a joint representation of DistributedComponents, to determine which parameters are on which devices, and how they need to be communicated.

broadcast of initial model parameters
gradient communication during training
optimizer

There are additional places that could be refactored in future work:

model_builder
module splitter

This PR also includes a SimpleLookAheadBucketing, which provides dynamic minibatching with guarantees on the maximum batch size. This allows increasing the minibatch size without causing VRAM OOM.

Naming scheme now consistently uses my_ or get_my_ prefix for local.

There are 4 places where almost the same loop over distributed components is performed, with subtle differences. 1) in train_single.py, when broadcasting initialized parameters 2) in trainer.py, when communicating the gradients 3) in utils/optimizers.py, when stepping the optimizer 4) in utils/module_splitter.py, when saving a checkpoint DRY could be greatly improved by refactoring these.

Tests are passing, but the modification is still only part way

- In order to simplify, CPU and single-GPU now also use multiprocessing, placing the dataloader in a separate process. Several places have been refactored to use the new distributed components: - Init broadcast in train_single.py - Gradient communication in trainer.py (still uses only_ready_reduce_and_rescale_grads though) - Sub-optimizer construction in utils/optimizers.py Two places have been identified as potential future candidates: - model_builder - module splitter for checkpointing The task distribution is now logged in the TQM of the main rank. It no longer has to be done in the dataloader, as each TQM has a global view.

This guarantees that the producer and consumer don't share state

Reads in data until there is a few minibatches worth of it. Guaranteed to not exceed maximum minibatch size, to avoid VRAM OOM. Sorts examples locally according to length (max of source and target). This allows some minibatches to use less padding.

Avoids the need to communicate ready_t

Closes #8

TimotheeMickus

LGTM, but I'm not sure about dropping the sentence-level batching + max len padding trick, which thus far is the one with the best tok/s, right?
to what extent could this be added back in, given the current implementation?

mammoth/bin/train.py

TimotheeMickus · 2024-05-06T11:12:27Z

mammoth/distributed/communication.py

@@ -35,81 +36,48 @@ def broadcast_tensors(tensors, src=0, group=None):
            torch.distributed.broadcast(t, src, group=group)


-def only_ready_reduce_and_rescale_grads(named_parameters, group=None):
+def managed_reduce_and_rescale_grads(


"managed" ? I'm not sure about the clarity of the terminology here, could you elaborate on that?

We should rename this, if we can come up with a more descriptive name.

"Reduce and rescale grads or dummy grads, with the caller deciding which one to do."

Maybe externally_managed_reduce_and_rescale_grads would do the trick?
The idea is that only_ready uses the metadata provided by the has_grad hook to make its own decisions about how to communicate certain parameters, but the new implementation needs to be managed by something else, with the caller passing in the has_local_gradient and gradient_norm.

mammoth/distributed/communication.py

TimotheeMickus · 2024-05-06T11:20:15Z

mammoth/opts.py

-        help="Maximum number of bins for batching.",
+        default=4,
+        help="The number of minibatches that will be yielded once bucketing is complete. "
+        "Recommended value: same as accum_count, or at least a multiple of it."


bucket is a terrible name for that?

Bucket is indeed a terrible name. I reused the existing parameter without renaming it, while totally changing its meaning. Classic mammoth.

The plan was to either clean this up or throw this away based on the benchmark results, but unfortunately they were somewhat inconclusive.

mammoth/inputters/dataloader.py

TimotheeMickus · 2024-05-06T11:22:47Z

mammoth/opts.py

        choices=["sents", "tokens"],
-        help="Batch grouping for batch_size. Standard is sents. Tokens will do dynamic batching",
+        help="Batch grouping for batch_size. Standard is tokens (max of src and tgt). Sents is unimplemented.",


so no support for max len padding, which is the thing that seems to work the least bad for now?

TimotheeMickus · 2024-05-06T11:28:37Z

mammoth/utils/optimizers.py

@@ -83,55 +54,63 @@ def build_torch_optimizer(model, opts, task_queue_manager):
    Returns:
      A ``torch.optim.Optimizer`` instance.


feels like we're missing a class constructor signature inspection + kwargs popping approach

TimotheeMickus

(Apparently issues with logging the TQM on this branch. Has this been smoketested / tested? if so I'll dismiss this second review)

[2024-05-06 11:46:02,478 1083  INFO] world_size = 4, queue_size = 40
[2024-05-06 11:46:02,481 1083  INFO] in task_queue_manager: node_rank 0 local_rank 0
Traceback (most recent call last):
  File "/home/nloppi/Tiedemann_project/mammoth/mammoth/train.py", line 6, in <module>
    main()
  File "/home/nloppi/Tiedemann_project/mammoth/mammoth/mammoth/bin/train.py", line 304, in main
    train(opts)
  File "/home/nloppi/Tiedemann_project/mammoth/mammoth/mammoth/bin/train.py", line 245, in train
    logger.info(f'TaskQueueManager: {global_task_queue_manager}')
  File "/home/nloppi/Tiedemann_project/mammoth/mammoth/mammoth/distributed/tasks.py", line 310, in __repr__
    kwargs = ',\n '.join(
  File "/home/nloppi/Tiedemann_project/mammoth/mammoth/mammoth/distributed/tasks.py", line 311, in <genexpr>
    f'{key}={pformat(self.__getattribute__(key))}'
AttributeError: 'TaskQueueManager' object has no attribute 'node_rank'

Waino · 2024-05-06T12:30:10Z

LGTM, but I'm not sure about dropping the sentence-level batching + max len padding trick, which thus far is the one with the best tok/s, right? to what extent could this be added back in, given the current implementation?

The sentence-level batching + max len padding trick is implemented as a special case of the spiral bucketing data loader.

If we want both dynamic batching and the sentence-level batching + max len padding trick, then the easiest way is to keep all 3 implementations. However, it would be quite confusing to have two different ways of doing dynamic batching, of which one is known to be flaky. So maybe the dynamic spiral bucketing would be there but inaccessible through the config.

A better solution would be to refactor the sentence-level batching + max len padding trick so that it doesn't need the spiral bucketing data loader.

For benchmarking, it was just easier to rip it out.

Waino · 2024-05-06T12:47:24Z

mammoth/utils/optimizers.py

@@ -223,27 +202,17 @@ def zero_grad(self):
        for name in self.optimizers:
            self.optimizers[name].zero_grad()

-    def step(self, grad_scaler=None):
-        """Step through all the suboptimizers"""
+    def managed_step(self, gradient_syncs, grad_scaler=None):


Here we also use the term managed with the same meaning.
The caller must supply gradient syncs (a sequence of DistributedComponentGradientSync. Should add a type annotation) which determine whether or not to step each suboptimizer.

Waino · 2024-05-06T13:03:26Z

(Apparently issues with logging the TQM on this branch. Has this been smoketested / tested? if so I'll dismiss this second review)

In commit fbe4f5c
I tried to reduce the log spam somewhat, and one of the actions was to only log a single TQM (the global one) instead of each device logging their own. It seems that a local TQM could be represented, but the __repr__ of the global one was broken.

I fixed the bug.

For some reason, setting the logging level of the logger doesn't work. Neither --verbose nor --log_file_level seems to affect the level of the logger. Therefore, messages are still logged as warnings, but only shown if the verbose flag is set.

TimotheeMickus · 2024-05-06T14:01:47Z

fix linting and this should be GTG

TimotheeMickus · 2024-05-06T16:05:06Z

LGTM! will let you the honors of merging :']

Waino · 2024-05-13T05:51:30Z

Closes #8

Waino added 13 commits April 15, 2024 10:03

Separate global and local TQM responsibilities

17633ec

Naming scheme now consistently uses my_ or get_my_ prefix for local.

WIP: this codebase is quicksand, and I have descended

9a3010d

WIP: refactor distributed components

ecfc108

Tests are passing, but the modification is still only part way

Initialize TaskDistributionStrategy in global_to_local

63dd232

This guarantees that the producer and consumer don't share state

Fix training on single gpu with smoketest data

ef08b1f

SimpleLookAheadBucketing

b851255

Reads in data until there is a few minibatches worth of it. Guaranteed to not exceed maximum minibatch size, to avoid VRAM OOM. Sorts examples locally according to length (max of source and target). This allows some minibatches to use less padding.

managed_reduce_and_rescale_grads

f5961f6

Avoids the need to communicate ready_t

Remove has_grad_hook

2c31a5f

Closes #8

DistributedComponent.min_rank should be property

72be768

Remove obsolete only_ready_reduce_and_rescale_grads, fix some bugs

e68e67a

Avoid some log spam

fbe4f5c

Waino requested a review from TimotheeMickus May 6, 2024 11:06

TimotheeMickus approved these changes May 6, 2024

View reviewed changes

TimotheeMickus self-requested a review May 6, 2024 11:43

TimotheeMickus requested changes May 6, 2024

View reviewed changes

Waino commented May 6, 2024

View reviewed changes

Bufix to TQM __repr__

886249d

TimotheeMickus approved these changes May 6, 2024

View reviewed changes

lint

f6da435

Waino merged commit fc51156 into main May 13, 2024
2 checks passed

Waino mentioned this pull request May 13, 2024

reimplement sentence based max len padding #67

Closed

Waino mentioned this pull request May 20, 2024

Post-merge fixes to global sample corpus ids #68

Merged

Waino mentioned this pull request May 20, 2024

Reimplement fixed size batching #69

Merged

This was referenced May 27, 2024

Conflict between fp16 and deterministic sampling #71

Closed

conflict between model saving and optimizer refactor #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic global sampling of corpus ids #66

Deterministic global sampling of corpus ids #66

Waino commented May 6, 2024

TimotheeMickus left a comment

TimotheeMickus May 6, 2024

Waino May 6, 2024 •

edited

Loading

TimotheeMickus May 6, 2024

Waino May 6, 2024

TimotheeMickus May 6, 2024

TimotheeMickus May 6, 2024

TimotheeMickus left a comment •

edited

Loading

Waino commented May 6, 2024

Waino May 6, 2024

Waino commented May 6, 2024

TimotheeMickus commented May 6, 2024

TimotheeMickus commented May 6, 2024

Waino commented May 13, 2024

		@@ -83,55 +54,63 @@ def build_torch_optimizer(model, opts, task_queue_manager):
		Returns:
		A ``torch.optim.Optimizer`` instance.

Deterministic global sampling of corpus ids #66

Deterministic global sampling of corpus ids #66

Conversation

Waino commented May 6, 2024

TimotheeMickus left a comment

Choose a reason for hiding this comment

TimotheeMickus May 6, 2024

Choose a reason for hiding this comment

Waino May 6, 2024 • edited Loading

Choose a reason for hiding this comment

TimotheeMickus May 6, 2024

Choose a reason for hiding this comment

Waino May 6, 2024

Choose a reason for hiding this comment

TimotheeMickus May 6, 2024

Choose a reason for hiding this comment

TimotheeMickus May 6, 2024

Choose a reason for hiding this comment

TimotheeMickus left a comment • edited Loading

Choose a reason for hiding this comment

Waino commented May 6, 2024

Waino May 6, 2024

Choose a reason for hiding this comment

Waino commented May 6, 2024

TimotheeMickus commented May 6, 2024

TimotheeMickus commented May 6, 2024

Waino commented May 13, 2024

Waino May 6, 2024 •

edited

Loading

TimotheeMickus left a comment •

edited

Loading