-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider making the decoding classes subclasses of torch.nn.Module and registering them via add_module() #8436
Comments
Initialize cuda tensors lazily on first call of __call__ isntead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: NVIDIA#8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. NVIDIA#8191 (comment) Remove excess imports. Check a few more error messages from nvrtc. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
There's a reason those decoding classes aren't modules. When rnnt was being developed it turned out there was a memory leak due to pt graph tracking and Autoregressive calls to the decoder joint. That's why .freeze(), .as_frozen() was developed in NeMos core to deal with that. I dunno if that reason is still valid in pytorch after the inference_mode() decorator was added, maybe safe to make them modules again. As you've mentioned, it is a lot of work to do this change, and I don't feel it is super high priority, but if someone wants to take a look that's fine. To get device inside the decoding classes, the preferred mechanism is mostly
This is not a good user experience and I want to avoid it. Users should not have to use any env flags for use Nemo ASR as a necessity.
Well if you're training, yes, but it's handled by ptl. During inference, we stick to single GPU for now, but with larger models we may consider doing multi GPU in the future. |
I frequently use in the notebooks I think that there are two possible approaches:
However, such a redesign will break the current checkpoints and will require a lot of work and testing. |
It's a few lines to manually manage device type from the encoder, decoder or joints param dtype, we shouldn't have such breaking changes for this |
On top of this, yes the fact that parameter count would be double counted for decoder and joint with such a redesign is also bad. |
I agree that, for now, such breaking changes are undesirable.
Currently, rnnt model owns:
Proposed, rnnt model owns:
|
That's just bad design, why merge the transcription and prediction network when in all literature it's denoted as separate modules. Also, you don't always call the prednet and decoder network with the same set of inputs (train time prepends blank, eval time it starts with blank as first token for autoregressive decoding). This is not a viable proposal in my opinion |
I don't really get what's the big issue with the decoding framework not being a neural module. It's responsibility is not to act as a parameter based operation on NN network as part of the forward, it's a agnostic layer that provides a stable interface (Hypothesis) to map the enc Dec Joint logprobs to text. It's a logical separation, not a module dependency. I understand that certain issues can arise due to current design, that's cause we're using more advanced design. However very simple solution to this exists which Daniel has already implemented and is trivial to do (and also recommended by pytorch btw - to base new tensors on the device of the current active ones that it will interact with). I don't see why such a thing requires a refactoring of the deciding framework. If there's a bug we fix it, we don't scrap the entire thing and do it all over cause of "pytorch design pattern" (which Nemo does not fully follow by design, we instead use PTL design pattern) |
My overall conclusion is that I have encountered an edge case, and the right approach is just to recreate the appropriate state tensors anytime that the device of the input tensors changes from what was being used before. |
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Closing this. The better way is lazy initialization given how nemo currently is. |
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: NVIDIA#8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. NVIDIA#8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: ataghibakhsh <ataghibakhsh@nvidia.com>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: #8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. #8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>
* Speed up RNN-T greedy decoding with cuda graphs This uses CUDA 12.3's conditional node support. Initialize cuda tensors lazily on first call of __call__ instead of __init__. We don't know what device is going to be used at construction time, and we can't rely on torch.nn.Module.to() to work here. See here: NVIDIA#8436 This fixes an error "Expected all tensors to be on the same device, but found at least two devices" that happens when you call to() on your torch.nn.Module after constructing it. NVIDIA#8191 (comment) Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Is your feature request related to a problem? Please describe.
@artbataev pointed out to me an issue with calling this when the model is using the cuda graph rnn-t decoder:
Basically, if I initialize the cuda decoder with instance variable buffers (or parameters) that are torch.Tensors on device cuda:0, the "to()" method won't move them over to device cuda:1, because to() recurses only into members that are also torch.nn.Modules: https://github.com/pytorch/pytorch/blob/c3b4d78e175920141de210f44d292971d7c52ff0/torch/nn/modules/module.py#L572
In order to make that behavior work, I would need to every class that transitively uses an instance of RNNTGreedyDecodeCudaGraph to inherit from torch.nn.Module, so that to() will act properly. This seems like a lot of work, but would allow this API call to work as expected. Otherwise, you get an error as Vladimir shows here: #8191 (comment)
.to() is being called here transitively
NeMo/nemo/core/connectors/save_restore_connector.py
Line 179 in 21990e4
NeMo/examples/asr/transcribe_speech.py
Lines 217 to 241 in 21990e4
in his transcribe_speech.py command line.
Basically any code in NeMo that allocates a torch.tensor that isn't tracked by pytorch's tracking of torch.Tensors (via wrapping it in either a parameter or a buffer and putting it inside a torch.nn.Module) will fail to be converted properly by to(). This could also cause subtle bugs in converting a model from float32 to bfloat16 as well. @erastorgueva-nv I don't think this is what you might be seeing in your Canary debugging, but FYI.
I ultimately don't think this is a good idea since it seems like a lot of work. I think a better way to fix this is to avoid calling to() at all, and instead have users set CUDA_VISIBLE_DEVICES environment variable appropriately before starting a process (and remove the cuda=1 config option in transcribe_speech.py). Note that using CUDA_VISIBLE_DEVICES to specify a single GPU and simply using torch.device("cuda") instead of specifying a device index in code via torch.device("cuda", my_index) is what's recommend: https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html#torch.cuda.set_device
Do we have any use for multiple cuda devices in a single process in NeMo?
The text was updated successfully, but these errors were encountered: