Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S4D Memory Requirements #51

Closed
stefan-baumann opened this issue Jun 27, 2022 · 26 comments
Closed

S4D Memory Requirements #51

stefan-baumann opened this issue Jun 27, 2022 · 26 comments

Comments

@stefan-baumann
Copy link

Hey, I wanted to give S4D a quick try in my research as a drop-in replacement of S4 (which, as far as I gathered, should be a good way to start), but I'm running into some hard memory limitations. I'm trying to train the DiffWave version of SaShiMi as a first experiment, but the memory requirements seem to increase significantly when replacing S4 with an equivalent S4D layer (with default settings), causing the model to go OOM in my case actually (so I don't have any precise measurements, but it's a 20% increase in overall memory consumption at least. I use the parameters as discussed in #46. Is this something you'd expect?

@albertfgu
Copy link
Contributor

The implementation of S4D originally uploaded to this repo did not have a custom kernel for the Vandermonde multiplication. Materializing the matrix uses $O(HNL)$ space on this line whereas in principle it could be implemented with $O(H(N+L))$ space. These issues are talked about in Section 3.2-3.4 of the S4D paper.

S4D was originally meant to be pedagogical whereas S4 usually requires less tuning out of the box so I didn't implement the more efficient kernel S4D kernel at first. I now have a version using the pykeops library which will be released with v3. I can upload a version now; it could help us too if you did some tests to see if it works

@stefan-baumann
Copy link
Author

stefan-baumann commented Jun 27, 2022

Ah, that makes sense, thanks! I'd love to test it, see what kind of effect it has for me, and report back :)

@albertfgu
Copy link
Contributor

The new standalone file is here. This one includes all options for all models; for example, you would pass in mode=nplr for S4 or mode=diag for S4D, and then there are the usual arguments. This file is ported from our internal research repo and I'm still testing it, so please open issues if you find any snags.

The measures for S4D are a little different: they are diag-inv for S4D-Inv, diag-lin for S4D-Lin, and diag-legs for S4D-LegS. There's also a mixed option diag that combines S4D-Inv and S4D-Lin.
Note that we personally have found weird edge cases with S4D where it doesn't seem to do well on our YoutubeMix generation dataset out of the box. So far we found that only measure=diag-lin works, but that did slightly better than S4.

I'm still thinking about what else to release - while this file has the full model with every option, I was thinking about releasing an intermediate file similar to the current s4d.py standalone that's less scary than this one. The simplest one will be the s4d_minimal.py standalone which I'm planning to keep for pedagogical purposes (so a total of potentially 3 models with various levels of simplicity vs tunability). Curious if you have any opinions for what's most useful to you.

In the next two days I will be working on double checking and releasing the original Sashimi+Diffwave repo, so stay tuned!

@stefan-baumann
Copy link
Author

Awesome, I'll give it a try. So just to make sure that I understood everything correctly: to do a drop-in-replacement of S4 -> S4D, I'd just take the S4 module and set mode='diag', measure='diag-lin' for a start, potentially switching the measure. Of course also adjusting the rest of the training process if needed.

Regarding your other question: Right now, I'm trying to apply a variant of SaShiMi trained in a diffusion context for my research. As I'm training on longer sequences than with Speech Commands, I run into memory limitations quickly (> 40GB of memory usage for a forward pass on one GPU with a batch size of 1/GPU), so I'm particularly interested in any additional efficiency I can get additionally to the more common efficiency optimizations.
Generally, the way in which the modules are provided is not that big of an issue for me (the current way is more than fine, although the standalone versions are definitely appreciated). The only immediate improvement I can think of off the top of my head is to make those parameters like mode and diag more directly available by including them as explicit arguments in the S4 class etc, as this can get somewhat confusing at times. Apart from that, there are of course some ways of improving immediate usability like via those standalone modules, which is definitely really good, but I don't think I personally have a large need for them. They're definitely appreciated though!

@stefan-baumann
Copy link
Author

Trying it out in practice currently causes the pykeops implementation to raise an exception for me:

Traceback (most recent call last):
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "[...]/learner.py", line 399, in train_distributed
    _train_impl(replica_id, model, dataset, args, params)
  File "[...]/learner.py", line 353, in _train_impl
    learner.train(max_steps=args.max_steps)
  File "[...]/learner.py", line 125, in train
    train_step_result = self.train_step(features)
  File "[...]/learner.py", line 172, in train_step
    predicted = self.model(noisy_audio, t, spectrogram)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/model.py", line 495, in forward
    x, _ = self.sashimi((x, diffusion_step)) # input shape ((batch, length, channels), (batch, 1)); output shape (batch, length, channels)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/src/sashimi/sashimi.py", line 946, in forward
    x, _ = layer(x, **layer_kwargs)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/src/sashimi/sashimi.py", line 356, in forward
    y, _ = self.layer(y)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/src/src/models/sequence/ss/standalone/s4.py", line 1371, in forward
    k, k_state = self.kernel(L=L, rate=rate, state=state) # (C H L) (B C H L)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/src/src/models/sequence/ss/standalone/s4.py", line 1248, in forward
    return self.kernel(state=state, L=L, rate=rate)
  File "[...]/env_a6000/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/src/src/models/sequence/ss/standalone/s4.py", line 1119, in forward
    K = log_vandermonde(C, dA.log(), L)
  File "[...]/src/src/models/sequence/ss/standalone/s4.py", line 101, in log_vandermonde
    r = vandermonde_mult(v, x, l, backend='GPU')
  File "[...]/env_a6000/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 568, in __call__
    out = GenredAutograd.apply(
  File "[...]/env_a6000/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 79, in forward
    result = myconv.genred_pytorch(
RuntimeError: [KeOps] Wrong number of args : is 3 but should be at least 4 in formula.

The same code was working with the previous S4D module, and I'm simply instanciating the S4 modules as S4(d_model, bidirectional=True, mode='diag', measure='diag-lin').

@albertfgu
Copy link
Contributor

Hmm, I went to branch origin/v3 and added these lines to the bottom of the standalone file:

if __name__ == '__main__':
    torch.manual_seed(42)

    device = 'cuda' # 'cpu'
    device = torch.device(device)
    model = S4(256, bidirectional=True, mode='diag', measure='diag-lin')

then ran
python -m src.models.sequence.ss.standalone.s4
and this doesn't throw an error for me.

Can you try this? What version of torch and pykeops are you on?

@stefan-baumann
Copy link
Author

Yes, I can confirm that this works for me, too. I'm on torch==1.11.0+cu113 and pykeops==1.5.

@albertfgu
Copy link
Contributor

Those are my versions too. Is it possible that the error is instantiating the wrong version of the module?

@stefan-baumann
Copy link
Author

stefan-baumann commented Jun 27, 2022

You mean that I'm using a version other than that from the v3 branch of the repo for the model that errors out? I don't see how this could happen in my case.

@albertfgu
Copy link
Contributor

Ok, I'm not sure what the difference is between the instantiation that errors and the one that doesn't. They are being instantiated with the same call right?

@stefan-baumann
Copy link
Author

Yes, the call is the same, except for the value for d_model.

@albertfgu
Copy link
Contributor

Can you specify what value you're using then?

@stefan-baumann
Copy link
Author

It's variable, different values from [64, 128, 256], as in the default small version of SaShiMi.

@albertfgu
Copy link
Contributor

Also, you can try the latest version pykeops==2.1. I had issues with this version on certain CUDA versions, but it should work on later CUDA versions like 11.3

You can also try to follow their instructions to clear the cache here: https://www.kernel-operations.io/keops/python/installation.html#part-checkpython

@stefan-baumann
Copy link
Author

Okay, this is really weird - looking more deeply into this, it seems like I randomly get this kind of error (seems to change sometimes). I don't know what this is caused by, but it seems to be dependent on the system I train on, so it's most probably not caused by you. Please excuse the commotion - this seems like the error is likely not caused on your side of the code.
It's definitely a weird one, considering that I'm training on the same model GPU on different servers with the same software (maybe related to pykeops? That's the only thing I can think of).

I can confirm now though, that I got your pykeops implementation to work on one machine in both a single- and dual-gpu setup. I should be able to report some results on the performance impact from the new version tomorrow, provided that I get it to work at full scale.

@albertfgu
Copy link
Contributor

Ok, feel free to file a separate issue as you narrow this down. This seems like quite a strange bug.

Also just to confirm, does this happen in the pykeops version of S4? If you pass in mode=nplr keops=True this should trigger the pykeops Cauchy kernel codepath.

@stefan-baumann
Copy link
Author

stefan-baumann commented Jun 28, 2022

So, first of all the promised results regarding the performance impact. They are from my analysis of my full model that I've done anyways, so they're not isolating S4D, but they should give a good lower bound on the improvements at least, incase it's interesting to you.
So, going from S4D @ sequence length $2^{16}$ with the normal implementation to the pykeops one (on pykeops==1.5) results in a 2.6 times memory reduction (40% of what it was before) and even slightly improves training time, going to about 88% of what it was before, and everything seems to be working fine (although I did not do full trainings for obvious reasons).
I have tried mode=nplr keops=True too, but didn't get an error during the first few tries. But seeing as those errors occur randomly anyways, I can't guarantee that they're not present with that version, too. But whatever it is, I suspect that it's not directly caused by your code anyways

@stefan-baumann
Copy link
Author

stefan-baumann commented Jun 28, 2022

Did you actually test training the current version of the S4 module in practice btw? Going back from S4D to S4 (and using default settings instead of mode='nplr', keops=True for the S4 module - those work) in my model breaks my training, as all gradients are NaN.

I condensed it down to this code to reliably trigger this issue on my end (adding it to the end of src.models.sequence.ss.standalone.s4:

if __name__ == '__main__':
    torch.manual_seed(42)

    device = 'cuda' # 'cpu'
    device = torch.device(device)
    # model = S4(256, bidirectional=True, mode='diag', measure='diag-lin').to(device) # works
    # model = S4(256, bidirectional=True, mode='nplr', keops=True).to(device) # works
    model = S4(256, bidirectional=True, mode='nplr').to(device) # torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0
    def f(x):
        return model(x)[0]
    print(model)
    torch.autograd.gradcheck(f, (torch.ones((1, 256, 2**6), requires_grad=True, device=device),))

If this is something you want to investigate, I can also create a separate issue, it just came up during the investigation of this problem, so I thought I'd post it here first.

@albertfgu
Copy link
Contributor

albertfgu commented Jun 28, 2022

going from S4D @ sequence length with the normal implementation to the pykeops one (on pykeops==1.5) results in a 2.6 times memory reduction (40% of what it was before)

this is really great to know! that's even a bit better than I expected

Going back from S4D to S4 (and using default settings instead of mode='nplr', keops=True for the S4 module - those work) in my model breaks my training, as all gradients are NaN.

This works fine for me. I ran my standard testing command python -m train wandb=null pipeline=mnist model=s4 works fine. Adding +model.layer.keops=true still works as well. (You'll need to remove a line from the layer config, which will be updated later as part of the full release). Let's start a separate issue for tracking this down

@albertfgu
Copy link
Contributor

If the code works for you with keops=True but gives NaNs without, it's probably a problem with the CUDA extension for Cauchy. You can try uninstalling and reinstalling it (pip uninstall cauchy_mult, navigate to extensions/cauchy, and python setup.py install). Also keep in mind that it has to be recompiled for different environments (CUDA version, machine, GPU type, etc.) if those change, although usually it'll just throw an error instead of failing with NaNs if there's a mismatch. If it still doesn't work, filing a separate issue would be helpful so we can look into it, but you can just uninstall the extension and rely on pykeops.

One more thing: You can try upgrading to pykeops==2.1 if it works (I found an edge case, but when it works it's better than 1.5)

@stefan-baumann
Copy link
Author

Okay, I was not aware that I'm expected to reinstall the cauchy extension for every single machine. I've been using a virtual environment shared over multiple systems (same GPU setup), which has worked fine before. Creating a new virtual environment and reinstalling the Cauchy extension seems to have fixed the immediate problem, thanks! Maybe adding a quick hint about it in one of the READMEs would be helpful :)

@albertfgu
Copy link
Contributor

Curious how things turned out - did you run into any other issues? Did you ever figure out what was going on with the multiGPU issue?

@stefan-baumann
Copy link
Author

So, in general, reinstalling the Cauchy extension seems to have resolved most of the issues I was facing. Regarding my multi-GPU problems, PyKeOps seems to have some issues when multiprocessing is used and the filesystem where its cache is is not particularly fast, which I could hotfix by removing its ability to cache builds. I assume that this problem might also be responsible for some other issues I encountered, as it seems to both randomly crash some processes and cause unexpected behaviour in others. I haven't investigated it enough to open an issue there yet though.

Apart from that, I have recently encountered some NaN gradients again when using the S4 layer with nplr mode and default measure on the most recent v3 build when keops=True is specified that don't occur with any other S4 parameters I use. I haven't had the time to further look into this yet though, and I haven't checked whether I can reliably reproduce this either so far. If I can nail it down somewhat, I'll open another issue.

@albertfgu
Copy link
Contributor

Ah yes I remember issues with pykeops 1.5 on multi-gpu and had to add some hacks with the cache folder. pykeops 2.1 is supposed to resolve some of these problems by avoiding the cache entirely; it has a much faster compilation time so it just doesn't cache the kernels.

So does your NaN gradient issue only occurs with keops=True or with the CUDA extension as well? I wonder if keops is causing any other hidden issues.

@stefan-baumann
Copy link
Author

Yes, I experienced the same with 1.5. But 2.1 still had similar issues for me with its cache folder (there's one at ~/.cache/keops2.1), which were a lot less straightforward to both diagnose and fix.

My issues with the gradients only occur with keops=True, mode='nplr' and seem to occur independently from any automatic/manual mixed precision stuff, which I suspected at first. But as I said, I haven't found the time to really look into it yet.

@albertfgu
Copy link
Contributor

Good to know, thanks for the report! File an issue if you dig into it more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants