You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A natural approach to faster SAE training is data parallel. Maybe we can just simply use DDP to make 8 copies of the TL model to yield activation and synchronize SAE gradients. This may help accelerate activation gen, which is the speed bottleneck for larger LMs.
This may not work on larger size models, say 70B models. Maybe the ultimate solution is a producer-consumer design pattern. Let's leave this for later.
Support DDP
The text was updated successfully, but these errors were encountered:
BTW, I did some modification to get over a bug in the ddp code.
Error message without modification:
...
[rank3]: File "/home/alan/dev/sae/Language-Model-SAEs/TransformerLens/transformer_lens/components/embed.py", line 34, in forward
[rank3]: return self.W_E[tokens, :]
[rank3]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)
We are currently working on it. Initially we did not take industry-size models into account, neither was DDP. We may have to refactor for about a week to work with that.
8B models just work on an A100 GPU with a small batch size.
If this does not fit in your scenario, you may have to wait for a while xd.
A natural approach to faster SAE training is data parallel. Maybe we can just simply use DDP to make 8 copies of the TL model to yield activation and synchronize SAE gradients. This may help accelerate activation gen, which is the speed bottleneck for larger LMs.
This may not work on larger size models, say 70B models. Maybe the ultimate solution is a producer-consumer design pattern. Let's leave this for later.
The text was updated successfully, but these errors were encountered: