-
Notifications
You must be signed in to change notification settings - Fork 130
Data Parallel #1950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Parallel #1950
Conversation
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
|
Awesome, will check. |
|
@Qubitium |
|
@avtc You are a bug magnet if I may say so. lol. Geez. You keep running into bugs faster than I can fix them. Btw, |
|
Idk if the fact that model first layer(s) does not have expert modules, and starting from layer index |
|
latest main works! thank you @Qubitium |
@avtc Another 75% speed reduction off on top of current main branch for MoE quantization. No joke. This PR will actually make smaller non-moe models slower but gives a huge boost to MoE models. The bigger the model, the more gpus will help.
Forwarding is now data-parallel (model is replicated across all gpus and work is sharded)
But based on some small tests, I expect oom possibility to now actually increase because now the model needs to be replicated/copied to mutliple gpu in multiple threads so gpu:0's memory load actually increases.