-
Notifications
You must be signed in to change notification settings - Fork 130
Threadx #1945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
|
@Qubitium i run GLM-4.5-Air to 4bit, mock_quantization, samples: 1 on this branch. And after few layers got an error: I have also tried to run this branch and with |
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
|
@avtc The latest commits just unlocked another 20%+ speed improvement in MoE layer quantization. I am getting carried away so not concentrating on memory usage right at this moment. |
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
|
the error log from latest main branch: will check if setting env var help |
|
setting |
Works now, VRAM is reclaimed, estimated time looks promising, 30m for GLM-4.5-Air, 4bit, samples: 1, mock_quantization=True. |
|
@avtc |
|
Regarding the |
@avtc Thanks for your patience in trying differen strategies. This PR by itself will not fix your oom issue out of the gate but sets the foundation for data-parallelism and also give us free metrics (without overhead) of when/count of how much gpu work we have done based on tasked submitted/completed on each individual gpus.
Right now toruch.cuda.empty_cache() will be called for every 14 bg gpu tasks that have been submittd to the gpu work queue.
The limit is current set at the module_lopper
__init__page.This will give us the flexibilty to implement multiple strategies based on subset, layer, tc. So it can mimic the old way, wait for all bg to complete per layer, and it can do fine-grained control like, do this (cleanup task) for every N submodules we process. Flexibilty is good here since one strategy will not fit all. For normal llama like models, there is about 7 modules, quantized modules, per layer. For MoE, this can vary wildly.