-
Notifications
You must be signed in to change notification settings - Fork 130
remove prev thread fix, replaced by main changes #1916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
|
@Qubitium the same error appear again. GLM-4.5-Air, mock_quantization: true, 3.13t, during layer number 1 (first layer with experts). |
|
@avtc Ok. Still working to fix many thread (GIL=0) on main. Guess there are stilll bugs there. Can you give me a reproducible script? Thanks! No need for private calib data, etc. Just something simple I can run on my local multi-gpu setup. |
|
@Qubitium python part |
|
There are new runtime errors in the output: |
|
@Qubitium With 4 GPUs the issue is not reproduced, but with 5 GPUs it reproduced. |
|
@Qubitium |
|
@avtc Can you open a new issue so I can track this. auto_gc was removed by me cause I don't think it matters now. May not. How many gpu are you running for the glm4.5 air quant and show the the "Tokens/Padd Tokens' data outputed in the new btw, please check your cpu memory usage. It should be much lower than before. |
|
The last main does not have fix for threads, so I can use only 4 x 3090. I have used single sample with 80 tokens to quant, to 4 bit, using a script above. I am not sure where to look for Tokens/Pad tokens, will check later. |
|
@avtc Check for this output: INFO Calibration: Sort in descending order by length
INFO Calibration: Total padded tokens: 0
INFO Calibration: Total non-padded tokens: 345522
INFO Calibration: Total tokens: 345522
INFO Calibration: Sort in descending order by length
INFO Calibration: Total padded tokens: 0
INFO Calibration: Total non-padded tokens: 345522
INFO Calibration: Total tokens: 345522 |
|
|
@Qubitium The max RAM usage was around 46 GB until layer 19, great result! But need to check with more samples... |
|
@Qubitium |
|
@Qubitium The VRAM with this fix is reclaimed after each layer. The offload turned off empty_cache, so the reclaim did not work with just |
|
@avtc Yes! I have confirmed add the post-level torch.cuda_empty_cache() op will help lower the gpu by about 10% on llama 3.2 1B model so you should see much more saving for MoE. But instead of this, which is very expensive (slow) for non-moe where each layer does not have so much modules, I will add new auto_gc_bytes features that will auto_gc based on how much estimated memory we have freed. This will make sure we only call and tune this call unless absolutely necesary. Should be done today |
|
@Qubitium the main idea behind the fix was to wait for offload to finish after each layer ( For small models that fits into VRAM that was unnoticeable, but with large models this fix will help to keep in VRAM only a single layer and prevent VRAM OOM. I have checked with With this fix I have noticed slowdown, so as a more improvement will be to monitor the VRAM usage and call |
|
@avtc Check my latest PR merge and tagged you for the vram autogc update. It should be fixed but needs tuning. #1934
Data looks good and expected. The offload is working but during model save, it is too stupid and not directly using the |
|
@Qubitium on last main - the CUDA OOM happen on layer 6 (with 4 x 3090, 4bit, 1 sample). |
|
@Qubitium I have moved |
|
@avtc You are right about the placement improving the situation for GLM 4.5 Air MoE. I am just trying to thinkn of a way that fixes for all situations. There are models with 4x more MoE experts that will surely OOM well before the entire layer's |
|
So the symptom are co-related.
|
@avtc Reverting previous threading fix for multi-gpu as I believe the recent threaad codes I pushed on
mainhas nullified this issue at the source. Please test. Also, massive cpu memory usage has been eliminated including elimination of packing as pack is now done inline to quant.Second fix: #1923