You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
New Features:
Fine-tuning, one-shot, and general compression techniques now support large language models built on top of Hugging Face Transformers, including full FSDP support and model stages for transitioning between training and post-training pathways. (#1834, #1891, #1907, #1902, #1940, #1939, #1897, #1907, #1912)
SparseML eval pathways have been added with plugins for perplexity and lm-eval-harness specifically for large language model support. (#1834)
AutoModel for casual language models, including quantized and sparse quantized support, has been added.
KV-cache injections now function accurately with MPT models in DeepSparse and SparseML, where before they crashed on export for MPT models. (#1801)
SmoothQuant updated to support proper device forwarding where it would not work properly in FSDP setups and crash. (#1830)
With nsamples increased to 512, the stability of OBCQ improved, resulting in a higher likelihood of it converging correctly. (#1812)
SmoothQuant NaN values are resolved during computation. (#1872)
TypeError with OBCQ when no sequence_length is provided is now resolved. (#1899)
Known Issues:
Memory usage is currently high for one-shot and fine-tuning algorithms on LLMs, resulting in the need for GPUs with more memory for model sizes 7B and above.
Memory usage is currently high for export pathways for LLMs, resulting in a requirement of large CPU RAM (>150GB) to successfully export for model sizes 7B and above.
Currently, exporting models created with quantization through FSDP pathways is failing on reloading the model from disk. The workaround is to perform quantization on a single GPU rather than multiple GPUs. A hotfix is forthcoming.
Currently, multi-stage pipelines that include quantization and are running through FSDP will fail after running training and on initialization of the SparseGPT quantization stage. This is due to the FSDP state not being propagated correctly. The workaround is to restart the run from the saved checkpoint after training and pruning are finished. A hotfix is forthcoming.