Release SparseML v1.7.0 · neuralmagic/sparseml

New Features:

Fine-tuning, one-shot, and general compression techniques now support large language models built on top of Hugging Face Transformers, including full FSDP support and model stages for transitioning between training and post-training pathways. (#1834, #1891, #1907, #1902, #1940, #1939, #1897, #1907, #1912)
SparseML eval pathways have been added with plugins for perplexity and lm-eval-harness specifically for large language model support. (#1834)
AutoModel for casual language models, including quantized and sparse quantized support, has been added.

Exporting pathways has been simplified across text generation and CV use cases to auto infer previously required arguments, such as task type. (#1858, #1878, #1880, #1883, #1884, #1888, #1889, #1890, #1898, #1908, #1909, #1910)
Recipe pathways have been updated to fully support LLMs for model compression techniques. (#1802, #1804, #1819, #1825, #1849)
Pruning for models that are partially quantized is now supported. (#1792)
OBCQ modifier target_ids argument is now optional. (#1825)
sequence_length for transformer exports is now automatically inferred if it is not supplied. (#1826)
OBCQ now supports non-CUDA systems. (#1828)
Neural Magic's Ultrayltics Enterprise License has been updated with a December 2023 amendment as cited. (#2090)

KV-cache injections now function accurately with MPT models in DeepSparse and SparseML, where before they crashed on export for MPT models. (#1801)
SmoothQuant updated to support proper device forwarding where it would not work properly in FSDP setups and crash. (#1830)
With nsamples increased to 512, the stability of OBCQ improved, resulting in a higher likelihood of it converging correctly. (#1812)
SmoothQuant NaN values are resolved during computation. (#1872)
TypeError with OBCQ when no sequence_length is provided is now resolved. (#1899)

Memory usage is currently high for one-shot and fine-tuning algorithms on LLMs, resulting in the need for GPUs with more memory for model sizes 7B and above.
Memory usage is currently high for export pathways for LLMs, resulting in a requirement of large CPU RAM (>150GB) to successfully export for model sizes 7B and above.
Currently, exporting models created with quantization through FSDP pathways is failing on reloading the model from disk. The workaround is to perform quantization on a single GPU rather than multiple GPUs. A hotfix is forthcoming.
Currently, multi-stage pipelines that include quantization and are running through FSDP will fail after running training and on initialization of the SparseGPT quantization stage. This is due to the FSDP state not being propagated correctly. The workaround is to restart the run from the saved checkpoint after training and pruning are finished. A hotfix is forthcoming.