Lawrence Liu1, Alexander Liu2, Mengdi Wang3, Tuo Zhao4, Lin F. Yang1
1 UCLA, 2 Independent, 3 Princeton University
, 4 Georgia Tech
ARMOR was accepted to ICLR 2026!
This is the official implementation of ARMOR, a new 2:4 semi-structured pruning method that leverages adaptive matrix factorization. ARMOR signficantly outperforms existing naive 2:4 semi-structured pruning methods such as Wanda and SparseGPT. Sample results for Qwen2.5 7B/14B/32B/72B are shown below. More results can be found in our paper.
- Python 3.13.2+
- Miniconda/Anaconda
- Cuda
We use two seperate virtual enviroments, one main enviroment ARMOR_main for pruning and perplexity evaluation, and one enviroment ARMOR_eval for LM evaluation harness based evaluation. This is because there are some dependency conflicts between hydra and certain tasks in lm-evaluation-harness such as math.
# Clone the repository
git clone git@github.com:LawrenceRLiu/ARMOR.git
cd ARMOR
# Armor main needs to be generated from pip due to dependency conflicts
conda create -n ARMOR_main python=3.13.2 -y
conda activate ARMOR_main
python -m pip install -r requirements_main.txt
# Armor eval can be generated from conda
conda env create -f eval_env.ymlWe have included automated scripts to generate the calibration data and run the pruning and evaluation.
# llama-2 compression
scripts/replicate/compress_llama2.bash <model_name> <gpus>
# llama-3 compression
scripts/replicate/compress_llama3.bash <model_name> <gpus>
# Qwen2.5/Qwen3 compression
scripts/replicate/compress_qwen.bash <model_name> <gpus> [num_processes]On the Qwen scripts there is an optional argument num_processes that specifies how many processes to use for data parallel for evaluation. If not specified, it will use as many processes as available gpus. However on larger models such as 70B, it is necessary to perform both data parallelism and model parallelism, so you may need to specify num_processes to be less than the number of available gpus.
We have also included a modified version of the Wanda in wanda-main folder. We modified the original Wanda codebas to work with the updated transformers and datasets libraries. We have included an automated script to run and evaluatie Wanda and SparseGPT pruning on Qwen models.
# Qwen2.5/Qwen3 compression with Wanda or SparseGPT
scripts/baselines/qwen_compress.bash <model_name> <wanda/sparsegpt> [num_processes]Since this repository was built ontop the NoWag repository, we have also included a script to run and evaluate NoWag pruning on Qwen models.
# Qwen2.5/Qwen3 compression with NoWag
scripts/baselines/NoWag_P.bash <model_name> [num_processes]For questions or issues, please contact lawrencerliu@ucla.edu

