diff --git a/configs/sparsification/methods/Holitom/holitom.yml b/configs/sparsification/methods/Holitom/holitom.yml new file mode 100644 index 000000000..344c705f5 --- /dev/null +++ b/configs/sparsification/methods/Holitom/holitom.yml @@ -0,0 +1,26 @@ +base: + seed: &seed 42 +model: + type: Llava OneVision + path: model path + torch_dtype: auto +eval: + eval_pos: [pretrain, transformed] + type: vqa + name: [mme] + download: False + path: MME dataset path + bs: 1 + inference_per_block: False +sparse: + method: TokenReduction + special: + method: HoliTom + RETAIN_RATIO: 0.20 + T: 0.65 + HOLITOM_k: 18 + HOLITOM_r: 0.5 +save: + save_trans: False + save_fake: False + save_path: /path/to/save/ diff --git a/docs/en/source/advanced/token_reduction.md b/docs/en/source/advanced/token_reduction.md new file mode 100644 index 000000000..68825284f --- /dev/null +++ b/docs/en/source/advanced/token_reduction.md @@ -0,0 +1,72 @@ + + +# Token Reduction + +LightCompress currently supports token reduction for mainstream multimodal large language models. Configuration is very simple—plug and play. + +Here is an example configuration + +```yaml +base: + seed: &seed 42 +model: + type: Llava + path: model path + torch_dtype: auto +eval: + eval_pos: [pretrain, transformed] + type: vqa + name: [gqa, mmbench_en_dev, mme] + bs: 1 + inference_per_block: False +sparse: + method: TokenReduction + special: + method: FastV + pruning_loc: 3 + rate: 0.778 +save: + save_trans: False + save_fake: False + save_path: /path/to/save/ +``` + +The configuration file contains three core sections, including: + +1. **`model`** + For model selection, you can choose LLaVA, LLaVA-NeXT, Qwen2.5VL, and LLaVA OneVision, etc. These models cover both image and video tasks. For the detailed list of supported models, see the file. LightCompress will support more models in the future. + +2. **`eval`** + For the `eval_pos` parameter: + - `pretrain` denotes the original model that keeps all visual tokens. + - `transformed` denotes the model with token reduction applied. + LightCompress integrates lmms-eval to evaluate various downstream datasets. Set `type` to `vqa`, and specify the datasets in `name` following the naming conventions in the lmms-eval documentation. + +3. **`sparse`** + Set `method` to `TokenReduction` first, and then specify the concrete algorithm and related hyperparameters under `special`. Since each algorithm has different hyperparameters, refer to the configuration files for details. + +## Combining Quantization + +LightCompress also supports an extreme compression scheme that combines token reduction with quantization. First, choose a quantization algorithm to save a `fake_qunat` model (see the quantization section of the docs). Then load this model and add the `token_reduction` field under `quant`. + +```yaml +quant: + method: RTN + weight: + bit: 4 + symmetric: False + granularity: per_group + group_size: 128 + special: + actorder: True + static_groups: True + percdamp: 0.01 + blocksize: 128 + true_sequential: True + quant_out: True + token_reduction: + method: FastV + special: + pruning_loc: 3 + rate: 0.778 +``` \ No newline at end of file diff --git a/docs/en/source/configs.md b/docs/en/source/configs.md index eaf3baaff..b7446df92 100644 --- a/docs/en/source/configs.md +++ b/docs/en/source/configs.md @@ -360,6 +360,26 @@ quant: static: True ``` +## sparse + + sparse.method + +The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files. + +It’s worth noting that for model sparsification, you need to specify the exact algorithm name, whereas for token reduction, you only need to set it to `TokenReduction` first, and then specify the exact algorithm under `special`. + +```yaml +sparse: + method: Wanda +``` + +```yaml +sparse: + method: TokenReduction + special: + method: FastV +``` + ## save save.save_vllm diff --git a/docs/zh_cn/source/advanced/token_reduction.md b/docs/zh_cn/source/advanced/token_reduction.md new file mode 100644 index 000000000..be12c6ad6 --- /dev/null +++ b/docs/zh_cn/source/advanced/token_reduction.md @@ -0,0 +1,68 @@ +# Token Reduction + +目前LightCompress支持对主流的多模态大语言模型进行token reduction,配置十分简单,即插即用。 + +下面是一个配置的例子 + +```yaml +base: + seed: &seed 42 +model: + type: Llava + path: model path + torch_dtype: auto +eval: + eval_pos: [pretrain, transformed] + type: vqa + name: [gqa, mmbench_en_dev, mme] + bs: 1 + inference_per_block: False +sparse: + method: TokenReduction + special: + method: FastV + pruning_loc: 3 + rate: 0.778 +save: + save_trans: False + save_fake: False + save_path: /path/to/save/ +``` + +配置文件中包含三大核心内容,包括: + +1. `model` +在模型选择上,可以选择LLaVA,LLaVA-NeXT,Qwen2.5VL以及LLaVA OneVision等,这些模型涵盖了图像任务和视频任务,详细的模型支持列表可以查阅[文件](https://github.com/ModelTC/LightCompress/blob/main/llmc/models/__init__.py),未来LightCompress也会支持更多的模型。 + +2. `eval` +首先,在`eval_pos`参数的选择上,`pretrain`表示原始保留所有视觉token的模型,`transformed`表示应用相应算法进行token reduction的模型。LightCompress接入了[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)进行各种下游数据集测评,需要将`type`指定为`vqa`,`name`中的下游测评数据集参考lmms-eval[文档](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md)中的命名方式。 + +3. `sparse` +`method`需要首先指定为TokenReduction,在`special`中继续指定具体的算法以及相关的一些超参数。由于每个算法对应的超参数不同,详细的可以参考[配置文件](https://github.com/ModelTC/LightCompress/tree/main/configs/sparsification/methods)。 + + +## 结合量化 + +LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_qunat`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。 + +```yaml +quant: + method: RTN + weight: + bit: 4 + symmetric: False + granularity: per_group + group_size: 128 + special: + actorder: True + static_groups: True + percdamp: 0.01 + blocksize: 128 + true_sequential: True + quant_out: True + token_reduction: + method: FastV + special: + pruning_loc: 3 + rate: 0.778 +``` \ No newline at end of file diff --git a/docs/zh_cn/source/configs.md b/docs/zh_cn/source/configs.md index 9fbf1382d..6d61b532d 100644 --- a/docs/zh_cn/source/configs.md +++ b/docs/zh_cn/source/configs.md @@ -401,6 +401,26 @@ quant: granularity: per_token ``` +## sparse + + sparse.method + +使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。 + +值得说明的是针对模型稀疏化,需要指定具体的算法名称,而token reduction只需要先指定为`TokenReduction`,在`special`中继续指定具体的算法。 + +```yaml +sparse: + method: Wanda +``` + +```yaml +sparse: + method: TokenReduction + special: + method: FastV +``` + ## save save.save_vllm