Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions configs/sparsification/methods/Holitom/holitom.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
base:
seed: &seed 42
model:
type: Llava OneVision
path: model path
torch_dtype: auto
eval:
eval_pos: [pretrain, transformed]
type: vqa
name: [mme]
download: False
path: MME dataset path
bs: 1
inference_per_block: False
sparse:
method: TokenReduction
special:
method: HoliTom
RETAIN_RATIO: 0.20
T: 0.65
HOLITOM_k: 18
HOLITOM_r: 0.5
save:
save_trans: False
save_fake: False
save_path: /path/to/save/
72 changes: 72 additions & 0 deletions docs/en/source/advanced/token_reduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@


# Token Reduction

LightCompress currently supports token reduction for mainstream multimodal large language models. Configuration is very simple—plug and play.

Here is an example configuration

```yaml
base:
seed: &seed 42
model:
type: Llava
path: model path
torch_dtype: auto
eval:
eval_pos: [pretrain, transformed]
type: vqa
name: [gqa, mmbench_en_dev, mme]
bs: 1
inference_per_block: False
sparse:
method: TokenReduction
special:
method: FastV
pruning_loc: 3
rate: 0.778
save:
save_trans: False
save_fake: False
save_path: /path/to/save/
```
The configuration file contains three core sections, including:
1. **`model`**
For model selection, you can choose LLaVA, LLaVA-NeXT, Qwen2.5VL, and LLaVA OneVision, etc. These models cover both image and video tasks. For the detailed list of supported models, see the file. LightCompress will support more models in the future.

2. **`eval`**
For the `eval_pos` parameter:
- `pretrain` denotes the original model that keeps all visual tokens.
- `transformed` denotes the model with token reduction applied.
LightCompress integrates lmms-eval to evaluate various downstream datasets. Set `type` to `vqa`, and specify the datasets in `name` following the naming conventions in the lmms-eval documentation.

3. **`sparse`**
Set `method` to `TokenReduction` first, and then specify the concrete algorithm and related hyperparameters under `special`. Since each algorithm has different hyperparameters, refer to the configuration files for details.
Comment on lines +36 to +46

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section is missing a few helpful hyperlinks that are present in the Chinese version of the documentation. Adding them would improve the user experience by making it easier to navigate to related resources.

Suggested change
1. **`model`**
For model selection, you can choose LLaVA, LLaVA-NeXT, Qwen2.5VL, and LLaVA OneVision, etc. These models cover both image and video tasks. For the detailed list of supported models, see the file. LightCompress will support more models in the future.
2. **`eval`**
For the `eval_pos` parameter:
- `pretrain` denotes the original model that keeps all visual tokens.
- `transformed` denotes the model with token reduction applied.
LightCompress integrates lmms-eval to evaluate various downstream datasets. Set `type` to `vqa`, and specify the datasets in `name` following the naming conventions in the lmms-eval documentation.
3. **`sparse`**
Set `method` to `TokenReduction` first, and then specify the concrete algorithm and related hyperparameters under `special`. Since each algorithm has different hyperparameters, refer to the configuration files for details.
1. **`model`**
For model selection, you can choose LLaVA, LLaVA-NeXT, Qwen2.5VL, and LLaVA OneVision, etc. These models cover both image and video tasks. For the detailed list of supported models, see [the file](https://github.com/ModelTC/LightCompress/blob/main/llmc/models/__init__.py). LightCompress will support more models in the future.
2. **`eval`**
For the `eval_pos` parameter:
- `pretrain` denotes the original model that keeps all visual tokens.
- `transformed` denotes the model with token reduction applied.
LightCompress integrates lmms-eval to evaluate various downstream datasets. Set `type` to `vqa`, and specify the datasets in `name` following the naming conventions in the [lmms-eval documentation](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md).
3. **`sparse`**
Set `method` to `TokenReduction` first, and then specify the concrete algorithm and related hyperparameters under `special`. Since each algorithm has different hyperparameters, refer to the [configuration files](https://github.com/ModelTC/LightCompress/tree/main/configs/sparsification/methods) for details.


## Combining Quantization

LightCompress also supports an extreme compression scheme that combines token reduction with quantization. First, choose a quantization algorithm to save a `fake_qunat` model (see the quantization section of the docs). Then load this model and add the `token_reduction` field under `quant`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in fake_qunat. It should be fake_quant.

Suggested change
LightCompress also supports an extreme compression scheme that combines token reduction with quantization. First, choose a quantization algorithm to save a `fake_qunat` model (see the quantization section of the docs). Then load this model and add the `token_reduction` field under `quant`.
LightCompress also supports an extreme compression scheme that combines token reduction with quantization. First, choose a quantization algorithm to save a `fake_quant` model (see the quantization section of the docs). Then load this model and add the `token_reduction` field under `quant`.


```yaml
quant:
method: RTN
weight:
bit: 4
symmetric: False
granularity: per_group
group_size: 128
special:
actorder: True
static_groups: True
percdamp: 0.01
blocksize: 128
true_sequential: True
quant_out: True
token_reduction:
method: FastV
special:
pruning_loc: 3
rate: 0.778
```
20 changes: 20 additions & 0 deletions docs/en/source/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,26 @@ quant:
static: True
```

## sparse

<font color=792ee5> sparse.method </font>

The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the file extension in the link to sparsification/__init__.py. It should be .py, not .pyn.

Suggested change
The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.


It’s worth noting that for model sparsification, you need to specify the exact algorithm name, whereas for token reduction, you only need to set it to `TokenReduction` first, and then specify the exact algorithm under `special`.

```yaml
sparse:
method: Wanda
```

```yaml
sparse:
method: TokenReduction
special:
method: FastV
```

## save

<font color=792ee5> save.save_vllm</font>
Expand Down
68 changes: 68 additions & 0 deletions docs/zh_cn/source/advanced/token_reduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Token Reduction

目前LightCompress支持对主流的多模态大语言模型进行token reduction,配置十分简单,即插即用。

下面是一个配置的例子

```yaml
base:
seed: &seed 42
model:
type: Llava
path: model path
torch_dtype: auto
eval:
eval_pos: [pretrain, transformed]
type: vqa
name: [gqa, mmbench_en_dev, mme]
bs: 1
inference_per_block: False
sparse:
method: TokenReduction
special:
method: FastV
pruning_loc: 3
rate: 0.778
save:
save_trans: False
save_fake: False
save_path: /path/to/save/
```
配置文件中包含三大核心内容,包括:
1. `model`
在模型选择上,可以选择LLaVA,LLaVA-NeXT,Qwen2.5VL以及LLaVA OneVision等,这些模型涵盖了图像任务和视频任务,详细的模型支持列表可以查阅[文件](https://github.com/ModelTC/LightCompress/blob/main/llmc/models/__init__.py),未来LightCompress也会支持更多的模型。

2. `eval`
首先,在`eval_pos`参数的选择上,`pretrain`表示原始保留所有视觉token的模型,`transformed`表示应用相应算法进行token reduction的模型。LightCompress接入了[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)进行各种下游数据集测评,需要将`type`指定为`vqa`,`name`中的下游测评数据集参考lmms-eval[文档](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md)中的命名方式。

3. `sparse`
`method`需要首先指定为TokenReduction,在`special`中继续指定具体的算法以及相关的一些超参数。由于每个算法对应的超参数不同,详细的可以参考[配置文件](https://github.com/ModelTC/LightCompress/tree/main/configs/sparsification/methods)。


## 结合量化

LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_qunat`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in fake_qunat. It should be fake_quant.

Suggested change
LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_qunat`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_quant`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。


```yaml
quant:
method: RTN
weight:
bit: 4
symmetric: False
granularity: per_group
group_size: 128
special:
actorder: True
static_groups: True
percdamp: 0.01
blocksize: 128
true_sequential: True
quant_out: True
token_reduction:
method: FastV
special:
pruning_loc: 3
rate: 0.778
```
20 changes: 20 additions & 0 deletions docs/zh_cn/source/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,26 @@ quant:
granularity: per_token
```

## sparse

<font color=792ee5> sparse.method </font>

使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the file extension in the link to sparsification/__init__.py. It should be .py, not .pyn.

Suggested change
使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。
使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。


值得说明的是针对模型稀疏化,需要指定具体的算法名称,而token reduction只需要先指定为`TokenReduction`,在`special`中继续指定具体的算法。

```yaml
sparse:
method: Wanda
```

```yaml
sparse:
method: TokenReduction
special:
method: FastV
```

## save

<font color=792ee5> save.save_vllm </font>
Expand Down