[Features] Support multi_modal training #628

lianqing11 · 2023-08-28T14:01:09Z

Update the multi-modal dataset
Support multi-modal training (i.e. LLava)

research4pan

Thanks! Training support of image-text dataset is a very important feature for multimodal models. The workload of this PR is very high, I would like to appreciate this amazing contribution made by @lianqing11 👍 The major changes required by merging into main have been highlighted in following comments.

`examples/finetune_multi_modal.py`

[Style] line 42: Extra tab instead of spaces.

`src/lmflow/datasets/dataset.py`

⚠️ [Architecture] line 116: normally backend means a way to implement dataset (include different dataset types), but not a specific implementation for a certain type. We can refactor this in later commits, where custom_multi_modal can be an implementation only supports image2text, where other implementations do not support this dataset type.
[Architecture] line 123: to ensure forward compatibility, return len(self.backend_dataset) should only be executed for backend == huggingface and backend == custom_multi_modal, raise NotImplementedError for other backends.

`src/lmflow/datasets/llava_constants.py`

[Architecture] Constants definition goes to src/lmflow/utils, e.g. src/lmflow/utils/constants.py

`src/lmflow/datasets/llava_conversation_lib.py`

[Architecture] Internal libs also go to src/lmflow/utils.
[Style] line 1: add shebang line and code encoding, and license information if necessary.

`src/lmflow/datasets/multi_modal_datasets.py`

[Style] line 18: use absolute import from lmflow.datasets.llava_constants instead of relative import, as instructed by google coding style.
[Style] line 117: typo? tokenizer_image_token -> tokenize_image_token

`src/lmflow/models/hf_encoder_decoder_model.py`

[Style] line 220-222: can use parenthesis to avoid backslash:

self.tokenizer.tokenizer = (
    AutoTokenizer.from_pretrained(
        model_args.llm_model_name_or_path)
)

🚫 [Bug] line 236-240: these lines have to be restored to ensure the normal functionality of cpu chatbots.

`src/lmflow/models/vision2seq_model.py`

[Style] line 25-26: Absolute import is preferred, as instructed by Google coding style.

`src/lmflow/models/vision_encoder/init.py`

[Style] line 1: Absolute import is preferred.

`src/lmflow/models/vision_encoder/clip_encoder.py`

[Style] line 1: add shebang line and code encoding, and license information if necessary.
[Style] line 96-222: make each line at most 80 characters, and better use parenthesis instead of backslash for line wraps.

`src/lmflow/pipeline/finetuner.py`

⚠️ [Question] line 227-228: Normally datasets are understood as objects stored the data, where the tokenization is performed by tokenizer in the model. Register the tokenizer in the dataset seems counterintuitive in terms of architecture designs. The same problem goes for lmflow.datasets.multi_modal_dataset.tokenizer_image_token. This part is recommended to be handled by lmflow.models.vision2seq_model.

lianqing11 · 2023-09-04T06:40:09Z

examples/vis_chatbot.py

The reason to check whether to use deepspeed is that when loading the model with 8bit, using deepspeed would raise an error.
huggingface/transformers#24540

No problem. Thanks!

research4pan

This follow-up PR majorly addresses inference of finetuned multimodal models. There are several problems required to be fixed before it can be merged into main. It is recommended to fix and merge the previous PR first, before we can proceed to address issues in this follow-up PR. Otherwise, huge amount of techical debt may be incurred.

`examples/vis_chatbot.py`

[Style] line 156: A better argument name for the usage in the vision chatbot can be chatbot_args.multimodal_backend or chatbot_args.backend, since llava or minigpt4 implementation does not only affect prompt formats, but also other aspects.

`scripts/run_finetune_multi_modal_stage1.sh`

🚫[Bug] line 11: better use localhost:0 by default, or not specifying this field, since some users may not have GPU9. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.

`scripts/run_finetune_multi_modal_stage2.sh`

🚫[Bug] line 8: hardwired path. Better put this data in lmflow.org:5000 and add corresponding download logics here. Specifically, modify data/download.sh, and add the download command as in line 32 of "scripts_run_vis_chatbot_gradio_minigpt4.sh".
🚫[Bug] line 9: hardwired path. Better put the data in lmflow.org:5000 (if necessary) and add corresponding download logics here, or download from official sites (if available).
🚫[Bug] line 11: better use localhost:0 by default, or not specifying this field, since some users may not have GPU9. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.
🚫[Bug] line 52: hardwired path. Better put the pretrained projection in lmflow.org:5000 and add corresponding download logics before running the python script.
[Style] line 77: incomplete error file name.

`scripts/run_vis_chatbot_llava.sh`

🚫[Bug] line 1: hardwired path. Better put the pretrained checkpoints in lmflow.org:5000 and add corresponding download logics here.
🚫[Bug] line 11: better use localhost:0 by default, or not specifying this field, since some users may not have GPU9. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.

`scripts/run_vis_chatbot_minigpt4.sh`

🚫[Bug] line 1: hardwired path. Better put the pretrained checkpoints in lmflow.org:5000 and add corresponding download logics here.
🚫[Bug] line 3: better use localhost:0 by default, or not specifying this field, since some users may not have GPU8. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.
[Style] line 6: line wrapping is recommended.

`src/lmflow/datasets/multi_modal_dataset.py`

⚠️ [Feature] The format here is too complex for common users to use. A documentation for this data format is recommended.
[Question] line 72-83: What's the major difference between llava_plain and llava_v1? Informative comments are encouraged to be included in the code.

`src/lmflow/models/hf_encoder_decoder_model.py`

⚠️ [Architecture] line 117: Mode name "finetune" is very confusing, as contributors later may mess it up with "normal" mode, which is the real finetuning/training. This may result in bugs in later contributions, so it is strongly recommended to use another name, such as "freeze", or keep the name "none", since there is no explicit difference between "finetune" and "none" mode in this follow-up PR.
[Style] line 194: remove useless comments.
[Style] line 208-210: can use parenthesis to handle this type of line-wrap.

`src/lmflow/models/utils.py`

[Architecture] Move this file to src/lmflow/utils/multimodal.py.

`src/lmflow/pipeline/finetuner.py`

[Style] Remove useless comments.

`src/lmflow/pipeline/inferencer.py`

[Style] line 179: the right parenthesis can go to next line to improve readability.

lianqing11 · 2023-09-04T08:02:19Z

src/lmflow/datasets/llava_conversation_lib.py

I have fixed the typo. I don't quite sure about the licence, so it is still missing in the file.

Regarding the tokenization in the dataset:
The tokenization in the dataset is to put the image flag into input_id and preprocess the image. We can consider how to refactor this.

research4pan

The style is greatly improved, thanks! The hardwired path problem seems still exists, better to be fixed before merging into main. Also, data format documentation is highly recommended.

`scripts/run_finetune_multi_modal_stage1.sh`

🚫 [Bug]: line 8-9: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
⚠️ [Feature] line 11: can add --num_gpus=1 if multigpu is not supported.

`scripts/run_finetune_multi_modal_stage2.sh`

🚫 [Bug]: line 9-10: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
⚠️ [Feature] line 12: can add --num_gpus=1 if multigpu is not supported.
🚫 [Bug]: line 53: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.

`scripts/run_vis_chatbot_llava.sh`

🚫 [Bug]: line 1-2: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.

`scripts/run_vis_chatbot_minigpt4.sh`

⚠️ [Feature] line 4: can add --num_gpus=1 if multigpu is not supported.

lianqing11 · 2023-09-06T07:21:27Z

The new commit updates the code to download the dataset and pre-trained model in multimodal training.

research4pan

LGTM 👍 Thanks! Automatic data download logics added.

support multi_modal training

4dc5fb8

research4pan reviewed Sep 3, 2023

View reviewed changes

lianqing11 added 2 commits September 3, 2023 17:10

support llava inference

588679d

update second stage finetune

68f8639

lianqing11 commented Sep 4, 2023

View reviewed changes

research4pan requested changes Sep 4, 2023

View reviewed changes

polish the code style and modify the dataset and model path

5fd7bbd

research4pan reviewed Sep 4, 2023

View reviewed changes

lianqing11 added 6 commits September 4, 2023 16:37

update script and the path to download the model

ab9598c

fix link in download llava dataset

ddfa884

add the script for downloading the dataset

76b01aa

update the num gpu-1 command for inference

a518629

update downloading script

10ac23e

update downloading dataset for multimodal

7e93a25

lianqing11 added 4 commits September 6, 2023 15:23

update the link to constans

0ffa432

remove cd .., modify llava inference script

a91bd3e

modify input arg

1bef01d

modify vicuna model

c529723

research4pan approved these changes Sep 6, 2023

View reviewed changes

research4pan merged commit bc569db into main Sep 6, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Features] Support multi_modal training #628

[Features] Support multi_modal training #628

lianqing11 commented Aug 28, 2023

research4pan left a comment

lianqing11 Sep 4, 2023

research4pan Sep 4, 2023

research4pan left a comment

lianqing11 commented Sep 4, 2023

`src/lmflow/datasets/llava_conversation_lib.py`

research4pan left a comment •

edited

Loading

lianqing11 commented Sep 6, 2023

research4pan left a comment

[Features] Support multi_modal training #628

[Features] Support multi_modal training #628

Conversation

lianqing11 commented Aug 28, 2023

research4pan left a comment

Choose a reason for hiding this comment

examples/finetune_multi_modal.py

src/lmflow/datasets/dataset.py

src/lmflow/datasets/llava_constants.py

src/lmflow/datasets/llava_conversation_lib.py

src/lmflow/datasets/multi_modal_datasets.py

src/lmflow/models/hf_encoder_decoder_model.py

src/lmflow/models/vision2seq_model.py

src/lmflow/models/vision_encoder/__init__.py

src/lmflow/models/vision_encoder/clip_encoder.py

src/lmflow/pipeline/finetuner.py

lianqing11 Sep 4, 2023

Choose a reason for hiding this comment

research4pan Sep 4, 2023

Choose a reason for hiding this comment

research4pan left a comment

Choose a reason for hiding this comment

examples/vis_chatbot.py

scripts/run_finetune_multi_modal_stage1.sh

scripts/run_finetune_multi_modal_stage2.sh

scripts/run_vis_chatbot_llava.sh

scripts/run_vis_chatbot_minigpt4.sh

src/lmflow/datasets/multi_modal_dataset.py

src/lmflow/models/hf_encoder_decoder_model.py

src/lmflow/models/utils.py

src/lmflow/pipeline/finetuner.py

src/lmflow/pipeline/inferencer.py

lianqing11 commented Sep 4, 2023

src/lmflow/datasets/llava_conversation_lib.py

research4pan left a comment • edited Loading

Choose a reason for hiding this comment

scripts/run_finetune_multi_modal_stage1.sh

scripts/run_finetune_multi_modal_stage2.sh

scripts/run_vis_chatbot_llava.sh

scripts/run_vis_chatbot_minigpt4.sh

lianqing11 commented Sep 6, 2023

research4pan left a comment

Choose a reason for hiding this comment

`examples/finetune_multi_modal.py`

`src/lmflow/datasets/dataset.py`

`src/lmflow/datasets/llava_constants.py`

`src/lmflow/datasets/llava_conversation_lib.py`

`src/lmflow/datasets/multi_modal_datasets.py`

`src/lmflow/models/hf_encoder_decoder_model.py`

`src/lmflow/models/vision2seq_model.py`

`src/lmflow/models/vision_encoder/init.py`

`src/lmflow/models/vision_encoder/clip_encoder.py`

`src/lmflow/pipeline/finetuner.py`

`examples/vis_chatbot.py`

`scripts/run_finetune_multi_modal_stage1.sh`

`scripts/run_finetune_multi_modal_stage2.sh`

`scripts/run_vis_chatbot_llava.sh`

`scripts/run_vis_chatbot_minigpt4.sh`

`src/lmflow/datasets/multi_modal_dataset.py`

`src/lmflow/models/hf_encoder_decoder_model.py`

`src/lmflow/models/utils.py`

`src/lmflow/pipeline/finetuner.py`

`src/lmflow/pipeline/inferencer.py`

`src/lmflow/datasets/llava_conversation_lib.py`

research4pan left a comment •

edited

Loading

`scripts/run_finetune_multi_modal_stage1.sh`

`scripts/run_finetune_multi_modal_stage2.sh`

`scripts/run_vis_chatbot_llava.sh`

`scripts/run_vis_chatbot_minigpt4.sh`