Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Features] Support multi_modal training #628

Merged
merged 14 commits into from
Sep 6, 2023

Conversation

lianqing11
Copy link
Collaborator

  1. Update the multi-modal dataset
  2. Support multi-modal training (i.e. LLava)

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Training support of image-text dataset is a very important feature for multimodal models. The workload of this PR is very high, I would like to appreciate this amazing contribution made by @lianqing11 👍 The major changes required by merging into main have been highlighted in following comments.

examples/finetune_multi_modal.py

  • [Style] line 42: Extra tab instead of spaces.

src/lmflow/datasets/dataset.py

  • ⚠️ [Architecture] line 116: normally backend means a way to implement dataset (include different dataset types), but not a specific implementation for a certain type. We can refactor this in later commits, where custom_multi_modal can be an implementation only supports image2text, where other implementations do not support this dataset type.
  • [Architecture] line 123: to ensure forward compatibility, return len(self.backend_dataset) should only be executed for backend == huggingface and backend == custom_multi_modal, raise NotImplementedError for other backends.

src/lmflow/datasets/llava_constants.py

  • [Architecture] Constants definition goes to src/lmflow/utils, e.g. src/lmflow/utils/constants.py

src/lmflow/datasets/llava_conversation_lib.py

  • [Architecture] Internal libs also go to src/lmflow/utils.
  • [Style] line 1: add shebang line and code encoding, and license information if necessary.

src/lmflow/datasets/multi_modal_datasets.py

  • [Style] line 18: use absolute import from lmflow.datasets.llava_constants instead of relative import, as instructed by google coding style.
  • [Style] line 117: typo? tokenizer_image_token -> tokenize_image_token

src/lmflow/models/hf_encoder_decoder_model.py

  • [Style] line 220-222: can use parenthesis to avoid backslash:
self.tokenizer.tokenizer = (
    AutoTokenizer.from_pretrained(
        model_args.llm_model_name_or_path)
)
  • 🚫 [Bug] line 236-240: these lines have to be restored to ensure the normal functionality of cpu chatbots.

src/lmflow/models/vision2seq_model.py

src/lmflow/models/vision_encoder/__init__.py

  • [Style] line 1: Absolute import is preferred.

src/lmflow/models/vision_encoder/clip_encoder.py

  • [Style] line 1: add shebang line and code encoding, and license information if necessary.
  • [Style] line 96-222: make each line at most 80 characters, and better use parenthesis instead of backslash for line wraps.

src/lmflow/pipeline/finetuner.py

  • ⚠️ [Question] line 227-228: Normally datasets are understood as objects stored the data, where the tokenization is performed by tokenizer in the model. Register the tokenizer in the dataset seems counterintuitive in terms of architecture designs. The same problem goes for lmflow.datasets.multi_modal_dataset.tokenizer_image_token. This part is recommended to be handled by lmflow.models.vision2seq_model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to check whether to use deepspeed is that when loading the model with 8bit, using deepspeed would raise an error.
huggingface/transformers#24540

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Thanks!

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follow-up PR majorly addresses inference of finetuned multimodal models. There are several problems required to be fixed before it can be merged into main. It is recommended to fix and merge the previous PR first, before we can proceed to address issues in this follow-up PR. Otherwise, huge amount of techical debt may be incurred.

examples/vis_chatbot.py

  • [Style] line 156: A better argument name for the usage in the vision chatbot can be chatbot_args.multimodal_backend or chatbot_args.backend, since llava or minigpt4 implementation does not only affect prompt formats, but also other aspects.

scripts/run_finetune_multi_modal_stage1.sh

  • 🚫[Bug] line 11: better use localhost:0 by default, or not specifying this field, since some users may not have GPU9. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.

scripts/run_finetune_multi_modal_stage2.sh

  • 🚫[Bug] line 8: hardwired path. Better put this data in lmflow.org:5000 and add corresponding download logics here. Specifically, modify data/download.sh, and add the download command as in line 32 of "scripts_run_vis_chatbot_gradio_minigpt4.sh".
  • 🚫[Bug] line 9: hardwired path. Better put the data in lmflow.org:5000 (if necessary) and add corresponding download logics here, or download from official sites (if available).
  • 🚫[Bug] line 11: better use localhost:0 by default, or not specifying this field, since some users may not have GPU9. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.
  • 🚫[Bug] line 52: hardwired path. Better put the pretrained projection in lmflow.org:5000 and add corresponding download logics before running the python script.
  • [Style] line 77: incomplete error file name.

scripts/run_vis_chatbot_llava.sh

  • 🚫[Bug] line 1: hardwired path. Better put the pretrained checkpoints in lmflow.org:5000 and add corresponding download logics here.
  • 🚫[Bug] line 11: better use localhost:0 by default, or not specifying this field, since some users may not have GPU9. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.

scripts/run_vis_chatbot_minigpt4.sh

  • 🚫[Bug] line 1: hardwired path. Better put the pretrained checkpoints in lmflow.org:5000 and add corresponding download logics here.
  • 🚫[Bug] line 3: better use localhost:0 by default, or not specifying this field, since some users may not have GPU8. Letting users use export CUDA_VISIBLE_DEVICES=x would be more user-friendly.
  • [Style] line 6: line wrapping is recommended.

src/lmflow/datasets/multi_modal_dataset.py

  • ⚠️ [Feature] The format here is too complex for common users to use. A documentation for this data format is recommended.
  • [Question] line 72-83: What's the major difference between llava_plain and llava_v1? Informative comments are encouraged to be included in the code.

src/lmflow/models/hf_encoder_decoder_model.py

  • ⚠️ [Architecture] line 117: Mode name "finetune" is very confusing, as contributors later may mess it up with "normal" mode, which is the real finetuning/training. This may result in bugs in later contributions, so it is strongly recommended to use another name, such as "freeze", or keep the name "none", since there is no explicit difference between "finetune" and "none" mode in this follow-up PR.
  • [Style] line 194: remove useless comments.
  • [Style] line 208-210: can use parenthesis to handle this type of line-wrap.

src/lmflow/models/utils.py

  • [Architecture] Move this file to src/lmflow/utils/multimodal.py.

src/lmflow/pipeline/finetuner.py

  • [Style] Remove useless comments.

src/lmflow/pipeline/inferencer.py

  • [Style] line 179: the right parenthesis can go to next line to improve readability.

@lianqing11
Copy link
Collaborator Author

src/lmflow/datasets/llava_conversation_lib.py

I have fixed the typo. I don't quite sure about the licence, so it is still missing in the file.

Regarding the tokenization in the dataset:
The tokenization in the dataset is to put the image flag into input_id and preprocess the image. We can consider how to refactor this.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style is greatly improved, thanks! The hardwired path problem seems still exists, better to be fixed before merging into main. Also, data format documentation is highly recommended.

scripts/run_finetune_multi_modal_stage1.sh

  • 🚫 [Bug]: line 8-9: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
  • ⚠️ [Feature] line 11: can add --num_gpus=1 if multigpu is not supported.

scripts/run_finetune_multi_modal_stage2.sh

  • 🚫 [Bug]: line 9-10: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
  • ⚠️ [Feature] line 12: can add --num_gpus=1 if multigpu is not supported.
  • 🚫 [Bug]: line 53: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.

scripts/run_vis_chatbot_llava.sh

  • 🚫 [Bug]: line 1-2: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.

scripts/run_vis_chatbot_minigpt4.sh

  • ⚠️ [Feature] line 4: can add --num_gpus=1 if multigpu is not supported.

@lianqing11
Copy link
Collaborator Author

The new commit updates the code to download the dataset and pre-trained model in multimodal training.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks! Automatic data download logics added.

@research4pan research4pan merged commit bc569db into main Sep 6, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants