-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Features] Support multi_modal training #628
Conversation
lianqing11
commented
Aug 28, 2023
- Update the multi-modal dataset
- Support multi-modal training (i.e. LLava)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Training support of image-text dataset is a very important feature for multimodal models. The workload of this PR is very high, I would like to appreciate this amazing contribution made by @lianqing11 👍 The major changes required by merging into main have been highlighted in following comments.
examples/finetune_multi_modal.py
- [Style] line 42: Extra tab instead of spaces.
src/lmflow/datasets/dataset.py
⚠️ [Architecture] line 116: normallybackend
means a way to implement dataset (include different dataset types), but not a specific implementation for a certain type. We can refactor this in later commits, wherecustom_multi_modal
can be an implementation only supportsimage2text
, where other implementations do not support this dataset type.- [Architecture] line 123: to ensure forward compatibility,
return len(self.backend_dataset)
should only be executed forbackend == huggingface
andbackend == custom_multi_modal
, raiseNotImplementedError
for other backends.
src/lmflow/datasets/llava_constants.py
- [Architecture] Constants definition goes to
src/lmflow/utils
, e.g.src/lmflow/utils/constants.py
src/lmflow/datasets/llava_conversation_lib.py
- [Architecture] Internal libs also go to
src/lmflow/utils
. - [Style] line 1: add shebang line and code encoding, and license information if necessary.
src/lmflow/datasets/multi_modal_datasets.py
- [Style] line 18: use absolute import
from lmflow.datasets.llava_constants
instead of relative import, as instructed by google coding style. - [Style] line 117: typo?
tokenizer_image_token
->tokenize_image_token
src/lmflow/models/hf_encoder_decoder_model.py
- [Style] line 220-222: can use parenthesis to avoid backslash:
self.tokenizer.tokenizer = (
AutoTokenizer.from_pretrained(
model_args.llm_model_name_or_path)
)
- 🚫 [Bug] line 236-240: these lines have to be restored to ensure the normal functionality of cpu chatbots.
src/lmflow/models/vision2seq_model.py
- [Style] line 25-26: Absolute import is preferred, as instructed by Google coding style.
src/lmflow/models/vision_encoder/__init__.py
- [Style] line 1: Absolute import is preferred.
src/lmflow/models/vision_encoder/clip_encoder.py
- [Style] line 1: add shebang line and code encoding, and license information if necessary.
- [Style] line 96-222: make each line at most 80 characters, and better use parenthesis instead of backslash for line wraps.
src/lmflow/pipeline/finetuner.py
⚠️ [Question] line 227-228: Normallydatasets
are understood as objects stored the data, where the tokenization is performed by tokenizer in the model. Register the tokenizer in the dataset seems counterintuitive in terms of architecture designs. The same problem goes forlmflow.datasets.multi_modal_dataset.tokenizer_image_token
. This part is recommended to be handled bylmflow.models.vision2seq_model
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to check whether to use deepspeed is that when loading the model with 8bit, using deepspeed would raise an error.
huggingface/transformers#24540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This follow-up PR majorly addresses inference of finetuned multimodal models. There are several problems required to be fixed before it can be merged into main. It is recommended to fix and merge the previous PR first, before we can proceed to address issues in this follow-up PR. Otherwise, huge amount of techical debt may be incurred.
examples/vis_chatbot.py
- [Style] line 156: A better argument name for the usage in the vision chatbot can be
chatbot_args.multimodal_backend
orchatbot_args.backend
, sincellava
orminigpt4
implementation does not only affect prompt formats, but also other aspects.
scripts/run_finetune_multi_modal_stage1.sh
- 🚫[Bug] line 11: better use
localhost:0
by default, or not specifying this field, since some users may not have GPU9. Letting users useexport CUDA_VISIBLE_DEVICES=x
would be more user-friendly.
scripts/run_finetune_multi_modal_stage2.sh
- 🚫[Bug] line 8: hardwired path. Better put this data in
lmflow.org:5000
and add corresponding download logics here. Specifically, modify data/download.sh, and add the download command as in line 32 of "scripts_run_vis_chatbot_gradio_minigpt4.sh". - 🚫[Bug] line 9: hardwired path. Better put the data in
lmflow.org:5000
(if necessary) and add corresponding download logics here, or download from official sites (if available). - 🚫[Bug] line 11: better use
localhost:0
by default, or not specifying this field, since some users may not have GPU9. Letting users useexport CUDA_VISIBLE_DEVICES=x
would be more user-friendly. - 🚫[Bug] line 52: hardwired path. Better put the pretrained projection in
lmflow.org:5000
and add corresponding download logics before running the python script. - [Style] line 77: incomplete error file name.
scripts/run_vis_chatbot_llava.sh
- 🚫[Bug] line 1: hardwired path. Better put the pretrained checkpoints in
lmflow.org:5000
and add corresponding download logics here. - 🚫[Bug] line 11: better use
localhost:0
by default, or not specifying this field, since some users may not have GPU9. Letting users useexport CUDA_VISIBLE_DEVICES=x
would be more user-friendly.
scripts/run_vis_chatbot_minigpt4.sh
- 🚫[Bug] line 1: hardwired path. Better put the pretrained checkpoints in
lmflow.org:5000
and add corresponding download logics here. - 🚫[Bug] line 3: better use
localhost:0
by default, or not specifying this field, since some users may not have GPU8. Letting users useexport CUDA_VISIBLE_DEVICES=x
would be more user-friendly. - [Style] line 6: line wrapping is recommended.
src/lmflow/datasets/multi_modal_dataset.py
⚠️ [Feature] The format here is too complex for common users to use. A documentation for this data format is recommended.- [Question] line 72-83: What's the major difference between llava_plain and llava_v1? Informative comments are encouraged to be included in the code.
src/lmflow/models/hf_encoder_decoder_model.py
⚠️ [Architecture] line 117: Mode name "finetune" is very confusing, as contributors later may mess it up with "normal" mode, which is the real finetuning/training. This may result in bugs in later contributions, so it is strongly recommended to use another name, such as "freeze", or keep the name "none", since there is no explicit difference between "finetune" and "none" mode in this follow-up PR.- [Style] line 194: remove useless comments.
- [Style] line 208-210: can use parenthesis to handle this type of line-wrap.
src/lmflow/models/utils.py
- [Architecture] Move this file to
src/lmflow/utils/multimodal.py
.
src/lmflow/pipeline/finetuner.py
- [Style] Remove useless comments.
src/lmflow/pipeline/inferencer.py
- [Style] line 179: the right parenthesis can go to next line to improve readability.
I have fixed the typo. I don't quite sure about the licence, so it is still missing in the file. Regarding the tokenization in the dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style is greatly improved, thanks! The hardwired path problem seems still exists, better to be fixed before merging into main. Also, data format documentation is highly recommended.
scripts/run_finetune_multi_modal_stage1.sh
- 🚫 [Bug]: line 8-9: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
⚠️ [Feature] line 11: can add--num_gpus=1
if multigpu is not supported.
scripts/run_finetune_multi_modal_stage2.sh
- 🚫 [Bug]: line 9-10: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
⚠️ [Feature] line 12: can add--num_gpus=1
if multigpu is not supported.- 🚫 [Bug]: line 53: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
scripts/run_vis_chatbot_llava.sh
- 🚫 [Bug]: line 1-2: the dataset and image is recommended to support automatic download, so the scripts is runnable for every user.
scripts/run_vis_chatbot_minigpt4.sh
⚠️ [Feature] line 4: can add--num_gpus=1
if multigpu is not supported.
The new commit updates the code to download the dataset and pre-trained model in multimodal training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Thanks! Automatic data download logics added.