Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add finetune code for Yi-VL model #368

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Conversation

minlik
Copy link

@minlik minlik commented Jan 30, 2024

The code is mostly modified from LLaVA

likuan and others added 8 commits January 29, 2024 18:36
Prerequistes -> Prerequisites
* [doc][feat] modified readme_CN.

* [doc][feat] modified readme_CN.

---------

Co-authored-by: YShow <66633207+Yimi81@users.noreply.github.com>
@Yimi81
Copy link
Contributor

Yimi81 commented Feb 1, 2024

@minlik Thank you for your PR, I will test it

likuan and others added 3 commits February 1, 2024 18:27
* [doc][feat] modified readme.

* [doc][feat] modified readme.

* [doc][feat] modified readme.

* [doc][feat] modified readme.

* [doc][feat] modified readme.

* [doc][feat] modified readme.
@Yimi81
Copy link
Contributor

Yimi81 commented Feb 1, 2024

Can you provide your environment? Both the official configuration of llava and the requirements. txt you provided reported errors. It would be great if you could provide a step in readme to reproduce your environment
The various errors I encountered:

- Import error
    from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
- ModuleNotFoundError: No module named 'llava'
- ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer'
- ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

@minlik
Copy link
Author

minlik commented Feb 1, 2024

Can you provide your environment? Both the official configuration of llava and the requirements. txt you provided reported errors. It would be great if you could provide a step in readme to reproduce your environment The various errors I encountered:

- Import error
    from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
- ModuleNotFoundError: No module named 'llava'
- ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer'
- ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

I have updated the requirements in the latest commit here. I suggest you reinstall the requirements.txt under VL folder, and these errors should be resolved.

- Import error
    from llava.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
- ModuleNotFoundError: No module named 'llava'

I have added PYTHONPATH=../../:$PYTHONPATH in the training scripts to ensure python environment

- ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer'

I use transformers==4.34.0 because transformers removed ShardedDDPOption since 4.35. I will fix it later, but for now, 4.34.0 is okay

- ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

I think the error is related to the torch version and cuda version. You may need to reinstall some packages following here.

Some of my environment is listed here, with cuda version 11.7

accelerate==0.26.1
bitsandbytes==0.42.0
deepspeed==0.13.1
flash-attn==2.3.3
huggingface-hub==0.17.3
peft==0.8.1
tokenizers==0.14.1
torch==2.0.1
torchvision==0.15.2
transformers==4.34.0
wandb==0.16.2

@Yimi81
Copy link
Contributor

Yimi81 commented Feb 2, 2024

my sh script:

PYTHONPATH=../../:$PYTHONPATH \
deepspeed --include localhost:0,1 --master_port 1234 llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /ML-A100/public/tmp/pretrain_weights/Yi-VL-6B \
    --data_path /ML-A100/public/tmp/yiguofeng/contribute/Yi/data.json \
    --image_folder /ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/images \
    --vision_tower /ML-A100/public/tmp/pretrain_weights/Yi-VL-34B/vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-34B-448 \
    --output_dir /ML-A100/public/tmp/VL-FT\
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \
    --report_to wandb

error:

Traceback (most recent call last):
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/train/train.py", line 786, in train
    model = LlavaLlamaForCausalLM.from_pretrained(
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3085, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_llama.py", line 44, in __init__
    self.model = LlavaLlamaModel(config)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_llama.py", line 36, in __init__
    super(LlavaLlamaModel, self).__init__(config)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_arch.py", line 32, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/clip_encoder/builder.py", line 11, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/clip_encoder/clip_encoder.py", line 19, in __init__
    self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/models/clip/configuration_clip.py", line 238, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/configuration_utils.py", line 696, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load the configuration of './vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448'

How did you specify a vision_tower? Are you the same as me?

@weijingxuan
Copy link

Hello,

I've been trying to run the finetune script as per the instructions in VL/scripts/finetune.sh, but I keep encountering an error that I haven't been able to resolve. The script fails to execute properly, and the error seems to originate from the llama_flash_attn_monkey_patch.py file, specifically lines 87 to 89:
output_unpad = flash_attn_unpadded_qkvpacked_func(
qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True
)
The error message I receive is as follows: cu_seqlens_q must have shape (batch_size + 1)
I have followed all the setup instructions accurately and ensured that all dependencies are correctly installed.
I would greatly appreciate any guidance on the specific details of using this code or suggestions on how I might resolve this error. Is there a particular setup or dependency version that I should be aware of?

Thank you.

@minlik
Copy link
Author

minlik commented Feb 2, 2024

my sh script:

PYTHONPATH=../../:$PYTHONPATH \
deepspeed --include localhost:0,1 --master_port 1234 llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /ML-A100/public/tmp/pretrain_weights/Yi-VL-6B \
    --data_path /ML-A100/public/tmp/yiguofeng/contribute/Yi/data.json \
    --image_folder /ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/images \
    --vision_tower /ML-A100/public/tmp/pretrain_weights/Yi-VL-34B/vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-34B-448 \
    --output_dir /ML-A100/public/tmp/VL-FT\
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \
    --report_to wandb

error:

Traceback (most recent call last):
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/train/train.py", line 786, in train
    model = LlavaLlamaForCausalLM.from_pretrained(
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3085, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_llama.py", line 44, in __init__
    self.model = LlavaLlamaModel(config)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_llama.py", line 36, in __init__
    super(LlavaLlamaModel, self).__init__(config)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_arch.py", line 32, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/clip_encoder/builder.py", line 11, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/clip_encoder/clip_encoder.py", line 19, in __init__
    self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/models/clip/configuration_clip.py", line 238, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/configuration_utils.py", line 696, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load the configuration of './vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448'

How did you specify a vision_tower? Are you the same as me?

my sh script:

PYTHONPATH=../../:$PYTHONPATH \
deepspeed --include localhost:0,1 --master_port 1234 llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /ML-A100/public/tmp/pretrain_weights/Yi-VL-6B \
    --data_path /ML-A100/public/tmp/yiguofeng/contribute/Yi/data.json \
    --image_folder /ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/images \
    --vision_tower /ML-A100/public/tmp/pretrain_weights/Yi-VL-34B/vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-34B-448 \
    --output_dir /ML-A100/public/tmp/VL-FT\
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \
    --report_to wandb

error:

Traceback (most recent call last):
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/train/train.py", line 786, in train
    model = LlavaLlamaForCausalLM.from_pretrained(
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3085, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_llama.py", line 44, in __init__
    self.model = LlavaLlamaModel(config)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_llama.py", line 36, in __init__
    super(LlavaLlamaModel, self).__init__(config)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/llava_arch.py", line 32, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/clip_encoder/builder.py", line 11, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/ML-A100/public/tmp/yiguofeng/contribute/Yi/VL/llava/model/clip_encoder/clip_encoder.py", line 19, in __init__
    self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/models/clip/configuration_clip.py", line 238, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/ML-A100/public/tmp/miniconda3/envs/vl_test/lib/python3.10/site-packages/transformers/configuration_utils.py", line 696, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load the configuration of './vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-6B-448'

How did you specify a vision_tower? Are you the same as me?

I set the parameter mm_vision_tower in config.json to the local ViT path, as suggested by the previous Yi-VL model readme file. However, I just checked the latest readme file and found that the suggestion was removed. Something changed and I will look into this further, but for now, simply set the parameter mm_vision_tower in config.json to the local ViT path, and I think it will work fine.

@minlik
Copy link
Author

minlik commented Feb 2, 2024

Hello,

I've been trying to run the finetune script as per the instructions in VL/scripts/finetune.sh, but I keep encountering an error that I haven't been able to resolve. The script fails to execute properly, and the error seems to originate from the llama_flash_attn_monkey_patch.py file, specifically lines 87 to 89: output_unpad = flash_attn_unpadded_qkvpacked_func( qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True ) The error message I receive is as follows: cu_seqlens_q must have shape (batch_size + 1) I have followed all the setup instructions accurately and ensured that all dependencies are correctly installed. I would greatly appreciate any guidance on the specific details of using this code or suggestions on how I might resolve this error. Is there a particular setup or dependency version that I should be aware of?

Thank you.

I think it is related to the flash-attention or transformers version. You can try to use my following environment. Alternatively, you can simply avoid using flash attention by changing train_mem.py to train.py to see whether it is okay. Moreover, I have only tested finetune_qlora.sh and finetune_lora.sh, but not the full finetune script finetune.sh, due to my GPU memory limit.

accelerate==0.26.1
bitsandbytes==0.42.0
deepspeed==0.13.1
flash-attn==2.3.3
huggingface-hub==0.17.3
peft==0.8.1
tokenizers==0.14.1
torch==2.0.1
torchvision==0.15.2
transformers==4.34.0
wandb==0.16.2

@qiuhuiGithub
Copy link

qiuhuiGithub commented Feb 22, 2024

could you provide a sample of you training data? Is the data format same as llava and just change <image> to <image_placeholder> ?

@minlik
Copy link
Author

minlik commented Feb 22, 2024

could you provide a sample of you training data? Is the data format same as llava and just change to <image_placeholder> ?

Yes. See here

@murray-z
Copy link

murray-z commented Mar 7, 2024

你好,最新的微调代码哪里可以看到呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants