Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support finetuning LLaVA 1.6 #432

Open
choyakawa opened this issue Feb 23, 2024 · 8 comments
Open

Support finetuning LLaVA 1.6 #432

choyakawa opened this issue Feb 23, 2024 · 8 comments

Comments

@choyakawa
Copy link

Support finetuning LLaVA 1.6

@LZHgrla
Copy link
Collaborator

LZHgrla commented Feb 26, 2024

@choyakawa , HI!

Thank you for your attention. The training script for LLaVA1.6 (Next) has not been released yet. We will try to follow up once it is released.

@LZHgrla
Copy link
Collaborator

LZHgrla commented Mar 14, 2024

Hi @choyakawa

@hhaAndroid is working on it.

Please subscribe #460!

@choyakawa
Copy link
Author

Failed on llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain with deepspeed zero3, is there anything wrong?
NCCL_IB_TIMEOUT=120 XTUNER_DATASET_TIMEOUT=120 NCCL_DEBUG=INFO NPROC_PER_NODE=8 NNODES=4 PORT=12345 ADDR=server0 NODE_RANK=0 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain --deepspeed deepspeed_zero3

set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(  
                                                                                                   ^  ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(    raise ValueError(

ValueError: ValueErrorTrying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.: 
Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
[2024-03-17 12:29:20,707] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 79735) of binary: /usr/local/bin/python3
Traceback (most recent call last):

@choyakawa
Copy link
Author

zero2 is ok, but replicating LLaVA 1.6 with 34B model is challenging without zero3

@choyakawa
Copy link
Author

choyakawa commented Mar 18, 2024

@LZHgrla Do you have any idea on the failure of zero3? I am having no idea why the image features from clip has shape torch.Size([0]) here.
It seems that batchsize>1 on zero 2 won't work either.

@LZHgrla
Copy link
Collaborator

LZHgrla commented Mar 19, 2024

@choyakawa
Quantization is not compatible with zero3. So, you should remove the quantization_config of model.llm when using zero3.

The features of LLaVA 1.6 are still WIP. If you have any advanced attempts (such as application of 34B LLM), you are welcome to provide detailed configs and executable commands. We will conduct some tests after development to improve the robustness.

@choyakawa
Copy link
Author

I am not using quantization, the above failure was on bf16. And I have also tried open_clip instead of openai vit-L, not working.

@awzhgw
Copy link

awzhgw commented Apr 24, 2024

@LZHgrla

这个报错信息,该怎么解决呢?

RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 270, in run
    self.runner.call_hook('before_train')
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 221, in before_train
    self._generate_samples(runner, max_new_tokens=50)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 207, in _generate_samples
    self._eval_images(runner, model, device, max_new_tokens,
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/anyshape_evaluate_chat_hook.py", line 53, in _eval_images
    image_features = model.preprocess_for_pixel_values({
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/model/anyshape_llava.py", line 109, in preprocess_for_pixel_values
    self.image_newline[:, None, None].expand(
RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
[2024-04-24 13:52:01,685] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2444983) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants