Support finetuning LLaVA 1.6 #432

choyakawa · 2024-02-23T14:49:29Z

Support finetuning LLaVA 1.6

LZHgrla · 2024-02-26T03:31:51Z

Thank you for your attention. The training script for LLaVA1.6 (Next) has not been released yet. We will try to follow up once it is released.

LZHgrla · 2024-03-14T05:16:50Z

Hi @choyakawa

@hhaAndroid is working on it.

Please subscribe #460!

choyakawa · 2024-03-17T12:35:22Z

Failed on llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain with deepspeed zero3, is there anything wrong?
NCCL_IB_TIMEOUT=120 XTUNER_DATASET_TIMEOUT=120 NCCL_DEBUG=INFO NPROC_PER_NODE=8 NNODES=4 PORT=12345 ADDR=server0 NODE_RANK=0 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain --deepspeed deepspeed_zero3

set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(  
                                                                                                   ^  ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(    raise ValueError(

ValueError: ValueErrorTrying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.: 
Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
[2024-03-17 12:29:20,707] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 79735) of binary: /usr/local/bin/python3
Traceback (most recent call last):

choyakawa · 2024-03-17T12:53:33Z

zero2 is ok, but replicating LLaVA 1.6 with 34B model is challenging without zero3

choyakawa · 2024-03-18T12:15:09Z

@LZHgrla Do you have any idea on the failure of zero3? I am having no idea why the image features from clip has shape torch.Size([0]) here.
It seems that batchsize>1 on zero 2 won't work either.

LZHgrla · 2024-03-19T04:57:26Z

@choyakawa
Quantization is not compatible with zero3. So, you should remove the quantization_config of model.llm when using zero3.

The features of LLaVA 1.6 are still WIP. If you have any advanced attempts (such as application of 34B LLM), you are welcome to provide detailed configs and executable commands. We will conduct some tests after development to improve the robustness.

choyakawa · 2024-03-19T07:51:14Z

I am not using quantization, the above failure was on bf16. And I have also tried open_clip instead of openai vit-L, not working.

awzhgw · 2024-04-24T14:01:36Z

@LZHgrla

这个报错信息，该怎么解决呢？

RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 270, in run
    self.runner.call_hook('before_train')
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 221, in before_train
    self._generate_samples(runner, max_new_tokens=50)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 207, in _generate_samples
    self._eval_images(runner, model, device, max_new_tokens,
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/anyshape_evaluate_chat_hook.py", line 53, in _eval_images
    image_features = model.preprocess_for_pixel_values({
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/model/anyshape_llava.py", line 109, in preprocess_for_pixel_values
    self.image_newline[:, None, None].expand(
RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
[2024-04-24 13:52:01,685] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2444983) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================

LZHgrla added the feature request label Feb 26, 2024

choyakawa mentioned this issue Mar 17, 2024

[Feature] Support any shape of Llava (from llava 1.6) #460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support finetuning LLaVA 1.6 #432

Support finetuning LLaVA 1.6 #432

choyakawa commented Feb 23, 2024

LZHgrla commented Feb 26, 2024

LZHgrla commented Mar 14, 2024

choyakawa commented Mar 17, 2024

choyakawa commented Mar 17, 2024

choyakawa commented Mar 18, 2024 •

edited

Loading

LZHgrla commented Mar 19, 2024

choyakawa commented Mar 19, 2024

awzhgw commented Apr 24, 2024

Support finetuning LLaVA 1.6 #432

Support finetuning LLaVA 1.6 #432

Comments

choyakawa commented Feb 23, 2024

LZHgrla commented Feb 26, 2024

LZHgrla commented Mar 14, 2024

choyakawa commented Mar 17, 2024

choyakawa commented Mar 17, 2024

choyakawa commented Mar 18, 2024 • edited Loading

LZHgrla commented Mar 19, 2024

choyakawa commented Mar 19, 2024

awzhgw commented Apr 24, 2024

choyakawa commented Mar 18, 2024 •

edited

Loading