Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp #115

ddoron9 · 2024-01-26T07:15:59Z

Hi,

I'm currently attempting to run the video_chat2 model on a multi-GPU setup consisting of 8 Nvidia Titan Xp GPUs, each with 12GiB of memory. I'm using the mvbench.ipynb notebook from the Ask-Anything/video_chat2 repository for this purpose.

To ensure the model loads on my GPUs, I've enabled the low_resource option in config.json. Additionally, I've specified device_map="auto" during the initialization of the llama_model in videochat2_it.py. The relevant code snippet is as follows:

if self.low_resource:
    self.llama_model = LlamaForCausalLM.from_pretrained(
        llama_model_path,
        load_in_8bit=True,
        device_map="auto",
        torch_dtype=torch.float16,
    )

However, when I execute the code, I encounter multiple errors originating from the following lines:

seg_embs = [model.llama_model.base_model.model.model.embed_tokens(seg_t).cpu() for seg_t in seg_tokens] # get_context_emb

outputs = model.llama_model.generate()

Could you provide some guidance or suggestions on how to effectively perform inference with sharded models in this multi-GPU environment?

Thank you for your incredible works.

The text was updated successfully, but these errors were encountered:

yinanhe · 2024-01-26T07:55:46Z

Thank you for your interest in our work. Could you provide more error information? We haven't attempted to load the model in shards before. We've successfully run it on a graphics card with at least 16GB of VRAM. We're happy to research this issue together if you can provide the error information from these lines.

ddoron9 · 2024-01-29T02:43:49Z

I'm not sure all the error is general in this reopsitory. because I've modified demo.py and mvbench.py to inference without gradio. there wasn't significant changes in the code. but for reference, I will attach my code file.
infer.txt

Anyway, enabling low_resource does returns multiple device mismatch error inside llama_model. The location of the issue is as follows. few more locations could be show afterwards from this examples.

Ask-Anything/video_chat2/conversation.py

Lines 237 to 239 in 389d886

    
               seg_embs = [self.model.llama_model.base_model.model.model.embed_tokens(seg_t) for seg_t in seg_tokens] 
        
           mixed_embs = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair] + [seg_embs[-1]] 
        
           mixed_embs = torch.cat(mixed_embs, dim=1)

Ask-Anything/video_chat2/models/blip2/modeling_llama.py

Line 74 in 389d886

return self.weight * hidden_states
Ask-Anything/video_chat2/models/blip2/modeling_llama.py

Line 275 in 389d886

hidden_states, self_attn_weights, present_key_value = self.self_attn(

Here is one of the error code if I enable low_resource flag.

Traceback (most recent call last):
  File "infer.py", line 88, in <module>
    main_results = main()
  File "infer.py", line 78, in main
    results = ask_questions(chat, chat_state, img_list, questions)
  File "infer.py", line 59, in ask_questions
    llm_message, _, chat_state = chat.answer(conv=chat_state, img_list=img_list, max_new_tokens=1000, num_beams=num_beams, temperature=temperature)
  File "/data1/doyi/Ask-Anything/video_chat2/conversation.py", line 65, in answer
    outputs = self.model.llama_model.generate(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/peft/peft_model.py", line 731, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/generation/utils.py", line 1525, in generate
    return self.sample(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/generation/utils.py", line 2622, in sample
    outputs = self(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 674, in forward
    outputs = self.model(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 563, in forward
    layer_outputs = decoder_layer(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 178, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/peft/tuners/lora.py", line 710, in forward
    self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I found a way to perform inference using about 9GiB of GPU memory by enabling low_resource. here is the way I modified the code.

"cuda:0" to self.device

Ask-Anything/video_chat2/conversation.py

Line 233 in 389d886

seg, return_tensors="pt", add_special_tokens=i == 0).to("cuda:0").input_ids
changing torch_dtype to bool float

                self.llama_model = LlamaForCausalLM.from_pretrained(
                    llama_model_path,
                    torch_dtype=torch.bfloat16,
                    load_in_8bit=True,
                )

what I'm struggling now is that I want to inference without low_resource's int8 model loading with llama-7b.

               self.llama_model = LlamaForCausalLM.from_pretrained(
                   llama_model_path,
                   torch_dtype=torch.bfloat16,
                   device_map={0: "4GiB", 1:"12GiB", 2:"12GiB", 3:"12GiB", 4:"12GiB", 5:"12GiB", 6:"12GiB", "cpu": "30GiB"}
               )

after setting device_map returns value error

    model = VideoChat2_it(config=cfg.model)
  File "/data1/doyi/Ask-Anything/video_chat2/models/videochat2_it.py", line 142, in __init__
    self.llama_model = LlamaForCausalLM.from_pretrained(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 797, in _load_state_dict_into_meta_model
    raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: model.embed_tokens.weight doesn't have any device set.

yinanhe · 2024-01-30T06:15:49Z

@ddoron9 For the first error, I think this may be caused by the hard coding in

Ask-Anything/video_chat2/conversation.py

Line 233 in 389d886

seg, return_tensors="pt", add_special_tokens=i == 0).to("cuda:0").input_ids

Changing cuda:0 to self.device may solve this problem, like this.

Ask-Anything/video_chat2/conversation.py

Line 233 in d57c30f

seg, return_tensors="pt", add_special_tokens=i == 0).to(self.device).input_ids

Please fix it and try again.

yinanhe · 2024-01-30T06:31:27Z

@ddoron9 For the second question, the ValueError indicates that model.embed_tokens.weight doesn't have a device set, suggesting that during the model loading process, certain parameters (such as embed_tokens.weight) aren't assigned to any device.

I'm not very familiar with this, but adding device_map={0: "4GiB", 1:"12GiB", 2:"12GiB", 3:"12GiB", 4:"12GiB", 5:"12GiB", 6:"12GiB", "cpu": "30GiB"} doesn't seem to work. It appears that this isn't where you input hardware information for your device. According to my research on huggingface, it appears that the keys of the map here should be filled with parameters from llama_model, and the values should indicate the device where you intend to place them. Maybe you should ensure that every part of the model has a designated device in the device_map. This may involve examining the structure of the LlamaForCausalLM model and how its parameters are initialized and loaded. For example you can set

{'embed_tokens.weight': 0,
 'embed_tokens.bais': 1,
 'encoder': "cpu"}

Perhaps device_map=auto can also solve this problem.

Coronal-Halo · 2024-01-30T19:17:18Z

Have you solved this problem? I faced a similar problem -- I can perform inference by setting the flag 'low_resource = true' in the config file, but I would always get the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Error when I set 'low_resource = false'. The above fixes do not work.

yinanhe · 2024-01-31T03:06:15Z

Have you solved this problem? I faced a similar problem -- I can perform inference by setting the flag 'low_resource = true' in the config file, but I would always get the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Error when I set 'low_resource = false'. The above fixes do not work.

@Coronal-Halo Hi, It seems that the error occurs only when low_resource=True for @ddoron9. And you are exactly the opposite. low_resource=False is our default parameter, and our program runs normally at this time. I want to know if the solution mentioned in the above method includes #115 (comment). If possible, could you provide more error information?

Coronal-Halo · 2024-02-09T21:59:49Z

Thanks for your reply. I solved this problem by trying every combination of putting variables to cpu vs. gpu.

kenchanLOL · 2024-03-10T03:01:30Z

Hi, I tried to load the model with dual 4090 and still faced the same error after applying the changes. I looked into debugger and realized that it is because the input tensor device is switched automictically by pre forward hook which I believe is implemented by huggingface when setting device_map = "auto" . Here is the step to reproduce the same behavior :

Check the hf_device_map after loading the model and find the index of layer where device number is changed.

...
'model.layers.15':0,
'model.layers.16':1.
...

In my case, it is model.layer.16
2. Set a conditional breakpoint at

Ask-Anything/video_chat2/models/blip2/modeling_llama.py

Lines 565 to 572 in fedc486

    
           layer_outputs = decoder_layer( 
        
               hidden_states, 
        
               attention_mask=attention_mask, 
        
               position_ids=position_ids, 
        
               past_key_value=past_key_value, 
        
               output_attentions=output_attentions, 
        
               use_cache=use_cache, 
        
           )

3.add hidden_states.device to watch and step into the function. The device will change from device(type='cuda', index=0) to device(type='cuda', index=1) . Beside, I checked the device of self.inputlayernorm.weight and it is located at cuda 0. Therefore, it would raise the error

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I did not encounter similar issue when I loaded v1 with the same setting for inference. Is it because v1 is using original llama while v2 isn't? Is there any workaround or fix here? Thanks.

kenchanLOL mentioned this issue Mar 12, 2024

Fail to run v2 with flash attention #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp #115

Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp #115

ddoron9 commented Jan 26, 2024 •

edited

Loading

yinanhe commented Jan 26, 2024

ddoron9 commented Jan 29, 2024 •

edited

Loading

yinanhe commented Jan 30, 2024

yinanhe commented Jan 30, 2024

Coronal-Halo commented Jan 30, 2024

yinanhe commented Jan 31, 2024 •

edited

Loading

Coronal-Halo commented Feb 9, 2024

kenchanLOL commented Mar 10, 2024

Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp #115

Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp #115

Comments

ddoron9 commented Jan 26, 2024 • edited Loading

yinanhe commented Jan 26, 2024

ddoron9 commented Jan 29, 2024 • edited Loading

yinanhe commented Jan 30, 2024

yinanhe commented Jan 30, 2024

Coronal-Halo commented Jan 30, 2024

yinanhe commented Jan 31, 2024 • edited Loading

Coronal-Halo commented Feb 9, 2024

kenchanLOL commented Mar 10, 2024

ddoron9 commented Jan 26, 2024 •

edited

Loading

ddoron9 commented Jan 29, 2024 •

edited

Loading

yinanhe commented Jan 31, 2024 •

edited

Loading