Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when using huggingface pretrained model with multi-gpu, model parameters were duplicate for every gpu in ram #17043

Open
linyubupa opened this issue Mar 12, 2023 · 19 comments
Labels
3rd party Related to a 3rd-party question Further information is requested strategy: deepspeed waiting on author Waiting on user action, correction, or update

Comments

@linyubupa
Copy link

linyubupa commented Mar 12, 2023

Bug description

when using huggingface pretrained model with multi-gpu, model parameters were duplicate for every gpu in ram

How to reproduce the bug

trainer = Trainer(
        max_epochs=1,
        devices=args.num_devices,
        precision=16,
        strategy="deepspeed_stage_3",
        accelerator='gpu',
        num_nodes=args.num_nodes,
    
    )
from transformers import (
    AdamW,
    GPTNeoForCausalLM,
    GPT2Tokenizer,
    AutoTokenizer,
    AutoModelForCausalLM,
    get_linear_schedule_with_warmup,
)
class AlpsModule(LightningModule):
    def __init__(
        self,
        model_name_or_path: str = "EleutherAI/gpt-j-6B",
        cache_dir: str ="/mntnlp/yumu/gpt-neo-x/" ,
        num_labels: int = 2,
        learning_rate: float = 5e-6,
        adam_epsilon: float = 3e-8,
        warmup_steps: int = 30,
        weight_decay: float = 0.01,
        **kwargs,
    ):
        super().__init__()
        self.save_hyperparameters()
    

        self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path
                                                        ,pad_token_id=self.tokenizer.pad_token_id
                                                        ,bos_token_id=self.tokenizer.bos_token_id
                                                        ,eos_token_id=self.tokenizer.eos_token_id
                                                        , cache_dir=cache_dir
                                                         # ,low_cpu_mem_usage=True
                                                        ).half()

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

@linyubupa linyubupa added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 12, 2023
@linyubupa
Copy link
Author

linyubupa commented Mar 12, 2023

if you got multi gpu, the cost cpu memory = 2 * model_size* gpu_numbers

@awaelchli
Copy link
Contributor

Hey @linyubupa

This is expected in the way you are initializing the model. I can see from the code snippet that you create the model in __init__. This isn't wrong, but for large models like yours it is inefficient. I recommend moving the initialization into this special Lightning hook:

def configure_sharded_model(self):
      self.model = AutoModelForCausalLM.from_pretrained(...)

Here is the documentation for working with deepspeed models (and also documentation for configure_sharded_model): https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

@awaelchli awaelchli added question Further information is requested strategy: deepspeed and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 15, 2023
@awaelchli
Copy link
Contributor

Please let me know if that helps :)

@awaelchli awaelchli added the waiting on author Waiting on user action, correction, or update label Mar 18, 2023
@awaelchli awaelchli self-assigned this Mar 18, 2023
@cgd-bot
Copy link

cgd-bot commented Mar 23, 2023

Please let me know if that helps :)

I had the same problem, but this method didn't solve it

Hey @linyubupa

This is expected in the way you are initializing the model. I can see from the code snippet that you create the model in __init__. This isn't wrong, but for large models like yours it is inefficient. I recommend moving the initialization into this special Lightning hook:

def configure_sharded_model(self):
      self.model = AutoModelForCausalLM.from_pretrained(...)

Here is the documentation for working with deepspeed models (and also documentation for configure_sharded_model): https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

I had the same problem, but this method didn't solve it

@imraviagrawal
Copy link

Hey @linyubupa

This is expected in the way you are initializing the model. I can see from the code snippet that you create the model in __init__. This isn't wrong, but for large models like yours it is inefficient. I recommend moving the initialization into this special Lightning hook:

def configure_sharded_model(self):
      self.model = AutoModelForCausalLM.from_pretrained(...)

Here is the documentation for working with deepspeed models (and also documentation for configure_sharded_model): https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

Yeah had same issue, and above does not solve it

@linyubupa
Copy link
Author

def configure_sharded_model(self):

sorry for late reply,I build up model in configure_sharded_model , but the cpu memory still cost amountly

@KzZheng
Copy link

KzZheng commented Mar 31, 2023

Same issue. After I put the model initialization into the configure_sharded_model, I return a new error that shows the loaded parameters are trying to assign to empty tensors.

Selection_278

@KzZheng
Copy link

KzZheng commented Mar 31, 2023

Same issue. After I put the model initialization into the configure_sharded_model, I return a new error that shows the loaded parameters are trying to assign to empty tensors.

Selection_278

It seems the model initialization should be here, but loading pre-trained weights should not be put here.

@munhouiani
Copy link

Same issue. After I put the model initialization into the configure_sharded_model, I return a new error that shows the loaded parameters are trying to assign to empty tensors.
Selection_278

It seems the model initialization should be here, but loading pre-trained weights should not be put here.

Did you find a solution for this?

@leeglg
Copy link

leeglg commented Apr 20, 2023

any solution about this? i really need help.

@Borda Borda added the 3rd party Related to a 3rd-party label May 3, 2023
@saketsathe
Copy link

I am facing a similar issue

@KzZheng
Copy link

KzZheng commented May 3, 2023

I think one possible solution is to convert the pretrained model weights to the deepspeed zero3 shared model formate, but I haven't tried it yet.

@saketsathe
Copy link

Is there code to try it out?

@leeglg
Copy link

leeglg commented May 4, 2023

any solution about this? i really need help.

@saketsathe @KzZheng

i tried like this.

`def configure_sharded_model(self):
print("start configure sharded model ")
# 모델 랜덤 초기화
llamaconfig = LlamaConfig.from_pretrained("decapoda-research/llama-7b-hf")
self.model = LlamaForCausalLM(llamaconfig)

    self.model.set_adapter(self.adapter_config)
    freeze_except_adapter(self.model, self.adapter_config)


    # 단일 weight의 list
    params_to_gather = [self.model.model.layers[0].self_attn.q_proj.weight]


    # 각 프로세스마다 실행됨.
    # checkpoint shard 에 있는 namedparameter 찾아서, 내 모델에서 동일한 named parameter 있으면 변경.
    # 일단 한번 출력. 이 weight가 값이 몇인지. 나중에 같은 코드로 0으로 변경된거 보기.




    # 변경 전 값 확인
    # check value before change
    with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
        print("\n random 초기화 된 weight \n", self.model.model.layers[0].self_attn.q_proj.weight[0, : 5])
        # 없음.

    time.sleep(3)



    # 1. 파일 하나씩 부르기
    # 2. 모델 파라미터랑 파일이랑 key 매칭.
    # 3. 값 넣기
    # GPU 한개 에서만 돌림.
    if torch.distributed.get_rank() == 0:
        with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
            self.model.model.layers[0].self_attn.q_proj.weight[0, : 5] = 0
        # 없음.
        #  # 체크포인트 파일 확인
        # SHARDED_FILE_PATH = "/home2/leeg/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348"

        # # 내 모델 올린거에서 state dict
        # #model_named_params = self.model.model.named_parameters()



        # # self.model.get_parameter()

        # # 체크포인트 파일 불러오기
        # PATH_LIST = [os.path.join(SHARDED_FILE_PATH, f"pytorch_model-000{i:02}-of-00033.bin") for i in range(1, 34)]
        # for PATH in tqdm(PATH_LIST):

        #     # single checkpoint shard file  's   state dict load
        #     file_state_dict = torch.load(PATH)

        #     # 내 모델에 있는 named_parameters 에서 
        #     named_parameters = dict(self.model.model.named_parameters())

        #     # 불러온 checkpoint file에 있는 key를 가지고 비교.  key가 내 모델에도 있으면, 파일에서 value 가져옴
        #     params_to_gather = [named_parameters[k] for k in file_state_dict.keys() if k in named_parameters]

        #     # for cp_k, cp_v in file_state_dict.items():
        #     #     if "inv_freq" in cp_k:
        #     #         continue

        #         # model_p = self.model.model.get_parameter(cp_k)
        #         # sharded_model_ps_dict[cp_k] = self.model.model.get_parameter(cp_k)
        #     with deepspeed.zero.GatheredParameters(params=params_to_gather, modifier_rank=0):
        #         self.model.model.load_state_dict(file_state_dict, strict=False)
                
            



    dist.barrier()

    
    # 각 프로세스마다 실행됨.
    # 변경 후 값 확인
    # check value after change
    with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
        print("\n barrier weight ", self.model.model.layers[0].self_attn.q_proj.weight[0, : 5])
        # 없음.`

The problem is, when i use from_pretrained("~~") in LightningModule's configure_sharded_model, Lightning Strategy Deepspeed 3 disturbs from_pretrained's assign work.

so, i tried using manual assignment rather than using from_pretrained, from sharded checkpoint file's parameter tensor to my model variable.

i didn't firmly figured out all of this, but i experimented below things.

  1. can call my random initialized parameters from GPUs or CPUs... where ever it is sent by action of deepspeed 3 strategy.
  2. can change 1 (parameter)'s value in context manager deepspeed.zero.GatheredParameters( ~~ ).

so, i think calling pretrained parameter file manually, and change my random initialized model parameters in deepspeed.zero.GatheredParameters is suitable approach.

@linyubupa
Copy link
Author

I solve this by using deepspeed init with transformers trainer : https://huggingface.co/docs/transformers/main_classes/deepspeed
、、、
deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901
your_program.py --deepspeed ds_config.json
、、、

@zhilif
Copy link

zhilif commented Oct 10, 2023

Is there any update on this issue?

@awaelchli awaelchli removed their assignment Nov 25, 2023
@Morizeyao
Copy link

update?

@chuckhope
Copy link

same...

@mickeysun0104
Copy link

any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party question Further information is requested strategy: deepspeed waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests