pruning 之后使用无法读取模型 #7

JCDemon · 2024-04-06T13:01:01Z

我尝试使用
model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True）
但是会报错
“Traceback (most recent call last):
File "/home/ubuntu/test_scripts/benchmark_r.py", line 154, in
main()
File "/home/ubuntu/test_scripts/benchmark_r.py", line 63, in main
model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True,
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 556, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([2048, 2785]) in "weight" (which has shape torch.Size([2048, 5504])), this look incorrect.
”
我也尝试了在加载的时候添加参数 ignore_mismatched_sizes=True
model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True, ignore_mismatched_sizes=True）

同样也会报错:
Some weights of QWenLMHeadModel were not initialized from the model checkpoint at /data/xxxx and are newly initialized because the shapes did not match:

transformer.h.10.mlp.c_proj.weight: found shape torch.Size([2048, 2785]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.10.mlp.w1.weight: found shape torch.Size([2785, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.10.mlp.w2.weight: found shape torch.Size([2785, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.11.mlp.c_proj.weight: found shape torch.Size([2048, 2518]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.11.mlp.w1.weight: found shape torch.Size([2518, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.11.mlp.w2.weight: found shape torch.Size([2518, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.12.attn.c_attn.weight: found shape torch.Size([3840, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.12.attn.c_proj.weight: found shape torch.Size([2048, 1280]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.12.mlp.c_proj.weight: found shape torch.Size([2048, 2393]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.12.mlp.w1.weight: found shape torch.Size([2393, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.12.mlp.w2.weight: found shape torch.Size([2393, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.13.attn.c_attn.weight: found shape torch.Size([3072, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.13.attn.c_proj.weight: found shape torch.Size([2048, 1024]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.13.mlp.c_proj.weight: found shape torch.Size([2048, 3776]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.13.mlp.w1.weight: found shape torch.Size([3776, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.13.mlp.w2.weight: found shape torch.Size([3776, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.14.attn.c_attn.weight: found shape torch.Size([2688, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.14.attn.c_proj.weight: found shape torch.Size([2048, 896]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.14.mlp.c_proj.weight: found shape torch.Size([2048, 3594]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.14.mlp.w1.weight: found shape torch.Size([3594, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.14.mlp.w2.weight: found shape torch.Size([3594, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.15.attn.c_attn.weight: found shape torch.Size([3072, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.15.attn.c_proj.weight: found shape torch.Size([2048, 1024]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.15.mlp.c_proj.weight: found shape torch.Size([2048, 4113]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.15.mlp.w1.weight: found shape torch.Size([4113, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.15.mlp.w2.weight: found shape torch.Size([4113, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.16.attn.c_attn.weight: found shape torch.Size([3072, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.16.attn.c_proj.weight: found shape torch.Size([2048, 1024]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.17.attn.c_attn.weight: found shape torch.Size([2688, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.17.attn.c_proj.weight: found shape torch.Size([2048, 896]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.17.mlp.c_proj.weight: found shape torch.Size([2048, 3263]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.17.mlp.w1.weight: found shape torch.Size([3263, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.17.mlp.w2.weight: found shape torch.Size([3263, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.18.mlp.c_proj.weight: found shape torch.Size([2048, 3861]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.18.mlp.w2.weight: found shape torch.Size([3861, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.18.attn.c_attn.weight: found shape torch.Size([1536, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.18.attn.c_proj.weight: found shape torch.Size([2048, 512]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.18.mlp.w1.weight: found shape torch.Size([3861, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.19.attn.c_attn.weight: found shape torch.Size([2688, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.19.attn.c_proj.weight: found shape torch.Size([2048, 896]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.20.attn.c_attn.weight: found shape torch.Size([2688, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.20.attn.c_proj.weight: found shape torch.Size([2048, 896]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.20.mlp.c_proj.weight: found shape torch.Size([2048, 3291]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.20.mlp.w1.weight: found shape torch.Size([3291, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.20.mlp.w2.weight: found shape torch.Size([3291, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.21.attn.c_attn.weight: found shape torch.Size([1536, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.21.attn.c_proj.weight: found shape torch.Size([2048, 512]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.22.attn.c_attn.weight: found shape torch.Size([3072, 2048]) in the checkpoint and torch.Size([6144, 2048]) in the model instantiated
transformer.h.22.attn.c_proj.weight: found shape torch.Size([2048, 1024]) in the checkpoint and torch.Size([2048, 2048]) in the model instantiated
transformer.h.9.mlp.c_proj.weight: found shape torch.Size([2048, 2630]) in the checkpoint and torch.Size([2048, 5504]) in the model instantiated
transformer.h.9.mlp.w1.weight: found shape torch.Size([2630, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
transformer.h.9.mlp.w2.weight: found shape torch.Size([2630, 2048]) in the checkpoint and torch.Size([5504, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "/home/ubuntu/test_scripts/benchmark_r.py", line 152, in
main()
File "/home/ubuntu/test_scripts/benchmark_r.py", line 63, in main
model = AutoModelForCausalLM.from_pretrained(args.model_path, device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True, ignore_mismatched_sizes=True)
File "/home/ubuntu/miniconda3/envs/qwen/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 556, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3558, in from_pretrained
dispatch_model(model, **device_map_kwargs)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/accelerate/big_modeling.py", line 474, in dispatch_model
model.to(device)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2556, in to
return super().to(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/ubuntu/miniconda3/envs/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

请问你们在prune模型之后是怎么去加载的呢。
很着急尝试FLAP，期待您的回复，谢谢。

The text was updated successfully, but these errors were encountered:

BenchuYee · 2024-04-08T11:36:34Z

hei JCDemon, if you want to load the model, you can use the pytorch api(torch.load, torch.save) instead of huggingface api(from_pretrained,save_pretrained).When you finish pruning the model, you use torch.save to save it and then use torch.load to load the pruned model.

JCDemon · 2024-04-09T12:06:46Z

hei JCDemon, if you want to load the model, you can use the pytorch api(torch.load, torch.save) instead of huggingface api(from_pretrained,save_pretrained).When you finish pruning the model, you use torch.save to save it and then use torch.load to load the pruned model.

I've solved the loading issue, thank you for your response. I was using FLAP to prune a llama-version of Qwen model (I converted the Qwen 1.8B model into llama2 version beforehand). I found that I can use save_pretrained to save the model in HF format. Then I convert the HF format pruned llama model into Qwen model and I found I couldn't load it using from_pretrained(). Fortunately, I managed to modify the qwen structure that is defined in modeling.py (by applying dynamic head_size to each layer, and then I finally can load the model using from_pretrained()).
Currently, I got problem using the pruned model to generate text. The currently output of the model seems to be some garbled codes instead of correct sentences. I think it could be the reason I didn't modify the word embedding step (I checked the input_ids running in the model, it is correct. But when converted to input_embed, the input_embed seems totally wrong)? any suggestion about this?

the input_ids here is correct, but the inputs_embeds is wrong. I checked "self.wte" it is a nn.embedding object the above print inputs_embeds is torch.Size([1, 12, 2048])", but I think it should be "([1,12,768])"??? Bec I already set the hidden_size of the first layer to 768, I don't know why it is still the original value 2048.

Also, I couldn't use the above-mentioned pruned llama model (which is actually a llama-version Qwen model) to generate text (I used the exactly same code "torch.load" to load the model as you guys put in the github "generate.py"), I encountered similar "garbled codes" issue as mentioned earlier. BTW, the tokenizer I used is the Qwen tokenizer.

shwu-nyunai · 2024-04-09T12:15:59Z

hi @JCDemon
can u share with me the code that u used to load using from_pretrained()

I am trying to save with torch.save but it doesn't seem to save the model for me.
#9

JCDemon · 2024-04-09T13:20:51Z

hi @JCDemon can u share with me the code that u used to load using from_pretrained()

I am trying to save with torch.save but it doesn't seem to save the model for me. #9

sure thing, here is the code I used to prune the llama-version Qwen model and save in HF format. The line "model.save_pretrained(args.save_model, safe_serialization=True)" actually made it work.

"import argparse
import os
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from models.hf_llama.modeling_llama import LlamaForCausalLM

from importlib.metadata import version

from lib.prune import prune_wanda_sp, prune_flap, prune_magnitude_sp, check_sparsity
from lib.eval import eval_ppl

print('torch', version('torch'))
print('transformers', version('transformers'))
print('accelerate', version('accelerate'))
print('# of gpus: ', torch.cuda.device_count())

def get_llm(model, cache_dir="llm_weights"):
# model = AutoModelForCausalLM.from_pretrained(
# model,
# torch_dtype=torch.float16,
# cache_dir=cache_dir,
# low_cpu_mem_usage=True,
# device_map="auto"
# )
model = LlamaForCausalLM.from_pretrained(
model,
torch_dtype=torch.float16,
cache_dir=cache_dir,
low_cpu_mem_usage=True,
# device_map="auto"
)
print(len(model.model.layers))
for i in range(len(model.model.layers)):
model.model.layers[i].self_attn.o_proj.bias = torch.nn.Parameter(torch.zeros_like(model.model.layers[i].self_attn.o_proj.bias, device='cpu')) # 或 'cuda'
model.model.layers[i].mlp.down_proj.bias = torch.nn.Parameter(torch.zeros_like(model.model.layers[i].mlp.down_proj.bias, device='cpu')) # 或 'cuda'
torch.nn.init.zeros_(model.model.layers[i].self_attn.o_proj.bias)
torch.nn.init.zeros_(model.model.layers[i].mlp.down_proj.bias)

model.seqlen = 128
return model

def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, help='LLaMA model') # Huggingface model name
parser.add_argument('--seed', type=int, default=0, help='Seed for sampling the calibration data.')
parser.add_argument('--nsamples', type=int, default=2048, help='Number of calibration samples.')
parser.add_argument('--pruning_ratio', type=float, default=0, help='Pruning ratio.')
parser.add_argument('--remove_heads', type=int, default=8, help='Remove num_heads')
parser.add_argument("--metrics", type=str, default="WIFV", choices=["IFV", "WIFV", "WIFN", 'N/A'])
parser.add_argument("--structure", type=str, default="AL-AM", choices=["UL-UM", "UL-MM", "AL-MM", "AL-AM", 'N/A'])
parser.add_argument("--prune_method", type=str, default="flap", choices=["flap", "wanda_sp", "mag_sp"])
parser.add_argument("--cache_dir", default="llm_weights", type=str)
parser.add_argument('--unstr', action="store_true")
parser.add_argument('--eval', action="store_true")
parser.add_argument('--save_model', type=str, default=None, help='Path to save the pruned model.')
args = parser.parse_args()

# Setting seeds for reproducibility
np.random.seed(args.seed)
torch.random.manual_seed(args.seed)

# Build the model and tokenizer
print(f"loading llm model {args.model}")
model = get_llm(args.model, args.cache_dir)
device = torch.device("cuda:0")
model.to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False, trust_remote_code=True)

if "30b" in args.model or "65b" in args.model: # for 30b and 65b we use device_map to load onto multiple A6000 GPUs, thus the processing here.
    device = model.hf_device_map["lm_head"]
print("use device ", device)

# Prune the model
print("pruning starts")
if args.prune_method == "flap":
    if args.metrics == 'N/A':
        raise ValueError("For FLAP pruning, the metrics parameter must be chosen from ['IFV', 'WIFV', 'WIFN']. 'N/A' is not a valid choice.")  
    if args.structure == 'N/A':
        raise ValueError("For FLAP pruning, the compressed model structure parameter must be chosen from ['UL-UM', 'UL-MM', 'AL-MM', 'AL-AM']. 'N/A' is not a valid choice.")  
    prune_flap(args, model, tokenizer, device)
elif args.prune_method == "wanda_sp":
    prune_wanda_sp(args, model, tokenizer, device)
elif args.prune_method == "mag_sp":
    prune_magnitude_sp(args, model, tokenizer, device)

# Check the sparsity of the model
print("*"*30)
sparsity_ratio = check_sparsity(model)
print(f"sparsity sanity check {sparsity_ratio:.4f}")
print(f"model parameter {sum(p.numel() for p in model.parameters()) / 1000 ** 3:.2f}B")
print("*"*30)
# Evaluate the model
if args.eval:
    ppl = eval_ppl(model, tokenizer, device)    
    print(f"ppl on wikitext {ppl}")
    
# Save the model
if args.save_model:
    if not os.path.exists(args.save_model):
        os.makedirs(args.save_model)
    # torch.save(model, f'{args.save_model}/pruned_model.pt')    
    # torch.save(model, f'{args.save_model}/pruned_model.bin')
    model.save_pretrained(args.save_model, safe_serialization=True)
    # tokenizer.save_pretrained(args.save_model)

if name == 'main':
main()"

JCDemon · 2024-04-09T13:28:54Z

hi @JCDemon can u share with me the code that u used to load using from_pretrained()

I am trying to save with torch.save but it doesn't seem to save the model for me. #9

I think that the reason I can use "save_pretrained" to save my model might bec it's not a official llama2 model. The model I used was converted from Qwen 1.8B.
Also, I think the issue u mentioned in #9 doesn't make sense as the weight_dict should be modified after pruning. It's weird to output the original model. Maybe you can try checking the weight before using save_pretrained?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pruning 之后使用无法读取模型 #7

pruning 之后使用无法读取模型 #7

JCDemon commented Apr 6, 2024

BenchuYee commented Apr 8, 2024 •

edited

JCDemon commented Apr 9, 2024 •

edited

shwu-nyunai commented Apr 9, 2024

JCDemon commented Apr 9, 2024

JCDemon commented Apr 9, 2024 •

edited

pruning 之后使用 无法读取模型 #7

pruning 之后使用 无法读取模型 #7

Comments

JCDemon commented Apr 6, 2024

BenchuYee commented Apr 8, 2024 • edited

JCDemon commented Apr 9, 2024 • edited

shwu-nyunai commented Apr 9, 2024

JCDemon commented Apr 9, 2024

JCDemon commented Apr 9, 2024 • edited

pruning 之后使用无法读取模型 #7

pruning 之后使用无法读取模型 #7

BenchuYee commented Apr 8, 2024 •

edited

JCDemon commented Apr 9, 2024 •

edited

JCDemon commented Apr 9, 2024 •

edited