Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan when converting neox and opt models with AutoGPTQ-triton #8

Closed
GenTxt opened this issue Apr 22, 2023 · 7 comments
Closed

nan when converting neox and opt models with AutoGPTQ-triton #8

GenTxt opened this issue Apr 22, 2023 · 7 comments

Comments

@GenTxt
Copy link

GenTxt commented Apr 22, 2023

Will test default cuda version next but encountering nan for all conversions using 'AutoGPTQ-triton'

Using ubuntu 22.04, python3.10, transformers 4.28(dev), 64 gigs ram, 2x RTX 24 gig cards.

Installed successfully with all dependencies.

Am I missing a particular package version?

python basic_usage.py

pretrained_model_dir = "models/gpt-neox-20b"
quantized_model_dir = "4bit_converted"

2023-04-22 13:29:42 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 1/44...
2023-04-22 13:29:47 INFO [auto_gptq.quantization.gptq] duration: 5.032277584075928
2023-04-22 13:29:47 INFO [auto_gptq.quantization.gptq] avg loss: 17.77143669128418
2023-04-22 13:29:47 INFO [auto_gptq.modeling._base] Quantizing attention.dense in layer 1/44...
2023-04-22 13:29:48 INFO [auto_gptq.quantization.gptq] duration: 1.7948594093322754
2023-04-22 13:29:48 INFO [auto_gptq.quantization.gptq] avg loss: 1.888306736946106
2023-04-22 13:29:49 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_h_to_4h in layer 1/44...
2023-04-22 13:29:50 INFO [auto_gptq.quantization.gptq] duration: 1.8883254528045654
2023-04-22 13:29:50 INFO [auto_gptq.quantization.gptq] avg loss: 28.566619873046875
2023-04-22 13:29:50 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_4h_to_h in layer 1/44...
2023-04-22 13:30:02 INFO [auto_gptq.quantization.gptq] duration: 11.343331575393677
2023-04-22 13:30:02 INFO [auto_gptq.quantization.gptq] avg loss: nan
2023-04-22 13:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 2/44
2023-04-22 13:30:02 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 2/44...
2023-04-22 13:30:04 INFO [auto_gptq.quantization.gptq] duration: 1.9044442176818848
2023-04-22 13:30:04 INFO [auto_gptq.quantization.gptq] avg loss: nan
etc. stopped

Similar results with opt-30b:

Same results with opt:

pretrained_model_dir = "models/opt-30b"
quantized_model_dir = "4bit_converted"

Loading checkpoint shards: 100%|██████████████| 267/267 [13:52<00:00, 3.12s/it]
2023-04-22 13:55:41 INFO [auto_gptq.modeling._base] Start quantizing layer 1/44
2023-04-22 13:55:46 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 1/44...
2023-04-22 13:55:51 INFO [auto_gptq.quantization.gptq] duration: 4.748894453048706
2023-04-22 13:55:51 INFO [auto_gptq.quantization.gptq] avg loss: 17.623794555664062
2023-04-22 13:55:51 INFO [auto_gptq.modeling._base] Quantizing attention.dense in layer 1/44...
2023-04-22 13:55:53 INFO [auto_gptq.quantization.gptq] duration: 1.8472576141357422
2023-04-22 13:55:53 INFO [auto_gptq.quantization.gptq] avg loss: 1.9249645471572876
2023-04-22 13:55:53 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_h_to_4h in layer 1/44...
2023-04-22 13:55:55 INFO [auto_gptq.quantization.gptq] duration: 1.9470229148864746
2023-04-22 13:55:55 INFO [auto_gptq.quantization.gptq] avg loss: 28.64271354675293
2023-04-22 13:55:55 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_4h_to_h in layer 1/44...
2023-04-22 13:56:06 INFO [auto_gptq.quantization.gptq] duration: 11.630852222442627
2023-04-22 13:56:06 INFO [auto_gptq.quantization.gptq] avg loss: nan
2023-04-22 13:56:07 INFO [auto_gptq.modeling._base] Start quantizing layer 2/44
2023-04-22 13:56:07 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 2/44...
2023-04-22 13:56:09 INFO [auto_gptq.quantization.gptq] duration: 1.975161075592041
2023-04-22 13:56:09 INFO [auto_gptq.quantization.gptq] avg loss: nan
etc. stopped

@GenTxt
Copy link
Author

GenTxt commented Apr 22, 2023

Confirming similar nan results above for main cuda branch using same models plus additional neox-20b.

@PanQiWei
Copy link
Collaborator

I saw you were using basic_usage.py, which just using one-shot quantization(one sample) to show case the usage of basic apis, and it may encounter into 'nan' when quantize big model with such few samples. I would suggest to try quantize_with_alpaca.py which uses many instruction-following samples to quantize LLMs.

Please let me know if the same problem still occurs when using quantize_with_alpaca.py

@GenTxt
Copy link
Author

GenTxt commented Apr 24, 2023

Tested above 'quantize_with_alpaca.py' with latest 0.3 version.

Needed to change the following:

parser.add_argument("--fast_tokenizer", action="store_true")

ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported.

changed to:

parser.add_argument("--fast_tokenizer", action="store_false")

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/gpt-neox-20b --quantized_model_dir 4bit_converted/neox20b-4bit.safetensor

After change the quantization proceeded without error complete with final examples from script printed to terminal.

Unfortunately, the quantized model isn't saved to --quantized_model_dir 4bit_converted/neox20b-4bit.safetensor

2023-04-24 15:03:20 INFO [auto_gptq.modeling._utils] Model packed.

The model 'GPTNeoXGPTQForCausalLM' is not supported for .

Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

prompt: Instruction:
Name three characteristics commonly associated with a strong leader.
Output:
etc.

Tested 3x. '4bit_converted' folder exists at same level as scripts and models

Am I missing a command to save the model to a local folder or has it been saved to another default location?

Thanks

@PanQiWei
Copy link
Collaborator

There are two things you should be aware of, maybe it's my bad that don't make it clear in example's README:

  1. there is not need to change the original command line flag's functionality, you can just enable --fast_tokenizer in command when using gpt_neox type models for it only have GPTNeoXTokenizerFast
  2. value for --quantized_model_dir should be a path to local directory, not a file, but you can check if the quantized model save into a dir named 4bit_converted/neox20b-4bit.safetensor

@GenTxt
Copy link
Author

GenTxt commented Apr 25, 2023

Thanks for the update. Model saved in '4bit_converted' in .bin format.

The model 'GPTNeoXGPTQForCausalLM' is not supported for . is still generated but not a big deal.

How to save as safetensors? Will run again using:

model.save_quantized(args.quantized_model_dir, use_safetensors=True)

Also, is there a simple inference script to use with the generated model above?

Cheers

@PanQiWei
Copy link
Collaborator

PanQiWei commented Apr 26, 2023

The model 'GPTNeoXGPTQForCausalLM' is not supported for

this is a warning throw by Hugging Face transformers, you can just ignore it, I will find a way to bypass it in the future.

is there a simple inference script to use with the generated model above

I would consider to write one as soon as possible in examples, for now you can reference to this code snip:

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, TextGenerationPipeline

text = "Hello, World!"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
generated_text = pipeline(text, return_full_text=False, num_beams=1, max_new_tokens=128)[0]['generated_text']
print(generated_text)

@GenTxt
Copy link
Author

GenTxt commented Apr 26, 2023

Thanks. Will work with above and close issue.

Looking forward to the script.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants