nan when converting neox and opt models with AutoGPTQ-triton #8

GenTxt · 2023-04-22T18:10:46Z

Will test default cuda version next but encountering nan for all conversions using 'AutoGPTQ-triton'

Using ubuntu 22.04, python3.10, transformers 4.28(dev), 64 gigs ram, 2x RTX 24 gig cards.

Installed successfully with all dependencies.

Am I missing a particular package version?

python basic_usage.py

pretrained_model_dir = "models/gpt-neox-20b"
quantized_model_dir = "4bit_converted"

2023-04-22 13:29:42 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 1/44...
2023-04-22 13:29:47 INFO [auto_gptq.quantization.gptq] duration: 5.032277584075928
2023-04-22 13:29:47 INFO [auto_gptq.quantization.gptq] avg loss: 17.77143669128418
2023-04-22 13:29:47 INFO [auto_gptq.modeling._base] Quantizing attention.dense in layer 1/44...
2023-04-22 13:29:48 INFO [auto_gptq.quantization.gptq] duration: 1.7948594093322754
2023-04-22 13:29:48 INFO [auto_gptq.quantization.gptq] avg loss: 1.888306736946106
2023-04-22 13:29:49 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_h_to_4h in layer 1/44...
2023-04-22 13:29:50 INFO [auto_gptq.quantization.gptq] duration: 1.8883254528045654
2023-04-22 13:29:50 INFO [auto_gptq.quantization.gptq] avg loss: 28.566619873046875
2023-04-22 13:29:50 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_4h_to_h in layer 1/44...
2023-04-22 13:30:02 INFO [auto_gptq.quantization.gptq] duration: 11.343331575393677
2023-04-22 13:30:02 INFO [auto_gptq.quantization.gptq] avg loss: nan
2023-04-22 13:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 2/44
2023-04-22 13:30:02 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 2/44...
2023-04-22 13:30:04 INFO [auto_gptq.quantization.gptq] duration: 1.9044442176818848
2023-04-22 13:30:04 INFO [auto_gptq.quantization.gptq] avg loss: nan
etc. stopped

Similar results with opt-30b:

Same results with opt:

pretrained_model_dir = "models/opt-30b"
quantized_model_dir = "4bit_converted"

Loading checkpoint shards: 100%|██████████████| 267/267 [13:52<00:00, 3.12s/it]
2023-04-22 13:55:41 INFO [auto_gptq.modeling._base] Start quantizing layer 1/44
2023-04-22 13:55:46 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 1/44...
2023-04-22 13:55:51 INFO [auto_gptq.quantization.gptq] duration: 4.748894453048706
2023-04-22 13:55:51 INFO [auto_gptq.quantization.gptq] avg loss: 17.623794555664062
2023-04-22 13:55:51 INFO [auto_gptq.modeling._base] Quantizing attention.dense in layer 1/44...
2023-04-22 13:55:53 INFO [auto_gptq.quantization.gptq] duration: 1.8472576141357422
2023-04-22 13:55:53 INFO [auto_gptq.quantization.gptq] avg loss: 1.9249645471572876
2023-04-22 13:55:53 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_h_to_4h in layer 1/44...
2023-04-22 13:55:55 INFO [auto_gptq.quantization.gptq] duration: 1.9470229148864746
2023-04-22 13:55:55 INFO [auto_gptq.quantization.gptq] avg loss: 28.64271354675293
2023-04-22 13:55:55 INFO [auto_gptq.modeling._base] Quantizing mlp.dense_4h_to_h in layer 1/44...
2023-04-22 13:56:06 INFO [auto_gptq.quantization.gptq] duration: 11.630852222442627
2023-04-22 13:56:06 INFO [auto_gptq.quantization.gptq] avg loss: nan
2023-04-22 13:56:07 INFO [auto_gptq.modeling._base] Start quantizing layer 2/44
2023-04-22 13:56:07 INFO [auto_gptq.modeling._base] Quantizing attention.query_key_value in layer 2/44...
2023-04-22 13:56:09 INFO [auto_gptq.quantization.gptq] duration: 1.975161075592041
2023-04-22 13:56:09 INFO [auto_gptq.quantization.gptq] avg loss: nan
etc. stopped

GenTxt · 2023-04-22T18:45:48Z

Confirming similar nan results above for main cuda branch using same models plus additional neox-20b.

PanQiWei · 2023-04-23T02:14:20Z

I saw you were using basic_usage.py, which just using one-shot quantization(one sample) to show case the usage of basic apis, and it may encounter into 'nan' when quantize big model with such few samples. I would suggest to try quantize_with_alpaca.py which uses many instruction-following samples to quantize LLMs.

Please let me know if the same problem still occurs when using quantize_with_alpaca.py

GenTxt · 2023-04-24T19:38:22Z

Tested above 'quantize_with_alpaca.py' with latest 0.3 version.

Needed to change the following:

parser.add_argument("--fast_tokenizer", action="store_true")

ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported.

changed to:

parser.add_argument("--fast_tokenizer", action="store_false")

CUDA_VISIBLE_DEVICES="0" python quant_with_alpaca.py --pretrained_model_dir models/gpt-neox-20b --quantized_model_dir 4bit_converted/neox20b-4bit.safetensor

After change the quantization proceeded without error complete with final examples from script printed to terminal.

Unfortunately, the quantized model isn't saved to --quantized_model_dir 4bit_converted/neox20b-4bit.safetensor

2023-04-24 15:03:20 INFO [auto_gptq.modeling._utils] Model packed.

The model 'GPTNeoXGPTQForCausalLM' is not supported for .

Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

prompt: Instruction:
Name three characteristics commonly associated with a strong leader.
Output:
etc.

Tested 3x. '4bit_converted' folder exists at same level as scripts and models

Am I missing a command to save the model to a local folder or has it been saved to another default location?

Thanks

PanQiWei · 2023-04-25T03:01:27Z

There are two things you should be aware of, maybe it's my bad that don't make it clear in example's README:

there is not need to change the original command line flag's functionality, you can just enable --fast_tokenizer in command when using gpt_neox type models for it only have GPTNeoXTokenizerFast
value for --quantized_model_dir should be a path to local directory, not a file, but you can check if the quantized model save into a dir named 4bit_converted/neox20b-4bit.safetensor

GenTxt · 2023-04-25T21:42:47Z

Thanks for the update. Model saved in '4bit_converted' in .bin format.

The model 'GPTNeoXGPTQForCausalLM' is not supported for . is still generated but not a big deal.

How to save as safetensors? Will run again using:

model.save_quantized(args.quantized_model_dir, use_safetensors=True)

Also, is there a simple inference script to use with the generated model above?

Cheers

PanQiWei · 2023-04-26T02:29:43Z

The model 'GPTNeoXGPTQForCausalLM' is not supported for

this is a warning throw by Hugging Face transformers, you can just ignore it, I will find a way to bypass it in the future.

is there a simple inference script to use with the generated model above

I would consider to write one as soon as possible in examples, for now you can reference to this code snip:

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, TextGenerationPipeline

text = "Hello, World!"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
generated_text = pipeline(text, return_full_text=False, num_beams=1, max_new_tokens=128)[0]['generated_text']
print(generated_text)

GenTxt · 2023-04-26T12:31:20Z

Thanks. Will work with above and close issue.

Looking forward to the script.

Cheers

GenTxt mentioned this issue Apr 23, 2023

neox.py generates randrange() error qwopqwop200/GPTQ-for-LLaMa#207

Closed

GenTxt closed this as completed Apr 26, 2023

GenTxt mentioned this issue Apr 27, 2023

How to run 'speedup_quantization' branch inference on 2x gpus? #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan when converting neox and opt models with AutoGPTQ-triton #8

nan when converting neox and opt models with AutoGPTQ-triton #8

GenTxt commented Apr 22, 2023

GenTxt commented Apr 22, 2023

PanQiWei commented Apr 23, 2023

GenTxt commented Apr 24, 2023

PanQiWei commented Apr 25, 2023

GenTxt commented Apr 25, 2023

PanQiWei commented Apr 26, 2023 •

edited

GenTxt commented Apr 26, 2023

nan when converting neox and opt models with AutoGPTQ-triton #8

nan when converting neox and opt models with AutoGPTQ-triton #8

Comments

GenTxt commented Apr 22, 2023

GenTxt commented Apr 22, 2023

PanQiWei commented Apr 23, 2023

GenTxt commented Apr 24, 2023

PanQiWei commented Apr 25, 2023

GenTxt commented Apr 25, 2023

PanQiWei commented Apr 26, 2023 • edited

GenTxt commented Apr 26, 2023

PanQiWei commented Apr 26, 2023 •

edited