You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
model_name=llama3_70b
hf_model_dir=/some-path/Meta-Llama-3-70B-Instruct
convert_model_dir=/some-path
trt_engine_dir=/some-path
tp_size=2 # tp_size=4 and tp_size=8 produces the same error
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
0.10.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:23<00:00, 1.27it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Traceback (most recent call last):
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 464, in
main()
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 456, in main
convert_and_save_hf(args)
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 360, in convert_and_save_hf
LLaMAForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 414, in quantize
convert.quantize(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1387, in quantize
act_range, llama_qkv_para, llama_smoother = smooth_quant(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1160, in smooth_quant
tokenizer = AutoTokenizer.from_pretrained(model_dir,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 883, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 169, in init
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
additional notes
I also tested llama3-8b, change hf_model_dir to Meta-Llama-3-8B-Instruct, convertion is success:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
0.10.0
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.36it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [01:25<00:00, 6.00it/s]
Weights loaded. Total time: 00:00:41
Weights loaded. Total time: 00:00:36
Total time of converting checkpoints: 00:03:31
The text was updated successfully, but these errors were encountered:
System Info
make -C docker release_build
on v0.10.0 branchWho can help?
@Tracin @nv-guomingz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
model_name=llama3_70b
hf_model_dir=/some-path/Meta-Llama-3-70B-Instruct
convert_model_dir=/some-path
trt_engine_dir=/some-path
tp_size=2 # tp_size=4 and tp_size=8 produces the same error
python3 examples/llama/convert_checkpoint.py --model_dir ${hf_model_dir}
--tp_size ${tp_size}
--workers ${tp_size}
--use_weight_only
--weight_only_precision int8
--int8_kv_cache
--dtype bfloat16
--output_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/
Expected behavior
convert success
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
0.10.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:23<00:00, 1.27it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565Traceback (most recent call last):
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 464, in
main()
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 456, in main
convert_and_save_hf(args)
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 360, in convert_and_save_hf
LLaMAForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 414, in quantize
convert.quantize(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1387, in quantize
act_range, llama_qkv_para, llama_smoother = smooth_quant(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1160, in smooth_quant
tokenizer = AutoTokenizer.from_pretrained(model_dir,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 883, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 169, in init
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
additional notes
I also tested llama3-8b, change hf_model_dir to Meta-Llama-3-8B-Instruct, convertion is success:
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
0.10.0
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.36it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [01:25<00:00, 6.00it/s]
Weights loaded. Total time: 00:00:41
Weights loaded. Total time: 00:00:36
Total time of converting checkpoints: 00:03:31
The text was updated successfully, but these errors were encountered: