Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3-70b int8+kv8 convert checkpoint failed on v0.10.0 branch #1814

Open
2 of 4 tasks
NaNAGISaSA opened this issue Jun 20, 2024 · 2 comments
Open
2 of 4 tasks

llama3-70b int8+kv8 convert checkpoint failed on v0.10.0 branch #1814

NaNAGISaSA opened this issue Jun 20, 2024 · 2 comments
Assignees
Labels
bug Something isn't working waiting for feedback

Comments

@NaNAGISaSA
Copy link

NaNAGISaSA commented Jun 20, 2024

System Info

  • CPU architecture: x86_64
  • GPU properties
    • GPU name: NVIDIA A100
    • GPU memory size: 40G
  • Libraries
    • TensorRT-LLM branch or tag: v0.10.0
    • Container used: yes, make -C docker release_build on v0.10.0 branch
  • NVIDIA driver version: 525.89.02
  • OS: Ubuntu 22.04

Who can help?

@Tracin @nv-guomingz

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

model_name=llama3_70b
hf_model_dir=/some-path/Meta-Llama-3-70B-Instruct
convert_model_dir=/some-path
trt_engine_dir=/some-path
tp_size=2 # tp_size=4 and tp_size=8 produces the same error

python3 examples/llama/convert_checkpoint.py --model_dir ${hf_model_dir}
--tp_size ${tp_size}
--workers ${tp_size}
--use_weight_only
--weight_only_precision int8
--int8_kv_cache
--dtype bfloat16
--output_dir ${convert_model_dir}/${dtype}/${tp_size}-gpu/

Expected behavior

convert success

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.10.0
0.10.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:23<00:00, 1.27it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Traceback (most recent call last):
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 464, in
main()
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 456, in main
convert_and_save_hf(args)
File "/workspace/volume/wangchao2/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 360, in convert_and_save_hf
LLaMAForCausalLM.quantize(args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 414, in quantize
convert.quantize(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1387, in quantize
act_range, llama_qkv_para, llama_smoother = smooth_quant(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1160, in smooth_quant
tokenizer = AutoTokenizer.from_pretrained(model_dir,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 883, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 169, in init
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

additional notes

I also tested llama3-8b, change hf_model_dir to Meta-Llama-3-8B-Instruct, convertion is success:

[TensorRT-LLM] TensorRT-LLM version: 0.10.0
0.10.0
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.36it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
calibrating model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [01:25<00:00, 6.00it/s]
Weights loaded. Total time: 00:00:41
Weights loaded. Total time: 00:00:36
Total time of converting checkpoints: 00:03:31

@NaNAGISaSA NaNAGISaSA added the bug Something isn't working label Jun 20, 2024
@hijkzzz hijkzzz self-assigned this Jun 23, 2024
@hijkzzz
Copy link
Collaborator

hijkzzz commented Jun 23, 2024

Could you try the latest version TRT_LLM 0.11+
see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

@Yoh-Z
Copy link

Yoh-Z commented Jun 28, 2024

Could you try the latest version TRT_LLM 0.11+ see the tutorial: https://nvidia.github.io/TensorRT-LLM/installation/linux.html

Which commit corresponds to version 0.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for feedback
Projects
None yet
Development

No branches or pull requests

3 participants