Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' #4404

Open
HouseYeung opened this issue May 13, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@HouseYeung
Copy link

What is the issue?

llama runner process has terminated: signal: abort trap error:error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'

i was running qwen1.5-8B-chat

the old version of ollama can run this model properly.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.1.37

@HouseYeung HouseYeung added the bug Something isn't working label May 13, 2024
@satindergrewal
Copy link

Yes, I am also getting the same error on MacOS, Macbook Pro M3 Max, 128GB.

(base) ➜  ollama run Dolphin-2.9.1-Qwen-110b
Error: llama runner process has terminated: signal: abort trap error:error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
(base) ➜  ollama --version
ollama version is 0.1.37
(base) ➜ 

@Anorid
Copy link

Anorid commented May 20, 2024

Mine is also having this error and I'm running, qwen1.5-14b model

@Anorid
Copy link

Anorid commented May 20, 2024

This was my mistake

root@autodl-container-c438119a3c-80821c25:~/autodl-tmp# ollama serve
2024/05/20 11:28:20 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-20T11:28:20.921+08:00 level=INFO source=images.go:704 msg="total blobs: 0"
time=2024-05-20T11:28:20.921+08:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-20T11:28:20.922+08:00 level=INFO source=routes.go:1054 msg="Listening on [::]:6006 (version 0.1.38)"
time=2024-05-20T11:28:20.922+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2749468660/runners
time=2024-05-20T11:28:24.936+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-20T11:28:25.117+08:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-0f3aa8d5-c5ed-3fa3-1cb4-4aef2d3d8317 library=cuda compute=8.6 driver=12.2 name="NVIDIA A40" total="47.5 GiB" available="47.3 GiB"
[GIN] 2024/05/20 - 11:32:05 | 200 | 86.076µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/20 - 11:32:29 | 201 | 12.804363258s | 127.0.0.1 | POST "/api/blobs/sha256:1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245"
[GIN] 2024/05/20 - 11:32:43 | 200 | 14.155378431s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/05/20 - 11:33:04 | 200 | 35.782µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/20 - 11:33:04 | 200 | 1.190285ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/20 - 11:33:04 | 200 | 737.579µs | 127.0.0.1 | POST "/api/show"
time=2024-05-20T11:33:06.243+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T11:33:06.244+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T11:33:06.244+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2749468660/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/model/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 39195"
time=2024-05-20T11:33:06.245+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-20T11:33:06.245+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-20T11:33:06.245+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="140637096448000" timestamp=1716175986
INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140637096448000" timestamp=1716175986 total_threads=128
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="39195" tid="140637096448000" timestamp=1716175986
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/model/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = merge5-1
llama_model_loader: - kv 2: qwen2.block_count u32 = 40
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 201 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-05-20T11:33:06.497+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T11:33:06.872+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
time=2024-05-20T11:33:07.122+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/05/20 - 11:33:07 | 500 | 2.21829574s | 127.0.0.1 | POST "/api/chat"
time=2024-05-20T11:33:12.234+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.112074522
time=2024-05-20T11:33:12.485+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.362608222
time=2024-05-20T11:33:12.734+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.612062447

@GitTurboy
Copy link

I got the same error on windows system:
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from D:\lamaModels\blobs\sha256-6b22d907af67d494c1194b1bd688423945b4d3009bded2e5ecbc88d426b0c5a3 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen1___5-1___8B-Chat
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type f16: 170 tensors
time=2024-05-20T16:44:58.427+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
time=2024-05-20T16:44:58.698+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "

@songofhawk
Copy link

I got the same one: Error: llama runner process has terminated: signal: abort trap error:error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants