Skip to content

Bug: Unexpected output length (Only one token response!) when set configs "-n -2 -c 256" for llama-server #239

@jakexcosme

Description

@jakexcosme

Note: This issue was copied from ggml-org#9933

Original Author: @morgen52
Original Issue Number: ggml-org#9933
Created: 2024-10-18T06:41:56Z


What happened?

Hi there.
As suggested by the documents, config -n indicates the number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled), and -c indicates the context size.
However, when I use the following command to start a server:

./llama.cpp-b3938/build_gpu/bin/llama-server     -m ../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf     -ngl 99 -n -2 -c 256

And Send a request with the following command:

curl --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "What is the meaning of life?"}'

I can get only one token of output from the response.

{"content":" I","id_slot":0,"stop":true,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","tokens_predicted":1,"tokens_evaluated":7,"generation_settings":{"n_ctx":256,"n_predict":-2,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","seed":4294967295,"seed_cur":3394087514,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","xtc","temperature"]},"prompt":"What is the meaning of life?","has_new_line":false,"truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":7,"timings":{"prompt_n":7,"prompt_ms":27.275,"prompt_per_token_ms":3.8964285714285714,"prompt_per_second":256.64527956003667,"predicted_n":1,"predicted_ms":0.005,"predicted_per_token_ms":0.005,"predicted_per_second":200000.0},"index":0}

Is there something wrong with the way I'm using it? Or is this a bug?

Name and Version

./llama.cpp-b3938/build_gpu/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 7 (d9a33c5)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions