Bug: Unexpected output length (Only one token response!) when set configs "-n -2 -c 256" for llama-server

**Note: This issue was copied from [https://github.com/ggml-org/llama.cpp/issues/9933](https://github.com/ggml-org/llama.cpp/issues/9933)**

**Original Author:** @morgen52
**Original Issue Number:** #9933
**Created:** 2024-10-18T06:41:56Z

---

### What happened?

Hi there.
As suggested by the documents, config -n indicates the number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled), and -c indicates the context size.
However, when I use the following command to start a server:
```bash
./llama.cpp-b3938/build_gpu/bin/llama-server     -m ../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf     -ngl 99 -n -2 -c 256
```
And Send a request with the following command:
```bash
curl --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "What is the meaning of life?"}'
```

I can get only one token of output from the response. 
```bash
{"content":" I","id_slot":0,"stop":true,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","tokens_predicted":1,"tokens_evaluated":7,"generation_settings":{"n_ctx":256,"n_predict":-2,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","seed":4294967295,"seed_cur":3394087514,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","xtc","temperature"]},"prompt":"What is the meaning of life?","has_new_line":false,"truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":7,"timings":{"prompt_n":7,"prompt_ms":27.275,"prompt_per_token_ms":3.8964285714285714,"prompt_per_second":256.64527956003667,"predicted_n":1,"predicted_ms":0.005,"predicted_per_token_ms":0.005,"predicted_per_second":200000.0},"index":0}
```
Is there something wrong with the way I'm using it? Or is this a bug?


### Name and Version

./llama.cpp-b3938/build_gpu/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 7 (d9a33c5)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Unexpected output length (Only one token response!) when set configs "-n -2 -c 256" for llama-server #239

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Unexpected output length (Only one token response!) when set configs "-n -2 -c 256" for llama-server #239

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions