Skip to content

Commit

Permalink
Add OpenAI API compatibility to server
Browse files Browse the repository at this point in the history
This is a cherry-pick of ggerganov/llama.cpp@af19d35
  • Loading branch information
ggerganov authored and jart committed Nov 30, 2023
1 parent ed87fdb commit 401dd08
Show file tree
Hide file tree
Showing 2 changed files with 407 additions and 8 deletions.
51 changes: 51 additions & 0 deletions llama.cpp/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ node index.js

`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).

`min_p`: The minimum probability for a token to be considered, relative to the probability of the most likely token (default: 0.05).

`n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).

`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded.
Expand Down Expand Up @@ -232,6 +234,55 @@ node index.js

- **GET** `/props`: Return the required assistant name and anti-prompt to generate the prompt in case you have specified a system prompt for all slots.

- **POST** `/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only ChatML-tuned models, such as Dolphin, OpenOrca, OpenHermes, OpenChat-3.5, etc can be used with this endpoint. Compared to `api_like_OAI.py` this API implementation does not require a wrapper to be served.

*Options:*

See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such are `mirostat` are supported.

*Examples:*

You can use either Python `openai` library with appropriate checkpoints:

```python
import openai

client = openai.OpenAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
{"role": "user", "content": "Write a limerick about python exceptions"}
]
)

print(completion.choices[0].message)
```
... or raw HTTP requests:

```shell
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}'
```

## More examples

### Change system prompt on runtime
Expand Down

0 comments on commit 401dd08

Please sign in to comment.