Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server Missing OpenAI API Support? #24

Closed
jasonacox opened this issue Dec 1, 2023 · 12 comments
Closed

Server Missing OpenAI API Support? #24

jasonacox opened this issue Dec 1, 2023 · 12 comments
Assignees

Comments

@jasonacox
Copy link

The server presents the UI but seems to be missing the APIs?

The example test:

curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

Results in a 404 error:

HTTP/1.1 404 Not Found
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 14
Content-Type: text/plain
Keep-Alive: timeout=5, max=5
Server: llama.cpp

File Not Found
@jart
Copy link
Collaborator

jart commented Dec 1, 2023

I cherry-picked OpenAI compatibility yesterday in 401dd08. It hasn't been incorporated into a release yet. I'll update this issue when the next release goes out. The llamafiles on Hugging Face will be updated too.

@jart jart self-assigned this Dec 1, 2023
@dzlab
Copy link

dzlab commented Dec 1, 2023

There a will be new Server binaries? or we can use with already downloaded ones like mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile. It will be great if we won't need to re-download the whole 4GB file.

@jart
Copy link
Collaborator

jart commented Dec 1, 2023

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:

  1. Download llamafile-server-0.2 and chmod +x it
  2. Download zipalign-0.2 and chmod +x it
  3. Run unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.
  4. Run ./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args to put the weights and argument file inside your latest and greatest llamafile executable.
  5. Run ./llamafile-server-0.2 and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile that I'm uploading to HuggingFace presently.

So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

@jart
Copy link
Collaborator

jart commented Dec 1, 2023

OK I've uploaded all the new .llamafiles to Hugging Face, for anyone who'd rather just re-download.

Enjoy!

@jart jart closed this as completed Dec 1, 2023
@dzlab
Copy link

dzlab commented Dec 1, 2023

@jart thanks, I followed the instructions you provided and got a v0.2 llamafile server binary. Now when I start the server (on Mac M1) then try the curl command from llama.cpp/server/README.md the server crashes consistently with this error

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Applications/HOME/Tools/llamafile/llamafile-server-0.2 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)
[1]    34103 abort      ./llamafile-server-0.2

@jasonacox
Copy link
Author

jasonacox commented Dec 2, 2023

First of all, @jart , thank you!!! We are getting close:

curl -i http://localhost:8080/v1/models
HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 132
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"data":[{"created":1701489258,"id":"mistral-7b-instruct-v0.1.Q4_K_M.gguf","object":"model","owned_by":"llamacpp"}],"object":"list"

But as @dzlab mentions, there is an assertion failure during /v1/chat/completions POST and causes the server to crash (core dump).

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed 

llamafile/llama.cpp/server/json.h

Lines 21305 to 21318 in 73ee0b1

/// @brief access specified object element
/// @sa https://json.nlohmann.me/api/basic_json/operator%5B%5D/
const_reference operator[](const typename object_t::key_type& key) const
{
// const operator[] only works for objects
if (JSON_HEDLEY_LIKELY(is_object()))
{
auto it = m_value.object->find(key);
JSON_ASSERT(it != m_value.object->end());
return it->second;
}
JSON_THROW(type_error::create(305, detail::concat("cannot use operator[] with a string argument with ", type_name()), this));
}

@jart jart reopened this Dec 2, 2023
@dave1010
Copy link

dave1010 commented Dec 8, 2023

This one is working for me: https://huggingface.co/jartine/mistral-7b.llamafile/blob/main/mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile

I'm using https://github.com/simonw/llm to connect to it, so not sure of the exact requests it's making.

(base) ➜  ~ llm --version
llm, version 0.1
(base) ➜  ~ cat '/Users/dave/Library/Application Support/io.datasette.llm/extra-openai-models.yaml'
- model_id: llamafile
  model_name: llamafile
  api_base: "http://localhost:8080/v1"
(base) ➜  ~ llm -m llamafile "what llm are you"
I am Mistral, a large language model trained by Mistral AI. How can I assist you today?

@dave1010
Copy link

dave1010 commented Dec 8, 2023

The request reported in the issue seems to work too:

(base) ➜  ~ curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 470
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702027652,"id":"chatcmpl-PajeeqdFmAP5VNrzZztEJwKi9bF4czMj","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

@jart
Copy link
Collaborator

jart commented Dec 8, 2023

@dave1010 glad to hear it's working for you!

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

@jart jart closed this as completed Dec 8, 2023
@dave1010
Copy link

dave1010 commented Dec 8, 2023

For completeness in case it helps, this curl command works fine for me too: llama.cpp/server/README.md

(base) ➜  ~ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
    {
        "role": "system",
        "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about python exceptions"
    }
    ]
    }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702042664,"id":"chatcmpl-LBodkSXWGkmxLu7pH39Lv2zF8jE6cxny","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

The server logs:

slot 0 released (155 tokens in cache)
slot 0 is processing [task id: 11]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    5827.25 ms /    77 tokens (   75.68 ms per token,    13.21 tokens per second)
print_timings:        eval time =    1809.26 ms /    43 runs   (   42.08 ms per token,    23.77 tokens per second)
print_timings:       total time =    7636.51 ms
slot 0 released (121 tokens in cache)
{"timestamp":1702042664,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":57680,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

Some debugging info in case it's helpful:

(base) ➜  ~ system_profiler SPHardwareDataType|grep -v UUID
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro18,3
      Model Number: Z15J000PGB/A
      Chip: Apple M1 Pro
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 16 GB
      System Firmware Version: 10151.1.1
      OS Loader Version: 10151.1.1
      Serial Number (system): PL2C3FY765
      Provisioning UDID: 00006000-000861892206801E
      Activation Lock Status: Enabled

@jasonacox
Copy link
Author

@dave1010 Thank you! This helped me narrow in on the issue. I am able to get this model to run with all the API curl examples with no issue on my Mac (M2). The assertion error only shows up on my Linux Ubuntu 22.04 box (using CPU only and with a GTX 3090 GPU).

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

Will do! I'll open it up focused on Linux.

@mofosyne
Copy link
Collaborator

mofosyne commented May 13, 2024

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:

1. Download [llamafile-server-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/llamafile-server-0.2) and chmod +x it

2. Download [zipalign-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/zipalign-0.2) and chmod +x it

3. Run `unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.

4. Run `./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` to put the weights and argument file inside your latest and greatest llamafile executable.

5. Run `./llamafile-server-0.2` and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest `mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile` that I'm uploading to HuggingFace presently.

So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

#412 now merged in which will give you the option of using llamafile-upgrade-engine to upgrade the engine in a more convenient manner when you install llamafile to your system.

This is done simply by calling llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile in your folder with the llamafile in it.

Usage Example / Expected Console Output
$ llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile 
== Engine Version Check ==
Engine version from mistral-7b-instruct-v0.1-Q4_K_M-server: llamafile v0.4.1
Engine version from /usr/local/bin/llamafile: llamafile v0.8.4
== Repackaging / Upgrading ==
extracting...
Archive:  mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.amd64  
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.arm64  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/compcap.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/llamafile.h  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-alloc.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.m  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.metal  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-quants.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/completion.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.html  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/json-schema-to-grammar.mjs  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Anchorage  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Beijing  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Berlin  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Boulder  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Chicago  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GMT  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GST  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Honolulu  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Israel  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Japan  
 extracting: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/London  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Melbourne  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/New_York  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/UTC  
 extracting: /tmp/tmp.FtvmAfSWty/.cosmo  
 extracting: /tmp/tmp.FtvmAfSWty/.args  
 extracting: /tmp/tmp.FtvmAfSWty/mistral-7b-instruct-v0.1.Q4_K_M.gguf  
 extracting: /tmp/tmp.FtvmAfSWty/ggml-cuda.dll  
repackaging...
== Completed ==
Original File: mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
Upgraded File: mistral-7b-instruct-v0.1-Q4_K_M-server.updated.llamafile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants