Server Missing OpenAI API Support? #24

jasonacox · 2023-12-01T06:26:50Z

The server presents the UI but seems to be missing the APIs?

The example test:

curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

Results in a 404 error:

HTTP/1.1 404 Not Found
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 14
Content-Type: text/plain
Keep-Alive: timeout=5, max=5
Server: llama.cpp

File Not Found

The text was updated successfully, but these errors were encountered:

jart · 2023-12-01T07:59:06Z

I cherry-picked OpenAI compatibility yesterday in 401dd08. It hasn't been incorporated into a release yet. I'll update this issue when the next release goes out. The llamafiles on Hugging Face will be updated too.

dzlab · 2023-12-01T15:07:58Z

There a will be new Server binaries? or we can use with already downloaded ones like mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile. It will be great if we won't need to re-download the whole 4GB file.

jart · 2023-12-01T16:03:44Z

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:

Download llamafile-server-0.2 and chmod +x it
Download zipalign-0.2 and chmod +x it
Run unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.
Run ./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args to put the weights and argument file inside your latest and greatest llamafile executable.
Run ./llamafile-server-0.2 and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile that I'm uploading to HuggingFace presently.

So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

jart · 2023-12-01T16:42:53Z

OK I've uploaded all the new .llamafiles to Hugging Face, for anyone who'd rather just re-download.

Enjoy!

dzlab · 2023-12-01T18:02:16Z

@jart thanks, I followed the instructions you provided and got a v0.2 llamafile server binary. Now when I start the server (on Mac M1) then try the curl command from llama.cpp/server/README.md the server crashes consistently with this error

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Applications/HOME/Tools/llamafile/llamafile-server-0.2 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)
[1]    34103 abort      ./llamafile-server-0.2

jasonacox · 2023-12-02T04:03:57Z

First of all, @jart , thank you!!! We are getting close:

curl -i http://localhost:8080/v1/models
HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 132
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"data":[{"created":1701489258,"id":"mistral-7b-instruct-v0.1.Q4_K_M.gguf","object":"model","owned_by":"llamacpp"}],"object":"list"

But as @dzlab mentions, there is an assertion failure during /v1/chat/completions POST and causes the server to crash (core dump).

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed

llamafile/llama.cpp/server/json.h

Lines 21305 to 21318 in 73ee0b1

    
           /// @brief access specified object element 
        
           /// @sa https://json.nlohmann.me/api/basic_json/operator%5B%5D/ 
        
           const_reference operator[](const typename object_t::key_type& key) const 
        
           { 
        
               // const operator[] only works for objects 
        
               if (JSON_HEDLEY_LIKELY(is_object())) 
        
               { 
        
                   auto it = m_value.object->find(key); 
        
                   JSON_ASSERT(it != m_value.object->end()); 
        
                   return it->second; 
        
               } 
        
               JSON_THROW(type_error::create(305, detail::concat("cannot use operator[] with a string argument with ", type_name()), this)); 
        
           }

dave1010 · 2023-12-08T09:27:42Z

This one is working for me: https://huggingface.co/jartine/mistral-7b.llamafile/blob/main/mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile

I'm using https://github.com/simonw/llm to connect to it, so not sure of the exact requests it's making.

(base) ➜  ~ llm --version
llm, version 0.1
(base) ➜  ~ cat '/Users/dave/Library/Application Support/io.datasette.llm/extra-openai-models.yaml'
- model_id: llamafile
  model_name: llamafile
  api_base: "http://localhost:8080/v1"
(base) ➜  ~ llm -m llamafile "what llm are you"
I am Mistral, a large language model trained by Mistral AI. How can I assist you today?

dave1010 · 2023-12-08T09:28:55Z

The request reported in the issue seems to work too:

(base) ➜  ~ curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 470
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702027652,"id":"chatcmpl-PajeeqdFmAP5VNrzZztEJwKi9bF4czMj","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

jart · 2023-12-08T12:01:10Z

@dave1010 glad to hear it's working for you!

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

dave1010 · 2023-12-08T13:43:08Z

For completeness in case it helps, this curl command works fine for me too: llama.cpp/server/README.md

(base) ➜  ~ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
    {
        "role": "system",
        "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about python exceptions"
    }
    ]
    }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702042664,"id":"chatcmpl-LBodkSXWGkmxLu7pH39Lv2zF8jE6cxny","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

The server logs:

slot 0 released (155 tokens in cache)
slot 0 is processing [task id: 11]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    5827.25 ms /    77 tokens (   75.68 ms per token,    13.21 tokens per second)
print_timings:        eval time =    1809.26 ms /    43 runs   (   42.08 ms per token,    23.77 tokens per second)
print_timings:       total time =    7636.51 ms
slot 0 released (121 tokens in cache)
{"timestamp":1702042664,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":57680,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

Some debugging info in case it's helpful:

(base) ➜  ~ system_profiler SPHardwareDataType|grep -v UUID
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro18,3
      Model Number: Z15J000PGB/A
      Chip: Apple M1 Pro
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 16 GB
      System Firmware Version: 10151.1.1
      OS Loader Version: 10151.1.1
      Serial Number (system): PL2C3FY765
      Provisioning UDID: 00006000-000861892206801E
      Activation Lock Status: Enabled

jasonacox · 2023-12-09T05:53:13Z

@dave1010 Thank you! This helped me narrow in on the issue. I am able to get this model to run with all the API curl examples with no issue on my Mac (M2). The assertion error only shows up on my Linux Ubuntu 22.04 box (using CPU only and with a GTX 3090 GPU).

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

Will do! I'll open it up focused on Linux.

mofosyne · 2024-05-13T04:38:32Z

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:
1. Download [llamafile-server-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/llamafile-server-0.2) and chmod +x it

2. Download [zipalign-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/zipalign-0.2) and chmod +x it

3. Run `unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.

4. Run `./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` to put the weights and argument file inside your latest and greatest llamafile executable.

5. Run `./llamafile-server-0.2` and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest `mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile` that I'm uploading to HuggingFace presently.
So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

#412 now merged in which will give you the option of using llamafile-upgrade-engine to upgrade the engine in a more convenient manner when you install llamafile to your system.

This is done simply by calling llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile in your folder with the llamafile in it.

Usage Example / Expected Console Output

$ llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile 
== Engine Version Check ==
Engine version from mistral-7b-instruct-v0.1-Q4_K_M-server: llamafile v0.4.1
Engine version from /usr/local/bin/llamafile: llamafile v0.8.4
== Repackaging / Upgrading ==
extracting...
Archive:  mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.amd64  
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.arm64  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/compcap.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/llamafile.h  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-alloc.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.m  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.metal  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-quants.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/completion.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.html  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/json-schema-to-grammar.mjs  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Anchorage  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Beijing  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Berlin  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Boulder  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Chicago  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GMT  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GST  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Honolulu  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Israel  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Japan  
 extracting: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/London  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Melbourne  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/New_York  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/UTC  
 extracting: /tmp/tmp.FtvmAfSWty/.cosmo  
 extracting: /tmp/tmp.FtvmAfSWty/.args  
 extracting: /tmp/tmp.FtvmAfSWty/mistral-7b-instruct-v0.1.Q4_K_M.gguf  
 extracting: /tmp/tmp.FtvmAfSWty/ggml-cuda.dll  
repackaging...
== Completed ==
Original File: mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
Upgraded File: mistral-7b-instruct-v0.1-Q4_K_M-server.updated.llamafile

jart self-assigned this Dec 1, 2023

jart added the enhancement label Dec 1, 2023

jart closed this as completed Dec 1, 2023

jart reopened this Dec 2, 2023

jart closed this as completed Dec 8, 2023

jasonacox mentioned this issue Dec 9, 2023

Assertion Error with OpenAI API Server on Linux #75

Closed

mofosyne mentioned this issue May 11, 2024

Porting "Inplace Upgrading Of Llamafiles Engine Bash Script" to llamafile for general usage #411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server Missing OpenAI API Support? #24

Server Missing OpenAI API Support? #24

jasonacox commented Dec 1, 2023

jart commented Dec 1, 2023

dzlab commented Dec 1, 2023

jart commented Dec 1, 2023

jart commented Dec 1, 2023

dzlab commented Dec 1, 2023

jasonacox commented Dec 2, 2023 •

edited

dave1010 commented Dec 8, 2023

dave1010 commented Dec 8, 2023

jart commented Dec 8, 2023

dave1010 commented Dec 8, 2023

jasonacox commented Dec 9, 2023

mofosyne commented May 13, 2024 •

edited

Server Missing OpenAI API Support? #24

Server Missing OpenAI API Support? #24

Comments

jasonacox commented Dec 1, 2023

jart commented Dec 1, 2023

dzlab commented Dec 1, 2023

jart commented Dec 1, 2023

jart commented Dec 1, 2023

dzlab commented Dec 1, 2023

jasonacox commented Dec 2, 2023 • edited

dave1010 commented Dec 8, 2023

dave1010 commented Dec 8, 2023

jart commented Dec 8, 2023

dave1010 commented Dec 8, 2023

jasonacox commented Dec 9, 2023

mofosyne commented May 13, 2024 • edited

jasonacox commented Dec 2, 2023 •

edited

mofosyne commented May 13, 2024 •

edited