Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Strange Behavior in HuggingChat (Chat-UI) #1222

Open
gururise opened this issue Dec 22, 2023 · 34 comments
Open

[Bug]: Strange Behavior in HuggingChat (Chat-UI) #1222

gururise opened this issue Dec 22, 2023 · 34 comments
Assignees
Labels
bug Something isn't working

Comments

@gururise
Copy link

gururise commented Dec 22, 2023

What happened?

When using LiteLLM as proxy for Together.ai and Mistral-7b-Instruct-v0.2 there are two strange issues that occur during inference when using the Chat-UI frontend by Huggingface.

  1. The text is displayed in large chunks rather than streamed to the UI word by word
  2. The formatting is messed up and not respected. Newlines are ignored:
    image

As seen below, when using gpt-3.5-turbo, the formatting is fine and the streaming word-by-word works:
image

Here is MODELS from .env.local for litellm proxy:

MODELS=`[
    {
      "name": "mistral-7b",
      "displayName": "mistralai/Mistral-7B-Instruct-v0.2",
      "description": "Mistral 7B is a new Apache 2.0 model, released by Mistral AI that outperforms Llama2 13B in benchmarks.",
      "websiteUrl": "https://mistral.ai/news/announcing-mistral-7b/",
      "preprompt": "",
      "chatPromptTemplate" : "<s>{{#each messages}}{{#ifUser}}[INST] {{#if @first}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}}{{content}} [/INST]{{/ifUser}}{{#ifAssistant}}{{content}}</s>{{/ifAssistant}}{{/each}}",
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "truncate": 3072,
        "max_new_tokens": 1024,
        "stop": ["</s>"]
      },
      "endpoints": [{
        "type" : "openai",
        "baseURL": "http://localhost:8000/v1"
      }]
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ]
    }
]`

To set up the gpt-3.5-turbo model:

MODELS=`[
    {
      "name": "gpt-3.5-turbo",
      "displayName": "GPT 3.5 Turbo",
      "endpoints" : [{
        "type": "openai"
      }],
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ]
    }
]`

Relevant log output

No response

Twitter / LinkedIn details

No response

@gururise gururise added the bug Something isn't working label Dec 22, 2023
@krrishdholakia
Copy link
Contributor

Hey @gururise do we know if the large chunk yielding is happening on together ai's side?

Re: newline, what's a fix for this? I believe this is part of the string being returned by togetherai

@gururise
Copy link
Author

gururise commented Dec 22, 2023

EDIT: I think I've confirmed there is something wrong/different with the together_ai implementation. If I use openai as the LLM provider with LiteLLM proxy, the application works as expected, but if I switch to together_ai as the LLM provider, things do not work nicely.

Hey @gururise do we know if the large chunk yielding is happening on together ai's side?

When I run litellm in debug mode, I can see the tokens being streamed individually.

Re: newline, what's a fix for this? I believe this is part of the string being returned by togetherai

Looking at the debug log when using together_ai, the newlines are escaped. Any ideas why? Is this something LiteLLM is doing?

Here is a snippet of the debug log when I am using together_ai (notice the newline towards the end is escaped):

_reason=None, index=0, delta=Delta(content=' Number', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
success callbacks: [<litellm.proxy.hooks.parallel_request_limiter.MaxParallelRequestsHandler object at 0x7f552a323fd0>]
returned chunk: ModelResponse(id='chatcmpl-dcffe220-a020-4eab-80df-df0214839ccb', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' Number', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
value of chunk: b'' 
value of chunk: b'data: {"choices":[{"text":"]"}],"id":"8399e7011ff32f77-LAX","token":{"id":28793,"text":"]","logprob":-0.0066871643,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}' 
PROCESSED CHUNK PRE CHUNK CREATOR: b'data: {"choices":[{"text":"]"}],"id":"8399e7011ff32f77-LAX","token":{"id":28793,"text":"]","logprob":-0.0066871643,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}'
model_response: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage()); completion_obj: {'content': ']'}
model_response finish reason 3: None
hold - False, model_response_str - ]
model_response: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
PROCESSED CHUNK POST CHUNK CREATOR: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
Logging Details LiteLLM-Success Call
success callbacks: [<litellm.proxy.hooks.parallel_request_limiter.MaxParallelRequestsHandler object at 0x7f552a323fd0>]
line in async streaming: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
returned chunk: ModelResponse(id='chatcmpl-b4739587-4105-4f86-9e8b-dabafae4645b', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=']', role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
value of chunk: b'' 
value of chunk: b'data: {"choices":[{"text":"\\n"}],"id":"8399e7011ff32f77-LAX","token":{"id":13,"text":"\\n","logprob":-0.000019788742,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}' 
PROCESSED CHUNK PRE CHUNK CREATOR: b'data: {"choices":[{"text":"\\n"}],"id":"8399e7011ff32f77-LAX","token":{"id":13,"text":"\\n","logprob":-0.000019788742,"special":false},"generated_text":null,"details":null,"stats":null,"usage":null}'
model_response: ModelResponse(id='chatcmpl-5745b053-586c-4bb2-be7c-9de42c721c31', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None))], created=1703264237, model='mistralai/Mixtral-8x7B-Instruct-v0.1', object='chat.completion.chunk', system_fingerprint=None, usage=Usage()); completion_obj: {'content': '\\n'}
model_response finish reason 3: None
hold - False, model_response_str - \n

Alright, I think I have confirmed it is something to do with the way LITELLM is handling togetherai. When I continue to use the LiteLLM Proxy but switch the provider to openai (gpt-3.5-turbo) everything works exactly as expected. The streaming occurs token by token and the output is parsed correctly:

TESTING LITELLM PROXY using OPEN AI (gpt-3.5-turbo):
image

Snippet of debug log from openai as provider.

completion obj content: Restaurant
model_response: ModelResponse(id='chatcmpl-a0cb7cb8-2691-4760-9ccb-3f7f438e2cfe', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage()); completion_obj: {'content': 'Restaurant'}
model_response finish reason 3: None
hold - False, model_response_str - Restaurant
model_response: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
PROCESSED ASYNC CHUNK POST CHUNK CREATOR: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
Logging Details LiteLLM-Success Call
line in async streaming: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
success callbacks: [<litellm.proxy.hooks.parallel_request_limiter.MaxParallelRequestsHandler object at 0x7f40ab127f10>]
returned chunk: ModelResponse(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(tool_calls=None, function_call=None, content='Restaurant', role=None))], created=1703264355, model='gpt-3.5-turbo', object='chat.completion.chunk', system_fingerprint=None, usage=Usage())
INSIDE ASYNC STREAMING!!!
value of async completion stream: <openai.AsyncStream object at 0x7f40a7d0feb0>
value of async chunk: ChatCompletionChunk(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[Choice(delta=ChoiceDelta(content=' Name', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1703264348, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)
PROCESSED ASYNC CHUNK PRE CHUNK CREATOR: ChatCompletionChunk(id='chatcmpl-8Yd9YizyHpZHuTezfVJLO7JPPDkVa', choices=[Choice(delta=ChoiceDelta(content=' Name', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1703264348, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)

@krrishdholakia
Copy link
Contributor

acknowledging this - will work on it today. thank you for the debugging so far @gururise

@krrishdholakia
Copy link
Contributor

Looking at the raw tgai call, doesn't look like they're streaming in chunks.
Screenshot 2023-12-25 at 6 45 55 AM

@krrishdholakia
Copy link
Contributor

Running with together_ai/mistralai/Mistral-7B-Instruct-v0.2, unable to repro with trivial example
Screenshot 2023-12-25 at 7 25 30 AM

Screenshot 2023-12-25 at 7 25 18 AM

@krrishdholakia
Copy link
Contributor

Testing with this curl request:

curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data '{
  "model": "tgai-mistral",
  "messages": [
        {
          "role": "user",
          "content": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }
      ],
  "stream": true,
  "temperature": 0.1,
  "top_p": 0.95,
  "repetition_penalty": 1.2,
  "top_k": 50,
  "truncate": 3072,
  "max_new_tokens": 1024,
  "stop": ["</s>"]
}'

I'm unable to repro the large chunk problem (see "content" for each line)
Screenshot 2023-12-25 at 7 29 46 AM

cc: @gururise do you know what the exact call being received by litellm is?

@krrishdholakia
Copy link
Contributor

@gururise bump on this/

@nigh8w0lf
Copy link

nigh8w0lf commented Jan 15, 2024

Seeing the same issue with formatting, TogetherAI with Mixtral-8x7B-Instruct-v0.1

The output is not formatted as reported above by OP
I'm using the litellm proxy server.
Used huggingface chat-ui and LibreChat, both had the same problem with formatting.

@nigh8w0lf
Copy link

I can see the tokens streamed Individually but as well, but like OP mentioned they are displayed in chunks at a time, as if the response is being first cached until it hits some sort of limit and is then displayed on Chat-UI.
Will test on LibreChat to see if it's the same behaviour.

@nigh8w0lf
Copy link

same behaviour in LibreChat as well so it looks like it's an issue with the proxy when using TogetherAI, and happens with any model on TogetherAI

@nigh8w0lf
Copy link

@gururise have you found a workaround to this issue or are you not using Together API's?

@gururise
Copy link
Author

@gururise have you found a workaround to this issue or are you not using Together API's?

Unfortunately, I have found no workaround in LiteLLM. I haven't had time to look further into this issue, perhaps if you have time to provide some more debugging information, @krrishdholakia can fix this issue.

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Jan 17, 2024

I'll do some further testing here, and try and repro this. I'm not seeing this when i just test the proxy chat completion endpoint w/ tgai and streaming on postman

@nigh8w0lf
Copy link

thanks @gururise
@krrishdholakia happy to help with debugging info.

@krrishdholakia
Copy link
Contributor

@nigh8w0lf can you let me know if you're seeing this issue when making a normal curl request to the proxy endpoint?

and also the version of litellm being used?

@nigh8w0lf
Copy link

@krrishdholakia I can see that the tokens are streamed when running in curl or when running the proxy in debug mode,the chunking seems to happen when the tokens are displayed in HF Chat-UI and LibreChat. there is also the formatting issue when the tokens are displayed in HF Chat-UI and LibreChat.

@nigh8w0lf
Copy link

sorry forget to mention the litellm version, I'm using 1.17.5

@nigh8w0lf
Copy link

updated to Litellm 1.17.14, still the same issue.
Wondering if the chunking is because the API is too fast 😆

@krrishdholakia
Copy link
Contributor

is this then a client side issue w/ Librechat / HF Chat UI?

cc: @Manouchehri i believe you're also using us with librechat, are you seeing similar buffering?

@krrishdholakia
Copy link
Contributor

@nigh8w0lf do you see this buffering happening for a regular openai call via the proxy?

I remember trying librechat w/ bedrock and that seemed to work fine

@Manouchehri
Copy link
Collaborator

I've been using Azure OpenAI, Bedrock, and Cohere. None of them had this issue from what I remember. =)

@nigh8w0lf
Copy link

@krrishdholakia it doesn't happen with any other API, only with TogetherAI

@gururise
Copy link
Author

gururise commented Jan 18, 2024

@krrishdholakia Just to add, I tried HF Chat UI with LiteLLM(OpenAI API) and it worked as expected. As @nigh8w0lf says, this issue only occurs when using LiteLLM with TogetherAI.

EDIT: If you look at the debug log I attached to an earlier comment. You can see that LiteLLM is returning escaped newline characters when used with TogetherAI.

@ishaan-jaff
Copy link
Contributor

ishaan-jaff commented Jan 24, 2024

Related PR: I saw this with Sagemaker: #1569

@ishaan-jaff
Copy link
Contributor

Pushed a fix for TogetherAI will be live on 1.18.13

2d26875

@gururise @nigh8w0lf can I get your help confirming the issue is fixed on 1.18.3+ ?

@nigh8w0lf
Copy link

nigh8w0lf commented Jan 24, 2024

The chunking issue seems to be fixed, I can see the response being streamed correctly.

The formatting issue is still present for TogetherAI API.

Seeing some new behavior after this update, HF Chat-UI thinks the response is incomplete? the Continue button appears after the response has been streamed completely, I have not seen this before, it's happening with all API's

image
@gururise does it happen for you?

@nigh8w0lf
Copy link

Don't see the "Continue" button issue when using HF Chat-UI without the proxy.

@krrishdholakia
Copy link
Contributor

Do you see this when calling openai directly? @nigh8w0lf

@nigh8w0lf
Copy link

Do you see this when calling openai directly? @nigh8w0lf

no, I don't see it when using the openai directly, I see it only when using the proxy.

@nigh8w0lf
Copy link

when I say directly, I mean using HF Chat-UI without Litellm.

@nigh8w0lf
Copy link

I have switched to Librechat as the frontend, the "Continue" issue is no longer a concern but the formatting issue is still exists on v1.20.0

@ishaan-jaff
Copy link
Contributor

@nigh8w0lf can we track the formatting bug on a new issue - since this issue was about the togetheriAI chunks hanging?

@nigh8w0lf
Copy link

@ishaan-jaff Sure I can log a new bug report, the Initial bug report above mentions both issues by the way hence why I was continuing here.

@nigh8w0lf
Copy link

#1792 @ishaan-jaff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants