Skip to content

[BUG]: Outputting Chinese characters may result in incomplete UTF8 encoding, causing garbled text #1048

@jxq1997216

Description

@jxq1997216

Description

I am building a translation software, and when it translates language into Chinese, it sometimes outputs incomplete UTF8 encoding as tokens, which causes the model to freeze when it is input again as context. I think the problem should be in the Decode part, and it seems that we can detect whether a token is a complete UTF8 encoding type in the Decode part.

Image

Reproduction Steps

1.Using the Sakura13B model to translate a Japanese text to Chinese, while adding some symbols, emojis, etc., here is my original text:
『いいね』\n『ウインナー好き』\n『おじさんのウインナーも食べて欲しいナ(笑)(^_^)😃✋💕』\n『一回茹でるといいらしいぞ』
2.Its output should include symbols like 口口
3.The model will be stuck if you input the text as context into it

Environment & Configuration

  • Operating system: Windows11
  • .NET runtime version: 8.0
  • LLamaSharp version:0.19
  • CUDA version (if you are using cuda backend): 12.7
  • CPU & GPU device: NVIDIA GeForce RTX 3090

Known Workarounds

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleStale issue will be autoclosed soon

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions