pass stop words to openai api #887

AllentDan · 2023-12-26T02:49:19Z

No description provided.

AllentDan · 2023-12-29T06:30:00Z

Please Notice!!! This PR modified _stop_words function. A full test is needed.

Conflicts: lmdeploy/serve/async_engine.py lmdeploy/serve/openai/api_server.py lmdeploy/tokenizer.py lmdeploy/turbomind/turbomind.py

lvhan028 · 2024-01-10T09:57:42Z

lmdeploy/tokenizer.py

@@ -53,6 +63,27 @@ def _maybe_add_prefix_space(self, tokens, decoded):
        else:
            return decoded

+    def indexes_containing_token(self, token: str):


Is indexes_containing_token time-consuming?

I used maps to get the index. The time consuming should be acceptable.

irexyc · 2024-01-10T11:13:34Z

lmdeploy/tokenizer.py

+                f'There are too many(>{self.max_indexes_num}) possible '
+                f'indexes may decoding {token}, we will use {indexes} only')
+        self._indexes_tokens_deque.append((token, indexes))
+        return indexes


这个特殊在哪里？

什么特殊？

if token == ' ': # ' ' is special

tokenizer 里空格符都会被处理成 '▁'

irexyc · 2024-01-10T11:32:26Z

lmdeploy/tokenizer.py

+        vocab = self.model.IdToPiece(list(range(self.vocab_size)))
+        indexes = [i for i, voc in enumerate(vocab) if token in voc]
+        if len(indexes) > self.max_indexes_num:
+            indexes = self.encode(token, add_bos=False)[-1:]


这里长度有可能超过1么？超过1的话，取最后的不太对，有可能会是一个常见的token

应该不会有这种情况，有多个单个 index 都能解码出含有 token 的字符串的话。token 本身应该只会被编码成一个

比如说token = 'ucke', internlm-chat-20b里面有
ucket
▁bucket
ucker
▁fucked
bucket
Bucket
uckets
_bucket
ucked
▁buckets
▁Bucket
▁sucked
▁Zucker
▁Tucker
(bucket
▁tucked

但是字典里面没有 ucke

AllentDan added 3 commits December 22, 2023 11:16

WIP

e76da70

comment

2c66e25

api_client arguments

ba95222

lvhan028 added the Bug:P1 label Dec 26, 2023

lvhan028 requested a review from irexyc December 26, 2023 03:01

AllentDan added 2 commits December 26, 2023 11:02

update batch_infer func

34c9ce0

better stop strategy

5288f9a

AllentDan mentioned this pull request Jan 3, 2024

check-in generation config #902

Merged

AllentDan added 4 commits January 3, 2024 16:12

more space for speed

d001f83

Merge branch 'main' into stop

1e7b671

Conflicts: lmdeploy/serve/async_engine.py lmdeploy/serve/openai/api_server.py lmdeploy/tokenizer.py lmdeploy/turbomind/turbomind.py

get all possible indexes for special words

c5bb5b8

compatible string

b6eb2cd

lvhan028 requested review from lvhan028 and grimoire January 10, 2024 07:12

lvhan028 reviewed Jan 10, 2024

View reviewed changes

irexyc reviewed Jan 10, 2024

View reviewed changes

AllentDan added 2 commits January 11, 2024 10:42

some token may decode out indexes len >1

12533b1

mv

6eb498d

grimoire approved these changes Jan 11, 2024

View reviewed changes

irexyc approved these changes Jan 11, 2024

View reviewed changes

lvhan028 merged commit 80cb84c into InternLM:main Jan 11, 2024
3 of 5 checks passed

AllentDan had a problem deploying to prod February 10, 2024 02:58 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass stop words to openai api #887

pass stop words to openai api #887

AllentDan commented Dec 26, 2023

AllentDan commented Dec 29, 2023

lvhan028 Jan 10, 2024 •

edited

AllentDan Jan 11, 2024

irexyc Jan 10, 2024

AllentDan Jan 11, 2024

irexyc Jan 11, 2024

AllentDan Jan 11, 2024

irexyc Jan 10, 2024

AllentDan Jan 11, 2024

irexyc Jan 11, 2024

pass stop words to openai api #887

pass stop words to openai api #887

Conversation

AllentDan commented Dec 26, 2023

AllentDan commented Dec 29, 2023

lvhan028 Jan 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvhan028 Jan 10, 2024 •

edited