Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass stop words to openai api #887

Merged
merged 11 commits into from
Jan 11, 2024
Merged

pass stop words to openai api #887

merged 11 commits into from
Jan 11, 2024

Conversation

AllentDan
Copy link
Collaborator

No description provided.

@AllentDan
Copy link
Collaborator Author

Please Notice!!! This PR modified _stop_words function. A full test is needed.

Conflicts:
	lmdeploy/serve/async_engine.py
	lmdeploy/serve/openai/api_server.py
	lmdeploy/tokenizer.py
	lmdeploy/turbomind/turbomind.py
@@ -53,6 +63,27 @@ def _maybe_add_prefix_space(self, tokens, decoded):
else:
return decoded

def indexes_containing_token(self, token: str):
Copy link
Collaborator

@lvhan028 lvhan028 Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is indexes_containing_token time-consuming?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used maps to get the index. The time consuming should be acceptable.

f'There are too many(>{self.max_indexes_num}) possible '
f'indexes may decoding {token}, we will use {indexes} only')
self._indexes_tokens_deque.append((token, indexes))
return indexes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个特殊在哪里?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

什么特殊?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if token == ' ': # ' ' is special

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizer 里空格符都会被处理成 '▁'

vocab = self.model.IdToPiece(list(range(self.vocab_size)))
indexes = [i for i, voc in enumerate(vocab) if token in voc]
if len(indexes) > self.max_indexes_num:
indexes = self.encode(token, add_bos=False)[-1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里长度有可能超过1么?超过1的话,取最后的不太对,有可能会是一个常见的token

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该不会有这种情况,有多个单个 index 都能解码出含有 token 的字符串的话。token 本身应该只会被编码成一个

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

比如说token = 'ucke', internlm-chat-20b里面有
ucket
▁bucket
ucker
▁fucked
bucket
Bucket
uckets
_bucket
ucked
▁buckets
▁Bucket
▁sucked
▁Zucker
▁Tucker
(bucket
▁tucked

但是字典里面没有 ucke

@lvhan028 lvhan028 merged commit 80cb84c into InternLM:main Jan 11, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants