Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Qwen support #850

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Add Qwen support #850

wants to merge 3 commits into from

Conversation

chenht2026
Copy link

@chenht2026 chenht2026 commented Dec 29, 2023

It works.
Qwen's tokenizer is based on tiktoken, I add the tokenizer(tokenization_qwen.py) from its huggingface repo without any revision. This make the code a little complicate, so maybe do not merge.

May be some one needs it.

Closes #840

@lantiga
Copy link
Contributor

lantiga commented Jan 8, 2024

Thank you @chenht2026, sorry for the wait. A few of us took a break :-)

Copy link
Contributor

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contributions! One comment on the config flag

@@ -27,6 +27,8 @@ class Config:
rotary_percentage: float = 0.25
parallel_residual: bool = True
bias: bool = True
# just for Qwen
is_Qwen: Optional[bool] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid this in favor of something that characterizes Qwen, like having bias only in c_attn.
For the time being we could rename this as attn_bias, and then in the future turn bias into a Option[bool, List[str]] if there's a need for it.

@lantiga lantiga added the enhancement New feature or request label Jan 8, 2024
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a tests/test_model.py test?

Comment on lines +1 to +4
# Copyright (c) Alibaba Cloud.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you didn't write the code from this file (which I assume you were not since you did add this license), you should link it to the original source. Is the original from PaddlePaddle? https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/qwen/tokenizer.py

I would advice that you create a version that only implements that few methods required by tokenizer.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy it from their huggingface repo. tokenization_qwen.py

from typing import Collection, Dict, List, Set, Tuple, Union

import tiktoken
from transformers import PreTrainedTokenizer, AddedToken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project doesn't have transformers as a dependency, so this import is not possible

@@ -91,6 +95,8 @@ def encode(
tokens = self.processor.encode(string).ids
elif self.backend == "sentencepiece":
tokens = self.processor.encode(string)
elif self.backend == "tiktoken":
tokens = self.processor.encode(string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like the new processor implements this method. Also, what about decoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good mention qwen's recommended languages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chinese and English. I'll add it.

@samuelazran
Copy link

It works. Qwen's tokenizer is based on tiktoken, I add the tokenizer(tokenization_qwen.py) from its huggingface repo without any revision. This make the code a little complicate, so maybe do not merge.

May be some one needs it.

Closes #840

Would it work with Qwen 2 (Qwen/Qwen1.5-7B-Chat)? if not what needs to be added?
Qwen1.5 improve a lot on its predecessor in performance.
https://huggingface.co/Qwen/Qwen1.5-7B-Chat/tree/main

@rasbt
Copy link
Collaborator

rasbt commented Jul 8, 2024

Hey just pinging to see if you are still interested in pursuing this PR. Personally, I think it'd be awesome to support the Qwen models (1.5 and especially 2) in LitGPT. There have been some improvements in the tokenizer in LitGPT recently that could now make this more easily possible.

Btw if rebasing here based on the main branch (which changed a lot) is too messy, you could also just open a fresh PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Qwen support
5 participants