Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Language Tokenization Support #298

Open
andrewdalpino opened this issue May 27, 2023 · 2 comments
Open

Multi Language Tokenization Support #298

andrewdalpino opened this issue May 27, 2023 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@andrewdalpino
Copy link
Member

andrewdalpino commented May 27, 2023

I'm hoping that we can get to the point where we fully support the following languages.

  • English
  • Spanish
  • German
  • French
  • Russian
  • Japanese
  • Hindi
  • Farsi
  • Chinese
  • Arabic

I started adding unit tests for these languages for a few tokenizers here https://github.com/RubixML/ML/tree/master/tests/Tokenizers - however, it doesn't look like we support all the langugaes. I only speak English so it's hard for me to tell. Could we get some help from the community to verify that our Tokenizers support all of these languages and, if not, contribute a fix?

https://github.com/RubixML/ML/tree/master/src/Tokenizers

Thank you!

@andrewdalpino andrewdalpino added enhancement New feature or request help wanted Extra attention is needed labels May 27, 2023
@taotecode
Copy link

How to join the development of multiple languages? I am good at Chinese and English.

@andrewdalpino
Copy link
Member Author

Hi @taotecode, thanks for your interest in contributing to the project! Here are the unit tests for the Tokenizers implemented in the library.

https://github.com/RubixML/ML/tree/master/tests/Tokenizers

We need help from native language speakers to ensure that we have test coverage for different languages and that the current tests are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants