Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe the current vocabularies supported by ModularTokenizer in the documentation #96

Open
sivanravidos opened this issue Jan 11, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@sivanravidos
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
The ModularTokenizer is very useful for various biologics tasks and model developments. However it is hard to know from reading its Readme which vocabularies are supported.
Instruction on how to extend the tokenizer
Describe the solution you'd like
I suggest the Readme will start from a tokenizer user perspective. Start with a simple code snippet of how to use it, add a description of the vocabularies are supported and how were they created.
For the later, note that while the amino acid token vocabulary is small, constant and well known, the gene or cell type vocabulary is large and vary between sources. One would wonder how was these vocabularies created and from which trusted source they were take.
Only then I would add the existing sections on the internals of the tokenizer, how to extend it, etc.

Additionaly, I would highlight the tokenizer in the main readme and point to its internal readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants