Skip to content

Add Tokenizer converter#359

Merged
cristian-tamblay merged 4 commits into
developfrom
feat/tokenizer-converter
Oct 30, 2025
Merged

Add Tokenizer converter#359
cristian-tamblay merged 4 commits into
developfrom
feat/tokenizer-converter

Conversation

@Irozuku
Copy link
Copy Markdown
Collaborator

@Irozuku Irozuku commented Oct 28, 2025

This pull request introduces a new Hugging Face-based tokenizer converter to the backend, allowing text columns to be tokenized and each token ID stored in its own column. The main changes involve adding the TokenizerConverter implementation, registering it among available converters, and making it importable from the package.

New Hugging Face Tokenizer Converter

  • Added the TokenizerConverter class in DashAI/back/converters/hugging_face/tokenizer.py, which tokenizes text columns using a selected Hugging Face tokenizer model, storing each token ID in a separate column and supporting batch processing and device selection.

@cristian-tamblay cristian-tamblay merged commit 86086bb into develop Oct 30, 2025
18 checks passed
@cristian-tamblay cristian-tamblay deleted the feat/tokenizer-converter branch October 30, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants