Add Tokenizer converter by Irozuku · Pull Request #359 · DashAISoftware/DashAI

Irozuku · 2025-10-28T16:25:00Z

This pull request introduces a new Hugging Face-based tokenizer converter to the backend, allowing text columns to be tokenized and each token ID stored in its own column. The main changes involve adding the TokenizerConverter implementation, registering it among available converters, and making it importable from the package.

New Hugging Face Tokenizer Converter

Added the TokenizerConverter class in DashAI/back/converters/hugging_face/tokenizer.py, which tokenizes text columns using a selected Hugging Face tokenizer model, storing each token ID in a separate column and supporting batch processing and device selection.

…erter

Irozuku added 4 commits October 28, 2025 13:22

feat: add TokenizerConverter and related schema for text tokenization

87fb8e6

fix: improve formatting of description and docstring in TokenizerConv…

3dfa851

…erter

Merge branch 'develop' into feat/tokenizer-converter

2005299

Merge branch 'develop' into feat/tokenizer-converter

e5a11d2

cristian-tamblay approved these changes Oct 30, 2025

View reviewed changes

cristian-tamblay merged commit 86086bb into develop Oct 30, 2025
18 checks passed

cristian-tamblay deleted the feat/tokenizer-converter branch October 30, 2025 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tokenizer converter#359

Add Tokenizer converter#359
cristian-tamblay merged 4 commits into
developfrom
feat/tokenizer-converter

Irozuku commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Irozuku commented Oct 28, 2025

New Hugging Face Tokenizer Converter

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants