Skip to content

Latest commit

 

History

History
22 lines (19 loc) · 425 Bytes

README.md

File metadata and controls

22 lines (19 loc) · 425 Bytes

HuggingFace WordPiece Tokenizer in C++

Takes as an input .json of the pre-trained tokenizer. Only inference mode is available at the moment.

Change your .json path here:

WordPieceTokenizer tokenizer("tokenizer.json");

Build:

Requires International Components for Unicode library:

sudo apt-get install libicu-dev

Compile:

g++ tokenizer.cpp -licuuc -o tokenizer

Run:

./tokenizer