Skip to content

RWKV/RWKV-tokenizer-node

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Native Node.js tokenizer for RWKV

0 dependency tokenizer for the RWKV project

Should also work for EleutherAI neox and pythia, as they use the same tokenizer

Setup

npm i rwkv-tokenizer-node

Usage

const tokenizer = require("RWKV-tokenizer-node");

// Encode into token int : [12092, 3645, 2]
const tokens = tokenizer.encode("Hello World!");

// Decode back to "Hello World!"
const decoded = tokenizer.decode(tokens);

Its primary purpose is for use in implementing RWKV-cpp-node , though it could probably be used for other use cases (eg. pure-JS implementaiton of gpt-neox or RWKV)

What can be improved?

  • performance: its kinda disappointing that this is easily 10x slower then the python implementation (which i believe is using the rust library), however this is generally still good enough for most usecases
  • Why not use the hugging face library? Sadly the official huggingface tokenizer lib for nodejs is broken : huggingface/tokenizers#911

PS: Anyone who has any ideas on how to improve its performance, while not failing the test suite, is welcomed to do so.

How to run the test?

# This run the sole test file test/tokenizer.test.js
npm run test

The python script used to seed the refence data (using huggingface tokenizer) is found at test/build-test-token-json.py This test includes a very extensive UTF-8 test file covering all major (and many minor) languages

Designated maintainer

@picocreator - is the current maintainer of the project, ping him on the RWKV discord if you have any questions on this project

Special thanks & refrences

@saharNooby - which the current implementation is heavily based on

@cztomsik @josephrocca @BlinkDL - for their various implementation, which is used as refence to squash out mismatching encoding with HF implementation.

About

RWKV tokenizer for node.js

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published