Skip to content

knights-analytics/tokenizers

 
 

Repository files navigation

Tokenizers

Go bindings for the HuggingFace Tokenizers library.

Installation

make build to build libtokenizers.a that you need to run your application that uses bindings.

To use libtokenizers.a in your go application, either:

  • Place libtokenizers.a in /usr/lib/, and compile your app as usual with go build.
  • Place libtokenizers.a in the go source directory of the tokenizer module (e.g. /home/user/go/pkg/mod/github.com/knights-analytics/tokenizers@v0.8.0/), and compile with go build -tags tokenizers_srcdir_relative.

Using pre-built binaries

Build your Go application using pre-built native binaries: docker build --platform=linux/amd64 -f example/Dockerfile .

Available binaries:

Getting started

TLDR: working example.

Load a tokenizer from a JSON config:

import "github.com/knights-analytics/tokenizers"

tk, err := tokenizers.FromFile("./data/bert-base-uncased.json")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Encode text and decode tokens:

fmt.Println("Vocab size:", tk.VocabSize())
// Vocab size: 30522
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
// [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
// [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
// brown fox jumps over the lazy dog

Benchmarks

go test . -bench=. -benchmem -benchtime=10s

goos: darwin
goarch: arm64
pkg: github.com/knights-analytics/tokenizers
BenchmarkEncodeNTimes-10     	  996556	     11851 ns/op	     116 B/op	       6 allocs/op
BenchmarkEncodeNChars-10      1000000000	     2.446 ns/op	       0 B/op	       0 allocs/op
BenchmarkDecodeNTimes-10     	 7286056	      1657 ns/op	     112 B/op	       4 allocs/op
BenchmarkDecodeNTokens-10    	65191378	     211.0 ns/op	       7 B/op	       0 allocs/op
PASS
ok  	github.com/knights-analytics/tokenizers	126.681s

Contributing

Please refer to CONTRIBUTING.md for information on how to contribute a PR to this project.

About

Go bindings for HF Tokenizer

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 61.2%
  • Rust 16.5%
  • Starlark 13.3%
  • Makefile 3.7%
  • Dockerfile 2.8%
  • C 2.5%