Skip to content

How to add a new BERT tokenizer model

SergeiAlonichau edited this page Aug 16, 2019 · 21 revisions

We assume the Bling Fire tools are already compiled and the PATH is set.

Initial Steps

  1. Create a new directory under ldbsrc

cd ldbsrc mkdir bert_chinese

  1. Copy content of an existing model similar to yours into the new directory:

cp bert_base_tok/* bert_chinese

  1. Modify options.small to use new output name for your bin file:

OUTPUT = bert_chinese.bin

OUTPUT = bert_chinese.bin

USE_CHARMAP = 1

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
<------>$(tmpdir)/wbd.fsa.$(mode).dump \
<------>$(tmpdir)/wbd.mmap.$(mode).dump \
<------>$(tmpdir)/charmap.mmap.$(mode).dump \

Disable Normalization

If you don't want to use character normalization such as case folding and accent removal, then you need to remove the charmap.utf8 compilation from options.small file and ldb.conf.small:

options.small

OUTPUT = bert_chinese_no_normalization.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap

resources = \
<------>$(tmpdir)/wbd.fsa.$(mode).dump \
<------>$(tmpdir)/wbd.mmap.$(mode).dump \

ldb.conf.small

[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2
# charmap 3

Enable Normalization

If you need normalization such as case folding, dropping of accents or something else. You can generate your own charmap.utf8 file. The format of the file is space . The is 0 or more length, usually 1. If there is no entry found for then it will remain unchanged. If length is 0 (empty string) then the will be deleted.

charmap.utf8 example

# A --> a
\x0041 \x0061

# B --> b
\x0042 \x0062

# C --> c
\x0043 \x0063

# D --> d
\x0044 \x0064

# E --> e
\x0045 \x0065

# F --> f
\x0046 \x0066

# G --> g
\x0047 \x0067

# H --> h
\x0048 \x0068

It is easy to use a script to generate a charmap you need. For BERT casefolded models we use this command line:

python gen_charmap.py > charmap.utf8

After a charmap.utf8 is created you need to make sure options.small and ldb.conf contain options for compilation and resource reference for charmap.utf8 as before (see bert_base_tok directory.)

Add a New vocab.txt

If you model uses a different vocab.txt file. (Computation of vocab.txt is outside of scope of this tutorial.) Then you need to convert it into "fa_lex" format. We have a simple helper script for this: python vocab_to_fa_lex.py .

Note the script will create a new wbd.tagset.txt file and vocab.falex file. vocab.falex contains all words converted into fa_lex rules, these rules will be applied once a tokenizer finds a full token. Each rule maps a token to a unique tag value which is the same as ID in the original vocab.txt file. This way we marry a tokenizer and a dictionary lookup into one finite-state machine.

vocab.falex example

 < ^ [\[][U][N][K][\]] > --> WORD_ID_100
 < ^ [\[][C][L][S][\]] > --> WORD_ID_101
 < ^ [\[][S][E][P][\]] > --> WORD_ID_102
 < ^ [\[][M][A][S][K][\]] > --> WORD_ID_103
 < ^ [<][S][>] > --> WORD_ID_104
 < ^ [<][T][>] > --> WORD_ID_105
 < ^ [!] > --> WORD_ID_106
 < ^ ["] > --> WORD_ID_107
 < ^ [#] > --> WORD_ID_108
 < ^ [$] > --> WORD_ID_109
 < ^ [%] > --> WORD_ID_110
 < ^ [&] > --> WORD_ID_111
 < ^ ['] > --> WORD_ID_112
 < ^ [(] > --> WORD_ID_113
 < ^ [)] > --> WORD_ID_114
 < ^ [*] > --> WORD_ID_115
 < ^ [+] > --> WORD_ID_116
 < ^ [,] > --> WORD_ID_117
 < ^ [\-] > --> WORD_ID_118
 < ^ [.] > --> WORD_ID_119
 < ^ [/] > --> WORD_ID_120
 < ^ [0] > --> WORD_ID_121
 < ^ [1] > --> WORD_ID_122
 < ^ [2] > --> WORD_ID_123
 < ^ [3] > --> WORD_ID_124
 < ^ [4] > --> WORD_ID_125
 < ^ [5] > --> WORD_ID_126
 < ^ [6] > --> WORD_ID_127
 < ^ [7] > --> WORD_ID_128
 < ^ [8] > --> WORD_ID_129
 < ^ [9] > --> WORD_ID_130
 < ^ [:] > --> WORD_ID_131
 < ^ [;] > --> WORD_ID_132
 < ^ [<] > --> WORD_ID_133
 < ^ [=] > --> WORD_ID_134
 < ^ [>] > --> WORD_ID_135
 < ^ [?] > --> WORD_ID_136
 < ^ [@] > --> WORD_ID_137
 < ^ [\[] > --> WORD_ID_138
 < ^ [\x5C] > --> WORD_ID_139

Now you also need to make sure that the new vocab.falex file is included into your main tokenization grammar file wbd.lex.utf8. Note the path for the _include starts at ldbsrc, so after updating the path you should see something like this in your wbd.lex.utf8:

...
_function FnTokWord
_include bert_chinese/vocab.falex
_end

Compile Your New Model

Assuming you are in ldbsrc directory, type this:

make -f Makefile.gnu lang=bert_chinese all

Given the machine is quite complex it might take a while 2-4 hours for existing BERT files. During the compilation make sure there are no "ERROR:" messages printed. If you encounter any, you should not use the bin file even it may have been created.

Debug Your Model

Sometimes you need to be able to tell why you are getting these IDs and not those. You can use fa_lex command line tool to see how the text was segmented, where the main words are and what are the sub-tokens for each word.

Try this:

printf 'Heung-Yeung "Harry" Shum (Chinese: 沈向洋; born in October 1966) is a computer scientist of Chinese origin.' | fa_lex --ldb=ldb/bert_chinese.bin --tagset=bert_chinese/wbd.tagset.txt --normalize-input

You should get:

heung/WORD he/WORD_ID_9245 ung/WORD_ID_9112 -/WORD -/WORD_ID_118 yeung/WORD y/WORD_ID_167 e/WORD_ID_8154 ung/WORD_ID_9112 "/WORD "/WORD_ID_107 harry/WORD harry/WORD_ID_12296 "/WORD "/WORD_ID_107 shum/WORD sh/WORD_ID_11167 um/WORD_ID_8545 (/WORD (/WORD_ID_113 chinese/WORD chinese/WORD_ID_10101 :/WORD :/WORD_ID_131 沈/WORD 沈/WORD_ID_3755 向/WORD 向/WORD_ID_1403 洋/WORD 洋/WORD_ID_3817 ;/WORD ;/WORD_ID_132 born/WORD bo/WORD_ID_11059 rn/WORD_ID_9256 in/WORD in/WORD_ID_8217 october/WORD october/WORD_ID_9548 1966/WORD 1966/WORD_ID_9093 )/WORD )/WORD_ID_114 is/WORD is/WORD_ID_8310 a/WORD a/WORD_ID_143 computer/WORD com/WORD_ID_8134 put/WORD_ID_11300 er/WORD_ID_8196 scientist/WORD sci/WORD_ID_11776 ent/WORD_ID_8936 ist/WORD_ID_9527 of/WORD of/WORD_ID_8205 chinese/WORD chinese/WORD_ID_10101 origin/WORD or/WORD_ID_8549 ig/WORD_ID_11421 in/WORD_ID_8277 ./WORD ./WORD_ID_119

The format is self explanatory, each WORD is followed by sub-tokens with WORD_ID_NNNN tag, where NNNN is the ids that the API will return.

If the problem that you see is in the main word tokenization then you can comment the subtoken rules and change / recompile the grammar much faster (minutes) until the error is fixed and then uncomment the subtoken logic back and do full final recompile.

...
_function FnTokWord
# comment subtoken rules for faster compilation
# _include bert_chinese/vocab.falex
_end

Test your model

You can run your model on a large text (should be in UTF-8 encoding) as follows:

printf 'Heung-Yeung "Harry" Shum (Chinese: 沈向洋; born in October 1966) is a computer scientist of Chinese origin.' | python test_bling.py -m bert_chinese.bin
Heung-Yeung "Harry" Shum (Chinese: 沈向洋; born in October 1966) is a computer scientist of Chinese origin.
[ 9245  9112   118   167  8154  9112   107 12296   107 11167  8545   113
 10101   131  3755  1403  3817   132 11059  9256  8217  9548  9093   114
  8310   143  8134 11300  8196 11776  8936  9527  8205 10101  8549 11421
  8277   119     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]

For a text file:

cat test.txt | python test_bling.py -m ldb/bert_chinese.bin > test.bling.chinese.txt

You can compare the output to original BERT's tokenizer code as follows:

cat test.txt | python test_bert.py > test.bert.multi_cased.txt

Please make sure that test_bert.py uses correct vocab.txt file and drop_case setting.

Then just diff them with your favorite diff program.

Performance on one thread

ProcTime = Min({Times No Output}) - Min({Times No Process})

ProcTime = 1.12 s

ProcSpeed = DataSize / ProcTime

ProcSpeed = 3.75 MB/s

(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1
real 1.38
user 2.66
sys 0.99
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1
real 1.31
user 2.42
sys 0.85
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1
real 1.36
user 2.46
sys 0.82
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 -n 1
real 0.19
user 1.22
sys 0.89
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 -n 1
real 0.19
user 1.22
sys 0.84
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ time -p cat test.txt | python test_bling.py -m bert_chinese.bin -s 1 -n 1
real 0.19
user 1.23
sys 0.88
(venv) sergeio@Semi-Structure-Store-Sergei-VM:~/BlingFire/BlingFire/Release$ ls -lh test.txt
-rw-rw-r-- 1 sergeio sergeio 4.2M Jul 12 20:08 test.txt

Since code is written in C++ and does not have Global Interpreter Lock, you can process your text in parallel. The models are thread safe so you don't need to keep a pool of them. In production setting we observed below 1 ms latency per document when used together with parallel for loop (called from C++.)