In [257]:
import os, re

Let's compare BPE, SentencePiece, and Bert (pretrained) tokenization.

# Word Tokenizer

But before that, let's look at a word tokenizer first.

In [2]:
from nltk.tokenize import word_tokenize

In [7]:
sent = "There's a son-in-law, mother-in-law, etc."
tokens = word_tokenize(sent)
print(tokens)

['There', "'s", 'a', 'son-in-law', ',', 'mother-in-law', ',', 'etc', '.']


How to get the original sentence from the tokens?  

In [8]:
" ".join(tokens)

"There 's a son-in-law , mother-in-law , etc ."

In [9]:
"".join(tokens)

"There'sason-in-law,mother-in-law,etc."

Tricky!

# BPE

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): [Neural Machine Translation of Rare Words with Subword Units](http://www.aclweb.org/anthology/P16-1162) Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

"Byte Pair Encoding (BPE) (Gage, 1994) is a simple data compression technique that iteratively replaces  the  most  frequent  pair  of  bytes  in  a  sequence with a single, unused byte.  We adapt this algorithm for word segmentation. Instead of merging frequent pairs of bytes, we merge characters or
character sequence"

https://github.com/rsennrich/subword-nmt

In [4]:
os.system("pip install subword-nmt")

0

### Let's play a little with a toy example

In [261]:
# let's create a sample text
# This is the same example as the one in the above paper.
text ="low\n"*5 + "lower\n"*2 + "newest\n"*6 + "widest\n"*3
print(text)

low
low
low
low
low
lower
lower
newest
newest
newest
newest
newest
newest
widest
widest
widest



In [262]:
with open('toy', 'w') as f:
    f.write(text)

step 1. Learn bpe.  
Process byte pair encoding and generate merge operations, i.e., codes.

In [263]:
# Note that -s means number of operations
learn_bpe = "subword-nmt learn-bpe -s 1 --min-frequency 2 < toy > codes"
os.system(learn_bpe)

0

In [264]:
codes = open('codes', 'r').read()
print("==codes==\n" + codes + "====")
print("number of codes: ", len(codes.splitlines())-1)

==codes==
#version: 0.2
s t</w>
====
number of codes:  1


△ `</w>` means end of a word.

△ Check the toy sample carefully. The last 9 words end in `st`, which is most frequent.

step 2. Apply bpe.    
Apply codes to the designated file such that the original text is segmented.
For demo, we apply the codes to the same toy file.


In [265]:
apply_bpe = "subword-nmt apply-bpe -c codes < toy > bpe"
os.system(apply_bpe)

0

In [266]:
bpe = open('bpe', 'r').read()
print("==segmented==\n" + bpe + "====")

==segmented==
l@@ o@@ w
l@@ o@@ w
l@@ o@@ w
l@@ o@@ w
l@@ o@@ w
l@@ o@@ w@@ e@@ r
l@@ o@@ w@@ e@@ r
n@@ e@@ w@@ e@@ st
n@@ e@@ w@@ e@@ st
n@@ e@@ w@@ e@@ st
n@@ e@@ w@@ e@@ st
n@@ e@@ w@@ e@@ st
n@@ e@@ w@@ e@@ st
w@@ i@@ d@@ e@@ st
w@@ i@@ d@@ e@@ st
w@@ i@@ d@@ e@@ st
====


△ Note that only `st` is glued.

step 3. Get vocab. 
We get vocabulary from the segmented file.

In [267]:
get_vocab = "subword-nmt get-vocab < bpe > vocab"
os.system(get_vocab)

0

In [268]:
vocab = open('vocab', 'r').read()
print("==vocab==\n" + vocab + "====")
print("number of vocab: ", len(vocab.splitlines()))

==vocab==
e@@ 17
w@@ 11
st 9
l@@ 7
o@@ 7
n@@ 6
w 5
i@@ 3
d@@ 3
r 2
====
number of vocab:  10


△ Note that # codes (=1) is not the same as # vocab (=10.

### What if we increase the number of operations?

In [269]:
learn_bpe = "subword-nmt learn-bpe -s 10 --min-frequency 2 < toy > codes"
os.system(learn_bpe)
codes = open('codes', 'r').read()
print("==codes==\n" + codes + "====")
print("number of codes: ", len(codes.splitlines())-1)

apply_bpe = "subword-nmt apply-bpe -c codes < toy > bpe"
os.system(apply_bpe)
bpe = open('bpe', 'r').read()
print("\n==segmented==\n" + bpe + "====")

get_vocab = "subword-nmt get-vocab < bpe > vocab"
os.system(get_vocab)
vocab = open('vocab', 'r').read()
print("\n==vocab==\n" + vocab + "====")
print("number of vocab: ", len(vocab.splitlines()))

==codes==
#version: 0.2
s t</w>
e st</w>
l o
w est</w>
n e
ne west</w>
lo w</w>
w i
wi d
wid est</w>
====
number of codes:  10

==segmented==
low
low
low
low
low
lo@@ w@@ e@@ r
lo@@ w@@ e@@ r
newest
newest
newest
newest
newest
newest
widest
widest
widest
====

==vocab==
newest 6
low 5
widest 3
lo@@ 2
w@@ 2
e@@ 2
r 2
====
number of vocab:  7


△ As you've seen, if you increase the number of operations,   
words should be less segmented,
and the number of vocabulary should decrease. 

### How to restore the original text from the segmented one?

In [270]:
restored = re.sub("@@( |$)", "", bpe)
print(restored)

low
low
low
low
low
lower
lower
newest
newest
newest
newest
newest
newest
widest
widest
widest



### How to restrict vocabulary?

In [271]:
reapply_bpe = "subword-nmt apply-bpe -c codes --vocabulary vocab --vocabulary-threshold 5 < toy > bpe2"
os.system(reapply_bpe)

0

In [272]:
bpe2 = open('bpe2', 'r').read()
print(bpe2)

low
low
low
low
low
l@@ o@@ w@@ e@@ r
l@@ o@@ w@@ e@@ r
newest
newest
newest
newest
newest
newest
w@@ i@@ d@@ e@@ s@@ t
w@@ i@@ d@@ e@@ s@@ t
w@@ i@@ d@@ e@@ s@@ t



In [273]:
# To compare with the original bpe segmented result, print it again.
print(bpe)

low
low
low
low
low
lo@@ w@@ e@@ r
lo@@ w@@ e@@ r
newest
newest
newest
newest
newest
newest
widest
widest
widest



△ `widest`, which was not segmented, is segmented into `w@@ i@@ d@@ e@@ s@@ t` because the frequency of `widest` was less than 5.

Be careful that the original vocabulary or thresholded one doesn't hold any more. We need to get the final vocabulary now.

In [274]:
get_vocab = "subword-nmt get-vocab < bpe2 > vocab2"
os.system(get_vocab)

0

In [275]:
vocab2 = open("vocab2", 'r').read()
print(vocab2)

newest 6
low 5
w@@ 5
e@@ 5
i@@ 3
d@@ 3
s@@ 3
t 3
l@@ 2
o@@ 2
r 2



### Let's test with a bigger text.

Download a sample file for demonstration from subword-nmt.

In [276]:
download = "wget https://github.com/rsennrich/subword-nmt/raw/master/subword_nmt/tests/data/corpus.en"
os.system(download)

0

In [213]:
print(open('corpus.en', 'r').read()[:1000])

iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould .
iron cement protects the ingot against the hot , abrasive steel casting process .
a fire restant repair cement for fire places , ovens , open fireplaces etc .
construction and repair of highways and ...
an announcement must be commercial character .
goods and services advancement through the P.O.Box system is NOT ALLOWED .
deliveries ( spam ) and other improper information deleted .
translator Internet is a Toolbar for MS Internet Explorer .
it allows you to translate in real time any web pasge from one language to another .
you only have to select languages and TI does all the work for you ! automatic dictionary updates ....
this software is written in order to increase your English keyboard typing speed , through teaching the basics of how to put your hand on to the keyboard and give some training examples .
each lesson teaches some extra k

In [217]:
learn_bpe = "subword-nmt learn-bpe -s 1000 --min-frequency 2 < corpus.en > codes"
os.system(learn_bpe)
codes = open('codes', 'r').read()
print("==codes==\n" + codes[:100] + "====")
print("number of codes: ", len(codes.splitlines())-1)

apply_bpe = "subword-nmt apply-bpe -c codes < corpus.en > bpe"
os.system(apply_bpe)
bpe = open('bpe', 'r').read()
print("\n==segmented==\n" + bpe[:100] + "====")

get_vocab = "subword-nmt get-vocab < bpe > vocab"
os.system(get_vocab)
vocab = open('vocab', 'r').read()
print("\n==vocab==\n" + vocab[:100] + "====")
print("number of vocab: ", len(vocab.splitlines()))

==codes==
#version: 0.2
t h
th e</w>
i n
a n
e r
r e
o r
a r
t i
an d</w>
o f</w>
e n
o u
o n
t o</w>
o n</w>
====
number of codes:  1000

==segmented==
ir@@ on c@@ ement is a read@@ y for use pa@@ st@@ e which is la@@ id as a fil@@ let by pu@@ t@@ ty k====

==vocab==
the 1358
, 1291
. 968
and 663
of 651
a 623
in 506
to 490
is 351
ed 279
s@@ 258
c@@ 254
you 253
for 2====
number of vocab:  1120


# (BPE in) SentencePiece

In [104]:
os.system("pip install sentencepiece")

0

In [106]:
import sentencepiece as spm

step 1. Train.  
This should generate `m.model` and `m.vocab`. This is analogous to the `learn bpe` in `subword-nmt`. However, unlike `subword-nmt`, vocabulary, not merge operations, is fixed.

In [238]:
train = '--input=corpus.en --model_prefix=m --vocab_size=1000 --model_type=bpe'
spm.SentencePieceTrainer.Train(train)

True

Check the vocab file.

In [237]:
vocab = open('m.vocab', 'r').read()
print("\n==vocab==\n" + vocab[:100] + "\n====")
print("number of vocab: ", len(vocab.splitlines()))


==vocab==
<unk>	0
<s>	0
</s>	0
▁t	-0
▁a	-1
▁th	-2
in	-3
▁the	-4
er	-5
▁o	-6
re	-7
▁,	-8
▁s	-9
at	-10
nd	-11
▁.
====
number of vocab:  1000


△ ▁, which means a space, precedes other characters.

step 2. Encode. 
First load the trained model and segment the designated text file so that all the pieces in the vocabulary should be generated.

In [285]:
# Load model
sp = spm.SentencePieceProcessor()
sp.Load("m.model")

# Segment
input_text = open('corpus.en', 'r').read()
pieces = sp.EncodeAsPieces(input_text)
ids = sp.EncodeAsIds(input_text)
print(" ".join(pieces[:100]))
print(ids[:100])

▁ ir on ▁c e ment ▁is ▁a ▁read y ▁for ▁use ▁p ast e ▁which ▁is ▁la id ▁as ▁a ▁f ill et ▁by ▁p ut t y ▁kn ife ▁or ▁f ing er ▁in ▁the ▁m ould ▁ ed g es ▁( ▁cor n ers ▁) ▁of ▁the ▁st e el ▁in g ot ▁m ould ▁. ▁ ir on ▁c e ment ▁pr ot ect s ▁the ▁in g ot ▁ag ain st ▁the ▁hot ▁, ▁ab r as ive ▁st e el ▁c ast ing ▁process ▁. ▁a ▁f ire ▁rest ant ▁rep a ir ▁c
[923, 92, 20, 18, 924, 115, 55, 4, 596, 940, 59, 362, 28, 202, 924, 173, 55, 431, 112, 97, 4, 22, 126, 58, 158, 28, 61, 925, 940, 353, 654, 119, 22, 31, 8, 30, 7, 33, 204, 923, 27, 941, 26, 146, 888, 929, 102, 150, 29, 7, 124, 924, 60, 30, 941, 46, 33, 204, 15, 923, 92, 20, 18, 924, 115, 94, 46, 186, 930, 7, 30, 941, 46, 586, 138, 73, 7, 834, 11, 279, 931, 47, 196, 124, 924, 60, 18, 202, 31, 813, 15, 4, 22, 441, 621, 192, 449, 927, 92, 18]


### How to restore?

In [233]:
sp.DecodePieces(pieces[:100])

'iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould . iron cement protects the ingot against the hot , abrasive steel casting process . a fire restant repair c'

In [239]:
sp.DecodeIds(ids[:100])

'iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould . iron cement protects the ingot against the hot , abrasive steel casting process . a fire restant repair c'

# Bert Tokenizer

In [241]:
os.system("pip install pytorch_pretrained_bert")

0

In [242]:
from pytorch_pretrained_bert import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


In [287]:
input_text = open("corpus.en", "r").read()
pieces = tokenizer.tokenize(input_text)

In [288]:
" ".join(pieces[:20])

'iron cement is a ready for use paste which is laid as a fill ##et by put ##ty knife or'

Bert Tokenizer is composed of Basic Tokenizer, which splits punctuations, and WordPiece Tokenizer. That can be a problem if you want to restore the original text. https://github.com/huggingface/pytorch-pretrained-BERT/issues/36


Bert Tokenizer uses ##. It is different from @@ in BPE or ▁ in SentencePiece.  
@@ is attached to the end of subwords, while ## and ▁ is to the front.  
`@@ + space` and `space + ##` are removed for restoration, while ▁ is replaced by a space.