# Introduction

We will be exploring the [spaCy](https://spacy.io/) toolkit for Natural Language Processing (NLP). My goal is to investigate a BPE (e.g. Wordpiece) like algorithm for tokenization.

In [26]:
import spacy

## Working with spaCy

I will not dwell too much on describing what this package is, but I will highlight what seem to its most important qualities. spaCy is fast, well established, and focused. Being fast is quite straightforward; however, by well established I mean the package is used throughout the NLP community and it is well documented. By focused I mean the package is focused on quickly providing good results, and not necessarily on giving the user every single possible option.

I will note that spaCy also has support for pre-trained transformer models through [Hugging Face](https://huggingface.co/).

Let's look at some basic stuff with spaCy just to get an idea of how it works. We have previously downloaded the "en_core_web_sm" pipeline as specified in their documentation. This is a small english model trained on web data. We load it below.

In [27]:
nlp = spacy.load('en_core_web_sm')

In [28]:
doc = nlp("People in the U.S.A are crazy, don't you think!")
for token in doc:
    print(token.text, token.pos_)

People NOUN
in ADP
the DET
U.S.A PROPN
are AUX
crazy ADJ
, PUNCT
do AUX
n't PART
you PRON
think VERB
! PUNCT


We can see that the pipeline easily parses a made up document. We have printed the segemented tokens and their part of speech. As an example we can see that "U.S.A" is a single token (which makes sense), and it is classified as a proper noun.

In [29]:
for ent in doc.ents:
    print(ent.text, ent.label_)

U.S.A GPE


In the output above we can see that the pipeline can extract entities from documents. It has extracted the "U.S.A" entity from our toy example above and correctly labeled it as a geopolitical entity.

### BPE Tokenization

I am interested in looking at Byte Pair Encoding (BPE) for document tokenization. To do this I first used the following code to download the wikitext-103 dataset compiled from Wikipedia.

```
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```

Then I used the following documentation from Hugging Face [here](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and spaCy [here](https://spacy.io/usage/linguistic-features) to build a tokenizer which can implement BPE.

In [30]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tk_bpe = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tk_bpe.pre_tokenizer = Whitespace()

files = [f'wikitext-103-raw/wiki.{split}.raw' for split in ['test', 'train', 'valid']]

tk_bpe.train(trainer, files)

Let's also build a WordPiece tokenizer to compare the top-down vs bottom approaches.

In [31]:
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

tk_wp = Tokenizer(WordPiece(unk_token="[UNK]"))
trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tk_wp.pre_tokenizer = Whitespace()

files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]

tk_wp.train(trainer, files)

We have built and trained our tokenizer on the downloaded dataset, so now lets try it out using spaCy. Below we define a class that will use the tokenizers we created above.

In [32]:
from spacy.tokens import Doc
import spacy

class BPELike:
    def __init__(self, vocab, tokenizer):
        self.vocab = vocab
        self._tokenizer = tokenizer

    def __call__(self, text):
        tokens = self._tokenizer.encode(text)
        words = []
        spaces = []
        for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
            words.append(text)
            if i < len(tokens.tokens) - 1:
                # If next start != current end we assume a space in between
                next_start, next_end = tokens.offsets[i + 1]
                spaces.append(next_start > end)
            else:
                spaces.append(True)
        return Doc(self.vocab, words=words, spaces=spaces)

First, we try out the BPE tokenization.

In [41]:
nlp = spacy.blank("en")
nlp.tokenizer = BPELike(nlp.vocab, tk_bpe)

doc = nlp("Let's try to make an uncommon sentence. Whomever you might\
            be, considering youself to be a person, you must\
            unequivocally agree this sentence is absurdly strange.")

In [42]:
print(doc.text)
print([token.text for token in doc])

Let's try to make an uncommon sentence. Whomever you might be, considering youself to be a person, you must unequivocally agree this sentence is absurdly strange. 
['Let', "'", 's', 'try', 'to', 'make', 'an', 'uncommon', 'sentence', '.', 'Wh', 'ome', 'ver', 'you', 'might', 'be', ',', 'considering', 'you', 'self', 'to', 'be', 'a', 'person', ',', 'you', 'must', 'une', 'qu', 'iv', 'oc', 'ally', 'agree', 'this', 'sentence', 'is', 'absurd', 'ly', 'strange', '.']


For the most part we just see the words being split on whitespace and tokenized whole. For more uncommon words we see just why algorithms like this are effective. For example, "Whomever" was split into "Wh" , "ome", and "ver". Similarily the word "unequivocally" was also split into subwords.

Next, lets tokenize the same sentence using the WordPiece method.

In [43]:
nlp.tokenizer = BPELike(nlp.vocab, tk_wp)

doc = nlp("Let's try to make an uncommon sentence. Whomever you might\
            be, considering youself to be a person, you must\
            unequivocally agree this sentence is absurdly strange.")

In [44]:
print(doc.text)
print([token.text for token in doc])

Let's try to make an uncommon sentence. Who##me##ver you might be, considering you##self to be a person, you must un##equ##ivo##ca##ll##y agree this sentence is absurd##ly strange. 
['Let', "'", 's', 'try', 'to', 'make', 'an', 'uncommon', 'sentence', '.', 'Who', '##me', '##ver', 'you', 'might', 'be', ',', 'considering', 'you', '##self', 'to', 'be', 'a', 'person', ',', 'you', 'must', 'un', '##equ', '##ivo', '##ca', '##ll', '##y', 'agree', 'this', 'sentence', 'is', 'absurd', '##ly', 'strange', '.']


Interestingly, the result is almost identical except for the word "unequivocally". In the BPE case this was split into 5 components, but in this case it was split into 6. As well, the components themselves differ between the two methods. It is not necessarily clear if one method is better than another, but I would tend to say that WordPiece looks better by inspection.

## Conclusion

The NLP toolkit spaCy is very capable and easy to use. Incorporating packages likes this into regular workflows in research or industry seems almost necessary. It is this sort of thing that saves so much time and allows one to really get to the meat of their work.