[Topic Link](https://spacy.io/usage/linguistic-features#tokenization)

In [1]:
import spacy

- *Important note: spaCy’s tokenization is `non-destructive`, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization. This is kind of a core principle of spaCy’s `Doc object: doc.text == input_text` should always hold `true`.*

- *During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:*

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

[('San Francisco', 0, 13, 'GPE')]


In [4]:
# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]

print(ent_san)
print(ent_francisco)

['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']


*`Adding special case tokenization rules` : Most domains have at least some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field. Here’s how to add a special case rule to an existing Tokenizer instance*

In [5]:
from spacy.symbols import ORTH

doc = nlp("gimme that")  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']

# Add special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

# Check new tokenization
print([w.text for w in nlp("gimme that")])  # ['gim', 'me', 'that']

['gimme', 'that']
['gim', 'me', 'that']


*The special case doesn’t have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring. The special case rules also have precedence over the punctuation splitting.*

In [6]:
assert "gimme" not in [w.text for w in nlp("gimme!")]
assert "gimme" not in [w.text for w in nlp('("...gimme...?")')]

nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
assert len(nlp("...gimme...?")) == 1

*`Debugging the tokenizer' : A working implementation of the pseudo-code above is available for debugging as nlp.tokenizer.explain(text). It returns a list of tuples showing which tokenizer rule or pattern was matched for each token. The tokens produced are identical to nlp.tokenizer() except for whitespace tokens:*

In [7]:
from spacy.lang.en import English

nlp = English()
text = '''"Let's go!"'''
doc = nlp(text)
tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
    print(t[1], "\\t", t[0])

" \t PREFIX
Let \t SPECIAL-1
's \t SPECIAL-2
go \t TOKEN
! \t SUFFIX
" \t SUFFIX


In [12]:
import re
from spacy.tokenizer import Tokenizer

special_cases = {":)": [{"ORTH": ":)"}]}
prefix_re = re.compile(r'''^[\\[\\("']''')
# suffix_re = re.compile(r'''[\\]\\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')


def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, rules=special_cases,prefix_search=prefix_re.search,# suffix_search=suffix_re.search,
                    infix_finditer=infix_re.finditer,url_match=simple_url_re.match)


nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("hello-world. :)")
print([t.text for t in doc])  # ['hello', '-', 'world.', ':)']

['hello', '-', 'world.', ':)']
