# The task

The basic input for NLP tasks (perhaps after speech2text processing of speech audio) is a sequence of characters. For the purposes of traditional linguistic analysis and for the majority of machine learning-based NLP solutions the input character sequences have to be segmented into small units: words or, using the terminology of computational linguistics, __tokens__, that make up larger units like sentences, paragraphs etc. in turn:

<a href="https://hackernoon.com/hn-images/1*7zGIb_pi6906J60Dm7YgyA.png"><img src="https://drive.google.com/uc?export=view&id=19xtF7dGrxz4SJrwFP_5hSYG4lPuCWcV9" width="40%"></a>

(image source: [NLP 101: Topic modeling for humans](https://hackernoon.com/nlp-101-topic-modeling-for-humans-part-1-a030e8155584))

Speaking of tokens instead of words has certain advantages:
+ allows a bit more flexibility as to what we treat as a token, e.g. it might be contentious to call punctuation marks or emoticons _words_ but it can be useful to treat them as legitimate tokens in certain settings.
+ it evokes a perspective which considers these segments as instances of certain __types__ which collectively constitute a __vocabulary__.

## What should count as a token?

It has to be emphasized that tokenization is strongly task dependent: what is a useful tokenization for one purpose can be unsatisfactory for another. For instance, for a _bag of words_ representation of documents, punctuation marks are probably not useful as tokens (they can be ignored similarly to white spaces), but if (sub)sentence boundaries are of interest then keeping punctuation mark tokens around is extremely important.

### Penn Treebank tokenization

Despite task dependency, there are some influential styles of tokenization that constitute "quasi standards". For English, the linguistically motivated tokenization style used by the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is very influential.

The surprisingly short [description](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html) is worth reading in full, but, for our purposes, the key features are:

+ punctuation marks are split from words and treated as separate tokens
+ verb contractions (like the "'s" in "she's") and  clitics (like the "n't" in "don't") are split.

## Normalization: when do expressions belong to the same type?

If we take it seriously that tokenization should produce a series of tokens that are all instances of some types, then the tokenization task can also involve decisions about type memberships. Are "apple" in the sentence

> There is an apple on your table.

and "Apple" in

> Apple is my favourite fruit.

instances of the same type? If the answer is yes, then our tokenization __normalizes__ or __standardizes__ the two "surface forms" to a common type, disregarding the capitalization, which is treated as an artifact of the word's position in the sentence.

Although our discussion will focus on the segmentation aspect of tokenization, normalization is super-important in practice, and can mean way more than just disregarding capitalization. To mention only a few representative examples,
+ typos and spelling variations can be "corrected" by tokenizing all variants as instances of the same type,
+ numerical or date-type expressions can be standardized, e.g. "1,000.00" and "1000" can be tokenized to a common standard,
+ punctuation can also be standardized, e.g. for many tasks there is no useful distinction between "!" and "!!", or between "\[" and "\(".

Depending on the task at hand, more radical normalization strategies are also conceivable e.g., regarding all numerical expressions as tokens of a special "number" type, or all words not in a predefined vocabulary ("out of vocabulary" words) as tokens of an "unknown" type, etc.

## Challenges

The challenges of tokenization are dependendent on the task at hand, as it has been discussed in the previous section, but also strongly depend on the text's
+ writing/alphabet,
+ language,
+ domain/corpus (e.g., texts with mathematical formulas pose specific challenges),
+ amount of noise (e.g., typos).

For European languages and writing systems, special challenges are posed, among others, by

+ abbreviations (frequently ending with a period)
+ number expressions (possibly containing white spaces, commas and periods)
+ "multiword expressions" (MWEs) like "New York", in these cases what should arguably count as one word contains spaces. In contrast to numbers, most tokenizers leave dealing with MWEs to later processing steps, e.g. to named entity recognition.
+ emoticons 

and, perhaps most importantly, by

###  The interdependence of tokenization and sentence segmentation

Although it is common, but by no means universal, to have sentence segmentation as a separate processing step in NLP pipelines, this step typically (but not always) _follows_ tokenization, which means that most tokenizers have to separate punctuation marks without knowing the sentence boundaries.

Unfortunately, there are some nasty interdependencies between sentence and token boundaries. E.g., in a text fragment of the form

> xxx yyy. Zzzz

the period after yyy can be a sentence ending punctuation and Zzzz can be capitalized because it starts a new sentence, but it can also be the case that there is no sentence boundary, as yyy. is an abbreviation and Zzzz is a name. In this type of cases the tokenization of a punctuation mark also makes a decision that influences sentence segmentation.

Using good abbreviation and name lists many of these tokenization problems can be solved, but sometimes there are  ambiguities whose proper resolution would require syntactic or even semantic and pragmatic considerations. Compare, for instance,

> Wir trafen den Abt. Bergbahnen sind seine Leidenschaft. (We met the abbot. Mountain railways are his passion.)

with 

> Wir sahen den Sprecher der Abt. Bergbahnen und Wanderwege. (We saw the spokesman of the dept. of mountain railways and hiking trails.)

(example from [Graën, Johannes, et al. "Cutter–a universal multilingual tokenizer." CEUR Workshop Proceedings. No. 2226. CEUR-WS, 2018.](http://ceur-ws.org/Vol-2226/paper9.pdf))

# A useful baseline: splitting on word dividers (if they exist...)
Limiting our discussion for the moment to writing systems that explicitly indicate (some) word boundaries, like English, __white spaces__ obviously convey fundamental information for tokenization. In fact, splitting the text at white spaces is a useful baseline:

In [5]:
text = "This isn't an easy sentence to tokenize!"
tokenized = text.split()
print(tokenized)

['This', "isn't", 'an', 'easy', 'sentence', 'to', 'tokenize!']


## Baseline problems
Unfortunately, the results produced by this simple white space-based strategy do not satisfy some basic requirements:

+ we typically want to treat __punctuation marks__  as separate tokens (but only if they really _function_ as punctuation, think of the periods in "U.K.", or the comma and period in "10,000.00$" )
+ this solution cannot separate token pairs _without_ white space between them, e.g., expressions with clitics. 

# Rule-based, deterministic solutions

The fact that the above simple baseline can perform acceptably in many cases suggests that more sophisticated, but still deterministic, pattern-matching-based solutions might work for tokenization, at least for writing systems with explicit word boundary indicators. In fact, tokenization is one of the few areas where rule-based, deterministic approaches are still widespread, and, what is more, considered to be state-of-the art solutions.

## String pattern matching: regular expressions

Motto: 

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. 

(Jamie Zavinsky)

Despite the justified criticism of __overusing__ them, regular expressions are and  will most probably remain an important NLP tool in any situation where rule-based pattern matching of strings is called for.

### Core idea

The core idea is very simple: given a (finite) $\Sigma$ alphabet of symbols, a regular expression (regex for short) is a pattern which precisely describes a subset of $\Sigma^*$, i.e., those sequences of $\Sigma$ symbols that "match" the pattern. Both regular expressions and matching is defined recursively as follows:

+ the empty string is a regular expression and matches itself
+ any single symbol in $\Sigma$ is a regular expression and matches itself
+ if $r_1$ and $r_2$ are regexes, then their
    - __concatenation__, $r_1 r_2$ is also a regex and matches exactly those strings that are the concatenation of a string matching $r_1$ and a string matching $r_2$
    - __alternation__, $r_1 \mid r_2$ is also a regex matching those strings that match either $r_1$ or $r_2$,
+ if $r$ is a regex, than applying the __Kleene star__ operator to $r$ we can form a new regex $r^*$ which matches exactly those strings that are the concatenation of zero or more strings each matching $r$.

Ambiguity of the scope of operations is avoided by using brackets, which can be omitted if the intended ordering corresponds to the following  operator precedence convention: Kleene star has the highest priority, followed by concatenation and alternation. A simple example: $a \mid b^*$ matches $a$ and any $b$-sequence, while $( a \mid b)^*$ matches any string consisting of "a"-s and "b"-s.

__Regular languages__

A bit of terminology, to which we will return later: a formal language (which is defined simply as  an arbitrary set of strings in an alphabet $\Sigma$) is __regular__ iff there is a $\Sigma$-regex which matches exactly the strings (so-called words) of the language.

__Advantages__

+ Despite important examples of string sets which cannot be described by regexes (e.g., $a^n b^n$), regexes are flexible enough for a lot of tasks, and, 
+ as importantly, there are efficient algorithms for deciding whether a string matches a regex in $O(n)$ (where $n$ is the string length). (Although constructing the underlying FSA can be exponential in the length of the regex, so caching is important.)

### Extensions

Motto:

> Regular expressions" […] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood)." 

(Larry Wall, author of the Perl programming language)

Regexes as described above are intentionally very barebone to make the theory simple. Real-life regex libraries add a lot of "syntactic sugar", which makes writing them more pleasant without adding expressive power, plus some genuine extensions that influences their expressive power as well.

Typical convenience extensions include:

+ character classes, that match any single letter/symbol in a set, e.g., [abcd] is equivalent to (a|b|c|d), and \s in python regexes matches any white-space character
+ character classes can be _complemented:_ e.g., \[^a\] matches any character except a
+ there is an operator for optional occurrence: r? matches s if s is empty or matches r
+ there are also operators for specifying the required number of pattern repetitions, e.g., r{m,n} matches s if s repeats the r pattern $k$ times, with $m\leq k \leq n$

Important extensions that actually __increase the expressive power__ are the so-called back-references bracketed groups. In this case a part of the regex can be referred to later and for matching it is require that the matches should be _identical_. 

E.g., the Python regex 

~~~python
(?P<a>.*)(?P=a)
~~~
   
matches any "twin word" containing the same expression repeated twice, despite the fact that twin words, famously, cannot be described by "vanilla" regular expressions -- the "twin word language" is not a regular language (it's not even a context-free language, a concept we'll discuss a bit later).

### Further regex uses

In addition to matching strings as a whole against a regex, two regex tasks are very common:

+ __finding the first or all substrings__ of a string that match a given regexp,
+ __regex-based find-and-replace__: in its simplest form this mean searching for subtrings matching a regex and replacing it with a given string. There are, however, two notable extras provided by modern regex libraries:
    - the regex can have look-ahead and look-back parts, that are used when finding a match but do not count in the part which is replaced -- this makes it possible to make context-sensitive replacements
    - replacements do not have to be fix -- they can contain back-references to parts of the match. This makes it possible to perform complicated pattern-matching based transformation of the input, e.g., changing the order of matching expressions etc.

## Cascade of regular expression-based substitutions

A simple, but surprisingly effective solution is to transform the input text by a series of regexp-based substitutions into one which then can simply be split on white space, as in our baseline tokenizer. A good example is [the tokenizer sed script accompanying the Penn Tree Bank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenizer.sed). It is important to note that the script is meant for tokenizing individual sentences, i.e. it requires sentence-segmented input.

A few representative rules (the syntax of the rules is s=\[PATTERN\]=\[REPLACEMENT\]=g):

```
s=\.\.\.= ... =g
s=[,;:@#$%&]= & =g

# Assume sentence tokenization has been done first, so split FINAL periods
# only. 
s=\([^.]\)\([.]\)\([])}>"']*\)[ 	]*$=\1 \2\3 =g
# however, we may as well split ALL question marks and exclamation points,
# since they shouldn't have the abbrev.-marker ambiguity problem
s=[?!]= & =g

# parentheses, brackets, etc.
s=[][(){}<>]= & =g
s=--= -- =g

# possessive or close-single-quote
s=\([^']\)' =\1 ' =g
# as in it's, I'm, we'd
s='\([sSmMdD]\) = '\1 =g
s='ll = 'll =g
s='re = 're =g
s='ve = 've =g
s=n't = n't =g
s='LL = 'LL =g
s='RE = 'RE =g
s='VE = 'VE =g
s=N'T = N'T =g
```

### Handing exceptions
Abbreviations and words containing punctuation marks pose important problems for this approach since the general rules incorrectly split these expressions. The standard solution to this problem is to replace the problematic expressions with unproblematic placeholders before executing the substitutions in question, e.g. using something like a 

```
(etc\.|i\.e\.|e\.g\.) => <abbrev>
```
substitution. Of course, this solution requires keeping track of the placeholder substitutions and restoring the original expressions after the execution of the problematic rules.

A good example is the __SoMaJo German-English tokenizer:__
+ [GitHub ](https://github.com/tsproisl/SoMaJo)
+ [Paper](http://aclweb.org/anthology/W16-2607)

## Lexer-based solutions
A very similar, but in certain respects more efficient approach is to use industry-standard, off-the shelf "lexers" (lexical analyzers), which were originally developed for the tokenization/lexical analysis of computer programs, but -- with appropriate rules -- can also be used for tokenizing natural language texts. A typical lexical analyzer takes a character stream as input and produces a stream of tokens from it, where each token is classified into one of the predefined token classes/types:

<a href="http://quex.sourceforge.net/images/lexical-analysis-process.png"><img src="https://drive.google.com/uc?export=view&id=1V-qcys4WCq_ESDeUpTkesAx2_efAamRW" width="400px"></a>

(Image source: http://quex.sourceforge.net/)

Most lexers are actually __lexical analyser generators__. Their input is a list of token classes (types), regular expression patterns and 

    [REGEX_PATTERN] -> [ACTION]
   
rules (where the most important action is classifying the actual match as a token of a given type), and they generate a  concrete, optimized lexical analyzer implementing the given rules, e.g., by generating the C source code of the analyzer.

### Commonly used lexers

+ __Lex__ -- the original lexer on UNIX systems, written originally in 1975. See the [Wikipedia entry](https://en.wikipedia.org/wiki/Lex_(software)), and its ["home page"](http://dinosaur.compilertools.net/). Some of its versions are now open source.
+ __Flex__ -- a more recent, open source alternative to Lex. See their [github repo](https://github.com/westes/flex) for more information.

An important limitation of earlier Lex variants is that they do not handle unicode input -- for modern natural language tokenization this is a huge problem. The following modern Lex variants are fully unicode-compatible:

+ [__Jflex__](https://jflex.de) -- a lexer written in Java.
+ [__Qlex__](http://quex.sourceforge.net/) -- a modern lexer producing highly performant C++ or C lexical analyzer source code.
+ [__PLY (Python Lex-Yacc)__](https://www.dabeaz.com/ply/) --  a pure Python lexer implementation, recommended for studying how lexers work.

### See also

Stanford CoreNLP's Jflex-based PTB tokenizer is a state-of-the-art lexer-based English tokenizer, whose [source code](https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/process/PTBLexer.flex) is worth looking into.

## Case study: spaCy's special purpose tokenizer

To appreciate how a purpose-built, rule-based NLP tokenizer might look like, let's look briefly at spaCy's built in tokenizer. The following explanation is from the [spaCy website's tokenization section](https://spacy.io/usage/linguistic-features#tokenization):

> First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
> 1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
> 2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

>If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

To sum up, white-space splitting is the foundation (this means that spaCys tokenizer doesn't allow multi-word tokens like New York or 100 000) but the white space separated chunks are further segmented recursively based on specific exception rules and general affix-pattern based rules:

<a href="https://d33wubrfki0l68.cloudfront.net/fedbc2aef51d678ae40a03cb35253dae2d52b18b/3d4b2/tokenization-57e618bd79d933c4ccd308b5739062d6.svg"><img src="https://drive.google.com/uc?export=view&id=1k3F9PSoCLtwEXqz6rv_9QAjk7lgflPjm" width="600px"></a>

It is also instructive to have a look at a spaCy tokenizer resource, e.g., the [English tokenizer exception list](https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tokenizer_exceptions.py) contains the following type of information:

```python
for pron in ["i", "you", "we", "they"]:
    for orth in [pron, pron.title()]:
        _exc[orth + "'ve"] = [
            {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
            {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
        ]

        _exc[orth + "ve"] = [
            {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
            {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
        ]
```

i.e., the exception lookup table (which is a Python dict) contains lemmatization and morphological information as well.

# Machine learning-based methods

Although for European writing systems/languages currently rule-based solutions constitute the state of the art/industry standard, statistical/machine learning based techniques have also been used:

- __Decisions trees__ were used (with hand-engineered features) to decide whether periods in a text are sentence boundaries or not: [Riley, Michael D. "Some applications of tree-based modelling to speech and language (1989)](https://www.aclweb.org/anthology/H89-2048.pdf).

- A rather influential _unsupervised approach_ to tokenization was developed for the so-called __Punkt__ sentence segmenter, which used MLE methods to determine which word--period combinations are probably abbreviations (as opposed sentence beginnings) in the training corpus. [Kiss, Tibor, and Jan Strunk: Unsupervised multilingual sentence boundary detection (2006)](https://www.mitpressjournals.org/doi/pdfplus/10.1162/coli.2006.32.4.485)

## HMM-based tokenization

Last, but not least, relatively recently HMM-based methods have been used. The approach relied on a pre-segmentation, which indicated possible token boundaries (e.g., any period followed by a space is a possible token), and classified the resulting segments in terms of three binary features:

- beginning of a word (BOW)
- beginning of a sentence (BOS)
- end of sentence (EOS)

These "hidden" features are estimated using a HMM model with observable features

- segment length
- typographic class
- contains leading whitespace (binary feature)
- stop word (binary feature)
- letter case

The model learned from the training data provides the following tokenization of an input sentence with pre-segmentation:

<a href="http://drive.google.com/uc?export=view&id=1yJRenGsufgdNIwbaozrl9IPkQM69rszd"><img src="https://drive.google.com/uc?export=view&id=1su_ptkqBUkAePbNrx7WRkrzMcX2ktV1m"></a>

See the whole paper, [Juris and Würzner: Word and Sentence Tokenization with Hidden Markov Models (2013)](https://www.researchgate.net/profile/Bryan_Jurish/publication/259772781_Word_and_Sentence_Tokenization_with_Hidden_Markov_Models/links/00b4952dce81419f06000000/Word-and-Sentence-Tokenization-with-Hidden-Markov-Models.pdf) (from which the above figure originates) for details.

# Do we really need word-based tokenization?

The development and dominance of DL-based solutions to several NLP-tasks (IR, sentiment analysis, chatbots) etc. has raised an important question concerning tokenization:

> do we really need word-based tokenization for efficient modeling?

NLP-tasks typically don't deal with words explicitly: they classify, search for etc. larger chunks of texts, i.e. short documents, or sections, paragraphs -- these are the natural units for most tasks. From this perspective, segmenting these units into smaller fragment has only instrumental value, should be done only when it's useful from the architectural and performance point of view.

Of course, if we can segment the input into a sequence of short character chunks which constitute a reasonably sized "vocabulary" this might be very useful, as we can work with shorter sequences than our character-based input, but there can be more natural ways of doing this then the theory-burdened traditional tokenization dealing with separating punctuations, clitics etc. In fact, it seems that there _are_ better solutions: we will return to this question when we discuss __subword tokenization__ methods.