In [None]:
#hide
%load_ext autoreload
%autoreload 2

In [None]:
# default_exp segment

In [None]:
#export
from typing import List

# Segmentation

To handle a big block of text we need ways of *segmenting* it into smaller pieces we can tackle.

The smallest piece is a *byte* (which depends on the *encoding* of the text), followed by a *character*, and then a *grapheme*.

We can segment into multiple characters, or into *words*, up into *sentences* and *sections*.

For example if you're analysing a book you may want to break it into chapters, containing sentences, containing words.

The right choice of segmentation depends on the *corpus* and the *task* (and your approach).
For example if you're using neural networks to translate between languages *subword tokenizers* like BPE ([Senrich, Haddow and Birch, 2015](https://arxiv.org/abs/1508.07909)) or Unigram Language Model ([Kudo 2018](https://arxiv.org/abs/1804.10959)) perform much better than word level segmentation because they can handle new words formed by seen morphemes and suffixes/prefixes.
But if you're just trying to understand sentiment you might be better off breaking it into words, normalising them and checking them against a sentiment dictionary.
Word segmentation in *unsegmented* languages like Chinese, Japanese, Thai, Balinese, Javanese and Khmer is difficult because even native speakers may not agree on how to break it into words (see [Sproat et al., 1996](https://www.aclweb.org/anthology/J96-3004/) for an example in evaluating Chinese segmentation).

Because of this we're going to provide a *range* of segmenters to try with different tasks.

# Normalisation

Normalisation is a related activity to segmentation of making things that are different the same.

Consider for example in Unicode there are multiple ways of representing the same symbol.
There's a [canonical normalisation](http://www.unicode.org/reports/tr15/) called NFC that represents all these tokens the same way.

It's often performed at the same time as word segmentation for efficiency, but it is logically a separate step (especially is you keep the context in your tokens like SpaCy does)

In [None]:
import unicodedata
a = u'\u0061\u0301'
a0 = unicodedata.normalize('NFC', u'\u0061\u0301')
a, a0

('á', 'á')

In [None]:
list(a), list(a0)

(['a', '́'], ['á'])

But for some applications maybe you are just interested in making everything ASCII; this can be useful for e.g. transliterated names.
The [unidecode library](https://github.com/takluyver/Unidecode) helps with that

In [None]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.1.2-py2.py3-none-any.whl (239 kB)
[K     |████████████████████████████████| 239 kB 3.9 MB/s eta 0:00:01     |████████████████▍               | 122 kB 3.9 MB/s eta 0:00:01
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.1.2


In [None]:
from unidecode import unidecode

In [None]:
x = "\u5317\u4EB0"
x, unidecode(x)

('北亰', 'Bei Jing ')

In [None]:
x = "Бизнес Ланч" # Business Lunch
x, unidecode(x)

('Бизнес Ланч', 'Biznes Lanch')

A common example is you may want to unify some kinds of punctuation

In [None]:
x = '–—-' # hyphen, emdash, endash
x, unidecode(x)

('–—-', '----')

In [None]:
x = '¿¡«…» „“'
x, unidecode(x)

('¿¡«…» „“', '?!<<...>> ,,"')

# Export

Beyond the character level you may want to treat certain words the same:

Spellings: 
* aluminum vs aluminium
* criticize vs critisise

Common misspellings:
* acceptable vs acceptible

Contractions and Abbreviations:
* don't for do not
* Mr. and Mr for Mister

Ideally we would keep the original forms for reference and layer the normalisation on top.

# Tokenization (Word Segmentation)

Notable tools:

* SpaCy (rule based, a few languages)
* Stanza (Neural based, many languages)
* Stanford NLP (via Stanza, rule based, a few languages)
* NLTK (rule based, a few languages)
* [Moses Tokenizer and Normalizer](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer) (Perl)

In [None]:
#!pip install stanza

!pip install spacy
!pip install spacy-lookups-data
!python -m spacy download en_core_web_sm

import stanza

In [None]:
#export


In [None]:
#export
import re

# From NLTK as referenced in Jurafsky, Speech and Language Processing Chapter 2 (Dec 2020 Draft)

TOKENIZE_RE = re.compile(r'''(?x) # set flag to allow verbose regexps
     ([A-Z]\.)+        # abbeviations, e.g. U.S.A
    | \w+(-\w+)*       # words with optional internal hyphens
    | \$?\d+(\.\d+)?%? # Currency and percentages, e.g. $12.40, 82%
    | \.\.\.           # Ellipsis
    | [][.,;"'?():-_`] # These are separate tokens
''')

class RegexTokenizer():
    def __init__(self, regexp: re.Pattern) -> None:
        self.re = regexp
        
    def __call__(self, text: str) -> List[str]:
        tokens = []
        pos = 0
        while pos < len(text):
            match = self.re.match(text, pos)
            if match:
                tokens.append(match.group(0))
                pos = match.end()
            else:
                pos += 1
        return tokens

def tokenize_space(text: str) -> List[str]:
    return re.split("\s", text)

tokenize_ascii = RegexTokenizer(TOKENIZE_RE)

In [None]:
text = "That U.S.A. poster-print/photgraph costs $12.40..."

In [None]:
tokenize_space(text)

['That', 'U.S.A.', 'poster-print/photgraph', 'costs', '$12.40...']

In [None]:
tokenize_ascii(text)

['That', 'U.S.A.', 'poster-print', 'photgraph', 'costs', '$12.40', '...']

# Subword Segmentation

[Sentencepiece](https://github.com/google/sentencepiece)

# Export

In [None]:
#hide
from nbdev.export import notebook2script; notebook2script()

Converted 000_data.ipynb.
Converted 00_core.ipynb.
Converted 01_segment.ipynb.
Converted 02_ngram.ipynb.
Converted index.ipynb.
