In [1]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%cd ..
import statnlpbook.tokenization as tok

# Tokenisation

![nospaces](../img/nospaces.jpg)

* Identify the **meaningful units** in a string of characters: usually **words**.

In Python you can tokenise text via `split`:

In [5]:
text = """Mr. Bob Dobolina is thinkin' of a master plan.
Why doesn't he quit?"""
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.\nWhy',
 "doesn't",
 'he',
 'quit?']

What is wrong with this?

In Python you can also tokenise using **patterns** at which to split tokens:
### Regular Expressions

A **regular expression** is a compact definition of a **set** of (character) sequences (strings).

Examples:
* `Mr.`: all strings containing `Mr` followed by any single character
* `Mr\.`: only the string `Mr.`
* <code>&nbsp;</code>`|\n|!!!`: only the strings <code>&nbsp;</code> (space), `\n` and `!!!`
* `[abc]`: only the characters `a`, `b` and `c`
* `\s`: all whitespace characters
* `1+`: all sequences of at least one `1`
* `\w+`: all sequences of alphanumeric characters and `_`


In [11]:
import re
re.compile('\s').split(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

Problems:
* Bad treatment of punctuation.  
* Easier to **define a token** than a gap. 

Let us use `findall` instead:

In [5]:
re.compile('\w+|[.?]').findall(text)

['Mr',
 '.',
 'Bob',
 'Dobolina',
 'is',
 'thinkin',
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 'doesn',
 't',
 'he',
 'quit',
 '?']

Problems:
* `Mr.` and `doesn't` are split into two tokens each.
* Lost an apostrophe (`thinkin'`).

Both are fixed below ...

In [6]:
re.compile('Mr\.|[\w\']+|[.?]').findall(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 "doesn't",
 'he',
 'quit',
 '?']

## Learning to Tokenise?
* For English, simple pattern matching is often sufficient.
* In some languages (e.g., Japanese), words are not separated by whitespace.
* In some languages (e.g., Vietnamese), whitespace does not indicate word boundary.


In [12]:
jap = "今日もしないといけない。"
viet = "thuế  thu nhập cá nhân"

Try lexicon-based tokenisation ...

In [7]:
re.compile('もし|今日|も|しない|と|いけない').findall(jap)

['今日', 'もし', 'と', 'いけない']

In [8]:
re.compile('thuế  thu nhập|cá nhân').findall(viet)

['thuế  thu nhập', 'cá nhân']

Equally complex for certain English domains (e.g., biomedical text).

In [9]:
bio = """We developed a nanocarrier system of herceptin-conjugated nanoparticles
of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin
prodrug ..."""

* d-alpha-tocopheryl-co-poly is **one** token
* (TPGS)-cisplatin are **five**: 
  * ( 
  * TPGS 
  * ) 
  * - 
  * cisplatin 

In [10]:
re.compile('\s').split(bio)[:15]

['We',
 'developed',
 'a',
 'nanocarrier',
 'system',
 'of',
 'herceptin-conjugated',
 'nanoparticles',
 'of',
 'd-alpha-tocopheryl-co-poly(ethylene',
 'glycol)',
 '1000',
 'succinate',
 '(TPGS)-cisplatin',
 'prodrug']

In [13]:
re.compile('\s').split("New York-based companies")

['New', 'York-based', 'companies']

Solution: Treat tokenisation as a **statistical NLP problem** (and as structured prediction)! 
  * [classification](doc_classify.ipynb)
  * [sequence labelling](sequence_labelling.ipynb)

# Sentence Segmentation

* Many NLP tools work sentence-by-sentence. 
* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens.

In [11]:
tokens = re.compile('Mr.|[\w\']+|[.?]').findall(text)
# try different regular expressions
tok.sentence_segment(re.compile('\.'), tokens)

[['Mr.',
  'Bob',
  'Dobolina',
  'is',
  "thinkin'",
  'of',
  'a',
  'master',
  'plan',
  '.'],
 ['Why', "doesn't", 'he', 'quit', '?']]

<center><img src="../img/quiz_time.png"></center>

What to do with transcribed speech? 

Discuss and enter your answer(s) here:

# [ucph.page.link/nlp_q2](https://ucph.page.link/nlp_q2)

([Responses](https://docs.google.com/forms/d/1WANt_ndHZhGkOwPu1klR4HmGAUH1QL9W4AAkNwU6Ulg/edit))

### Solution

* One way to tokenise transcribed speech is to break it up into utterances, and then break each utterance into words
* Train a model on written data that can add punctuation, which can then be applied to add punctuation to the new text
* Tokenise using additionally captured metadata from the speaker's voice (i.e. pauses, prosidy)

# Background Reading

* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf): Chapter 2, Regular Expressions, Text Normalization, Edit Distance.
* Manning, Raghavan & Schuetze, Introduction to Information Retrieval: [Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)