```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Introduction to tokenization

## What is tokenization?

* It turns a string or document into tokens (smaller chunks).

* It's one step in preparing a text for NLP.

* It has many different theories and rules.

* Users can create their own rules using regular expressions.

* There are some examples:

	* *breaking out words or sentences*
    
	* *separating punctuation*
    
	* *separating all hashtags in a tweet*

## What is the NLTK library?

* NLTK: Natural Language Toolkit

## Code of the NLTK library:

In [1]:
from nltk.tokenize import word_tokenize

word_tokenize("Hi there!")

['Hi', 'there', '!']

## Why tokenize?

* Easier to map part of speech.

* To match common words.

* To remove unwanted tokens.

* E.g., 

    ```
    >>> word_tokenize("I don't like Sam's shoes.")
    ['I', 'do', "n't", 'like', 'Sam', "'s", 'shoes', '.']
    ```

## What are the other NLTK tokenizers?

* `sent_tokenize`: tokenize a document into sentences.

* `regexp_tokenize`: tokenize a string or document based on a regular expression pattern.

* `TweetTokenizer`: special class just for tweet tokenization, allowing separate hashtags, mentions, and lots of exclamation points, such as '!!!'.

## Code of regex practice (the difference between `re.search()` and `re.match()`):

In [2]:
import re

re.match('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [3]:
re.search('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [4]:
re.match('cd', 'abcde')

In [5]:
re.search('cd', 'abcde')

<re.Match object; span=(2, 4), match='cd'>

## Practice exercises for introduction to tokenization:

$\blacktriangleright$ **Data pre-loading:**

In [6]:
scene_one = open("ref2. Scene 1 of Monty Python and the Holy Grail.txt").read()

$\blacktriangleright$ **NLTK word tokenization with practice:**

In [7]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

{'Will', 'Listen', 'What', 'question', "'s", 'Not', 'order', 'Where', 'fly', 'join', "'em", '[', 'sun', 'he', 'but', 'strand', 'Whoa', 'Arthur', 'temperate', 'winter', 'grips', 'ridden', 'point', 'be', 'does', 'other', '.', 'ounce', 'every', 'European', '?', 'SOLDIER', 'my', 'five', 'length', 'by', 'ratios', 'weight', 'wings', 'of', 'or', 'wind', "'ve", 'land', 'bangin', 'plover', 'me', 'Patsy', 'the', 'using', ':', 'bird', 'get', 'Britons', 'non-migratory', 'ARTHUR', 'speak', 'mean', 'suggesting', 'martin', 'one', 'SCENE', 'found', 'Oh', "'", 'times', 'Found', 'then', 'Pull', 'and', 'second', 'lord', 'your', 'swallows', 'breadth', 'search', 'Camelot', 'am', 'warmer', 'KING', 'through', 'you', 'held', 'in', 'trusty', 'got', 'carry', 'carrying', 'I', 'it', '!', 'simple', 'needs', 'empty', 'beat', 'It', 'tropical', 'Please', 'The', 'since', 'Saxons', '1', 'this', ',', 'servant', 'You', ']', 'Am', '--', 'agree', 'not', "'re", 'Pendragon', 'Uther', 'do', 'on', 'So', 'Supposing', 'tell', 'W

$\blacktriangleright$ **Package pre-loading:**

In [8]:
import re

$\blacktriangleright$ **Regex (`re.search()`) practice:**

In [9]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

580 588


In [10]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>


In [11]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s#]+:"
print(re.match(pattern2, sentences[3]))

<re.Match object; span=(0, 7), match='ARTHUR:'>


## Appendixes

### Appendix A: Version Checking:

In [12]:
import nltk

print('The nltk version is {}.'.format(nltk.__version__))

The nltk version is 3.5.


### Appendix B: Syntax Reference of [NLTK 3.5](https://www.nltk.org):

$\blacktriangleright$ **[`nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize):**

* Return a tokenized copy of `text`, using NLTK’s recommended word tokenizer (currently an improved [`TreebankWordTokenizer`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.treebank.TreebankWordTokenizer) along with [`PunktSentenceTokenizer`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer) for the specified language).

* Parameters:

    * `text` (`str`) – text to split into words

    * `language` (`str`) – the model name in the Punkt corpus

    * `preserve_line` (`bool`) – An option to keep the preserve the sentence and not sentence tokenize it.

$\blacktriangleright$ **[`nltk.tokenize.sent_tokenize(text, language='english')`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize):**

* Return a sentence-tokenized copy of `text`, using NLTK’s recommended sentence tokenizer (currently [`PunktSentenceTokenizer`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer) for the specified language).

* Parameters:

    * `text` – text to split into sentences

    * `language` – the model name in the Punkt corpus

$\blacktriangleright$ **[`nltk.tokenize.regexp.regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.regexp_tokenize):**

* Return a tokenized copy of `text`. See `RegexpTokenizer` for descriptions of the arguments.

* Caution: The function `regexp_tokenize()` takes the text as its first argument, and the regular expression pattern as its second argument. This differs from the conventions used by Python’s `re` functions, where the pattern is always the first argument. (This is for consistency with the other NLTK tokenizers.)

$\blacktriangleright$ **[class `nltk.tokenize.regexp.RegexpTokenizer(pattern, gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer):**

* A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

In [13]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

* Parameters:

    * `pattern` (`str`) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. `(?:…)`, instead)

    * `gaps` (`bool`) – `True` if this tokenizer’s pattern should be used to find separators between tokens; `False` if this tokenizer’s pattern should be used to find the tokens themselves.

    * `discard_empty` (`bool`) – `True` if any empty tokens `''` generated by the tokenizer should be discarded. Empty tokens can only be generated if `_gaps == True`.

    * `flags` (`int`) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: `re.UNICODE | re.MULTILINE | re.DOTALL`.

$\blacktriangleright$ **[class `nltk.tokenize.casual.TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.casual.TweetTokenizer):**

* Tokenizer for tweets.

In [14]:
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()
s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
tknzr.tokenize(s0)

['This',
 'is',
 'a',
 'cooool',
 '#dummysmiley',
 ':',
 ':-)',
 ':-P',
 '<3',
 'and',
 'some',
 'arrows',
 '<',
 '>',
 '->',
 '<--']

* Examples using `strip_handles` and `reduce_len` parameters:

In [15]:
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
tknzr.tokenize(s1)

[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']