```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Introduction to tokenization

## What is tokenization?

* It turns a string or document into tokens (smaller chunks).

* It's one step in preparing a text for NLP.

* It has many different theories and rules.

* Users can create their own rules using regular expressions.

* There are some examples:

	* *breaking out words or sentences*
    
	* *separating punctuation*
    
	* *separating all hashtags in a tweet*

## What is the `NLTK` library?

* `NLTK`: Natural Language Toolkit

## Code of the `NLTK` library:

In [1]:
from nltk.tokenize import word_tokenize

word_tokenize("Hi there!")

['Hi', 'there', '!']

## Why tokenize?

* Easier to map part of speech.

* To match common words.

* To remove unwanted tokens.

* E.g., 

    ```
    >>> word_tokenize("I don't like Sam's shoes.")
    ['I', 'do', "n't", 'like', 'Sam', "'s", 'shoes', '.']
    ```

## What are the other `NLTK` tokenizers?

* `sent_tokenize`: tokenize a document into sentences.

* `regexp_tokenize`: tokenize a string or document based on a regular expression pattern.

* `TweetTokenizer`: special class just for tweet tokenization, allowing separate hashtags, mentions, and lots of exclamation points, such as '!!!'.

## Code of regex practice (the difference between `re.search()` and `re.match()`):

In [2]:
import re

re.match('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [3]:
re.search('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [4]:
re.match('cd', 'abcde')

In [5]:
re.search('cd', 'abcde')

<re.Match object; span=(2, 4), match='cd'>

## Practice exercises for introduction to tokenization:

$\blacktriangleright$ **Data pre-loading:**

In [6]:
scene_one = open("ref2. Monty Python and the Holy Grail.txt").read()

$\blacktriangleright$ **`NLTK` word tokenization with practice:**

In [7]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)



$\blacktriangleright$ **Package pre-loading:**

In [8]:
import re

$\blacktriangleright$ **Regex (`re.search()`) practice:**

In [9]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

580 588


In [10]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>


In [11]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s#]+:"
print(re.match(pattern2, sentences[3]))

<re.Match object; span=(0, 7), match='ARTHUR:'>
