```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Introduction to tokenization

## What is tokenization?

* It turns a string or document into tokens (smaller chunks).

* It's one step in preparing a text for NLP.

* It has many different theories and rules.

* Users can create their own rules using regular expressions.

* There are some examples:

	* *breaking out words or sentences*
    
	* *separating punctuation*
    
	* *separating all hashtags in a tweet*

## What is the `nltk` library?

* `nltk`: natural language toolkit

## Code of the `nltk` library:

In [1]:
from nltk.tokenize import word_tokenize

word_tokenize("Hi there!")

['Hi', 'there', '!']

## Why tokenize?

* Easier to map part of speech.

* To match common words.

* To remove unwanted tokens.

* E.g., 

    ```
    >>> word_tokenize("I don't like Sam's shoes.")
    ['I', 'do', "n't", 'like', 'Sam', "'s", 'shoes', '.']
    ```

## What are the other `nltk` tokenizers?

* `sent_tokenize`: tokenize a document into sentences.

* `regexp_tokenize`: tokenize a string or document based on a regular expression pattern.

* `TweetTokenizer`: special class just for tweet tokenization, allowing separate hashtags, mentions, and lots of exclamation points, such as '!!!'.

## Code of regex practice (the difference between `re.search()` and `re.match()`):

In [11]:
import re

re.match('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [12]:
re.search('abc', 'abcde')

<re.Match object; span=(0, 3), match='abc'>

In [13]:
re.match('cd', 'abcde')

In [14]:
re.search('cd', 'abcde')

<re.Match object; span=(2, 4), match='cd'>

## What are the common regex patterns?

![Common regex patterns](ref1.%20Common%20regex%20patterns.jpg)

## How to use Python's `re` module?

* `re` module:

    * `split`: split a string on regex
    
    * `findall`: find all pa erns in a string
    
    * `search`: search for a pattern
    
    * `match`: match an entire string or substring based on a pattern

* Parameterize the pattern first and parameterize the string second.

* May return an iterator, string, or match object.

## Code of Python's `re` module:

In [3]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

## Practice question for finding out the corresponding pattern:

* Which of the following Regex patterns results in the following text?

    ```
    >>> my_string = "Let's write RegEx!"
    >>> re.findall(PATTERN, my_string)
    ['Let', 's', 'write', 'RegEx']
    ```

    $\Box$ `PATTERN = r"\s+"`.

    $\boxtimes$ `PATTERN = r"\w+"`.
    
    $\Box$ `PATTERN = r"[a-z]"`.
        
    $\Box$ `PATTERN = r"\w"`.

$\blacktriangleright$ **Package pre-loading:**

In [4]:
import re

$\blacktriangleright$ **Question-solving method:**

In [5]:
my_string = "Let's write RegEx!"
PATTERN = r"\s+"
re.findall(PATTERN, my_string)

[' ', ' ']

In [6]:
my_string = "Let's write RegEx!"
PATTERN = r"\w+"
re.findall(PATTERN, my_string)

['Let', 's', 'write', 'RegEx']

In [7]:
my_string = "Let's write RegEx!"
PATTERN = r"[a-z]"
re.findall(PATTERN, my_string)

['e', 't', 's', 'w', 'r', 'i', 't', 'e', 'e', 'g', 'x']

In [8]:
my_string = "Let's write RegEx!"
PATTERN = r"\w"
re.findall(PATTERN, my_string)

['L', 'e', 't', 's', 'w', 'r', 'i', 't', 'e', 'R', 'e', 'g', 'E', 'x']

## Practice exercises for introduction to regular expressions:

$\blacktriangleright$ **Package pre-loading:**

In [9]:
import re

$\blacktriangleright$ **Data pre-loading:**

In [10]:
my_string = "Let's write RegEx!  \
Won't that be fun?  \
I sure think so.  \
Can you find 4 sentences?  \
Or perhaps, all 19 words?"

$\blacktriangleright$ **Regular expressions (`re.split()` and `re.findall()`) practice:**

In [11]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']
