```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Advanced tokenization with NLTK and regex

## How to make regex groups and how to indicate "OR"?

* "OR" is represented using `|`.

* It is possible to define a group using `()`.

* It is possible to define explicit character ranges using `[]`.

## Code of regex groups and the indication of "OR":

In [2]:
import re

match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

## What are regex ranges and groups?

![Regex ranges and groups.jpg](ref3.%20Regex%20ranges%20and%20groups.jpg)

## Code of character range with `re.match()`:

In [3]:
import re

my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<re.Match object; span=(0, 35), match='match lowercase spaces nums like 12'>

## Practice question for choosing a tokenizer:

* Given the following string, which of the below patterns is the best tokenizer? It is better to retain sentence punctuation as separate tokens if possible but have `'#1'` remain a single token.

    ```
    my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
    ```
    
    $\Box$ `r"(\w+|\?|!)"`.

    $\boxtimes$ `r"(\w+|#\d|\?|!)"`.
    
    $\Box$ `r"(#\d\w+\?!)"`.
        
    $\Box$ `r"\s+"`.

$\blacktriangleright$ **Package pre-loading:**

In [10]:
from nltk.tokenize import regexp_tokenize

$\blacktriangleright$ **Data pre-loading:**

In [11]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
string = my_string

pattern1 = r"(\w+|\?|!)"
pattern2 = r"(\w+|#\d|\?|!)"
pattern3 = r"(#\d\w+\?!)"
pattern4 = r"\s+"

$\blacktriangleright$ **Question-solving method:**

In [12]:
pattern = pattern1
print(regexp_tokenize(string, pattern))

['SOLDIER', '1', 'Found', 'them', '?', 'In', 'Mercea', '?', 'The', 'coconut', 's', 'tropical', '!']


In [13]:
pattern = pattern2
print(regexp_tokenize(string, pattern))

['SOLDIER', '#1', 'Found', 'them', '?', 'In', 'Mercea', '?', 'The', 'coconut', 's', 'tropical', '!']


In [14]:
pattern = pattern3
print(regexp_tokenize(string, pattern))

[]


In [15]:
pattern = pattern4
print(regexp_tokenize(string, pattern))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
