```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Charting word length with `NLTK`

## How to make regex groups and how to indicate "OR"?

* "OR" is represented using `|`.

* It is possible to define a group using `()`.

* It is possible to define explicit character ranges using `[]`.

## Code of regex groups and the indication of "OR":

In [1]:
import re

match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

## What are regex ranges and groups?

![Regex ranges and groups.jpg](ref3.%20Regex%20ranges%20and%20groups.jpg)

## Code of character range with `re.match()`:

In [2]:
import re

my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<re.Match object; span=(0, 35), match='match lowercase spaces nums like 12'>

## Practice question for choosing a tokenizer:

* Given the following string, which of the below patterns is the best tokenizer? It is better to retain sentence punctuation as separate tokens if possible but have `'#1'` remain a single token.

    ```
    my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
    ```
    
    $\Box$ `r"(\w+|\?|!)"`.

    $\boxtimes$ `r"(\w+|#\d|\?|!)"`.
    
    $\Box$ `r"(#\d\w+\?!)"`.
        
    $\Box$ `r"\s+"`.

$\blacktriangleright$ **Package pre-loading:**

In [3]:
from nltk.tokenize import regexp_tokenize

$\blacktriangleright$ **Data pre-loading:**

In [4]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
string = my_string

pattern1 = r"(\w+|\?|!)"
pattern2 = r"(\w+|#\d|\?|!)"
pattern3 = r"(#\d\w+\?!)"
pattern4 = r"\s+"

$\blacktriangleright$ **Question-solving method:**

In [5]:
pattern = pattern1
print(regexp_tokenize(string, pattern))

['SOLDIER', '1', 'Found', 'them', '?', 'In', 'Mercea', '?', 'The', 'coconut', 's', 'tropical', '!']


In [6]:
pattern = pattern2
print(regexp_tokenize(string, pattern))

['SOLDIER', '#1', 'Found', 'them', '?', 'In', 'Mercea', '?', 'The', 'coconut', 's', 'tropical', '!']


In [7]:
pattern = pattern3
print(regexp_tokenize(string, pattern))

[]


In [8]:
pattern = pattern4
print(regexp_tokenize(string, pattern))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


## Practice exercises for advanced tokenization with `NLTK` and regex:

$\blacktriangleright$ **Data pre-loading:**

In [9]:
tweets = [
    'This is the best #nlp exercise ive found online! #python',
    '#NLP is super fun! <3 #learning', 'Thanks @datacamp :) #nlp #python'
]

$\blacktriangleright$ **`NLTK` regex tokenization practice:**

In [10]:
# Import the necessary modules
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import regexp_tokenize

In [11]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

['#nlp', '#python']


In [12]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

['@datacamp', '#nlp', '#python']


In [13]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


$\blacktriangleright$ **Package pre-loading:**

In [14]:
from nltk.tokenize import word_tokenize

$\blacktriangleright$ **Data re-pre-loading:**

In [15]:
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

$\blacktriangleright$ **Non-ascii tokenization practice:**

In [16]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|\
'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"

print(regexp_tokenize(german_text, emoji))

['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']
['Wann', 'Pizza', 'Und', 'Über']
['🍕', '🚕']
