```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Introduction to regular expressions

## What exactly are regular expressions?

* Strings with a special syntax

* Allow matching patterns in other strings, e.g., 

    * *find all web links in a document*
    
    * *parse email addresses*
    
    * *remove/replace unwanted characters*

## Code of the applications of regular expressions:

In [1]:
import re

re.match('abc', 'abcdef')

<re.Match object; span=(0, 3), match='abc'>

In [2]:
word_regex = '\w+'
re.match(word_regex, 'hi there!')

<re.Match object; span=(0, 2), match='hi'>

## What are the common RegEx patterns?

![Common RegEx patterns](ref1.%20Common%20RegEx%20patterns.jpg)

## How to use Python's `re` module?

* `re` module:

    * `split`: split a string on RegEx
    
    * `findall`: find all pa erns in a string
    
    * `search`: search for a pattern
    
    * `match`: match an entire string or substring based on a pattern

* Parameterize the pattern first and parameterize the string second.

* May return an iterator, string, or match object.

## Code of Python's `re` module:

In [3]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

## Practice question for finding out the corresponding pattern:

* Which of the following RegEx patterns results in the following text?

    ```
    >>> my_string = "Let's write RegEx!"
    >>> re.findall(PATTERN, my_string)
    ['Let', 's', 'write', 'RegEx']
    ```

    $\Box$ `PATTERN = r"\s+"`.

    $\boxtimes$ `PATTERN = r"\w+"`.
    
    $\Box$ `PATTERN = r"[a-z]"`.
        
    $\Box$ `PATTERN = r"\w"`.

$\blacktriangleright$ **Package pre-loading:**

In [4]:
import re

$\blacktriangleright$ **Question-solving method:**

In [5]:
my_string = "Let's write RegEx!"
PATTERN = r"\s+"
re.findall(PATTERN, my_string)

[' ', ' ']

In [6]:
my_string = "Let's write RegEx!"
PATTERN = r"\w+"
re.findall(PATTERN, my_string)

['Let', 's', 'write', 'RegEx']

In [7]:
my_string = "Let's write RegEx!"
PATTERN = r"[a-z]"
re.findall(PATTERN, my_string)

['e', 't', 's', 'w', 'r', 'i', 't', 'e', 'e', 'g', 'x']

In [8]:
my_string = "Let's write RegEx!"
PATTERN = r"\w"
re.findall(PATTERN, my_string)

['L', 'e', 't', 's', 'w', 'r', 'i', 't', 'e', 'R', 'e', 'g', 'E', 'x']

## Practice exercises for introduction to regular expressions:

$\blacktriangleright$ **Package pre-loading:**

In [9]:
import re

$\blacktriangleright$ **Data pre-loading:**

In [10]:
my_string = "Let's write RegEx!  \
Won't that be fun?  \
I sure think so.  \
Can you find 4 sentences?  \
Or perhaps, all 19 words?"

$\blacktriangleright$ **Regular expressions (`re.split()` and `re.findall()`) practice:**

In [11]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']
