```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§1 Introduction to Natural Language Processing in Python

§1.1 Regular expressions & word tokenization
```

# Introduction to regular expressions

## What is Natural Language Processing?

* The field of study Natural Language Processing (NLP) focused on making sense of language using statistics and computers.

* The basics of NLP include:

    * *topic identification*
    
    * *text classification*

* NLP applications include:

    * *chatbots*

    * *translation*

    * *sentiment analysis*

    * *and many more*

## What exactly are regular expressions?

* Strings with a special syntax

* Allow matching patterns in other strings, e.g., 

    * *find all web links in a document*
    
    * *parse email addresses*
    
    * *remove/replace unwanted characters*

## Code of the applications of regular expressions:

In [1]:
import re

re.match('abc', 'abcdef')

<re.Match object; span=(0, 3), match='abc'>

In [2]:
word_regex = '\w+'
re.match(word_regex, 'hi there!')

<re.Match object; span=(0, 2), match='hi'>

## What are the common regex patterns?

![Common regex patterns](ref1.%20Common%20regex%20patterns.jpg)

## How to use Python's `re` module?

* `re` module:

    * `split`: split a string on regex
    
    * `findall`: find all pa erns in a string
    
    * `search`: search for a pattern
    
    * `match`: match an entire string or substring based on a pattern

* Parameterize the pattern first and parameterize the string second.

* May return an iterator, string, or match object.

## Code of Python's `re` module:

In [3]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

## Practice question for finding out the corresponding pattern:

* Which of the following regex patterns results in the following text?

    ```
    >>> my_string = "Let's write RegEx!"
    >>> re.findall(PATTERN, my_string)
    ['Let', 's', 'write', 'RegEx']
    ```

    $\Box$ `PATTERN = r"\s+"`.

    $\boxtimes$ `PATTERN = r"\w+"`.
    
    $\Box$ `PATTERN = r"[a-z]"`.
        
    $\Box$ `PATTERN = r"\w"`.

$\blacktriangleright$ **Package pre-loading:**

In [4]:
import re

$\blacktriangleright$ **Data pre-loading:**

In [5]:
my_string = "Let's write RegEx!"

$\blacktriangleright$ **Question-solving method:**

In [6]:
PATTERN = r"\s+"
re.findall(PATTERN, my_string)

[' ', ' ']

In [7]:
PATTERN = r"\w+"
re.findall(PATTERN, my_string)

['Let', 's', 'write', 'RegEx']

In [8]:
PATTERN = r"[a-z]"
re.findall(PATTERN, my_string)

['e', 't', 's', 'w', 'r', 'i', 't', 'e', 'e', 'g', 'x']

In [9]:
PATTERN = r"\w"
re.findall(PATTERN, my_string)

['L', 'e', 't', 's', 'w', 'r', 'i', 't', 'e', 'R', 'e', 'g', 'E', 'x']

## Practice exercises for introduction to regular expressions:

$\blacktriangleright$ **Package pre-loading:**

In [10]:
import re

$\blacktriangleright$ **Data pre-loading:**

In [11]:
my_string = "Let's write RegEx! Won't that be fun? I sure think so. \
Can you find 4 sentences? Or perhaps, all 19 words?"

$\blacktriangleright$ **Regular expressions (`re.split()` and `re.findall()`) practice:**

In [12]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[\.\?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", " Won't that be fun", ' I sure think so', ' Can you find 4 sentences', ' Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


## Appendixes

### Appendix A: Version Checking:

In [13]:
import sys

print('The nltk version is {}.'.format(sys.version))

The nltk version is 3.7.9 (default, Aug 31 2020, 07:22:35) 
[Clang 10.0.0 ].


### Appendix B: Syntax Reference of [Python 3.7.9](https://docs.python.org/3.7):

#### [`re.findall(pattern, string, flags=0)`](https://docs.python.org/3.7/library/re.html#re.findall):

* Return all non-overlapping matches of `pattern` in `string`, as a list of strings. The `string` is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

#### [`re.match(pattern, string, flags=0)`](https://docs.python.org/3.7/library/re.html#re.match):

* If zero or more characters at the beginning of `string` match the regular expression `pattern`, return a corresponding [match object](https://docs.python.org/3.7/library/re.html#match-objects). Return `None` if the string does not match the pattern; note that this is different from a zero-length match.

* Note that even in [`MULTILINE`](https://docs.python.org/3.7/library/re.html#re.MULTILINE) mode,`re.match()` will only match at the beginning of the string and not at the beginning of each line.

* If you want to locate a match anywhere in `string`, use `search()` instead (see also `search()` vs. `match()`).

#### [`re.search(pattern, string, flags=0)`](https://docs.python.org/3.7/library/re.html#re.search):

* Scan through `string` looking for the first location where the regular expression `pattern` produces a match, and return a corresponding [match object](https://docs.python.org/3.7/library/re.html#match-objects). Return `None` if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

#### [`re.split(pattern, string, maxsplit=0, flags=0)`](https://docs.python.org/3.7/library/re.html#re.split):

* Split `string` by the occurrences of `pattern`. If capturing parentheses are used in `pattern`, then the text of all groups in the pattern are also returned as part of the resulting list. If `maxsplit` is nonzero, at most `maxsplit` splits occur, and the remainder of the string is returned as the final element of the list.

In [14]:
re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [15]:
re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [16]:
re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [17]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

* If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

In [18]:
re.split(r'(\W+)', '...words, words...')

['', '...', 'words', ', ', 'words', '...', '']

* That way, separator components are always found at the same relative indices within the result list.

* Empty matches for the pattern split the string only when not adjacent to a previous empty match.

In [19]:
re.split(r'\b', 'Words, words, words.')

['', 'Words', ', ', 'words', ', ', 'words', '.']

In [20]:
re.split(r'\W*', '...words...')

['', '', 'w', 'o', 'r', 'd', 's', '', '']

In [21]:
re.split(r'(\W*)', '...words...')

['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

#### [`search()` vs. `match()`](https://docs.python.org/3.7/library/re.html#search-vs-match):

* Python offers two different primitive operations based on regular expressions: `re.match()` checks for a match only at the beginning of the string, while `re.search()` checks for a match anywhere in the string (this is what Perl does by default).

* For example:

In [22]:
re.match("c", "abcdef")  # No match

In [23]:
re.search("c", "abcdef")  # Match

<re.Match object; span=(2, 3), match='c'>

* Regular expressions beginning with `'^'` can be used with `search()` to restrict the match at the beginning of the string:

In [24]:
re.match("c", "abcdef")  # No match

In [25]:
re.search("^c", "abcdef")  # No match

In [26]:
re.search("^a", "abcdef")  # Match

<re.Match object; span=(0, 1), match='a'>

* Note however that in [`MULTILINE`](https://docs.python.org/3.7/library/re.html#re.MULTILINE) mode `match()` only matches at the beginning of the string, whereas using `search()` with a regular expression beginning with `'^'` will match at the beginning of each line.

In [27]:
re.match('X', 'A\nB\nX', re.MULTILINE)  # No match

In [28]:
re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match

<re.Match object; span=(4, 5), match='X'>