# Regular Expressions

Regular expressions (also called `Regex` for short) are usually used to search for patterns in text strings. Python has a library [re](https://docs.python.org/3/library/re.html) providing many useful utility methods to handle `regex`.

## Normal way to check if a phrase is in a text

In [1]:
text = 'The phone number of the agent is 230-111-1234. The phone is available 24/7. Call soon! You can also call 123-456-7890.'
text

'The phone number of the agent is 230-111-1234. The phone is available 24/7. Call soon! You can also call 123-456-7890.'

In [2]:
'230-111-1234' in text

True

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
# re.search() only searches and returns the first found match
first_match = re.search(pattern, text)
first_match

<re.Match object; span=(4, 9), match='phone'>

In [6]:
first_match.span()

(4, 9)

In [7]:
first_match.start()

4

In [8]:
first_match.end()

9

In [9]:
# use re.findall(pattern, text) to search for all matches
all_matches = re.findall(pattern, text)
all_matches

['phone', 'phone']

In [10]:
# re.finditer(pattern, text) returns a list of 
for match in re.finditer(pattern, text):
    print(match.span()) # match.span() returns a tuple containing start and end indices of the match
    startIndex, endIndex = match.span()
    print(text[startIndex: endIndex])
    print(match.group()) # match.group() is another way to get the actual text that matched

(4, 9)
phone
phone
(51, 56)
phone
phone


## Patterns

### Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. We can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'

placing the r in front of the string allows Python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table >
<tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>
<tr><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>
<tr><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>
<tr><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>
<tr><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>
<tr><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>
<tr><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr>
</table>

In [11]:
phone_pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [12]:
first_phone_number = re.search(phone_pattern, text)
first_phone_number

<re.Match object; span=(33, 45), match='230-111-1234'>

In [13]:
first_phone_number.group()

'230-111-1234'

In [14]:
all_phone_numbers = re.finditer(phone_pattern, text)
for phone in all_phone_numbers:
    print(phone.group())

230-111-1234
123-456-7890


In [15]:
# Another way to search for phone numbers
print(re.findall(phone_pattern, text))

['230-111-1234', '123-456-7890']


### Quantifiers

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.
Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

| Character | Description | Example Pattern Code | Exammple Match |
|-----------|-------------|----------------------|----------------|
| + | Occurs one or more times | Version \w-\w+ | Version A-b1_1 |
| {3} | Occurs exactly 3 times | \D{3} | abc |
| {2,4}	| Occurs 2 to 4 times | \d{2,4} | 123 |
| {3,} | Occurs 3 or more | \w{3,} | anycharacters |
| \* | Occurs zero or more times | A\*B\*C* | AAACC |
| ? | Once or none | plurals? | plural |

In [16]:
phone_quantified_pattern = r'\d{3}-\d{3}-\d{4}'

In [17]:
print(re.findall(phone_quantified_pattern, text))

['230-111-1234', '123-456-7890']


### Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

    r'(group1)(group2)'

We can separate groups of regular expressions using parentheses:

In [18]:
phone_grouped_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [19]:
# Search for first match
first_result = re.search(phone_grouped_pattern, text)
first_result

<re.Match object; span=(33, 45), match='230-111-1234'>

In [20]:
print(first_result.group()) # Group the entire match
print(first_result.group(1)) # Extract the first group
print(first_result.group(2)) # Extract the second group
print(first_result.group(3)) # Extract the third group

230-111-1234
230
111
1234


In [21]:
# Search for all match
all_results = re.finditer(phone_grouped_pattern, text)
for result in all_results:
    print(result.group()) # Group the entire match
    print(result.group(1)) # Extract the first group
    print(result.group(2)) # Extract the second group
    print(result.group(3)) # Extract the third group

230-111-1234
230
111
1234
123-456-7890
123
456
7890


## Additional Regex Syntax

### `Or` operator `|` (also called `pipe` operator)

In [22]:
# Use | to have an or statement
re.findall(r'book|computer', 'I am reading a book about computer science.')

['book', 'computer']

### `Wildcard` character `.`

In [23]:
# Use a "wildcard" as a placement that will match any character placed there. You can use a simple period . for this.
re.findall(r'.ook', 'I took money to buy a cook book for learning Python')

['took', 'cook', 'book']

In [24]:
# Another more flexible way to do the above search
# One or more non-whitespace that ends with 'ok'
re.findall(r'\S+ok', 'Look, I took money to buy a cook book for learning Python')

['Look', 'took', 'cook', 'book']

### `Starts with` (`^`) and `ends with` (`$`)

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [25]:
sample_sentences = [
    '123 people registered for an account',
    'no number like 123 at the start and end of the sentence',
    'the number of objects is 1234'
]

In [26]:
# Find all sentences that end with at least a number - f'\d+$'
[sentence for sentence in sample_sentences if len(re.findall(f'\d+$', sentence))]

['the number of objects is 1234']

In [27]:
# Find all sentences that start with a least a number - f'^\d+'
[sentence for sentence in sample_sentences if len(re.findall(f'^\d+', sentence))]

['123 people registered for an account']

In [28]:
re.findall(f'^\d+', sample_sentences[0])

['123']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded.

In [29]:
# Exclude any characters that are number
phrase = 'one thing we can do is to add 3 and 4 to get 7.'
re.findall(f'[^\d]', phrase)

['o',
 'n',
 'e',
 ' ',
 't',
 'h',
 'i',
 'n',
 'g',
 ' ',
 'w',
 'e',
 ' ',
 'c',
 'a',
 'n',
 ' ',
 'd',
 'o',
 ' ',
 'i',
 's',
 ' ',
 't',
 'o',
 ' ',
 'a',
 'd',
 'd',
 ' ',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 ' ',
 't',
 'o',
 ' ',
 'g',
 'e',
 't',
 ' ',
 '.']

In [30]:
# To get the words back together
re.findall(f'[^\d]+', phrase)

['one thing we can do is to add ', ' and ', ' to get ', '.']

In [31]:
# Using the same technique above, we can remove punctuation from a sentence
sentence_includes_punc = 'This sentence is a string, and includes punctuation. How to remove thems? Can you try?!'
re.findall(f'[^,.?! ]+', sentence_includes_punc)

['This',
 'sentence',
 'is',
 'a',
 'string',
 'and',
 'includes',
 'punctuation',
 'How',
 'to',
 'remove',
 'thems',
 'Can',
 'you',
 'try']

In [32]:
sentence_punc_excluded = ' '.join(re.findall(f'[^,.?! ]+', sentence_includes_punc))
sentence_punc_excluded

'This sentence is a string and includes punctuation How to remove thems Can you try'

### Brackets `[]` for grouping

As shown above, we can use brackets to group together options that specify `sets` of characters

* `[amk]` will match `'a'`, `'m'`, or `'k'`
* `[a-z]` will match any lowercase ASCII letter
* Special characters lose their special meaning inside sets. For example, `[(+*)]` will match any of the literal characters `'('`, `'+'`, `'*'`, or `')'`

In [33]:
# Find all phrase wrapped with an open and close parenthesis
# [(] matches open parenthesis
# [\w\s]+ matches one or more either alphanumeric \w or whitespace \s characters
# [)] matches close parenthesis
re.findall(r'[(][\w\s]+[)]', '1 (one) is an integer number that is (very important) in mathematics. 2 (two) is an even number.')

['(one)', '(very important)', '(two)']

In [34]:
# Find all hyphenated and/or underscored words
re.findall(f'[\w]+[-_][\w]+', 'A variable name can be like my_var_name. When writing source-code we should use descrip-tive names.')

['my_var_name', 'source-code', 'descrip-tive']

### Parentheses `()` for multiple options

If we have multiple matching options, we can use `()` and pipe `|` operator to list them out:

`(A|B|C)` means either `A` or `B` or `C` regex expression can match

In [35]:
# Find all email end with either 'com', 'io' or 'dev'
email_pattern = r'([\w]+[@][\w]+[.])(com|io|dev)'
emails = 'test@mail.com, booking@domain.foo, fantastic@mail.dev, e@mail.io'
re.findall(email_pattern, emails)

[('test@mail.', 'com'), ('fantastic@mail.', 'dev'), ('e@mail.', 'io')]

In [36]:
[email.group() for email in re.finditer(email_pattern, emails)]

['test@mail.com', 'fantastic@mail.dev', 'e@mail.io']