## Regular Expression (Regex)

**Source:** [RegExOne](https://regexone.com/)

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents. And while there is a lot of theory behind formal languages, the following lessons and examples will explore the more practical uses of regular expressions so that you can use them as quickly as possible.

Best way to dive into **regex** is to go through examples.

In [2]:
# demo: import re

import re

In [3]:
# demo: findall

some_string = "The quick brown fox jumps over the lazy dogs."

re.findall("THE",some_string.upper())

['THE', 'THE']

In [4]:
# next few examples courtesy of DataCamp
pattern = r"Cookie" # raw string literal
sequence = "Cookie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


In [5]:
# Wildcard
# . - A period. Matches any single character except newline character.
re.search(r'Co.k.e', 'Cookie').group()

# The group() function returns the string matched by the re.

'Cookie'

In [6]:
pattern = r'C.....'
print(re.search(pattern,"Cookie").group())
print(re.search(pattern,"Cuckoo").group())
print(re.search(pattern,"Cockroach").group())

Cookie
Cuckoo
Cockro


In [7]:
# \w - Lowercase w. Matches any single letter, digit or underscore.
pattern = r'Co\wk\we'
print(re.search(pattern, 'Cookie').group())
print(re.search(pattern, 'Coffee'))
print(re.search(pattern, 'Co!kee'))

Cookie
None
None


In [8]:
# \W - Uppercase w. Matches any character not part of \w (lowercase w).
pattern = r'C\Wke'
print(re.search(pattern, 'C*ke').group())

C*ke


In [9]:
# \s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.
print(re.search(r'Eat\scake', 'Eat cake').group())
print(re.search(r'M\sE\s', 'M E ').group())

Eat cake
M E 


In [10]:
# \S - Uppercase s. Matches any character not part of \s (lowercase s).
print(re.search(r'Eats\Sshoots\sand\sleaves', 'Eats,shoots and leaves').group())

Eats,shoots and leaves


In [11]:
#\n - Lowercase n. Matches newline.
#
#\r - Lowercase r. Matches return.
#
#\d - Lowercase d. Matches decimal digit 0-9.

re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

In [12]:
# ^ - Caret. Matches a pattern at the start of the string.
re.search(r'^Eat', 'Eat cake').group()

'Eat'

In [13]:
# $ - Matches a pattern at the end of string.
re.search(r'cake$', 'Throw cake').group()

'cake'

In [14]:
# [abc] - Matches a or b or c.
# [12] - Matches 1 or 2
#
# [a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). 
# Characters that are not within a range can be matched by complementing the set. 
# If the first character of the set is ^, all the characters that are not in the set will be matched.
re.search(r'Number: [0-6]', 'Number: 6').group()

'Number: 6'

In [15]:
re.search(r'^09[129][78][-* ]9999999', '0998 9999999').group()

'0998 9999999'

In [16]:
# \A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.
pattern = r'\A[A-Za-e]ookie'

print(re.search(pattern, 'Cookie').group())
print(re.search(pattern, 'Bookie').group())
print(re.search(pattern, 'cookie').group())
print(re.search(pattern, 'Zookie').group())

Cookie
Bookie
cookie
Zookie


In [17]:
# \b - Lowercase b. Matches only the beginning or end of the word.

re.search(r'\b[A-Ka-c]ookie', 'bookie').group()

'bookie'

In [18]:
# Warm-up

some_string_2 = "abc123 another123"

print(re.findall("123",some_string_2))
print(re.findall("abc",some_string_2))
print(re.findall("abc123",some_string_2))
print(re.findall("xyz",some_string_2))
print(re.findall("[a-z]",some_string_2)) # find all occurrences of lower case letters (a-z)
print(re.findall("[0-9]",some_string_2)) # find all occurrences of numeric digits (0-9)

['123', '123']
['abc']
['abc123']
[]
['a', 'b', 'c', 'a', 'n', 'o', 't', 'h', 'e', 'r']
['1', '2', '3', '1', '2', '3']


### Repetitions

It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:

`+` - Checks for one or more characters to its left.

In [19]:
re.search(r'[A-Z].+kie', 'Bo0000okie').group()

'Bo0000okie'

`*` - Checks for zero or more characters to its left.

In [20]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Caaaaokie').group()

'Caaaaokie'

In [21]:
re.search(r'Ca*o*kie', 'Caaaaookie').group()

'Caaaaookie'

`?` - Checks for exactly zero or one character to its left.

In [22]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Colour').group()

'Colour'

In [23]:
contact_info = "Tony Stark 0917-9003000, 0917-9003001, 0917-9003002 tony@example.com tony@ateneo.edu"

# Extract possible phone number(s) from text
print(re.findall("[0-9]{4}-[0-9]{7}", contact_info)) # find all possible phone numbers in the text
print(re.findall("[a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-z]+", contact_info))

['0917-9003000', '0917-9003001', '0917-9003002']
['tony@example.com', 'tony@ateneo.edu']


But what if you want to check for exact number of sequence repetition?

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

`{x}` - Repeat exactly x number of times.

`{x,}` - Repeat at least x times or more.

`{x, y}` - Repeat at least x times but no more than y times.

In [24]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

In [25]:
re.search(r'\d{4}[- ]\d{7}', '0987 6543210').group()

'0987 6543210'

The `+` and `*` qualifiers are said to be greedy.

#### Groups and Groupings

Suppose that, when you're validating email addresses and want to check the user name and host separately.

This is when the group feature of regular expression comes in handy. It allows you to pick up parts of the matching text.

Parts of a regular expression pattern bounded by parenthesis() are called **groups**. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the `group()` function all along in this tutorial's examples. The plain `match.group()` without any argument is still the whole matched text as usual.

In [26]:
email_address = 'Please contact us at: support.coffee@obf.ateneo.edu'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)
if match:
  print(match.group()) # The whole matched text
  print(match.group(1)) # The username (group 1)
  print(match.group(2)) # The host (group 2)

support.coffee@obf.ateneo.edu
support.coffee
obf.ateneo.edu


#### Greedy vs. Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression but sometimes this behavior is not desired:

In [27]:
pattern = "cookie"
sequence = "Cake and cookie"

heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()

'<h1>TITLE</h1>'

However, if you only wanted to match the first `<h1>` tag, you could have used the greedy qualifier `*?` that matches as little text as possible.

Adding `?` after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched. When you run `<.*?>`, you will only get a match with `<h1>`.

In [28]:
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()

'<h1>'

#### Replace text


In [29]:
contact_info = "Tony Stark 0917-9003000, 0917-9003001, 0917-9003002 tony@example.com tony@ateneo.edu"

re.sub(r"(09[0-9][0-9])-",r"(\1) ",contact_info)

# (09[0-9][0-9]) refers to group 1 containing the mobile number prefix
# \1 refers to group 1

'Tony Stark (0917) 9003000, (0917) 9003001, (0917) 9003002 tony@example.com tony@ateneo.edu'

In [30]:
brgy = "Sto.                  Nino   "
brgy = re.sub("\s+"," ",brgy)
brgy = re.sub("Sto.","Santo",brgy)
brgy = re.sub("Sto","Santo",brgy)
brgy = re.sub("Nino\s+","Niño",brgy)

In [31]:
brgy

'Santo Niño'

In [32]:
brgy = "Sto. Nino"
brgy = re.sub("Sto.","Santo",brgy)

In [33]:
dirty_location_list = [
    "Paranaque",
    "Pnque",
    "Parañaque",
]

clean_location_list = [re.sub(r'(Paranaque|Pnque|Parañaque)',r'PARAÑAQUE',loc) for loc in dirty_location_list]

### IMPORTANT

Please go through the tutorials and sample puzzles [here](https://regexcrossword.com/). We won't have time for a full-blown lecture on regex, but the topic is too important for you not to cover through self-study because future assignments and tests will depend heavily on it.