# Regular expressions
* A formal language for defining text strings (character sequences)
* Used for pattern matching (e.g. searching & replacing in text)

1. Disjunctions
2. Negation
3. Optionality
4. Aliases
5. Anchors

## 1. Disjunctions

In [1]:
import re

In [2]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [3]:
pattern = r"the"

In [12]:
print(re.sub(pattern, "X", text))

Most X X time X can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's X use X apostrophes? Or it might need a more up-to-date model.



In [5]:
# also match on upper case T
pattern = r"[Tt]he"

In [7]:
# match on digits
pattern = r"[0-9]"

In [9]:
# match on upper case
pattern = r"[A-Z]"

In [11]:
# match any of these words
pattern = r"of|the|we"

## 2. Negation

In [13]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [14]:
# match on anything that is NOT digits
pattern = r"[^0-9]"

In [17]:
print(re.sub(pattern, " ", text))

 ost of the time we can use white space   ut what about fred gmail com or              he students  attempts aren t working   aybe it s the use of apostrophes   r it might need a more up to date model  


In [16]:
# match on anything that is NOT lower case characters
pattern = r"[^a-z]"

## 3. Optionality

In [18]:
text = '''begin began begun beginning'''

In [19]:
# . means match anything (like a 'wild card')
pattern = r"beg.n"

In [20]:
print(re.sub(pattern, "X", text))

X X X Xning


In [21]:
text = '''colour can be spelled color'''

In [22]:
# ? means previous character is optional
pattern = r"colou?r"

In [23]:
print(re.sub(pattern, "", text))

 can be spelled 


In [24]:
# * is the Kleene star, meaning match 0 or more of previous char
pattern = r"w.*"

In [25]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [26]:
print(re.sub(pattern, "", text))

Most of the time 
But 
The students' attempts aren't 
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [27]:
# make sure the match is non-greedy using the ? character
pattern = r"w.*? "

In [28]:
text = '''foo fooo foooo fooooo!'''

In [29]:
# + is the Kleene plus, meaning match 1 or more of previous char
pattern = r"fooo+"

In [30]:
print(re.sub(pattern, "", text))

foo   !


## 4. Aliases

#### \w - match word
#### \d - match digit
#### \s - match whitespace
#### \W - match not word
#### \D - match not digit
#### \S - match not whitespace

In [31]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [32]:
pattern = r"\w"

In [33]:
# match of all word characters
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [34]:
pattern = r"\d"

In [35]:
# match of all digit characters
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or //?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



## 5. Anchors

In [44]:
# delete all words
pattern = '\w+'

In [45]:
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [46]:
# delete only words at the start of a string
pattern = '^\w+'

In [47]:
print(re.sub(pattern, "", text))

 of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [48]:
# switch on multiline mode to delete words at the start of each line
print(re.sub(pattern, "", text, flags=re.MULTILINE))

 of the time we can use white space.
 what about fred@gmail.com or 13/01/2021?
 students' attempts aren't working.
 it's the use of apostrophes? Or it might need a more up-to-date model.



In [49]:
# use $ to anchor the match at the end of a string
pattern = '\W$'

In [50]:
# delete non-words from end of string
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model


In [52]:
# switch on multiline mode to delete non-words at the end of each line
print(re.sub(pattern, "", text, flags=re.MULTILINE))

Most of the time we can use white space
But what about fred@gmail.com or 13/01/2021
The students' attempts aren't working
Maybe it's the use of apostrophes? Or it might need a more up-to-date model
