# Regular expressions
* A formal language for defining text strings (character sequences)
* Used for pattern matching (e.g. searching & replacing in text)<br>
1. Disjunctions
2. Negation
3. Optionality
4. Aliases
5. Anchors

# 1. Disjunctions

In [1]:
import re

In [2]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [3]:
pattern = r"the" # We're using the string literal syntax, what follows r is a string literal

In [4]:
print(re.sub(pattern, "X", text)) # Substitute X for pattern in text

Most of X time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's X use of apostrophes? Or it might need a more up-to-date model.



In [5]:
pattern = r"[Tt]he" # match with upper or lowercase t

In [6]:
print(re.sub(pattern, "X", text)) # Substitute X for pattern in text

Most of X time we can use white space.
But what about fred@gmail.com or 13/01/2021?
X students' attempts aren't working.
Maybe it's X use of apostrophes? Or it might need a more up-to-date model.



In [7]:
pattern = r"[0-9]" # match on digits

In [8]:
print(re.sub(pattern, "X", text)) # Substitute X for pattern in text

Most of the time we can use white space.
But what about fred@gmail.com or XX/XX/XXXX?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [9]:
pattern = r"[A-Z]" # match on uppercase letters in the range A to Z

In [10]:
print(re.sub(pattern, "X", text)) # Substitute X for pattern in text

Xost of the time we can use white space.
Xut what about fred@gmail.com or 13/01/2021?
Xhe students' attempts aren't working.
Xaybe it's the use of apostrophes? Xr it might need a more up-to-date model.



In [11]:
pattern = r"of|the|we" # Match on a list of words

In [12]:
print(re.sub(pattern, "X", text)) # Substitute X for pattern in text

Most X X time X can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's X use X apostrophes? Or it might need a more up-to-date model.



# 2. Negation

In [13]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [14]:
pattern = r"[^0-9]"

In [15]:
print(re.sub(pattern, "", text)) # Substitute with empty for pattern in text

13012021


In [16]:
pattern = r"[^a-z]"

In [17]:
print(re.sub(pattern, " ", text)) # Substitute with empty for pattern in text

 ost of the time we can use white space   ut what about fred gmail com or              he students  attempts aren t working   aybe it s the use of apostrophes   r it might need a more up to date model  


# 3. Optionality

In [18]:
text = '''begin began begun beginning'''

In [19]:
# . means match anything, like a wildcard
pattern = r"beg.n" # 'beg' followed by anything and followed by n

In [20]:
print(re.sub(pattern, "X", text))

X X X Xning


In [21]:
text = '''colour can be spelled color'''

In [22]:
# ? means previous character is optional
pattern = r"colou?r"

In [23]:
print(re.sub(pattern, "", text))

 can be spelled 


In [24]:
# * is the Kleene star, meaning match 0 or more of previous character
pattern = r"w.*"

In [25]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [26]:
print(re.sub(pattern, "", text))

Most of the time 
But 
The students' attempts aren't 
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [27]:
# make sure the match is non-greedy using the ? character
pattern = r"w.*?"

In [28]:
print(re.sub(pattern, "X", text))

Most of the time Xe can use Xhite space.
But Xhat about fred@gmail.com or 13/01/2021?
The students' attempts aren't Xorking.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [29]:
# make sure the match is non-greedy using the ? character, the whole word
pattern = r"w.*? "

In [30]:
print(re.sub(pattern, "X", text))

Most of the time Xcan use Xspace.
But Xabout fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [31]:
text = '''foo fooo foooo fooooo!'''

In [32]:
# + is the Kleene plus, meaning match 1 or more of the previous characters
pattern = r"fooo+"

In [33]:
print(re.sub(pattern, "X", text))

foo X X X!


# 4. Aliases
Shortcuts
```
\w - match word
\d - match digit
\s - match whitespace
\W - match not word
\D - match not digit
\S - match not whitespace
```

In [34]:
text = '''Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.
'''

In [35]:
pattern = r"\w"

In [36]:
# match all word characters
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [37]:
pattern = r"\d"

In [38]:
print(re.sub(pattern, "X", text))

Most of the time we can use white space.
But what about fred@gmail.com or XX/XX/XXXX?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [39]:
pattern = r"\D"

In [40]:
print(re.sub(pattern, "", text))

13012021


# 5. Anchors

In [41]:
# delete all words
pattern = "\w+"

In [42]:
print(re.sub(pattern, "", text))

        .
   @.  //?
 '  ' .
 '    ?       -- .



In [43]:
# delete only words at the start of a string
pattern = "^\w+"

In [44]:
print(re.sub(pattern, "", text))

 of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model.



In [45]:
# switch on multiline mode to delete words at the start of each line
print(re.sub(pattern, "", text, flags=re.MULTILINE))

 of the time we can use white space.
 what about fred@gmail.com or 13/01/2021?
 students' attempts aren't working.
 it's the use of apostrophes? Or it might need a more up-to-date model.



In [46]:
# use $ to anchr the match at the end of a string
pattern = "\W$"

In [47]:
print(re.sub(pattern, "", text))

Most of the time we can use white space.
But what about fred@gmail.com or 13/01/2021?
The students' attempts aren't working.
Maybe it's the use of apostrophes? Or it might need a more up-to-date model


In [48]:
# switch on multiline mode to delete non-words at the end of each line
print(re.sub(pattern, "", text, flags=re.MULTILINE))

Most of the time we can use white space
But what about fred@gmail.com or 13/01/2021
The students' attempts aren't working
Maybe it's the use of apostrophes? Or it might need a more up-to-date model
