 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc in Big Data Analytics & Artificial Intelligence*

# Regular Expressions

Regular expressions (sometimes called regex) allow us to search for strings using almost any sort of rule we can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions can be strange in their syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.
Regular expressions are handled using Python's built-in **re** library. See [this link](https://docs.python.org/3/library/re.html) for more information.

[This link](https://docs.python.org/3/howto/regex.html) to the Python documentation provides information on all possible regex patterns.

## Searching for Basic Patterns

Consider the following string:

In [None]:
text = "The phone number for ATU's Donegal campus is 074-91-56789, and its email address is info@atu.ie."

We would like to know if the word (or string) "phone" is in the contents of our string called "text". We can quickly do a check using the `in` keyword and check if this word exists in the string:

In [None]:
'phone' in text

The `in` keyword can also be used when we want to check if an exact number such as a phone number is within a string. For example, does the phone number 074-91-56789 exist in our string:

In [None]:
"074-91-56789" in text

## Problems with this Method

This type of string matching technique requires an **exact** copy of your search text to be present for a True result to be returned. For example, this check does not work since one of the dashes is missing:

In [None]:
"074-9156789" in text

Regular expressions allow us to overcome this problem.

## Regular Expressions

What if we don't know the exact number that we're looking for? For example, if we only know the number format, including dashes? Or what if we wanted to find phone numbers within a document? Or other informtion such as a date or an email address?

We can use a **regular expression** to find a *specific pattern* within text. Regular expressions allow for pattern searching within a text string, or within an entire document.

In [None]:
# Import the regular expressions library in Python
import re

In regular expressions, every character type has a corresponding pattern code.
Digits have a placeholder pattern code of `\d`. Using a backslash allows Python to know that what you are typing is a *special character* and not the letter 'd'.

In [None]:
pattern = 'ATU'

In [None]:
re.search(pattern, text)

There's some useful information within this output that we can examine. For example, we can show where the match occurs within the string. Remember that the text indexing position starts at zero.

In [None]:
# Copy the output of the search keyword into a variable
text_match = re.search(pattern, text)

# show where the match occurs within the string
text_match.span()

This means that the character **A** in the search word **ATU** starts at location 21 and the **U** is in position 24.

We can request particular attributes. Press the tab key after pressing the "." to see all potential options associated with your keyed-in command.

In [None]:
text_match.start()

In [None]:
text_match.end()

Be careful using `re.search` when you are looking for multiple occurrences of a string:

In [None]:
my_new_text = "I am a new student in ATU and I think ATU is great."
# Using pattern variable from previous definition

# re.search only finds the first instance
my_match = re.search(pattern, my_new_text)
my_match.span()

Instead we need to use the `re.findall` command to get all matches of the specified keyword.

In [None]:
my_match = re.findall(pattern, my_new_text)
my_match

We can find a list or iteration of all matched objects using the `re.finditer` command. In this example we also use a *for* loop to print each iteration of the requested match.

In [None]:
for matched_word in re.finditer(pattern, my_new_text):
    print(matched_word.span())

## Searching for General Text Patterns

So far we've learned how to search for a basic string. Now we will check for specific patterns. We can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \. 

When defining a pattern string for regular expression we use the format:

r'mypattern'

placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

This table shows all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>mywork_\d\d\d</td><td>mywork_722</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-z_9</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

Lets use these text patterns to find a phone number pattern within the text string we created earlier in this notebook.

In [None]:
# Show the text pattern to remind us of its contents
text

We're looking for a phone number pattern which consists of three digits, a hyphen, two more digits, another hyphen, and five more digits:

In [None]:
search_pattern = r'\d\d\d-\d\d-\d\d\d\d\d'
phone_numbers = re.search(search_pattern, text)
# show the contents of phone_numbers
phone_numbers

The output shows where the pattern occured, and the actual found match.

We can show just the search result using the `group` command

In [None]:
phone_numbers.group()

This will work for all identified patterns within the searched text string. If the phone number in the text string changes, the `re.search` command will still pick out the phone number, as long as it is using the same pattern.



We can use the `re.findall()` command to find multiple occurrences of text with the same pattern. We can use `re.finditer` to work through each iteration, just as we did earlier in this document.

Notice the repetition of `\d`. It can be annoying to use, particularly if we are looking for very long strings of numbers. We can use quantifiers to improve this.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.Here's a table showing the quantifiers available to us.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Example Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>any characters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

Let's convert the phone number pattern using a quantifier. Here's the original pattern from earlier:

In [None]:
search_pattern

In [None]:
# we use 'r' first
# "\d" means a digit (from table above)
# Note there's no spaces between any of the expression
search_pattern = r"\d{3}-\d{2}-\d{5}"


In [None]:
re.search(search_pattern, text)

In [None]:
my_match = re.search(search_pattern, text)

In [None]:
my_match.group()

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code by extracting the first three digits? We can use groups for any general task that involves grouping together regular expressions. Then we can break them down. 

We can add brackets around each part of the search pattern to create a group and to improve the search result. Then we can use these brackets to allow us to find an occurrence of a group. 

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [None]:
# Change the search_pattern so that it is split into groups
search_pattern = r"(\d{3})-(\d{2})-(\d{5})"

my_match = re.search(search_pattern, text)

# Find phone numbers that contain a match for the first group
my_match.group(1)

If we only wanted the final set of digits from the phone number, we can specify the third set of brackets - group 3.

In [None]:
my_match.group(3)

## Other Regex Syntax

We can use the or (`|`) statement to find more than one text item within a string.

In [None]:
text = "A man and a woman were here earlier."

re.search(r"man|woman", text)

Note that `re.search` only finds the first occurrence of the match pattern, just as it did when we used it. So it does not detect the occurrence of `woman` in the search text. We can solve this by using the `re.findall` command, just as we did earlier.

In [None]:
re.findall(r"man|woman", text)

## The Wildcard Character

We can use a wildcard to find any occurrance of a character in place of the wildcard. The example below describes how to search for any words that end with 'at' and include one letter.

In [None]:
text =(r"The cat sat on the mat.")
# A "." in a regex query represents a wildcard
# Therefore any word matching a letter followed by "at" will be selected.
re.findall(r".at", text)

If we change the text to contain the word 'splat', the search term will only find part of the word 'splat' because of the single '.' in the search term. 

In [None]:
text =(r"The cat sat on the mat and then went splat.")
re.findall(r".at", text)

We can improve on the search to find any two characters before 'at'. Notice that the output picks up the space before 'cat', 'sat' and 'mat'.

In [None]:
text =(r"The cat sat on the mat and then went splat.")
re.findall(r"..at", text)

## Starts With and Ends With

We can check to see if text starts or ends with a particular character. We use the `^` and `$` symbols to find these. 

For example, if we want to find any occurence of a particular number at the end of a sentence then we use the `$` sign.

In [None]:
re.findall(r"\d$", "All rooms on the second floor of ATU Donegal end with a 2")

Note that this will only find digits if they are the final element of the search string. It will not work if a `.` is the final character. For example, the following does not work:

In [None]:
re.findall(r"\d$", "All rooms on the second floor of ATU Donegal end with a 2.")

Similarly, we can check whether a sentence begins with a digit using the `^` symbol:

In [None]:
re.findall(r"^\d", "1 divided by 0 gives an error.")

Note that both of these options check an entire string and not individual words.

We can exclude numbers from a string using the `[]` symbols. Anything inside the square brackets will be excluded from the search result:

In [None]:
# Here's a search phrase including several numbers
search_phrase = "There are 3 numbers within this sentence. The first 1 is 3, the 2nd is 6, and the 3rd is 9."
# Everything within the square brackets will be excluded from the result
re.findall(r"[^\d]", search_phrase)

To get only the words out of a sentence and remove punctuation, we can use the `+` symbol to exclude all within the square brackets. This is really useful in NLP.

Lets look at an example:

In [None]:
search_phrase = "There are lots of punctuation marks within this sentence! And I would like to remove them! But how can I do that?"
re.findall(r"[^.!?]+", search_phrase)

The string is now broken into sections, based on what is found within the square brackets. Each break occurs when there is a match for a character within the square brackets. For example, if we remove the `!` from the square brackets, then sentences ending in a `!` will not have this character removed. And these sentences will remain as one sentence.


In [None]:
re.findall(r"[^.?]+", search_phrase)

We can use the `+` with grouping to find any number of a search term. 

Let's use this feature to find hyphenated words in a sentence. Let's use the `\w` to find any number of alphanumeric characters. Refer to the table above if to remind yourself why this is the case. Note that the full stop is not included as an alphanumeric character, and we do not need to specify how long each character type string will need to be.

In [None]:
search_phrase = "Here is a sentence with some hyphen-words. I want to remove any long-ish words and hyphen-words."
# Using the "+" symbol allows for any length of alphanumeric word to be found.
# The "-" symbol is part of the pattern ie "an alphanumeric word - an alhanumeric word".
re.findall(r"[\w]+-[\w]+", search_phrase)

## Parentheses for Multiple Options

If we have multiple options for matching within a word, we can use parentheses to list out these options. For Example:

In [None]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = "Hello, would you like some catfish?"
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [None]:
re.search(r"cat(fish|nap|claw)",text)

In [None]:
re.search(r'cat(fish|nap|claw)',texttwo)

In [None]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

## Anchors and Word Boundaries
We've just seen that the symbols `^` (caret) and `$` can be used to match the start and end of a string, respectively. They also apply at line breaks in multiline strings. We can use `\A` and `\Z` to match strictly at the start and end, excluding line breaks. These are known as *anchors*. 

Another anchor is `\b`, used to match at a word boundary. A word boundary occurs before the first character of a string (if the first character is a word character); after the last character of a string (if the last character is a word character); and between two characters in a string (where one is a word character and the other is not a word character). This allows us to search for whole words only using syntax like `\bword\b`.

## Exercise

The code below finds two occurrences of the string `'art'`. Adapt it to 
- instead find only the whole word *art*;
- find only words ending in art which occur at the end of the string.


In [None]:
string = 'art and cart.'
re.findall(r'art', string)

## Substitution

The `re.sub()` method can be used to substitute using regex. The syntax for this is `re.sub(pattern, repl, string, count=0, flags=0)`. Here `pattern` is the regular expression we want to match; `repl` is the replacement, `string` is the original string; `count` is the maximum number of substitutions that should be performed, with `0` indicating replace all; and `flags` allows for a number of options to modify the behaviour of the pattern. 

For example, to remove the symbols from a phone number, we could use

In [None]:
phone = '(074) 91-56789'
pattern = '\D'
re.sub(pattern, '', phone)

Here `D` is an inverse digit character set, containing all characters *except* digits, so the method then removes all non-digit characters.

## Exercises

1. For the following string, use a regular expression to replace the whole word *red* with the word *blue*.

In [None]:
words = 'bread bred red spread predict read featured red.'

2. For the following string, change the whole word *wall* to *law* (i.e. words such as *wallet* should be unchanged).

In [None]:
string = '(wall) call ball pall ball fall mall tall wall call ball pall mall wall ball fall wallet mallet walls wall:call:ball:pall'

3. For the following string, change the whole word *wall* to *law* only if it is at the start of a line. You will need to investigate the `flags` parameter to indicate that the input is a multiline string.

In [None]:
string = '''\
(wall) call ball pall 
ball fall mall tall 
wall call ball pall 
mall wall ball fall 
wallet mallet walls 
wall:call:ball:pall
'''

4. Create a regex expression that extracts the domain name of a URL. It should work regardless of whether the `www` or a subdirectory is included, i.e. it should extract the string `atu` from each of the URLs below.

In [None]:
urls = ['http://atu.ie', 'https://www.atu.ie', 'http://www.atu.ie/donegal/']

In [None]:
for url in urls:
  print(url)