<div style="background-color:lightgrey;
            padding:10px;
            color:black;
            border:black dashed 2px; 
            border-radius:5px;
            margin: 20px 0;">
            
            
# Regular Expressions (RegEx)



**Staff:**  Paavo Van der Eecken  <br/>
**Support Material:**<br/>
- [Tool for constructing and debugging regular expressions](https://regex101.com/)
- [Tutorial on RegEx](https://docs.python.org/3.8/howto/regex.html)
- [Python RegEx cheat sheet](https://www.dataquest.io/blog/regex-cheatsheet/)
- [Exercises](../exercises/Questions_2023/09_EX_regex.ipynb)

**Support Sessions:**  Tuesday, October 10, 10:30AM

</div>


## 1. What are Regular Expressions?

### Very general definition:

**Regular expressions** are used to formalize patterns in character sequences. Any given character sequence (=any string) can be tested against a regular expression. If the pattern occurs in the string, we call this a **match**. In other words: **Regular expressions are used for pattern matching**.

One of the most common use cases is the validation of input strings on web forms, e.g., to test if users entered a properly formatted e-mail address or birth date.

### A more scientific definition:
"**Pattern matching** consists in finding a section of text that is
described (matched) by a regular expression.

A **regular expression** is a string containing a combination of normal characters and special metacharacters or metasequences. The normal characters match themselves.
**Metacharacters** and metasequences are characters or sequences of characters that represent ideas such as quantity, locations, or types of characters."
(Stubblebine, Tony. *Regular Expression Pocket Reference*. O'Reilly 2007 )

## 2. First examples: have we already seen 'pattern matching' in this course?

We have previously seen operations that allowed us to check if a string is contained in another string.

### The comparison operator `in`
The operator `in` searches for the string on its left side within the string on its right side. If the left side string is included in the right side, the comparison returns *TRUE*, else *FALSE*


In [1]:
'ab' in 'table'

True

In [2]:
'ab' in 'chair'

False

In [3]:
if ("a" in "cat"):
    print ("Cat has an a.")
else:
    print ("Deadend.")

Cat has an a.


### The String-Method `string.find(pattern)`
Searches for a pattern in a given string. If the pattern is found, it returns the index of the first character (start position), first occurrence. In case of no occurrences, it returns -1.


In [8]:
'table'.find('a')

1

In [7]:
'table'.find('b')

2

In [6]:
'table'.find('ble')

2

In [5]:
'table'.find('x') # no occurrence

-1

In [4]:
'salad'.find('a') # returns first occurrence, not second one

1

### Writing our own functions for pattern matching?
We can write our own functions to find patterns in text strings. However, this is relatively laboursome. To illustrate this, look at the following example:

In [9]:
# Does the pattern ('a' followed by anything, followed by 'b') match a string?

def mymatch (x, y, text):
    'x followed by y with anything in between'
    for letter in text:
        if letter == x:
            pos = text.index(letter)
            for letter in text[pos:]:
                if letter == y:
                    return True
    return False

In [10]:
mymatch('a', 'b', 'zanzibar')

True

## 3. The Python module "re"

### 3.1 Importing the ``re`` module
Regular expressions are implemented in most programming languages. The module `re` contains the functionality for Python's implementation of regular expressions. To use regular expressions for pattern matching in our code, we must first ``import re``.

The following code is a first example that shows how ``re`` is imported and how the module function ``re.findall(pattern**, string)`` is used.

In [11]:
import re

result = re.findall('a', 'zanzibar')
result  # result is a list of all matches! In this case 'a' occurs twice in 'zanzibar', therefore result contains a list with two 'a'

['a', 'a']

Notes: 
- the module function ``re.findall()`` returns a list of all hits, but not their indexes within the string. 
- pattern\*\*: for now we will use static strings as pattern, i.e. we search for 'a' in 'zanzibar'. We will learn to search for abstract patterns (e.g. "a two-digit number followed by a point") later.

### 3.2 Working with so-called match objects

The list returned by ``re.findall()`` is not very powerful, as it tells us nothing about "where these matches occured in the string". For more powerful results, the `re` module provides functions that return so called 'match objects'. We will talk more about objects in week 3. For this session, we will try to understand 'match objects' simply by using them in examples.

A first example to look at match objects is the module function ``re.match(pattern,string)``

In [21]:
result = re.match('zan','zanzibar')


print(result)
print(result.pos) #index position of search-start
print(result.endpos) #index position of search-end
print(result.string) #entire string that was searched
print(result.start()) #start index position of match
print(result.end()) #end index position of match
print(result.span()) # start and end position as a tuple
print(result.group()) # returns the match as a group (here: the string 'zan')

<re.Match object; span=(0, 3), match='zan'>
0
8
zanzibar
0
3
(0, 3)
zan


Match objects have numerous object properties, containing all the important information about the match(es). The example above gives a first impression and helps us to understand how the results can be accessed in our code. 

Note: In the example above there is a match. Therefore, properties like ``pos``, ``endpos``,``start``, ``end``, ``span`` and ``group`` are available. If there is no match, there is also no match object:

In [16]:
result = re.match('miep','zanzibar') # we know that there is no match...
result.pos

AttributeError: 'NoneType' object has no attribute 'pos'

### 3.3 Module functions of the ``re`` module

The most important functions for this short introduction are:

#### `re.findall(*pattern, string*)`
- returns a list of matches (no match object, only the strings)
- returns empty list when there is no match

In [22]:
text = "am I a happy cat?"
pattern = "a"
re.findall(pattern,text)

['a', 'a', 'a', 'a']

#### `re.match(*pattern, string [,flags]*)`
- looks for a match from the start position of the string
- if a match is found: returns a match object (see above), 
- else: returns *None*
- useful if you want to check whether a string starts with a pattern, or completely matches that pattern

In [23]:
pattern = "am I"
result = re.match(pattern,text)

result.span() # (start, end) tuple of the match, where start is always 0 (re.match() matches from beginning!)

(0, 4)

#### `re.search(*pattern, string [,flags]*)`
- searches the entire string (not like ``re.match()``, which searches only at the start)
- stops with first/left-most match (finds only one match)
- if a match is found: returns a match object (see above), 
- else: returns *None*
- useful if you want to search within an entire string

In [None]:
pattern = "happy"
result = re.search(pattern,text)
result.span() # (start, end) tuple of the match. Note: search() searches in the entire string. while match() searches only at the beginning

#### `re.finditer(*pattern, string*)`
- used to find multiple matches
- loop through (=iterate over) multiple matches, e.g. with for-loop
- each match has a match object

In [30]:
text = "am I a happy cat?"
counter = 0 # iterators are lazy. They do not count/index the positions themselves :(
pattern = "a"

for match in re.finditer(pattern,text):
    counter += 1
    print("The nr {0} match of '{1}' is found at the position {2}.".format(counter, match.group(), match.span()))
    

The nr 1 match of 'a' is found at the position (0, 1).
The nr 2 match of 'a' is found at the position (5, 6).
The nr 3 match of 'a' is found at the position (8, 9).
The nr 4 match of 'a' is found at the position (14, 15).


Notes:
- *\[,flags\]* are optional parameters. More on those later.
- The regular expression module ``re`` has many more module functions, but they can not be covered in this short lecture

### 3.4 Patterns

In all examples above, we either search for single characters or static character sequences. However, regular expressions become truly powerful only once we formalize more variable patterns. The following section will introduce the basic concepts of pattern design.

**HOW TO SEARCH FOR:**

#### Single characters ('a', 'f', ...)

In [32]:
pattern = "h"
text = "I am a happy cat."
result = re.search(pattern,text)
result.span()

(7, 8)

#### Strings / concatenated characters ('ab', 'bar', ...)

In [31]:
pattern = "happy"
text = "I am a happy cat."
result = re.search(pattern,text)
result.span()

(7, 12)

#### Either-or disjunctions ('a' or 'b'): 'a|b

In [25]:
pattern = "a|b"

#ask user to input a character:
inputstring = input("please enter a single character")

#check if input string is a or b (see pattern)
result = re.match(pattern,inputstring)

if (result):
    print ("You entered an 'a' or 'b'")
    # Let's retrieve from the match object, what was actually (specifically) entered:
    print ("Actually, you entered a letter '{0}'".format(result.group()))
else:
    print ("You did not enter an 'a' or 'b'")

please enter a single characterc
You did not enter an 'a' or 'b'


#### Either-or with concatenations:  '(ab)|c'

In [33]:
pattern = "(ab)|c"   # this is the same as "ab|c"

#ask user to input a character:
inputstring = input("please enter 'ab', then do it again with 'c', then do it again with 'abc'")

#check if input string is a or b (see pattern)
result = re.match(pattern,inputstring)

if (result):
    print ("You entered 'ab' or 'c'")
    # Let's retrieve from the match object, what was actually entered:
    print ("Actually, you entered '{0}'".format(result.group()))
else:
    print ("You did not enter 'ab' or 'c'")


please enter 'ab', then do it again with 'c', then do it again with 'abc'ac
You did not enter 'ab' or 'c'


Note: We matched with ``re.match()`` here. This means, we check if 'ab' or 'c' occurs at the beginning of the user input string. However, the user can enter much longer strings, and it will still create a match, if only they start either with 'ab' or 'c'

#### Any option out of a *character set*

In [36]:
pattern = "[abc]d" # any of a,b or c, followed by d

inputstring = input("please enter 'ad', then do it again with 'bd', then do it again with 'abcd'")
result = re.match(pattern,inputstring)

if (result):
    print ("You entered 'ad' or 'bd' or 'cd'")
    # Let's retrieve from the match object, what was actually entered:
    print ("Actually, you entered '{0}'".format(result.group()))
else:
    print ("You did not enter what I asked you to!")

    
#Match starts looking from the start of the string


please enter 'ad', then do it again with 'bd', then do it again with 'abcd'abcd
You entered 'ad' or 'bd' or 'cd'
Actually, you entered 'cd'


- More *character sets*:
    - ``[abc]``: 'a', 'b' or 'c' (any character listed or contained within a listed range in square brakets)
    - ``[^abc]``: not 'a', 'b' or 'c' (any character **not** listed)
    - ``[a-z]``: a character in the range a-z 
    - ``[a-zA-Z]``: a character in the range a-z or A-Z or write as ``[A-z]``
    - ``[^a-zA-Z]``: a character NOT in the range a-z or A-Z
    - ``[0-9]``: a number in the range 0 to 9
   
#### Special sequences for characters and character groups:
- ``\b``: backspace
- ``\n``: newline
- ``\t``: horizontal tab
- ``\v``: vertical tab
- ``\xhh``: general character escape by two-digit hexadecimal code. Replace the hh with the hexcode of the character in the following list:https://www.rapidtables.com/code/text/ascii-table.html. E.g. backslash \ is escaped as ``\x5C`` or ``\\``.
- ``.``:  any character except new-line character
- ``\d``: any digit (equals ``[0-9]``)
- ``\D``: any Nondigit (any character but digits)
- ``\s``: all whitespace characters, including tabs etc. Short for: ``[\t\n\r\f\v]``
- ``\S``: all Non-whitespace characters
- ``\w``: all word characters, Short for: ``[a-zA-Z0-9_]``
- ``\W``: all non-word characters
- ``\b``: all boundaries characters
- ``\B``: all non-boundaries characters

#### Quantifiers (repeat x-times)
- ``a?``: zero or one 'a'
- ``a+``: one or more 'a'
- ``a*``: zero or more 'a'
- ``a{5}``: exactly 5 of 'a'

Some examples for quantifiers:


In [44]:
# Ask user to enter a word with a double vowel ('aa' or 'ee' or 'ii' or 'oo' or 'uu'):
word = input("Please enter a word with a double vowel:")

pattern = '[aeiou]{2}'
result = re.search(pattern,word)

print("The double vowel in '{0}' is '{1}'.".format(word,result.group() )) if(result) else print("I could not find a double vowel in {0}".format(word))



Please enter a word with a double vowel:ea
The double vowel in 'ea' is 'ea'.


In [None]:
# extra exercise: correct year of birth, with four digits


In [None]:
# extra exercise: correct format for an email address


#### Beginning and end of a string or a line


- ``^``: start of string or line
- ``\$``: end of string or line
   

In [47]:
# Pattern: a string that starts with c, followed by zero or more word characters, and ends with t
pattern = "^c\w*t$"

#ask user to enter a string
myword = input("Enter a word that starts with c and ends with t: ")

print("{0} is a correct answer.".format(myword)) if (re.search(pattern, myword)) else print("{0} is incorrect".format(myword))

Enter a word that starts with c and ends with t: chaterine_t
chaterine_t is a correct answer.


### 3.5 Look Ahead (simple case): `?=`*expression*
The following example illustrates that we sometimes need to *look ahead*. This is, for example, the case, when we want to match characters only until a certain character occurs - but without including that character into the match.

Let's start the example by searching for all words in a tongue twister that start with 'b' and end with 'r'. Look at the word ```butter's``` in line 3 of the code and ```blisters``` in line 7. They do not end with an 'r', but they occur in the matches!

In [52]:
ttwister = '''
Betty Botter bought some butter
But she said the butter's bitter
If I put it in my batter, it will make my batter bitter
But a bit of better butter will make my batter better
So 'twas better Betty Botter bought a bit of better butter
blisters
'''

#we are looking 4 all words that start with b or B and end with an r.
pattern = "[bB]\w+r"
result = re.findall(pattern,ttwister)
result

['Botter',
 'butter',
 'butter',
 'bitter',
 'batter',
 'batter',
 'bitter',
 'better',
 'butter',
 'batter',
 'better',
 'better',
 'Botter',
 'better',
 'butter',
 'blister']

It helps here, to add ```\W``` (i.e., followed by a non-word character) . However, now the non-word character is included in the matches, which is not what we wanted.

In [53]:
#we are looking for all words that start with 'b' or 'B' and end with an 'r'.

pattern = "[bB]\w+r\W" #followed by a non word character
result = re.findall(pattern,ttwister)
result




['Botter ',
 'butter\n',
 "butter'",
 'bitter\n',
 'batter,',
 'batter ',
 'bitter\n',
 'better ',
 'butter ',
 'batter ',
 'better\n',
 'better ',
 'Botter ',
 'better ',
 'butter\n']

To match the string only *until* the non-word character (```\W```) occurs, we can use a so-called **look ahead**:

```(?=\W)```: the character following the match must be a non-word

In [56]:
# we are looking for all words that start with 'b' or 'B' and end with an 'r'.

pattern = "[bB]\w+r(?=\W)"
result1 = re.findall(pattern,ttwister)
print(result1)

pattern = "([bB]\w+r)\W"
result2 = re.findall(pattern,ttwister)
print(result2)


['Botter', 'butter', 'butter', 'bitter', 'batter', 'batter', 'bitter', 'better', 'butter', 'batter', 'better', 'better', 'Botter', 'better', 'butter']
['Botter', 'butter', 'butter', 'bitter', 'batter', 'batter', 'bitter', 'better', 'butter', 'batter', 'better', 'better', 'Botter', 'better', 'butter']


(for advanced RegEx examples with complex look aheads, see also this [link](http://www.rexegg.com/regex-disambiguation.html#lookaround))

### 3.6 Some fun with more examples

In [57]:
# find a URL in a text
mytext = '''This is a text with a URL hidden somewhere in between. 
        Let's see if we can find the URL http://www.google.com in this
        multiline string with a regular expression. Maybe, instead of
        this first URL, we can also find https://www.amazon.com, but this
        is of course uncertain'''

pattern = "https?\:\/\/www\.\w+\.[a-zA-Z]{1,5}"  #assumes top level domains have max 5 characters

result = re.findall(pattern, mytext)
result


['http://www.google.com', 'https://www.amazon.com']

In [58]:
# find all XML-tags in this string:
myXML = '''<html>
                <head>
                    <title>My first website</title>
                </head>
                <body>
                    <p>Welcome to my first website</p>
                </body>
           </html>
           '''
# pattern for opening tags:
patternOpen = "\<[a-zA-Z]+\>"
result = re.findall(patternOpen, myXML)
print(result)

#pattern for closing tags:
patternClose = "\<\/[a-zA-Z]+\>"
result = re.findall(patternClose, myXML)
print(result)


['<html>', '<head>', '<title>', '<body>', '<p>']
['</title>', '</head>', '</p>', '</body>', '</html>']


In [None]:
# Can you make the previous example a bit more concise: 1 single pattern for all XML tags?


In [None]:
# Can you think of more examples?

### 3.7 Gaining more efficiency in pattern matching: `re.compile(*pattern*)` 

The function ```re.compile(*pattern*)``` *compiles* the pattern before it is used for pattern matching. This means that the pattern is already translated into CPU instructions (imagine it like zeros and ones that the CPU can work with). When a pre-compiled pattern is used in the code, e.g., in a for- or while-loop, the Python interpreter does not need to compile it in every iteration of the loop and therefore saves time. In other words: matching against a compiled pattern takes the CPU less time and increases the runtime significantly. 

**Test Case:** In the following cells we monitor the runtime of the cell with ```%%time```. 

- At first, we match the *un-compiled* pattern 1 million times against a string (this is done in a while loop). 
- In a second approach, we compile the pattern first. Then again, we match the pattern 1 million times against the same string. 

Let's see if the runtime of the second example is signifantly shorter!

In [59]:
# Compile a regular expression: find all numbers (any length) in a string

p = re.compile('[0-9]+')
p.findall('12/10/1990')



['12', '10', '1990']

In [None]:
%%time

n = 0
while n < 1000000000:
    re.findall('[0-9]+', '12/10/1990')
    n += 1
    

In [61]:
%%time
p = re.compile('[0-9]+')
n = 0
while n < 1000000000:
    p.findall('12/10/1990')
    n += 1


CPU times: total: 656 ms
Wall time: 652 ms
